Thanks Thanks:  0
Likes Likes:  0
Results 1 to 1 of 1

Thread: [Beginners Tutorial] File Compression And Splitting [How It Works]

  1. #1
    veteran Chimaera's Avatar
    Join Date
    Apr 2008
    Post Thanks
    Rep Power

    [Beginners Tutorial] File Compression And Splitting [How It Works]

    Hidden Content

    Ok i thought i would put up an offering to help those who don't understand why files are packed and you have to extract them to get the actual file you need. There is a certain amount of technical information in this post which is unavoidable

    There are many types of compression some better than others but ill deal with the more common ones in this article

    First off ill deal with how it actually works using the .zip format.

    If you download many programs and files off the Internet, you've probably encountered ZIP files before. This compression system is a very handy invention, especially for Web users, because it lets you reduce the overall number of bits and bytes in a file so it can be transmitted faster over slower Internet connections, or take up less space on a disk.
    Once you download the file, your computer uses a program such as WinZip to expand the file back to its original size. If everything works correctly, the expanded file is identical to the original file before it was compressed.

    At first glance, this seems very mysterious. How can you reduce the number of bits and bytes and then add those exact bits and bytes back later?
    As it turns out, the basic idea behind the process is fairly straightforward. In this article, we'll examine this simple method as we take a very small file through the basic process of compression.

    File Calculation
    Most types of computer files are fairly redundant -- they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.

    Lossless Compression
    The type of compression we're discussing here is called lossless compression, because it lets you recreate the original file exactly. All lossless compression is based on the idea of breaking a file into a "smaller" form for transmission or storage and then putting it back together on the other end so it can be used again.

    As an example, let's look at a type of information we're all familiar with: words.

    In John F. Kennedy's 1961 inaugural address, he delivered this famous line:

    "Ask not what your country can do for you -- ask what you can do for your country."

    The quote has 17 words, made up of 61 letters, 16 spaces, one dash and one period. If each letter, space or punctuation mark takes up one unit of memory, we get a total file size of 79 units. To get the file size down, we need to look for redundancies.

    Immediately, we notice that:

    * "ask" appears two times
    * "what" appears two times
    * "your" appears two times
    * "country" appears two times
    * "can" appears two times
    * "do" appears two times
    * "for" appears two times
    * "you" appears two times

    Ignoring the difference between capital and lower-case letters, roughly half of the phrase is redundant. Nine words -- ask, not, what, your, country, can, do, for, you -- give us almost everything we need for the entire quote. To construct the second half of the phrase, we just point to the words in the first half and fill in the spaces and punctuation.

    Redundancy and Algorithms
    Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. "LZ" refers to Lempel and Ziv, the algorithm's creators, and "dictionary" refers to the method of cataloging pieces of data.

    The system for arranging dictionaries varies, but it could be as simple as a numbered list. When we go through Kennedy's famous words, we pick out the words that are repeated and put them into the numbered index. Then, we simply write the number instead of writing out the whole word.

    So, if this is our dictionary:

    1. ask
    2. what
    3. your
    4. country
    5. can
    6. do
    7. for
    8. you

    Our sentence now reads:

    "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"

    If you knew the system, you could easily reconstruct the original phrase using only this dictionary and number pattern. This is what the expansion program on your computer does when it expands a downloaded file.
    You might also have encountered compressed files that open themselves up. To create this sort of file, the programmer includes a simple expansion program with the compressed file. It automatically reconstructs the original file once it's downloaded.

    But how much space have we actually saved with this system?
    "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4" is certainly shorter than "Ask not what your country can do for you; ask what you can do for your country;" but keep in mind that we need to save the dictionary itself along with the file.

    In an actual compression scheme, figuring out the various file requirements would be fairly complicated; but for our purposes, let's go back to the idea that every character and every space takes up one unit of memory.
    We already saw that the full phrase takes up 79 units. Our compressed sentence (including spaces) takes up 37 units, and the dictionary (words and numbers) also takes up 37 units. This gives us a file size of 74, so we haven't reduced the file size by very much.

    But this is only one sentence! You can imagine that if the compression program worked through the rest of Kennedy's speech, it would find these words and others repeated many more times, it would also be rewriting the dictionary to get the most efficient organization possible.

    In our previous example, we picked out all the repeated words and put those in a dictionary. To us, this is the most obvious way to write a dictionary. But a compression program sees it quite differently:
    It doesn't have any concept of separate words -- it only looks for patterns. And in order to reduce the file size as much as possible, it carefully selects which patterns to include in the dictionary.

    If we approach the phrase from this perspective, we end up with a completely different dictionary.

    If the compression program scanned Kennedy's phrase, the first redundancy it would come across would be only a couple of letters long. In "ask not what your," there is a repeated pattern of the letter "t" followed by a space -- in "not" and "what." If the compression program wrote this to the dictionary, it could write a "1" every time a "t" were followed by a space. But in this short phrase, this pattern doesn't occur enough to make it a worthwhile entry, so the program would eventually overwrite it.

    The next thing the program might notice is "ou," which appears in both "your" and "country." If this were a longer document, writing this pattern to the dictionary could save a lot of space -- "ou" is a fairly common combination in the English language. But as the compression program worked through this sentence, it would quickly discover a better choice for a dictionary entry: Not only is "ou" repeated, but the entire words "your" and "country" are both repeated, and they are actually repeated together, as the phrase "your country." In this case, the program would overwrite the dictionary entry for "ou" with the entry for "your country."

    The phrase "can do for" is also repeated, one time followed by "your" and one time followed by "you," giving us a repeated pattern of "can do for you." This lets us write 15 characters (including spaces) with one number value, while "your country" only lets us write 13 characters (with spaces) with one number value, so the program would overwrite the "your country" entry as just "r country," and then write a separate entry for "can do for you." The program proceeds in this way, picking up all repeated bits of information and then calculating which patterns it should write to the dictionary. This ability to rewrite the dictionary is the "adaptive" part of LZ adaptive dictionary-based algorithm. The way a program actually does this is fairly complicated but we are simplifying here to give you the grasp of what we are talking about.

    No matter what specific method you use, this in-depth searching system lets you compress the file much more efficiently than you could by just picking out words. Using the patterns we picked out above, and adding "__" for spaces, we come up with this larger dictionary:

    1. ask__
    2. what__
    3. you
    4. r__country
    5. __can__do__for__you

    And this smaller sentence:


    The sentence now takes up 18 units of memory, and our dictionary takes up 41 units. So we've compressed the total file size from 79 units to 59 units! This is just one way of compressing the phrase, and not necessarily the most efficient one.

    So how good is this system? The file-reduction ratio depends on a number of factors, including file type, file size and compression scheme.

    In most languages of the world, certain letters and words often appear together in the same pattern. Because of this high rate of redundancy, text files compress very well. A reduction of 50 percent or more is typical for a good-sized text file.
    Most programming languages are also very redundant because they use a relatively small collection of commands, which frequently go together in a set pattern. Files that include a lot of unique information, such as graphics or MP3 files, cannot be compressed much with this system because they don't repeat many patterns

    If a file has a lot of repeated patterns, the rate of reduction typically increases with file size. You can see this just by looking at our example -- if we had more of Kennedy's speech, we would be able to refer to the patterns in our dictionary more often, and so get more out of each entry's file space.
    Also, more pervasive patterns might emerge in the longer work, allowing us to create a more efficient dictionary.

    This efficiency also depends on the specific algorithm used by the compression program. Some programs are particularly suited to picking up patterns in certain types of files, and so may compress them more succinctly. Others have dictionaries within dictionaries, which might compress efficiently for larger files but not for smaller ones.
    While all compression programs of this sort work with the same basic idea, there is actually a good deal of variation in the manner of execution. Programmers are always trying to build a better system.

    Lossy Compression

    Lossy compression works very differently. These programs simply eliminate "unnecessary" bits of information, tailoring the file so that it is smaller. This type of compression is used a lot for reducing the file size of bitmap pictures, which tend to be fairly bulky. To see how this works, let's consider how your computer might compress a scanned photograph.

    A lossless compression program can't do much with this type of file. While large parts of the picture may look the same -- the whole sky is blue, for example -- most of the individual pixels are a little bit different. To make this picture smaller without compromising the resolution, you have to change the color value for certain pixels. If the picture had a lot of blue sky, the program would pick one color of blue that could be used for every pixel. Then, the program rewrites the file so that the value for every sky pixel refers back to this information. If the compression scheme works well, you won't notice the change, but the file size will be significantly reduced.

    Of course, with lossy compression, you can't get the original file back after it has been compressed. You're stuck with the compression program's reinterpretation of the original.
    For this reason, you can't use this sort of compression for anything that needs to be reproduced exactly, including software applications, databases and presidential inauguration speeches.

    There are limits to what can be done, it all depends on how good the dictionary is to the type of file your trying to compact.

    Other Things To Remember
    There are many types of compression but the main three are:
    Hidden Content [.rar]
    Hidden Content src="Hidden Content border="0" alt="" />[.zip]
    Hidden Content src="Hidden Content border="0" alt="" />[.7z]
    Hidden Content personal preference is Winrar but thats what it is a preference, you will find the majority of files tend to be in winrar ( .rar ) format these days.

    There is a certain amount of crossover but its generally safer to use the same program that made the file when you unpack it because the dictionary's are not exactly the same, so you can get unpredictable results.

    Now another area of compression is splitting and it generally confuses people.

    File Splitters
    Hidden Content [.001]
    Hidden Content file method that is quite common these days is .001 and this one causes no end of confusion because it is not a compression format it is a splitter, It splits files of any type and size. Does not need to be installed. This program can even handle files which are larger than 10Gb!

    So if you find a download that has files .001 to .008 for example then use hjsplit to stick them back together.
    And you should end up with the original file after joining.

    Also bear in mind winrar also splits into smaller sections for uploading so you may have more than one file to extract and it wont if you don't have them all

    Common Problems With File Extraction
    There is a veritable minefield that goes with hand in hand with file compression, namely the tricks people do to protect their files once they have cracked them and packaged them for release.

    Just because you can see the files available for download on whatever your chosen format DONT assume that the files are:

    1: Complete
    2: Packed Correctly
    3: Actually contain what you want
    4: Don't have a Password on the archive
    5: Don't need a password on the second compression inside the first compression
    6: Will extract with the correct file ending eg: .avi
    7: Aren't damaged in some way so you cant extract them anyway
    8: Are actually in the correct language
    9: Dont contain spyware, trojans, viruses etc etc
    10: And many other things besides................................

    Why would you assume this would happen?
    Because the file you see in its glory on Wod or anywhere else for that matter could have passed through many hands before it gets to the download you see in front of you.

    People often take credit for others work by removing the crackers .nfo file and passing it of as their own.
    Or they remove one part of the set so you have to contact them to get the missing file and if the download has jumped through many forums before you see it you may never be able to get the missing piece.

    Passwords for eg:
    Someone may have found the file and packed it again before upping it onto rapidshare to share with people without realizing the original file was already passworded so you may open the first but not the second
    Or they may have not copied the password correctly so you still cant open it

    Mislabeling is common as well if one of the kiddies gets a file which is clearly labelled as a CAM of a film then packs it in an archive labeled DVDRIP lots of people are gonna download it only to find its incorrect (this is very common), do your research know what stage a film is at so if its too early for the dvdrip to be available save yourself the wasted download eg: if its not out at the cinemas for eg it wont be available as a dvd rip yet....

    Use common sense when you find things look at the site you are getting it from, most of the stuff on here is put up by competent people who know what they are doing, but mistakes still happen as detailed above.

    Please don't whinge because your file didn't unpack because the chances are it may not be the posters fault as detailed above

    I hope you find this useful and if there is anything you think out to be included let me know


    creds to for some of the technical material
    Last edited by Chimaera; 28-06-2009 at 02:03 PM.

Similar Threads

  1. [Beginners Tutorial] On How To Install Winrar
    By Chimaera in forum General Discussion
    Replies: 15
    Last Post: 27-10-2014, 06:58 AM
  2. Video compression/conversion for mobiles
    By trinitys_mate in forum General Computer Topics
    Replies: 1
    Last Post: 29-01-2007, 09:15 PM
  3. Cable splitting
    By plymbrew in forum Cable Dbox2
    Replies: 7
    Last Post: 18-11-2006, 10:42 AM


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts