LZ77 and LZ78
Encyclopedia : L : LZ : LZ7 : LZ77 and LZ78
LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. These two algorithms form the basis for most of the LZ variations including LZW, LZSS and others. They are both dictionary coders, unlike minimum redundancy coders or run length coders. LZ77 is the "sliding window" compression algorithm, which was later shown to be equivalent to the explicit dictionary technique first given in LZ78.
LZ77
LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that has already passed through both encoder and decoder. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the character exactly distance characters behind it in the uncompressed stream." (The "distance" is sometimes called the "offset" instead.)
The encoder and decoder must both keep track of some amount of the most recent data, such as the last 2KB, 4KB, or 32KB. The structure this data is held in is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. This is why the encoder can use a smaller size sliding window than the decoder, but not vice-versa.
Many documents which talk about LZ77 algorithms describe a length-distance pair as a command to "copy" data from the sliding window: "Go back distance characters in the buffer and copy length characters, starting from that point." While those used to imperative programming may find this model intuitive, it may also make it hard to understand a feature of LZ77 encoding: namely, that it is not only acceptable but frequently useful to have a length-distance pair where the length actually exceeds the distance. As a copy command, this is puzzling: "Go back one character in the buffer and copy seven characters, starting from that point." How can seven characters be copied from the buffer when only one of the specified characters is actually in the buffer? Looking at a length-distance pair as a statement of identity, however, clarifies the confusion: each of the next seven characters is identical to the character that comes one before it. This means that each character can be determined by looking back in the buffer -- even if the character looked back to was not in the buffer when the decoding of the current pair began. Since by definition a pair like this will be repeating a sequence of distance characters multiple times, it means that LZ77 incorporates a flexible and easy form of run-length encoding.
Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they output their encoded data -- what values are possible for lengths and distances, for example, and how length-distance pairs are distinguished from literals (single characters encoded as themselves, rather than as part of a length-distance pair.) A few examples:
- The algorithm illustrated in Lempel and Ziv's original 1977 paper output all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal which followed that match. If two successive characters in the input stream could only be encoded as literals, the length would be 0.
- In the PalmDoc format, a length-distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
- As of 2004, the most popular LZ77 based compression method is called DEFLATE; it combines LZ77 with Huffman coding. Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; since a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice-versa.
LZ78
While the LZ77 algorithm works on past data, the LZ78 algorithm attempts to work on future data. It does this by forward scanning the input buffer and matching it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary.Though initially popular, the popularity of LZ78 later dampened, possibly because for the first few decades after it was introduced, parts of LZ78 were patent encumbered in the United States. The most popular form of LZ78 compression was the LZW algorithm, a modification of the LZ78 algorithm made by Terry Welch.
LZ78 Example
This example shows the LZ78 algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the message. In order to keep things clear, let us assume that we're dealing with a simple alphabet - capital letters only, and no punctuation or spaces. This example has been constructed to give reasonable compression on a very short message; when used on real data, repetition is generally less pronounced, and so the initial parts of a message will see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum.Jacob Ziv and Abraham Lempel; [Compression of Individual Sequences Via Variable-Rate Coding], IEEE Transactions on Information Theory, September 1978. A message to be sent might then look like the following:
TOBEORNOTTOBEORTOBEORNOT#The # is a marker used to show that the end of the message has been reached. Clearly, then, we have 27 symbols in our alphabet. A computer will render these as strings of bits; 5-bit strings are needed to give sufficient combinations to encompass the entire dictionary. As the dictionary grows, the strings will need to grow in length to accommodate the additional entries. A 5-bit string gives 25 = 32 possible combinations of bits, and so when the 33rd dictionary word is created, the algorithm will have to start using 6-bit strings. Note that since the all-zero string 00000 is used, and is labeled "0", the 33rd dictionary entry will be labeled 32. The initial dictionary, then, will consist of the following:
# = 00000 A = 00001 B = 00010 C = 00011 . . . Z = 11010
Encoding
If we weren't using LZ78, and just sent the message as it stands (25 symbols at 5 bits each), it would require 125 bits. We will be able to compare this figure to the LZ78 output later. We are now in a position to apply LZ78 to the message.
Symbol: Bit Code: New Dictionary Entry: (= output)In using LZ78 we have made a saving of 28 bits out of 125 -- we have reduced the message by almost 22%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, allowing repeated words to be sent very compactly.T 20 = 10100 28: TO O 15 = 01111 29: OB B 2 = 00010 30: BE E 5 = 00101 31: EO O 15 = 01111 32: OR <--- start using 6-bit strings R 18 = 010010 33: RN N 14 = 001110 34: NO O 15 = 001111 35: OT T 20 = 010100 36: TT TO 28 = 011100 37: TOB BE 30 = 011110 38: BEO OR 32 = 100000 39: ORT TOB 37 = 100101 40: TOBE EO 31 = 011111 41: EOR RN 33 = 100001 42: RNO OT 35 = 100011 43: OT# # 0 = 000000
Total Length = 5*5 + 12*6 = 97 bits.
Decoding
Imagine now that we have received the message produced above, and wish to decode it. We need to know in advance the initial dictionary used, but we can reconstruct the additional entries as we go, since they are always simply concatenations of previous entries.
Bits: Output: New Entry: Full: Partial:The only slight complication comes if the newly-created dictionary word is sent immediately. In the decoding example above, when the decoder receives the first symbol, T, it knows that symbol 28 begins with a T, but what does it end with? The problem is illustrated below. We are decoding part of a message that reads ABABA:10100 = 20 T 28: T? 01111 = 15 O 28: TO 29: O? 00010 = 2 B 29: OB 30: B? 00101 = 5 E 30: BE 31: E? 01111 = 15 O 31: EO 32: O? <--- start using 6-bit strings 010010 = 18 R 32: OR 33: R? 001110 = 14 N 33: RN 34: N? 001111 = 15 O 34: NO 35: O? 010100 = 20 T 35: OT 36: T? 011100 = 28 TO 36: TT 37: TO? <--- for 36, only add 1st element 011110 = 30 BE 37: TOB 38: BE? of next dictionary word 100000 = 32 OR 38: BEO 39: OR? 100101 = 37 TOB 39: ORT 40: TOB? 011111 = 31 EO 40: TOBE 41: EO? 100001 = 33 RN 41: EOR 42: RN? 100011 = 35 OT 42: RNO 43: OT? 000000 = 0 #
Bits: Output: New Entry: Full: Partial:At first glance, this may appear to be asking the impossible of the decoder. We know ahead of time that entry 47 should be ABA, but how can the decoder work this out? The critical step is to note that 47 is built out of 29 plus whatever comes next. 47, therefore, ends with "whatever comes next". But, since it was sent immediately, it must also start with "whatever comes next", and so must end with the same symbol it starts with, namely A. This trick allows the decoder to see that 47 must be ABA.. . . 011101 = 29 AB 46: (word) 47: AB? 101111 = 47 AB? <--- what do we do here?
More generally the situation occurs whenever the encoder encounters the input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary. The encoder outputs the symbol for cS putting new symbol for cSc in the dictionary. Next it sees the cSc in the input and sends the new symbol it just inserted into the dictionary. By the reasoning presented in the above example this is the only case were the newly-created symbol is sent immediately.
References
- Jacob Ziv and Abraham Lempel; [A Universal Algorithm for Sequential Data Compression], IEEE Transactions on Information Theory, May 1977.
External links
- [List of LZ77 algorithm (and its derivatives) libraries, papers and sources]
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
