Warning, /cpukit/compression/zlib/doc/rfc1951.txt is written in an unsupported language. File is not indexed.
0001
0002
0003
0004
0005
0006
0007 Network Working Group P. Deutsch
0008 Request for Comments: 1951 Aladdin Enterprises
0009 Category: Informational May 1996
0010
0011
0012 DEFLATE Compressed Data Format Specification version 1.3
0013
0014 Status of This Memo
0015
0016 This memo provides information for the Internet community. This memo
0017 does not specify an Internet standard of any kind. Distribution of
0018 this memo is unlimited.
0019
0020 IESG Note:
0021
0022 The IESG takes no position on the validity of any Intellectual
0023 Property Rights statements contained in this document.
0024
0025 Notices
0026
0027 Copyright (c) 1996 L. Peter Deutsch
0028
0029 Permission is granted to copy and distribute this document for any
0030 purpose and without charge, including translations into other
0031 languages and incorporation into compilations, provided that the
0032 copyright notice and this notice are preserved, and that any
0033 substantive changes or deletions from the original are clearly
0034 marked.
0035
0036 A pointer to the latest version of this and related documentation in
0037 HTML format can be found at the URL
0038 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
0039
0040 Abstract
0041
0042 This specification defines a lossless compressed data format that
0043 compresses data using a combination of the LZ77 algorithm and Huffman
0044 coding, with efficiency comparable to the best currently available
0045 general-purpose compression methods. The data can be produced or
0046 consumed, even for an arbitrarily long sequentially presented input
0047 data stream, using only an a priori bounded amount of intermediate
0048 storage. The format can be implemented readily in a manner not
0049 covered by patents.
0050
0051
0052
0053
0054
0055
0056
0057
0058 Deutsch Informational [Page 1]
0059
0060 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0061
0062
0063 Table of Contents
0064
0065 1. Introduction ................................................... 2
0066 1.1. Purpose ................................................... 2
0067 1.2. Intended audience ......................................... 3
0068 1.3. Scope ..................................................... 3
0069 1.4. Compliance ................................................ 3
0070 1.5. Definitions of terms and conventions used ................ 3
0071 1.6. Changes from previous versions ............................ 4
0072 2. Compressed representation overview ............................. 4
0073 3. Detailed specification ......................................... 5
0074 3.1. Overall conventions ....................................... 5
0075 3.1.1. Packing into bytes .................................. 5
0076 3.2. Compressed block format ................................... 6
0077 3.2.1. Synopsis of prefix and Huffman coding ............... 6
0078 3.2.2. Use of Huffman coding in the "deflate" format ....... 7
0079 3.2.3. Details of block format ............................. 9
0080 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
0081 3.2.5. Compressed blocks (length and distance codes) ...... 11
0082 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
0083 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
0084 3.3. Compliance ............................................... 14
0085 4. Compression algorithm details ................................. 14
0086 5. References .................................................... 16
0087 6. Security Considerations ....................................... 16
0088 7. Source code ................................................... 16
0089 8. Acknowledgements .............................................. 16
0090 9. Author's Address .............................................. 17
0091
0092 1. Introduction
0093
0094 1.1. Purpose
0095
0096 The purpose of this specification is to define a lossless
0097 compressed data format that:
0098 * Is independent of CPU type, operating system, file system,
0099 and character set, and hence can be used for interchange;
0100 * Can be produced or consumed, even for an arbitrarily long
0101 sequentially presented input data stream, using only an a
0102 priori bounded amount of intermediate storage, and hence
0103 can be used in data communications or similar structures
0104 such as Unix filters;
0105 * Compresses data with efficiency comparable to the best
0106 currently available general-purpose compression methods,
0107 and in particular considerably better than the "compress"
0108 program;
0109 * Can be implemented readily in a manner not covered by
0110 patents, and hence can be practiced freely;
0111
0112
0113
0114 Deutsch Informational [Page 2]
0115
0116 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0117
0118
0119 * Is compatible with the file format produced by the current
0120 widely used gzip utility, in that conforming decompressors
0121 will be able to read data produced by the existing gzip
0122 compressor.
0123
0124 The data format defined by this specification does not attempt to:
0125
0126 * Allow random access to compressed data;
0127 * Compress specialized data (e.g., raster graphics) as well
0128 as the best currently available specialized algorithms.
0129
0130 A simple counting argument shows that no lossless compression
0131 algorithm can compress every possible input data set. For the
0132 format defined here, the worst case expansion is 5 bytes per 32K-
0133 byte block, i.e., a size increase of 0.015% for large data sets.
0134 English text usually compresses by a factor of 2.5 to 3;
0135 executable files usually compress somewhat less; graphical data
0136 such as raster images may compress much more.
0137
0138 1.2. Intended audience
0139
0140 This specification is intended for use by implementors of software
0141 to compress data into "deflate" format and/or decompress data from
0142 "deflate" format.
0143
0144 The text of the specification assumes a basic background in
0145 programming at the level of bits and other primitive data
0146 representations. Familiarity with the technique of Huffman coding
0147 is helpful but not required.
0148
0149 1.3. Scope
0150
0151 The specification specifies a method for representing a sequence
0152 of bytes as a (usually shorter) sequence of bits, and a method for
0153 packing the latter bit sequence into bytes.
0154
0155 1.4. Compliance
0156
0157 Unless otherwise indicated below, a compliant decompressor must be
0158 able to accept and decompress any data set that conforms to all
0159 the specifications presented here; a compliant compressor must
0160 produce data sets that conform to all the specifications presented
0161 here.
0162
0163 1.5. Definitions of terms and conventions used
0164
0165 Byte: 8 bits stored or transmitted as a unit (same as an octet).
0166 For this specification, a byte is exactly 8 bits, even on machines
0167
0168
0169
0170 Deutsch Informational [Page 3]
0171
0172 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0173
0174
0175 which store a character on a number of bits different from eight.
0176 See below, for the numbering of bits within a byte.
0177
0178 String: a sequence of arbitrary bytes.
0179
0180 1.6. Changes from previous versions
0181
0182 There have been no technical changes to the deflate format since
0183 version 1.1 of this specification. In version 1.2, some
0184 terminology was changed. Version 1.3 is a conversion of the
0185 specification to RFC style.
0186
0187 2. Compressed representation overview
0188
0189 A compressed data set consists of a series of blocks, corresponding
0190 to successive blocks of input data. The block sizes are arbitrary,
0191 except that non-compressible blocks are limited to 65,535 bytes.
0192
0193 Each block is compressed using a combination of the LZ77 algorithm
0194 and Huffman coding. The Huffman trees for each block are independent
0195 of those for previous or subsequent blocks; the LZ77 algorithm may
0196 use a reference to a duplicated string occurring in a previous block,
0197 up to 32K input bytes before.
0198
0199 Each block consists of two parts: a pair of Huffman code trees that
0200 describe the representation of the compressed data part, and a
0201 compressed data part. (The Huffman trees themselves are compressed
0202 using Huffman encoding.) The compressed data consists of a series of
0203 elements of two types: literal bytes (of strings that have not been
0204 detected as duplicated within the previous 32K input bytes), and
0205 pointers to duplicated strings, where a pointer is represented as a
0206 pair <length, backward distance>. The representation used in the
0207 "deflate" format limits distances to 32K bytes and lengths to 258
0208 bytes, but does not limit the size of a block, except for
0209 uncompressible blocks, which are limited as noted above.
0210
0211 Each type of value (literals, distances, and lengths) in the
0212 compressed data is represented using a Huffman code, using one code
0213 tree for literals and lengths and a separate code tree for distances.
0214 The code trees for each block appear in a compact form just before
0215 the compressed data for that block.
0216
0217
0218
0219
0220
0221
0222
0223
0224
0225
0226 Deutsch Informational [Page 4]
0227
0228 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0229
0230
0231 3. Detailed specification
0232
0233 3.1. Overall conventions In the diagrams below, a box like this:
0234
0235 +---+
0236 | | <-- the vertical bars might be missing
0237 +---+
0238
0239 represents one byte; a box like this:
0240
0241 +==============+
0242 | |
0243 +==============+
0244
0245 represents a variable number of bytes.
0246
0247 Bytes stored within a computer do not have a "bit order", since
0248 they are always treated as a unit. However, a byte considered as
0249 an integer between 0 and 255 does have a most- and least-
0250 significant bit, and since we write numbers with the most-
0251 significant digit on the left, we also write bytes with the most-
0252 significant bit on the left. In the diagrams below, we number the
0253 bits of a byte so that bit 0 is the least-significant bit, i.e.,
0254 the bits are numbered:
0255
0256 +--------+
0257 |76543210|
0258 +--------+
0259
0260 Within a computer, a number may occupy multiple bytes. All
0261 multi-byte numbers in the format described here are stored with
0262 the least-significant byte first (at the lower memory address).
0263 For example, the decimal number 520 is stored as:
0264
0265 0 1
0266 +--------+--------+
0267 |00001000|00000010|
0268 +--------+--------+
0269 ^ ^
0270 | |
0271 | + more significant byte = 2 x 256
0272 + less significant byte = 8
0273
0274 3.1.1. Packing into bytes
0275
0276 This document does not address the issue of the order in which
0277 bits of a byte are transmitted on a bit-sequential medium,
0278 since the final data format described here is byte- rather than
0279
0280
0281
0282 Deutsch Informational [Page 5]
0283
0284 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0285
0286
0287 bit-oriented. However, we describe the compressed block format
0288 in below, as a sequence of data elements of various bit
0289 lengths, not a sequence of bytes. We must therefore specify
0290 how to pack these data elements into bytes to form the final
0291 compressed byte sequence:
0292
0293 * Data elements are packed into bytes in order of
0294 increasing bit number within the byte, i.e., starting
0295 with the least-significant bit of the byte.
0296 * Data elements other than Huffman codes are packed
0297 starting with the least-significant bit of the data
0298 element.
0299 * Huffman codes are packed starting with the most-
0300 significant bit of the code.
0301
0302 In other words, if one were to print out the compressed data as
0303 a sequence of bytes, starting with the first byte at the
0304 *right* margin and proceeding to the *left*, with the most-
0305 significant bit of each byte on the left as usual, one would be
0306 able to parse the result from right to left, with fixed-width
0307 elements in the correct MSB-to-LSB order and Huffman codes in
0308 bit-reversed order (i.e., with the first bit of the code in the
0309 relative LSB position).
0310
0311 3.2. Compressed block format
0312
0313 3.2.1. Synopsis of prefix and Huffman coding
0314
0315 Prefix coding represents symbols from an a priori known
0316 alphabet by bit sequences (codes), one code for each symbol, in
0317 a manner such that different symbols may be represented by bit
0318 sequences of different lengths, but a parser can always parse
0319 an encoded string unambiguously symbol-by-symbol.
0320
0321 We define a prefix code in terms of a binary tree in which the
0322 two edges descending from each non-leaf node are labeled 0 and
0323 1 and in which the leaf nodes correspond one-for-one with (are
0324 labeled with) the symbols of the alphabet; then the code for a
0325 symbol is the sequence of 0's and 1's on the edges leading from
0326 the root to the leaf labeled with that symbol. For example:
0327
0328
0329
0330
0331
0332
0333
0334
0335
0336
0337
0338 Deutsch Informational [Page 6]
0339
0340 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0341
0342
0343 /\ Symbol Code
0344 0 1 ------ ----
0345 / \ A 00
0346 /\ B B 1
0347 0 1 C 011
0348 / \ D 010
0349 A /\
0350 0 1
0351 / \
0352 D C
0353
0354 A parser can decode the next symbol from an encoded input
0355 stream by walking down the tree from the root, at each step
0356 choosing the edge corresponding to the next input bit.
0357
0358 Given an alphabet with known symbol frequencies, the Huffman
0359 algorithm allows the construction of an optimal prefix code
0360 (one which represents strings with those symbol frequencies
0361 using the fewest bits of any possible prefix codes for that
0362 alphabet). Such a code is called a Huffman code. (See
0363 reference [1] in Chapter 5, references for additional
0364 information on Huffman codes.)
0365
0366 Note that in the "deflate" format, the Huffman codes for the
0367 various alphabets must not exceed certain maximum code lengths.
0368 This constraint complicates the algorithm for computing code
0369 lengths from symbol frequencies. Again, see Chapter 5,
0370 references for details.
0371
0372 3.2.2. Use of Huffman coding in the "deflate" format
0373
0374 The Huffman codes used for each alphabet in the "deflate"
0375 format have two additional rules:
0376
0377 * All codes of a given bit length have lexicographically
0378 consecutive values, in the same order as the symbols
0379 they represent;
0380
0381 * Shorter codes lexicographically precede longer codes.
0382
0383
0384
0385
0386
0387
0388
0389
0390
0391
0392
0393
0394 Deutsch Informational [Page 7]
0395
0396 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0397
0398
0399 We could recode the example above to follow this rule as
0400 follows, assuming that the order of the alphabet is ABCD:
0401
0402 Symbol Code
0403 ------ ----
0404 A 10
0405 B 0
0406 C 110
0407 D 111
0408
0409 I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
0410 lexicographically consecutive.
0411
0412 Given this rule, we can define the Huffman code for an alphabet
0413 just by giving the bit lengths of the codes for each symbol of
0414 the alphabet in order; this is sufficient to determine the
0415 actual codes. In our example, the code is completely defined
0416 by the sequence of bit lengths (2, 1, 3, 3). The following
0417 algorithm generates the codes as integers, intended to be read
0418 from most- to least-significant bit. The code lengths are
0419 initially in tree[I].Len; the codes are produced in
0420 tree[I].Code.
0421
0422 1) Count the number of codes for each code length. Let
0423 bl_count[N] be the number of codes of length N, N >= 1.
0424
0425 2) Find the numerical value of the smallest code for each
0426 code length:
0427
0428 code = 0;
0429 bl_count[0] = 0;
0430 for (bits = 1; bits <= MAX_BITS; bits++) {
0431 code = (code + bl_count[bits-1]) << 1;
0432 next_code[bits] = code;
0433 }
0434
0435 3) Assign numerical values to all codes, using consecutive
0436 values for all codes of the same length with the base
0437 values determined at step 2. Codes that are never used
0438 (which have a bit length of zero) must not be assigned a
0439 value.
0440
0441 for (n = 0; n <= max_code; n++) {
0442 len = tree[n].Len;
0443 if (len != 0) {
0444 tree[n].Code = next_code[len];
0445 next_code[len]++;
0446 }
0447
0448
0449
0450 Deutsch Informational [Page 8]
0451
0452 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0453
0454
0455 }
0456
0457 Example:
0458
0459 Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
0460 3, 2, 4, 4). After step 1, we have:
0461
0462 N bl_count[N]
0463 - -----------
0464 2 1
0465 3 5
0466 4 2
0467
0468 Step 2 computes the following next_code values:
0469
0470 N next_code[N]
0471 - ------------
0472 1 0
0473 2 0
0474 3 2
0475 4 14
0476
0477 Step 3 produces the following code values:
0478
0479 Symbol Length Code
0480 ------ ------ ----
0481 A 3 010
0482 B 3 011
0483 C 3 100
0484 D 3 101
0485 E 3 110
0486 F 2 00
0487 G 4 1110
0488 H 4 1111
0489
0490 3.2.3. Details of block format
0491
0492 Each block of compressed data begins with 3 header bits
0493 containing the following data:
0494
0495 first bit BFINAL
0496 next 2 bits BTYPE
0497
0498 Note that the header bits do not necessarily begin on a byte
0499 boundary, since a block does not necessarily occupy an integral
0500 number of bytes.
0501
0502
0503
0504
0505
0506 Deutsch Informational [Page 9]
0507
0508 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0509
0510
0511 BFINAL is set if and only if this is the last block of the data
0512 set.
0513
0514 BTYPE specifies how the data are compressed, as follows:
0515
0516 00 - no compression
0517 01 - compressed with fixed Huffman codes
0518 10 - compressed with dynamic Huffman codes
0519 11 - reserved (error)
0520
0521 The only difference between the two compressed cases is how the
0522 Huffman codes for the literal/length and distance alphabets are
0523 defined.
0524
0525 In all cases, the decoding algorithm for the actual data is as
0526 follows:
0527
0528 do
0529 read block header from input stream.
0530 if stored with no compression
0531 skip any remaining bits in current partially
0532 processed byte
0533 read LEN and NLEN (see next section)
0534 copy LEN bytes of data to output
0535 otherwise
0536 if compressed with dynamic Huffman codes
0537 read representation of code trees (see
0538 subsection below)
0539 loop (until end of block code recognized)
0540 decode literal/length value from input stream
0541 if value < 256
0542 copy value (literal byte) to output stream
0543 otherwise
0544 if value = end of block (256)
0545 break from loop
0546 otherwise (value = 257..285)
0547 decode distance from input stream
0548
0549 move backwards distance bytes in the output
0550 stream, and copy length bytes from this
0551 position to the output stream.
0552 end loop
0553 while not last block
0554
0555 Note that a duplicated string reference may refer to a string
0556 in a previous block; i.e., the backward distance may cross one
0557 or more block boundaries. However a distance cannot refer past
0558 the beginning of the output stream. (An application using a
0559
0560
0561
0562 Deutsch Informational [Page 10]
0563
0564 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0565
0566
0567 preset dictionary might discard part of the output stream; a
0568 distance can refer to that part of the output stream anyway)
0569 Note also that the referenced string may overlap the current
0570 position; for example, if the last 2 bytes decoded have values
0571 X and Y, a string reference with <length = 5, distance = 2>
0572 adds X,Y,X,Y,X to the output stream.
0573
0574 We now specify each compression method in turn.
0575
0576 3.2.4. Non-compressed blocks (BTYPE=00)
0577
0578 Any bits of input up to the next byte boundary are ignored.
0579 The rest of the block consists of the following information:
0580
0581 0 1 2 3 4...
0582 +---+---+---+---+================================+
0583 | LEN | NLEN |... LEN bytes of literal data...|
0584 +---+---+---+---+================================+
0585
0586 LEN is the number of data bytes in the block. NLEN is the
0587 one's complement of LEN.
0588
0589 3.2.5. Compressed blocks (length and distance codes)
0590
0591 As noted above, encoded data blocks in the "deflate" format
0592 consist of sequences of symbols drawn from three conceptually
0593 distinct alphabets: either literal bytes, from the alphabet of
0594 byte values (0..255), or <length, backward distance> pairs,
0595 where the length is drawn from (3..258) and the distance is
0596 drawn from (1..32,768). In fact, the literal and length
0597 alphabets are merged into a single alphabet (0..285), where
0598 values 0..255 represent literal bytes, the value 256 indicates
0599 end-of-block, and values 257..285 represent length codes
0600 (possibly in conjunction with extra bits following the symbol
0601 code) as follows:
0602
0603
0604
0605
0606
0607
0608
0609
0610
0611
0612
0613
0614
0615
0616
0617
0618 Deutsch Informational [Page 11]
0619
0620 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0621
0622
0623 Extra Extra Extra
0624 Code Bits Length(s) Code Bits Lengths Code Bits Length(s)
0625 ---- ---- ------ ---- ---- ------- ---- ---- -------
0626 257 0 3 267 1 15,16 277 4 67-82
0627 258 0 4 268 1 17,18 278 4 83-98
0628 259 0 5 269 2 19-22 279 4 99-114
0629 260 0 6 270 2 23-26 280 4 115-130
0630 261 0 7 271 2 27-30 281 5 131-162
0631 262 0 8 272 2 31-34 282 5 163-194
0632 263 0 9 273 3 35-42 283 5 195-226
0633 264 0 10 274 3 43-50 284 5 227-257
0634 265 1 11,12 275 3 51-58 285 0 258
0635 266 1 13,14 276 3 59-66
0636
0637 The extra bits should be interpreted as a machine integer
0638 stored with the most-significant bit first, e.g., bits 1110
0639 represent the value 14.
0640
0641 Extra Extra Extra
0642 Code Bits Dist Code Bits Dist Code Bits Distance
0643 ---- ---- ---- ---- ---- ------ ---- ---- --------
0644 0 0 1 10 4 33-48 20 9 1025-1536
0645 1 0 2 11 4 49-64 21 9 1537-2048
0646 2 0 3 12 5 65-96 22 10 2049-3072
0647 3 0 4 13 5 97-128 23 10 3073-4096
0648 4 1 5,6 14 6 129-192 24 11 4097-6144
0649 5 1 7,8 15 6 193-256 25 11 6145-8192
0650 6 2 9-12 16 7 257-384 26 12 8193-12288
0651 7 2 13-16 17 7 385-512 27 12 12289-16384
0652 8 3 17-24 18 8 513-768 28 13 16385-24576
0653 9 3 25-32 19 8 769-1024 29 13 24577-32768
0654
0655 3.2.6. Compression with fixed Huffman codes (BTYPE=01)
0656
0657 The Huffman codes for the two alphabets are fixed, and are not
0658 represented explicitly in the data. The Huffman code lengths
0659 for the literal/length alphabet are:
0660
0661 Lit Value Bits Codes
0662 --------- ---- -----
0663 0 - 143 8 00110000 through
0664 10111111
0665 144 - 255 9 110010000 through
0666 111111111
0667 256 - 279 7 0000000 through
0668 0010111
0669 280 - 287 8 11000000 through
0670 11000111
0671
0672
0673
0674 Deutsch Informational [Page 12]
0675
0676 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0677
0678
0679 The code lengths are sufficient to generate the actual codes,
0680 as described above; we show the codes in the table for added
0681 clarity. Literal/length values 286-287 will never actually
0682 occur in the compressed data, but participate in the code
0683 construction.
0684
0685 Distance codes 0-31 are represented by (fixed-length) 5-bit
0686 codes, with possible additional bits as shown in the table
0687 shown in Paragraph 3.2.5, above. Note that distance codes 30-
0688 31 will never actually occur in the compressed data.
0689
0690 3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
0691
0692 The Huffman codes for the two alphabets appear in the block
0693 immediately after the header bits and before the actual
0694 compressed data, first the literal/length code and then the
0695 distance code. Each code is defined by a sequence of code
0696 lengths, as discussed in Paragraph 3.2.2, above. For even
0697 greater compactness, the code length sequences themselves are
0698 compressed using a Huffman code. The alphabet for code lengths
0699 is as follows:
0700
0701 0 - 15: Represent code lengths of 0 - 15
0702 16: Copy the previous code length 3 - 6 times.
0703 The next 2 bits indicate repeat length
0704 (0 = 3, ... , 3 = 6)
0705 Example: Codes 8, 16 (+2 bits 11),
0706 16 (+2 bits 10) will expand to
0707 12 code lengths of 8 (1 + 6 + 5)
0708 17: Repeat a code length of 0 for 3 - 10 times.
0709 (3 bits of length)
0710 18: Repeat a code length of 0 for 11 - 138 times
0711 (7 bits of length)
0712
0713 A code length of 0 indicates that the corresponding symbol in
0714 the literal/length or distance alphabet will not occur in the
0715 block, and should not participate in the Huffman code
0716 construction algorithm given earlier. If only one distance
0717 code is used, it is encoded using one bit, not zero bits; in
0718 this case there is a single code length of one, with one unused
0719 code. One distance code of zero bits means that there are no
0720 distance codes used at all (the data is all literals).
0721
0722 We can now define the format of the block:
0723
0724 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
0725 5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
0726 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
0727
0728
0729
0730 Deutsch Informational [Page 13]
0731
0732 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0733
0734
0735 (HCLEN + 4) x 3 bits: code lengths for the code length
0736 alphabet given just above, in the order: 16, 17, 18,
0737 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
0738
0739 These code lengths are interpreted as 3-bit integers
0740 (0-7); as above, a code length of 0 means the
0741 corresponding symbol (literal/length or distance code
0742 length) is not used.
0743
0744 HLIT + 257 code lengths for the literal/length alphabet,
0745 encoded using the code length Huffman code
0746
0747 HDIST + 1 code lengths for the distance alphabet,
0748 encoded using the code length Huffman code
0749
0750 The actual compressed data of the block,
0751 encoded using the literal/length and distance Huffman
0752 codes
0753
0754 The literal/length symbol 256 (end of data),
0755 encoded using the literal/length Huffman code
0756
0757 The code length repeat codes can cross from HLIT + 257 to the
0758 HDIST + 1 code lengths. In other words, all code lengths form
0759 a single sequence of HLIT + HDIST + 258 values.
0760
0761 3.3. Compliance
0762
0763 A compressor may limit further the ranges of values specified in
0764 the previous section and still be compliant; for example, it may
0765 limit the range of backward pointers to some value smaller than
0766 32K. Similarly, a compressor may limit the size of blocks so that
0767 a compressible block fits in memory.
0768
0769 A compliant decompressor must accept the full range of possible
0770 values defined in the previous section, and must accept blocks of
0771 arbitrary size.
0772
0773 4. Compression algorithm details
0774
0775 While it is the intent of this document to define the "deflate"
0776 compressed data format without reference to any particular
0777 compression algorithm, the format is related to the compressed
0778 formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
0779 since many variations of LZ77 are patented, it is strongly
0780 recommended that the implementor of a compressor follow the general
0781 algorithm presented here, which is known not to be patented per se.
0782 The material in this section is not part of the definition of the
0783
0784
0785
0786 Deutsch Informational [Page 14]
0787
0788 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0789
0790
0791 specification per se, and a compressor need not follow it in order to
0792 be compliant.
0793
0794 The compressor terminates a block when it determines that starting a
0795 new block with fresh trees would be useful, or when the block size
0796 fills up the compressor's block buffer.
0797
0798 The compressor uses a chained hash table to find duplicated strings,
0799 using a hash function that operates on 3-byte sequences. At any
0800 given point during compression, let XYZ be the next 3 input bytes to
0801 be examined (not necessarily all different, of course). First, the
0802 compressor examines the hash chain for XYZ. If the chain is empty,
0803 the compressor simply writes out X as a literal byte and advances one
0804 byte in the input. If the hash chain is not empty, indicating that
0805 the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
0806 same hash function value) has occurred recently, the compressor
0807 compares all strings on the XYZ hash chain with the actual input data
0808 sequence starting at the current point, and selects the longest
0809 match.
0810
0811 The compressor searches the hash chains starting with the most recent
0812 strings, to favor small distances and thus take advantage of the
0813 Huffman encoding. The hash chains are singly linked. There are no
0814 deletions from the hash chains; the algorithm simply discards matches
0815 that are too old. To avoid a worst-case situation, very long hash
0816 chains are arbitrarily truncated at a certain length, determined by a
0817 run-time parameter.
0818
0819 To improve overall compression, the compressor optionally defers the
0820 selection of matches ("lazy matching"): after a match of length N has
0821 been found, the compressor searches for a longer match starting at
0822 the next input byte. If it finds a longer match, it truncates the
0823 previous match to a length of one (thus producing a single literal
0824 byte) and then emits the longer match. Otherwise, it emits the
0825 original match, and, as described above, advances N bytes before
0826 continuing.
0827
0828 Run-time parameters also control this "lazy match" procedure. If
0829 compression ratio is most important, the compressor attempts a
0830 complete second search regardless of the length of the first match.
0831 In the normal case, if the current match is "long enough", the
0832 compressor reduces the search for a longer match, thus speeding up
0833 the process. If speed is most important, the compressor inserts new
0834 strings in the hash table only when no match was found, or when the
0835 match is not "too long". This degrades the compression ratio but
0836 saves time since there are both fewer insertions and fewer searches.
0837
0838
0839
0840
0841
0842 Deutsch Informational [Page 15]
0843
0844 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0845
0846
0847 5. References
0848
0849 [1] Huffman, D. A., "A Method for the Construction of Minimum
0850 Redundancy Codes", Proceedings of the Institute of Radio
0851 Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
0852
0853 [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
0854 Compression", IEEE Transactions on Information Theory, Vol. 23,
0855 No. 3, pp. 337-343.
0856
0857 [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
0858 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
0859
0860 [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
0861 available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
0862
0863 [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
0864 encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
0865
0866 [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
0867 Comm. ACM, 33,4, April 1990, pp. 449-459.
0868
0869 6. Security Considerations
0870
0871 Any data compression method involves the reduction of redundancy in
0872 the data. Consequently, any corruption of the data is likely to have
0873 severe effects and be difficult to correct. Uncompressed text, on
0874 the other hand, will probably still be readable despite the presence
0875 of some corrupted bytes.
0876
0877 It is recommended that systems using this data format provide some
0878 means of validating the integrity of the compressed data. See
0879 reference [3], for example.
0880
0881 7. Source code
0882
0883 Source code for a C language implementation of a "deflate" compliant
0884 compressor and decompressor is available within the zlib package at
0885 ftp://ftp.uu.net/pub/archiving/zip/zlib/.
0886
0887 8. Acknowledgements
0888
0889 Trademarks cited in this document are the property of their
0890 respective owners.
0891
0892 Phil Katz designed the deflate format. Jean-Loup Gailly and Mark
0893 Adler wrote the related software described in this specification.
0894 Glenn Randers-Pehrson converted this document to RFC and HTML format.
0895
0896
0897
0898 Deutsch Informational [Page 16]
0899
0900 RFC 1951 DEFLATE Compressed Data Format Specification May 1996
0901
0902
0903 9. Author's Address
0904
0905 L. Peter Deutsch
0906 Aladdin Enterprises
0907 203 Santa Margarita Ave.
0908 Menlo Park, CA 94025
0909
0910 Phone: (415) 322-0103 (AM only)
0911 FAX: (415) 322-1734
0912 EMail: <ghost@aladdin.com>
0913
0914 Questions about the technical content of this specification can be
0915 sent by email to:
0916
0917 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
0918 Mark Adler <madler@alumni.caltech.edu>
0919
0920 Editorial comments on this specification can be sent by email to:
0921
0922 L. Peter Deutsch <ghost@aladdin.com> and
0923 Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
0924
0925
0926
0927
0928
0929
0930
0931
0932
0933
0934
0935
0936
0937
0938
0939
0940
0941
0942
0943
0944
0945
0946
0947
0948
0949
0950
0951
0952
0953
0954 Deutsch Informational [Page 17]
0955