Back to home page

LXR

 
 

    


Warning, /cpukit/compression/zlib/doc/rfc1951.txt is written in an unsupported language. File is not indexed.

0001 
0002 
0003 
0004 
0005 
0006 
0007 Network Working Group                                         P. Deutsch
0008 Request for Comments: 1951                           Aladdin Enterprises
0009 Category: Informational                                         May 1996
0010 
0011 
0012         DEFLATE Compressed Data Format Specification version 1.3
0013 
0014 Status of This Memo
0015 
0016    This memo provides information for the Internet community.  This memo
0017    does not specify an Internet standard of any kind.  Distribution of
0018    this memo is unlimited.
0019 
0020 IESG Note:
0021 
0022    The IESG takes no position on the validity of any Intellectual
0023    Property Rights statements contained in this document.
0024 
0025 Notices
0026 
0027    Copyright (c) 1996 L. Peter Deutsch
0028 
0029    Permission is granted to copy and distribute this document for any
0030    purpose and without charge, including translations into other
0031    languages and incorporation into compilations, provided that the
0032    copyright notice and this notice are preserved, and that any
0033    substantive changes or deletions from the original are clearly
0034    marked.
0035 
0036    A pointer to the latest version of this and related documentation in
0037    HTML format can be found at the URL
0038    <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
0039 
0040 Abstract
0041 
0042    This specification defines a lossless compressed data format that
0043    compresses data using a combination of the LZ77 algorithm and Huffman
0044    coding, with efficiency comparable to the best currently available
0045    general-purpose compression methods.  The data can be produced or
0046    consumed, even for an arbitrarily long sequentially presented input
0047    data stream, using only an a priori bounded amount of intermediate
0048    storage.  The format can be implemented readily in a manner not
0049    covered by patents.
0050 
0051 
0052 
0053 
0054 
0055 
0056 
0057 
0058 Deutsch                      Informational                      [Page 1]
0059 
0060 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0061 
0062 
0063 Table of Contents
0064 
0065    1. Introduction ................................................... 2
0066       1.1. Purpose ................................................... 2
0067       1.2. Intended audience ......................................... 3
0068       1.3. Scope ..................................................... 3
0069       1.4. Compliance ................................................ 3
0070       1.5.  Definitions of terms and conventions used ................ 3
0071       1.6. Changes from previous versions ............................ 4
0072    2. Compressed representation overview ............................. 4
0073    3. Detailed specification ......................................... 5
0074       3.1. Overall conventions ....................................... 5
0075           3.1.1. Packing into bytes .................................. 5
0076       3.2. Compressed block format ................................... 6
0077           3.2.1. Synopsis of prefix and Huffman coding ............... 6
0078           3.2.2. Use of Huffman coding in the "deflate" format ....... 7
0079           3.2.3. Details of block format ............................. 9
0080           3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
0081           3.2.5. Compressed blocks (length and distance codes) ...... 11
0082           3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
0083           3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
0084       3.3. Compliance ............................................... 14
0085    4. Compression algorithm details ................................. 14
0086    5. References .................................................... 16
0087    6. Security Considerations ....................................... 16
0088    7. Source code ................................................... 16
0089    8. Acknowledgements .............................................. 16
0090    9. Author's Address .............................................. 17
0091 
0092 1. Introduction
0093 
0094    1.1. Purpose
0095 
0096       The purpose of this specification is to define a lossless
0097       compressed data format that:
0098           * Is independent of CPU type, operating system, file system,
0099             and character set, and hence can be used for interchange;
0100           * Can be produced or consumed, even for an arbitrarily long
0101             sequentially presented input data stream, using only an a
0102             priori bounded amount of intermediate storage, and hence
0103             can be used in data communications or similar structures
0104             such as Unix filters;
0105           * Compresses data with efficiency comparable to the best
0106             currently available general-purpose compression methods,
0107             and in particular considerably better than the "compress"
0108             program;
0109           * Can be implemented readily in a manner not covered by
0110             patents, and hence can be practiced freely;
0111 
0112 
0113 
0114 Deutsch                      Informational                      [Page 2]
0115 
0116 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0117 
0118 
0119           * Is compatible with the file format produced by the current
0120             widely used gzip utility, in that conforming decompressors
0121             will be able to read data produced by the existing gzip
0122             compressor.
0123 
0124       The data format defined by this specification does not attempt to:
0125 
0126           * Allow random access to compressed data;
0127           * Compress specialized data (e.g., raster graphics) as well
0128             as the best currently available specialized algorithms.
0129 
0130       A simple counting argument shows that no lossless compression
0131       algorithm can compress every possible input data set.  For the
0132       format defined here, the worst case expansion is 5 bytes per 32K-
0133       byte block, i.e., a size increase of 0.015% for large data sets.
0134       English text usually compresses by a factor of 2.5 to 3;
0135       executable files usually compress somewhat less; graphical data
0136       such as raster images may compress much more.
0137 
0138    1.2. Intended audience
0139 
0140       This specification is intended for use by implementors of software
0141       to compress data into "deflate" format and/or decompress data from
0142       "deflate" format.
0143 
0144       The text of the specification assumes a basic background in
0145       programming at the level of bits and other primitive data
0146       representations.  Familiarity with the technique of Huffman coding
0147       is helpful but not required.
0148 
0149    1.3. Scope
0150 
0151       The specification specifies a method for representing a sequence
0152       of bytes as a (usually shorter) sequence of bits, and a method for
0153       packing the latter bit sequence into bytes.
0154 
0155    1.4. Compliance
0156 
0157       Unless otherwise indicated below, a compliant decompressor must be
0158       able to accept and decompress any data set that conforms to all
0159       the specifications presented here; a compliant compressor must
0160       produce data sets that conform to all the specifications presented
0161       here.
0162 
0163    1.5.  Definitions of terms and conventions used
0164 
0165       Byte: 8 bits stored or transmitted as a unit (same as an octet).
0166       For this specification, a byte is exactly 8 bits, even on machines
0167 
0168 
0169 
0170 Deutsch                      Informational                      [Page 3]
0171 
0172 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0173 
0174 
0175       which store a character on a number of bits different from eight.
0176       See below, for the numbering of bits within a byte.
0177 
0178       String: a sequence of arbitrary bytes.
0179 
0180    1.6. Changes from previous versions
0181 
0182       There have been no technical changes to the deflate format since
0183       version 1.1 of this specification.  In version 1.2, some
0184       terminology was changed.  Version 1.3 is a conversion of the
0185       specification to RFC style.
0186 
0187 2. Compressed representation overview
0188 
0189    A compressed data set consists of a series of blocks, corresponding
0190    to successive blocks of input data.  The block sizes are arbitrary,
0191    except that non-compressible blocks are limited to 65,535 bytes.
0192 
0193    Each block is compressed using a combination of the LZ77 algorithm
0194    and Huffman coding. The Huffman trees for each block are independent
0195    of those for previous or subsequent blocks; the LZ77 algorithm may
0196    use a reference to a duplicated string occurring in a previous block,
0197    up to 32K input bytes before.
0198 
0199    Each block consists of two parts: a pair of Huffman code trees that
0200    describe the representation of the compressed data part, and a
0201    compressed data part.  (The Huffman trees themselves are compressed
0202    using Huffman encoding.)  The compressed data consists of a series of
0203    elements of two types: literal bytes (of strings that have not been
0204    detected as duplicated within the previous 32K input bytes), and
0205    pointers to duplicated strings, where a pointer is represented as a
0206    pair <length, backward distance>.  The representation used in the
0207    "deflate" format limits distances to 32K bytes and lengths to 258
0208    bytes, but does not limit the size of a block, except for
0209    uncompressible blocks, which are limited as noted above.
0210 
0211    Each type of value (literals, distances, and lengths) in the
0212    compressed data is represented using a Huffman code, using one code
0213    tree for literals and lengths and a separate code tree for distances.
0214    The code trees for each block appear in a compact form just before
0215    the compressed data for that block.
0216 
0217 
0218 
0219 
0220 
0221 
0222 
0223 
0224 
0225 
0226 Deutsch                      Informational                      [Page 4]
0227 
0228 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0229 
0230 
0231 3. Detailed specification
0232 
0233    3.1. Overall conventions In the diagrams below, a box like this:
0234 
0235          +---+
0236          |   | <-- the vertical bars might be missing
0237          +---+
0238 
0239       represents one byte; a box like this:
0240 
0241          +==============+
0242          |              |
0243          +==============+
0244 
0245       represents a variable number of bytes.
0246 
0247       Bytes stored within a computer do not have a "bit order", since
0248       they are always treated as a unit.  However, a byte considered as
0249       an integer between 0 and 255 does have a most- and least-
0250       significant bit, and since we write numbers with the most-
0251       significant digit on the left, we also write bytes with the most-
0252       significant bit on the left.  In the diagrams below, we number the
0253       bits of a byte so that bit 0 is the least-significant bit, i.e.,
0254       the bits are numbered:
0255 
0256          +--------+
0257          |76543210|
0258          +--------+
0259 
0260       Within a computer, a number may occupy multiple bytes.  All
0261       multi-byte numbers in the format described here are stored with
0262       the least-significant byte first (at the lower memory address).
0263       For example, the decimal number 520 is stored as:
0264 
0265              0        1
0266          +--------+--------+
0267          |00001000|00000010|
0268          +--------+--------+
0269           ^        ^
0270           |        |
0271           |        + more significant byte = 2 x 256
0272           + less significant byte = 8
0273 
0274       3.1.1. Packing into bytes
0275 
0276          This document does not address the issue of the order in which
0277          bits of a byte are transmitted on a bit-sequential medium,
0278          since the final data format described here is byte- rather than
0279 
0280 
0281 
0282 Deutsch                      Informational                      [Page 5]
0283 
0284 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0285 
0286 
0287          bit-oriented.  However, we describe the compressed block format
0288          in below, as a sequence of data elements of various bit
0289          lengths, not a sequence of bytes.  We must therefore specify
0290          how to pack these data elements into bytes to form the final
0291          compressed byte sequence:
0292 
0293              * Data elements are packed into bytes in order of
0294                increasing bit number within the byte, i.e., starting
0295                with the least-significant bit of the byte.
0296              * Data elements other than Huffman codes are packed
0297                starting with the least-significant bit of the data
0298                element.
0299              * Huffman codes are packed starting with the most-
0300                significant bit of the code.
0301 
0302          In other words, if one were to print out the compressed data as
0303          a sequence of bytes, starting with the first byte at the
0304          *right* margin and proceeding to the *left*, with the most-
0305          significant bit of each byte on the left as usual, one would be
0306          able to parse the result from right to left, with fixed-width
0307          elements in the correct MSB-to-LSB order and Huffman codes in
0308          bit-reversed order (i.e., with the first bit of the code in the
0309          relative LSB position).
0310 
0311    3.2. Compressed block format
0312 
0313       3.2.1. Synopsis of prefix and Huffman coding
0314 
0315          Prefix coding represents symbols from an a priori known
0316          alphabet by bit sequences (codes), one code for each symbol, in
0317          a manner such that different symbols may be represented by bit
0318          sequences of different lengths, but a parser can always parse
0319          an encoded string unambiguously symbol-by-symbol.
0320 
0321          We define a prefix code in terms of a binary tree in which the
0322          two edges descending from each non-leaf node are labeled 0 and
0323          1 and in which the leaf nodes correspond one-for-one with (are
0324          labeled with) the symbols of the alphabet; then the code for a
0325          symbol is the sequence of 0's and 1's on the edges leading from
0326          the root to the leaf labeled with that symbol.  For example:
0327 
0328 
0329 
0330 
0331 
0332 
0333 
0334 
0335 
0336 
0337 
0338 Deutsch                      Informational                      [Page 6]
0339 
0340 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0341 
0342 
0343                           /\              Symbol    Code
0344                          0  1             ------    ----
0345                         /    \                A      00
0346                        /\     B               B       1
0347                       0  1                    C     011
0348                      /    \                   D     010
0349                     A     /\
0350                          0  1
0351                         /    \
0352                        D      C
0353 
0354          A parser can decode the next symbol from an encoded input
0355          stream by walking down the tree from the root, at each step
0356          choosing the edge corresponding to the next input bit.
0357 
0358          Given an alphabet with known symbol frequencies, the Huffman
0359          algorithm allows the construction of an optimal prefix code
0360          (one which represents strings with those symbol frequencies
0361          using the fewest bits of any possible prefix codes for that
0362          alphabet).  Such a code is called a Huffman code.  (See
0363          reference [1] in Chapter 5, references for additional
0364          information on Huffman codes.)
0365 
0366          Note that in the "deflate" format, the Huffman codes for the
0367          various alphabets must not exceed certain maximum code lengths.
0368          This constraint complicates the algorithm for computing code
0369          lengths from symbol frequencies.  Again, see Chapter 5,
0370          references for details.
0371 
0372       3.2.2. Use of Huffman coding in the "deflate" format
0373 
0374          The Huffman codes used for each alphabet in the "deflate"
0375          format have two additional rules:
0376 
0377              * All codes of a given bit length have lexicographically
0378                consecutive values, in the same order as the symbols
0379                they represent;
0380 
0381              * Shorter codes lexicographically precede longer codes.
0382 
0383 
0384 
0385 
0386 
0387 
0388 
0389 
0390 
0391 
0392 
0393 
0394 Deutsch                      Informational                      [Page 7]
0395 
0396 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0397 
0398 
0399          We could recode the example above to follow this rule as
0400          follows, assuming that the order of the alphabet is ABCD:
0401 
0402             Symbol  Code
0403             ------  ----
0404             A       10
0405             B       0
0406             C       110
0407             D       111
0408 
0409          I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
0410          lexicographically consecutive.
0411 
0412          Given this rule, we can define the Huffman code for an alphabet
0413          just by giving the bit lengths of the codes for each symbol of
0414          the alphabet in order; this is sufficient to determine the
0415          actual codes.  In our example, the code is completely defined
0416          by the sequence of bit lengths (2, 1, 3, 3).  The following
0417          algorithm generates the codes as integers, intended to be read
0418          from most- to least-significant bit.  The code lengths are
0419          initially in tree[I].Len; the codes are produced in
0420          tree[I].Code.
0421 
0422          1)  Count the number of codes for each code length.  Let
0423              bl_count[N] be the number of codes of length N, N >= 1.
0424 
0425          2)  Find the numerical value of the smallest code for each
0426              code length:
0427 
0428                 code = 0;
0429                 bl_count[0] = 0;
0430                 for (bits = 1; bits <= MAX_BITS; bits++) {
0431                     code = (code + bl_count[bits-1]) << 1;
0432                     next_code[bits] = code;
0433                 }
0434 
0435          3)  Assign numerical values to all codes, using consecutive
0436              values for all codes of the same length with the base
0437              values determined at step 2. Codes that are never used
0438              (which have a bit length of zero) must not be assigned a
0439              value.
0440 
0441                 for (n = 0;  n <= max_code; n++) {
0442                     len = tree[n].Len;
0443                     if (len != 0) {
0444                         tree[n].Code = next_code[len];
0445                         next_code[len]++;
0446                     }
0447 
0448 
0449 
0450 Deutsch                      Informational                      [Page 8]
0451 
0452 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0453 
0454 
0455                 }
0456 
0457          Example:
0458 
0459          Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
0460          3, 2, 4, 4).  After step 1, we have:
0461 
0462             N      bl_count[N]
0463             -      -----------
0464             2      1
0465             3      5
0466             4      2
0467 
0468          Step 2 computes the following next_code values:
0469 
0470             N      next_code[N]
0471             -      ------------
0472             1      0
0473             2      0
0474             3      2
0475             4      14
0476 
0477          Step 3 produces the following code values:
0478 
0479             Symbol Length   Code
0480             ------ ------   ----
0481             A       3        010
0482             B       3        011
0483             C       3        100
0484             D       3        101
0485             E       3        110
0486             F       2         00
0487             G       4       1110
0488             H       4       1111
0489 
0490       3.2.3. Details of block format
0491 
0492          Each block of compressed data begins with 3 header bits
0493          containing the following data:
0494 
0495             first bit       BFINAL
0496             next 2 bits     BTYPE
0497 
0498          Note that the header bits do not necessarily begin on a byte
0499          boundary, since a block does not necessarily occupy an integral
0500          number of bytes.
0501 
0502 
0503 
0504 
0505 
0506 Deutsch                      Informational                      [Page 9]
0507 
0508 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0509 
0510 
0511          BFINAL is set if and only if this is the last block of the data
0512          set.
0513 
0514          BTYPE specifies how the data are compressed, as follows:
0515 
0516             00 - no compression
0517             01 - compressed with fixed Huffman codes
0518             10 - compressed with dynamic Huffman codes
0519             11 - reserved (error)
0520 
0521          The only difference between the two compressed cases is how the
0522          Huffman codes for the literal/length and distance alphabets are
0523          defined.
0524 
0525          In all cases, the decoding algorithm for the actual data is as
0526          follows:
0527 
0528             do
0529                read block header from input stream.
0530                if stored with no compression
0531                   skip any remaining bits in current partially
0532                      processed byte
0533                   read LEN and NLEN (see next section)
0534                   copy LEN bytes of data to output
0535                otherwise
0536                   if compressed with dynamic Huffman codes
0537                      read representation of code trees (see
0538                         subsection below)
0539                   loop (until end of block code recognized)
0540                      decode literal/length value from input stream
0541                      if value < 256
0542                         copy value (literal byte) to output stream
0543                      otherwise
0544                         if value = end of block (256)
0545                            break from loop
0546                         otherwise (value = 257..285)
0547                            decode distance from input stream
0548 
0549                            move backwards distance bytes in the output
0550                            stream, and copy length bytes from this
0551                            position to the output stream.
0552                   end loop
0553             while not last block
0554 
0555          Note that a duplicated string reference may refer to a string
0556          in a previous block; i.e., the backward distance may cross one
0557          or more block boundaries.  However a distance cannot refer past
0558          the beginning of the output stream.  (An application using a
0559 
0560 
0561 
0562 Deutsch                      Informational                     [Page 10]
0563 
0564 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0565 
0566 
0567          preset dictionary might discard part of the output stream; a
0568          distance can refer to that part of the output stream anyway)
0569          Note also that the referenced string may overlap the current
0570          position; for example, if the last 2 bytes decoded have values
0571          X and Y, a string reference with <length = 5, distance = 2>
0572          adds X,Y,X,Y,X to the output stream.
0573 
0574          We now specify each compression method in turn.
0575 
0576       3.2.4. Non-compressed blocks (BTYPE=00)
0577 
0578          Any bits of input up to the next byte boundary are ignored.
0579          The rest of the block consists of the following information:
0580 
0581               0   1   2   3   4...
0582             +---+---+---+---+================================+
0583             |  LEN  | NLEN  |... LEN bytes of literal data...|
0584             +---+---+---+---+================================+
0585 
0586          LEN is the number of data bytes in the block.  NLEN is the
0587          one's complement of LEN.
0588 
0589       3.2.5. Compressed blocks (length and distance codes)
0590 
0591          As noted above, encoded data blocks in the "deflate" format
0592          consist of sequences of symbols drawn from three conceptually
0593          distinct alphabets: either literal bytes, from the alphabet of
0594          byte values (0..255), or <length, backward distance> pairs,
0595          where the length is drawn from (3..258) and the distance is
0596          drawn from (1..32,768).  In fact, the literal and length
0597          alphabets are merged into a single alphabet (0..285), where
0598          values 0..255 represent literal bytes, the value 256 indicates
0599          end-of-block, and values 257..285 represent length codes
0600          (possibly in conjunction with extra bits following the symbol
0601          code) as follows:
0602 
0603 
0604 
0605 
0606 
0607 
0608 
0609 
0610 
0611 
0612 
0613 
0614 
0615 
0616 
0617 
0618 Deutsch                      Informational                     [Page 11]
0619 
0620 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0621 
0622 
0623                  Extra               Extra               Extra
0624             Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
0625             ---- ---- ------     ---- ---- -------   ---- ---- -------
0626              257   0     3       267   1   15,16     277   4   67-82
0627              258   0     4       268   1   17,18     278   4   83-98
0628              259   0     5       269   2   19-22     279   4   99-114
0629              260   0     6       270   2   23-26     280   4  115-130
0630              261   0     7       271   2   27-30     281   5  131-162
0631              262   0     8       272   2   31-34     282   5  163-194
0632              263   0     9       273   3   35-42     283   5  195-226
0633              264   0    10       274   3   43-50     284   5  227-257
0634              265   1  11,12      275   3   51-58     285   0    258
0635              266   1  13,14      276   3   59-66
0636 
0637          The extra bits should be interpreted as a machine integer
0638          stored with the most-significant bit first, e.g., bits 1110
0639          represent the value 14.
0640 
0641                   Extra           Extra               Extra
0642              Code Bits Dist  Code Bits   Dist     Code Bits Distance
0643              ---- ---- ----  ---- ----  ------    ---- ---- --------
0644                0   0    1     10   4     33-48    20    9   1025-1536
0645                1   0    2     11   4     49-64    21    9   1537-2048
0646                2   0    3     12   5     65-96    22   10   2049-3072
0647                3   0    4     13   5     97-128   23   10   3073-4096
0648                4   1   5,6    14   6    129-192   24   11   4097-6144
0649                5   1   7,8    15   6    193-256   25   11   6145-8192
0650                6   2   9-12   16   7    257-384   26   12  8193-12288
0651                7   2  13-16   17   7    385-512   27   12 12289-16384
0652                8   3  17-24   18   8    513-768   28   13 16385-24576
0653                9   3  25-32   19   8   769-1024   29   13 24577-32768
0654 
0655       3.2.6. Compression with fixed Huffman codes (BTYPE=01)
0656 
0657          The Huffman codes for the two alphabets are fixed, and are not
0658          represented explicitly in the data.  The Huffman code lengths
0659          for the literal/length alphabet are:
0660 
0661                    Lit Value    Bits        Codes
0662                    ---------    ----        -----
0663                      0 - 143     8          00110000 through
0664                                             10111111
0665                    144 - 255     9          110010000 through
0666                                             111111111
0667                    256 - 279     7          0000000 through
0668                                             0010111
0669                    280 - 287     8          11000000 through
0670                                             11000111
0671 
0672 
0673 
0674 Deutsch                      Informational                     [Page 12]
0675 
0676 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0677 
0678 
0679          The code lengths are sufficient to generate the actual codes,
0680          as described above; we show the codes in the table for added
0681          clarity.  Literal/length values 286-287 will never actually
0682          occur in the compressed data, but participate in the code
0683          construction.
0684 
0685          Distance codes 0-31 are represented by (fixed-length) 5-bit
0686          codes, with possible additional bits as shown in the table
0687          shown in Paragraph 3.2.5, above.  Note that distance codes 30-
0688          31 will never actually occur in the compressed data.
0689 
0690       3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
0691 
0692          The Huffman codes for the two alphabets appear in the block
0693          immediately after the header bits and before the actual
0694          compressed data, first the literal/length code and then the
0695          distance code.  Each code is defined by a sequence of code
0696          lengths, as discussed in Paragraph 3.2.2, above.  For even
0697          greater compactness, the code length sequences themselves are
0698          compressed using a Huffman code.  The alphabet for code lengths
0699          is as follows:
0700 
0701                0 - 15: Represent code lengths of 0 - 15
0702                    16: Copy the previous code length 3 - 6 times.
0703                        The next 2 bits indicate repeat length
0704                              (0 = 3, ... , 3 = 6)
0705                           Example:  Codes 8, 16 (+2 bits 11),
0706                                     16 (+2 bits 10) will expand to
0707                                     12 code lengths of 8 (1 + 6 + 5)
0708                    17: Repeat a code length of 0 for 3 - 10 times.
0709                        (3 bits of length)
0710                    18: Repeat a code length of 0 for 11 - 138 times
0711                        (7 bits of length)
0712 
0713          A code length of 0 indicates that the corresponding symbol in
0714          the literal/length or distance alphabet will not occur in the
0715          block, and should not participate in the Huffman code
0716          construction algorithm given earlier.  If only one distance
0717          code is used, it is encoded using one bit, not zero bits; in
0718          this case there is a single code length of one, with one unused
0719          code.  One distance code of zero bits means that there are no
0720          distance codes used at all (the data is all literals).
0721 
0722          We can now define the format of the block:
0723 
0724                5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
0725                5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
0726                4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)
0727 
0728 
0729 
0730 Deutsch                      Informational                     [Page 13]
0731 
0732 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0733 
0734 
0735                (HCLEN + 4) x 3 bits: code lengths for the code length
0736                   alphabet given just above, in the order: 16, 17, 18,
0737                   0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
0738 
0739                   These code lengths are interpreted as 3-bit integers
0740                   (0-7); as above, a code length of 0 means the
0741                   corresponding symbol (literal/length or distance code
0742                   length) is not used.
0743 
0744                HLIT + 257 code lengths for the literal/length alphabet,
0745                   encoded using the code length Huffman code
0746 
0747                HDIST + 1 code lengths for the distance alphabet,
0748                   encoded using the code length Huffman code
0749 
0750                The actual compressed data of the block,
0751                   encoded using the literal/length and distance Huffman
0752                   codes
0753 
0754                The literal/length symbol 256 (end of data),
0755                   encoded using the literal/length Huffman code
0756 
0757          The code length repeat codes can cross from HLIT + 257 to the
0758          HDIST + 1 code lengths.  In other words, all code lengths form
0759          a single sequence of HLIT + HDIST + 258 values.
0760 
0761    3.3. Compliance
0762 
0763       A compressor may limit further the ranges of values specified in
0764       the previous section and still be compliant; for example, it may
0765       limit the range of backward pointers to some value smaller than
0766       32K.  Similarly, a compressor may limit the size of blocks so that
0767       a compressible block fits in memory.
0768 
0769       A compliant decompressor must accept the full range of possible
0770       values defined in the previous section, and must accept blocks of
0771       arbitrary size.
0772 
0773 4. Compression algorithm details
0774 
0775    While it is the intent of this document to define the "deflate"
0776    compressed data format without reference to any particular
0777    compression algorithm, the format is related to the compressed
0778    formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
0779    since many variations of LZ77 are patented, it is strongly
0780    recommended that the implementor of a compressor follow the general
0781    algorithm presented here, which is known not to be patented per se.
0782    The material in this section is not part of the definition of the
0783 
0784 
0785 
0786 Deutsch                      Informational                     [Page 14]
0787 
0788 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0789 
0790 
0791    specification per se, and a compressor need not follow it in order to
0792    be compliant.
0793 
0794    The compressor terminates a block when it determines that starting a
0795    new block with fresh trees would be useful, or when the block size
0796    fills up the compressor's block buffer.
0797 
0798    The compressor uses a chained hash table to find duplicated strings,
0799    using a hash function that operates on 3-byte sequences.  At any
0800    given point during compression, let XYZ be the next 3 input bytes to
0801    be examined (not necessarily all different, of course).  First, the
0802    compressor examines the hash chain for XYZ.  If the chain is empty,
0803    the compressor simply writes out X as a literal byte and advances one
0804    byte in the input.  If the hash chain is not empty, indicating that
0805    the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
0806    same hash function value) has occurred recently, the compressor
0807    compares all strings on the XYZ hash chain with the actual input data
0808    sequence starting at the current point, and selects the longest
0809    match.
0810 
0811    The compressor searches the hash chains starting with the most recent
0812    strings, to favor small distances and thus take advantage of the
0813    Huffman encoding.  The hash chains are singly linked. There are no
0814    deletions from the hash chains; the algorithm simply discards matches
0815    that are too old.  To avoid a worst-case situation, very long hash
0816    chains are arbitrarily truncated at a certain length, determined by a
0817    run-time parameter.
0818 
0819    To improve overall compression, the compressor optionally defers the
0820    selection of matches ("lazy matching"): after a match of length N has
0821    been found, the compressor searches for a longer match starting at
0822    the next input byte.  If it finds a longer match, it truncates the
0823    previous match to a length of one (thus producing a single literal
0824    byte) and then emits the longer match.  Otherwise, it emits the
0825    original match, and, as described above, advances N bytes before
0826    continuing.
0827 
0828    Run-time parameters also control this "lazy match" procedure.  If
0829    compression ratio is most important, the compressor attempts a
0830    complete second search regardless of the length of the first match.
0831    In the normal case, if the current match is "long enough", the
0832    compressor reduces the search for a longer match, thus speeding up
0833    the process.  If speed is most important, the compressor inserts new
0834    strings in the hash table only when no match was found, or when the
0835    match is not "too long".  This degrades the compression ratio but
0836    saves time since there are both fewer insertions and fewer searches.
0837 
0838 
0839 
0840 
0841 
0842 Deutsch                      Informational                     [Page 15]
0843 
0844 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0845 
0846 
0847 5. References
0848 
0849    [1] Huffman, D. A., "A Method for the Construction of Minimum
0850        Redundancy Codes", Proceedings of the Institute of Radio
0851        Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
0852 
0853    [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
0854        Compression", IEEE Transactions on Information Theory, Vol. 23,
0855        No. 3, pp. 337-343.
0856 
0857    [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
0858        available in ftp://ftp.uu.net/pub/archiving/zip/doc/
0859 
0860    [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
0861        available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
0862 
0863    [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
0864        encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
0865 
0866    [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
0867        Comm. ACM, 33,4, April 1990, pp. 449-459.
0868 
0869 6. Security Considerations
0870 
0871    Any data compression method involves the reduction of redundancy in
0872    the data.  Consequently, any corruption of the data is likely to have
0873    severe effects and be difficult to correct.  Uncompressed text, on
0874    the other hand, will probably still be readable despite the presence
0875    of some corrupted bytes.
0876 
0877    It is recommended that systems using this data format provide some
0878    means of validating the integrity of the compressed data.  See
0879    reference [3], for example.
0880 
0881 7. Source code
0882 
0883    Source code for a C language implementation of a "deflate" compliant
0884    compressor and decompressor is available within the zlib package at
0885    ftp://ftp.uu.net/pub/archiving/zip/zlib/.
0886 
0887 8. Acknowledgements
0888 
0889    Trademarks cited in this document are the property of their
0890    respective owners.
0891 
0892    Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark
0893    Adler wrote the related software described in this specification.
0894    Glenn Randers-Pehrson converted this document to RFC and HTML format.
0895 
0896 
0897 
0898 Deutsch                      Informational                     [Page 16]
0899 
0900 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
0901 
0902 
0903 9. Author's Address
0904 
0905    L. Peter Deutsch
0906    Aladdin Enterprises
0907    203 Santa Margarita Ave.
0908    Menlo Park, CA 94025
0909 
0910    Phone: (415) 322-0103 (AM only)
0911    FAX:   (415) 322-1734
0912    EMail: <ghost@aladdin.com>
0913 
0914    Questions about the technical content of this specification can be
0915    sent by email to:
0916 
0917    Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
0918    Mark Adler <madler@alumni.caltech.edu>
0919 
0920    Editorial comments on this specification can be sent by email to:
0921 
0922    L. Peter Deutsch <ghost@aladdin.com> and
0923    Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
0924 
0925 
0926 
0927 
0928 
0929 
0930 
0931 
0932 
0933 
0934 
0935 
0936 
0937 
0938 
0939 
0940 
0941 
0942 
0943 
0944 
0945 
0946 
0947 
0948 
0949 
0950 
0951 
0952 
0953 
0954 Deutsch                      Informational                     [Page 17]
0955