Back to home page

LXR

 
 

    


Warning, /cpukit/compression/zlib/doc/rfc1950.txt is written in an unsupported language. File is not indexed.

0001 
0002 
0003 
0004 
0005 
0006 
0007 Network Working Group                                         P. Deutsch
0008 Request for Comments: 1950                           Aladdin Enterprises
0009 Category: Informational                                      J-L. Gailly
0010                                                                 Info-ZIP
0011                                                                 May 1996
0012 
0013 
0014          ZLIB Compressed Data Format Specification version 3.3
0015 
0016 Status of This Memo
0017 
0018    This memo provides information for the Internet community.  This memo
0019    does not specify an Internet standard of any kind.  Distribution of
0020    this memo is unlimited.
0021 
0022 IESG Note:
0023 
0024    The IESG takes no position on the validity of any Intellectual
0025    Property Rights statements contained in this document.
0026 
0027 Notices
0028 
0029    Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly
0030 
0031    Permission is granted to copy and distribute this document for any
0032    purpose and without charge, including translations into other
0033    languages and incorporation into compilations, provided that the
0034    copyright notice and this notice are preserved, and that any
0035    substantive changes or deletions from the original are clearly
0036    marked.
0037 
0038    A pointer to the latest version of this and related documentation in
0039    HTML format can be found at the URL
0040    <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
0041 
0042 Abstract
0043 
0044    This specification defines a lossless compressed data format.  The
0045    data can be produced or consumed, even for an arbitrarily long
0046    sequentially presented input data stream, using only an a priori
0047    bounded amount of intermediate storage.  The format presently uses
0048    the DEFLATE compression method but can be easily extended to use
0049    other compression methods.  It can be implemented readily in a manner
0050    not covered by patents.  This specification also defines the ADLER-32
0051    checksum (an extension and improvement of the Fletcher checksum),
0052    used for detection of data corruption, and provides an algorithm for
0053    computing it.
0054 
0055 
0056 
0057 
0058 Deutsch & Gailly             Informational                      [Page 1]
0059 
0060 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0061 
0062 
0063 Table of Contents
0064 
0065    1. Introduction ................................................... 2
0066       1.1. Purpose ................................................... 2
0067       1.2. Intended audience ......................................... 3
0068       1.3. Scope ..................................................... 3
0069       1.4. Compliance ................................................ 3
0070       1.5.  Definitions of terms and conventions used ................ 3
0071       1.6. Changes from previous versions ............................ 3
0072    2. Detailed specification ......................................... 3
0073       2.1. Overall conventions ....................................... 3
0074       2.2. Data format ............................................... 4
0075       2.3. Compliance ................................................ 7
0076    3. References ..................................................... 7
0077    4. Source code .................................................... 8
0078    5. Security Considerations ........................................ 8
0079    6. Acknowledgements ............................................... 8
0080    7. Authors' Addresses ............................................. 8
0081    8. Appendix: Rationale ............................................ 9
0082    9. Appendix: Sample code ..........................................10
0083 
0084 1. Introduction
0085 
0086    1.1. Purpose
0087 
0088       The purpose of this specification is to define a lossless
0089       compressed data format that:
0090 
0091           * Is independent of CPU type, operating system, file system,
0092             and character set, and hence can be used for interchange;
0093 
0094           * Can be produced or consumed, even for an arbitrarily long
0095             sequentially presented input data stream, using only an a
0096             priori bounded amount of intermediate storage, and hence can
0097             be used in data communications or similar structures such as
0098             Unix filters;
0099 
0100           * Can use a number of different compression methods;
0101 
0102           * Can be implemented readily in a manner not covered by
0103             patents, and hence can be practiced freely.
0104 
0105       The data format defined by this specification does not attempt to
0106       allow random access to compressed data.
0107 
0108 
0109 
0110 
0111 
0112 
0113 
0114 Deutsch & Gailly             Informational                      [Page 2]
0115 
0116 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0117 
0118 
0119    1.2. Intended audience
0120 
0121       This specification is intended for use by implementors of software
0122       to compress data into zlib format and/or decompress data from zlib
0123       format.
0124 
0125       The text of the specification assumes a basic background in
0126       programming at the level of bits and other primitive data
0127       representations.
0128 
0129    1.3. Scope
0130 
0131       The specification specifies a compressed data format that can be
0132       used for in-memory compression of a sequence of arbitrary bytes.
0133 
0134    1.4. Compliance
0135 
0136       Unless otherwise indicated below, a compliant decompressor must be
0137       able to accept and decompress any data set that conforms to all
0138       the specifications presented here; a compliant compressor must
0139       produce data sets that conform to all the specifications presented
0140       here.
0141 
0142    1.5.  Definitions of terms and conventions used
0143 
0144       byte: 8 bits stored or transmitted as a unit (same as an octet).
0145       (For this specification, a byte is exactly 8 bits, even on
0146       machines which store a character on a number of bits different
0147       from 8.) See below, for the numbering of bits within a byte.
0148 
0149    1.6. Changes from previous versions
0150 
0151       Version 3.1 was the first public release of this specification.
0152       In version 3.2, some terminology was changed and the Adler-32
0153       sample code was rewritten for clarity.  In version 3.3, the
0154       support for a preset dictionary was introduced, and the
0155       specification was converted to RFC style.
0156 
0157 2. Detailed specification
0158 
0159    2.1. Overall conventions
0160 
0161       In the diagrams below, a box like this:
0162 
0163          +---+
0164          |   | <-- the vertical bars might be missing
0165          +---+
0166 
0167 
0168 
0169 
0170 Deutsch & Gailly             Informational                      [Page 3]
0171 
0172 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0173 
0174 
0175       represents one byte; a box like this:
0176 
0177          +==============+
0178          |              |
0179          +==============+
0180 
0181       represents a variable number of bytes.
0182 
0183       Bytes stored within a computer do not have a "bit order", since
0184       they are always treated as a unit.  However, a byte considered as
0185       an integer between 0 and 255 does have a most- and least-
0186       significant bit, and since we write numbers with the most-
0187       significant digit on the left, we also write bytes with the most-
0188       significant bit on the left.  In the diagrams below, we number the
0189       bits of a byte so that bit 0 is the least-significant bit, i.e.,
0190       the bits are numbered:
0191 
0192          +--------+
0193          |76543210|
0194          +--------+
0195 
0196       Within a computer, a number may occupy multiple bytes.  All
0197       multi-byte numbers in the format described here are stored with
0198       the MOST-significant byte first (at the lower memory address).
0199       For example, the decimal number 520 is stored as:
0200 
0201              0     1
0202          +--------+--------+
0203          |00000010|00001000|
0204          +--------+--------+
0205           ^        ^
0206           |        |
0207           |        + less significant byte = 8
0208           + more significant byte = 2 x 256
0209 
0210    2.2. Data format
0211 
0212       A zlib stream has the following structure:
0213 
0214            0   1
0215          +---+---+
0216          |CMF|FLG|   (more-->)
0217          +---+---+
0218 
0219 
0220 
0221 
0222 
0223 
0224 
0225 
0226 Deutsch & Gailly             Informational                      [Page 4]
0227 
0228 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0229 
0230 
0231       (if FLG.FDICT set)
0232 
0233            0   1   2   3
0234          +---+---+---+---+
0235          |     DICTID    |   (more-->)
0236          +---+---+---+---+
0237 
0238          +=====================+---+---+---+---+
0239          |...compressed data...|    ADLER32    |
0240          +=====================+---+---+---+---+
0241 
0242       Any data which may appear after ADLER32 are not part of the zlib
0243       stream.
0244 
0245       CMF (Compression Method and flags)
0246          This byte is divided into a 4-bit compression method and a 4-
0247          bit information field depending on the compression method.
0248 
0249             bits 0 to 3  CM     Compression method
0250             bits 4 to 7  CINFO  Compression info
0251 
0252       CM (Compression method)
0253          This identifies the compression method used in the file. CM = 8
0254          denotes the "deflate" compression method with a window size up
0255          to 32K.  This is the method used by gzip and PNG (see
0256          references [1] and [2] in Chapter 3, below, for the reference
0257          documents).  CM = 15 is reserved.  It might be used in a future
0258          version of this specification to indicate the presence of an
0259          extra field before the compressed data.
0260 
0261       CINFO (Compression info)
0262          For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
0263          size, minus eight (CINFO=7 indicates a 32K window size). Values
0264          of CINFO above 7 are not allowed in this version of the
0265          specification.  CINFO is not defined in this specification for
0266          CM not equal to 8.
0267 
0268       FLG (FLaGs)
0269          This flag byte is divided as follows:
0270 
0271             bits 0 to 4  FCHECK  (check bits for CMF and FLG)
0272             bit  5       FDICT   (preset dictionary)
0273             bits 6 to 7  FLEVEL  (compression level)
0274 
0275          The FCHECK value must be such that CMF and FLG, when viewed as
0276          a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG),
0277          is a multiple of 31.
0278 
0279 
0280 
0281 
0282 Deutsch & Gailly             Informational                      [Page 5]
0283 
0284 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0285 
0286 
0287       FDICT (Preset dictionary)
0288          If FDICT is set, a DICT dictionary identifier is present
0289          immediately after the FLG byte. The dictionary is a sequence of
0290          bytes which are initially fed to the compressor without
0291          producing any compressed output. DICT is the Adler-32 checksum
0292          of this sequence of bytes (see the definition of ADLER32
0293          below).  The decompressor can use this identifier to determine
0294          which dictionary has been used by the compressor.
0295 
0296       FLEVEL (Compression level)
0297          These flags are available for use by specific compression
0298          methods.  The "deflate" method (CM = 8) sets these flags as
0299          follows:
0300 
0301             0 - compressor used fastest algorithm
0302             1 - compressor used fast algorithm
0303             2 - compressor used default algorithm
0304             3 - compressor used maximum compression, slowest algorithm
0305 
0306          The information in FLEVEL is not needed for decompression; it
0307          is there to indicate if recompression might be worthwhile.
0308 
0309       compressed data
0310          For compression method 8, the compressed data is stored in the
0311          deflate compressed data format as described in the document
0312          "DEFLATE Compressed Data Format Specification" by L. Peter
0313          Deutsch. (See reference [3] in Chapter 3, below)
0314 
0315          Other compressed data formats are not specified in this version
0316          of the zlib specification.
0317 
0318       ADLER32 (Adler-32 checksum)
0319          This contains a checksum value of the uncompressed data
0320          (excluding any dictionary data) computed according to Adler-32
0321          algorithm. This algorithm is a 32-bit extension and improvement
0322          of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073
0323          standard. See references [4] and [5] in Chapter 3, below)
0324 
0325          Adler-32 is composed of two sums accumulated per byte: s1 is
0326          the sum of all bytes, s2 is the sum of all s1 values. Both sums
0327          are done modulo 65521. s1 is initialized to 1, s2 to zero.  The
0328          Adler-32 checksum is stored as s2*65536 + s1 in most-
0329          significant-byte first (network) order.
0330 
0331 
0332 
0333 
0334 
0335 
0336 
0337 
0338 Deutsch & Gailly             Informational                      [Page 6]
0339 
0340 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0341 
0342 
0343    2.3. Compliance
0344 
0345       A compliant compressor must produce streams with correct CMF, FLG
0346       and ADLER32, but need not support preset dictionaries.  When the
0347       zlib data format is used as part of another standard data format,
0348       the compressor may use only preset dictionaries that are specified
0349       by this other data format.  If this other format does not use the
0350       preset dictionary feature, the compressor must not set the FDICT
0351       flag.
0352 
0353       A compliant decompressor must check CMF, FLG, and ADLER32, and
0354       provide an error indication if any of these have incorrect values.
0355       A compliant decompressor must give an error indication if CM is
0356       not one of the values defined in this specification (only the
0357       value 8 is permitted in this version), since another value could
0358       indicate the presence of new features that would cause subsequent
0359       data to be interpreted incorrectly.  A compliant decompressor must
0360       give an error indication if FDICT is set and DICTID is not the
0361       identifier of a known preset dictionary.  A decompressor may
0362       ignore FLEVEL and still be compliant.  When the zlib data format
0363       is being used as a part of another standard format, a compliant
0364       decompressor must support all the preset dictionaries specified by
0365       the other format. When the other format does not use the preset
0366       dictionary feature, a compliant decompressor must reject any
0367       stream in which the FDICT flag is set.
0368 
0369 3. References
0370 
0371    [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification",
0372        available in ftp://ftp.uu.net/pub/archiving/zip/doc/
0373 
0374    [2] Thomas Boutell, "PNG (Portable Network Graphics) specification",
0375        available in ftp://ftp.uu.net/graphics/png/documents/
0376 
0377    [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
0378        available in ftp://ftp.uu.net/pub/archiving/zip/doc/
0379 
0380    [4] Fletcher, J. G., "An Arithmetic Checksum for Serial
0381        Transmissions," IEEE Transactions on Communications, Vol. COM-30,
0382        No. 1, January 1982, pp. 247-252.
0383 
0384    [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms,"
0385        November, 1993, pp. 144, 145. (Available from
0386        gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073.
0387 
0388 
0389 
0390 
0391 
0392 
0393 
0394 Deutsch & Gailly             Informational                      [Page 7]
0395 
0396 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0397 
0398 
0399 4. Source code
0400 
0401    Source code for a C language implementation of a "zlib" compliant
0402    library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/.
0403 
0404 5. Security Considerations
0405 
0406    A decoder that fails to check the ADLER32 checksum value may be
0407    subject to undetected data corruption.
0408 
0409 6. Acknowledgements
0410 
0411    Trademarks cited in this document are the property of their
0412    respective owners.
0413 
0414    Jean-Loup Gailly and Mark Adler designed the zlib format and wrote
0415    the related software described in this specification.  Glenn
0416    Randers-Pehrson converted this document to RFC and HTML format.
0417 
0418 7. Authors' Addresses
0419 
0420    L. Peter Deutsch
0421    Aladdin Enterprises
0422    203 Santa Margarita Ave.
0423    Menlo Park, CA 94025
0424 
0425    Phone: (415) 322-0103 (AM only)
0426    FAX:   (415) 322-1734
0427    EMail: <ghost@aladdin.com>
0428 
0429 
0430    Jean-Loup Gailly
0431 
0432    EMail: <gzip@prep.ai.mit.edu>
0433 
0434    Questions about the technical content of this specification can be
0435    sent by email to
0436 
0437    Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
0438    Mark Adler <madler@alumni.caltech.edu>
0439 
0440    Editorial comments on this specification can be sent by email to
0441 
0442    L. Peter Deutsch <ghost@aladdin.com> and
0443    Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
0444 
0445 
0446 
0447 
0448 
0449 
0450 Deutsch & Gailly             Informational                      [Page 8]
0451 
0452 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0453 
0454 
0455 8. Appendix: Rationale
0456 
0457    8.1. Preset dictionaries
0458 
0459       A preset dictionary is specially useful to compress short input
0460       sequences. The compressor can take advantage of the dictionary
0461       context to encode the input in a more compact manner. The
0462       decompressor can be initialized with the appropriate context by
0463       virtually decompressing a compressed version of the dictionary
0464       without producing any output. However for certain compression
0465       algorithms such as the deflate algorithm this operation can be
0466       achieved without actually performing any decompression.
0467 
0468       The compressor and the decompressor must use exactly the same
0469       dictionary. The dictionary may be fixed or may be chosen among a
0470       certain number of predefined dictionaries, according to the kind
0471       of input data. The decompressor can determine which dictionary has
0472       been chosen by the compressor by checking the dictionary
0473       identifier. This document does not specify the contents of
0474       predefined dictionaries, since the optimal dictionaries are
0475       application specific. Standard data formats using this feature of
0476       the zlib specification must precisely define the allowed
0477       dictionaries.
0478 
0479    8.2. The Adler-32 algorithm
0480 
0481       The Adler-32 algorithm is much faster than the CRC32 algorithm yet
0482       still provides an extremely low probability of undetected errors.
0483 
0484       The modulo on unsigned long accumulators can be delayed for 5552
0485       bytes, so the modulo operation time is negligible.  If the bytes
0486       are a, b, c, the second sum is 3a + 2b + c + 3, and so is position
0487       and order sensitive, unlike the first sum, which is just a
0488       checksum.  That 65521 is prime is important to avoid a possible
0489       large class of two-byte errors that leave the check unchanged.
0490       (The Fletcher checksum uses 255, which is not prime and which also
0491       makes the Fletcher check insensitive to single byte changes 0 <->
0492       255.)
0493 
0494       The sum s1 is initialized to 1 instead of zero to make the length
0495       of the sequence part of s2, so that the length does not have to be
0496       checked separately. (Any sequence of zeroes has a Fletcher
0497       checksum of zero.)
0498 
0499 
0500 
0501 
0502 
0503 
0504 
0505 
0506 Deutsch & Gailly             Informational                      [Page 9]
0507 
0508 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0509 
0510 
0511 9. Appendix: Sample code
0512 
0513    The following C code computes the Adler-32 checksum of a data buffer.
0514    It is written for clarity, not for speed.  The sample code is in the
0515    ANSI C programming language. Non C users may find it easier to read
0516    with these hints:
0517 
0518       &      Bitwise AND operator.
0519       >>     Bitwise right shift operator. When applied to an
0520              unsigned quantity, as here, right shift inserts zero bit(s)
0521              at the left.
0522       <<     Bitwise left shift operator. Left shift inserts zero
0523              bit(s) at the right.
0524       ++     "n++" increments the variable n.
0525       %      modulo operator: a % b is the remainder of a divided by b.
0526 
0527       #define BASE 65521 /* largest prime smaller than 65536 */
0528 
0529       /*
0530          Update a running Adler-32 checksum with the bytes buf[0..len-1]
0531        and return the updated checksum. The Adler-32 checksum should be
0532        initialized to 1.
0533 
0534        Usage example:
0535 
0536          unsigned long adler = 1L;
0537 
0538          while (read_buffer(buffer, length) != EOF) {
0539            adler = update_adler32(adler, buffer, length);
0540          }
0541          if (adler != original_adler) error();
0542       */
0543       unsigned long update_adler32(unsigned long adler,
0544          unsigned char *buf, int len)
0545       {
0546         unsigned long s1 = adler & 0xffff;
0547         unsigned long s2 = (adler >> 16) & 0xffff;
0548         int n;
0549 
0550         for (n = 0; n < len; n++) {
0551           s1 = (s1 + buf[n]) % BASE;
0552           s2 = (s2 + s1)     % BASE;
0553         }
0554         return (s2 << 16) + s1;
0555       }
0556 
0557       /* Return the adler32 of the bytes buf[0..len-1] */
0558 
0559 
0560 
0561 
0562 Deutsch & Gailly             Informational                     [Page 10]
0563 
0564 RFC 1950       ZLIB Compressed Data Format Specification        May 1996
0565 
0566 
0567       unsigned long adler32(unsigned char *buf, int len)
0568       {
0569         return update_adler32(1L, buf, len);
0570       }
0571 
0572 
0573 
0574 
0575 
0576 
0577 
0578 
0579 
0580 
0581 
0582 
0583 
0584 
0585 
0586 
0587 
0588 
0589 
0590 
0591 
0592 
0593 
0594 
0595 
0596 
0597 
0598 
0599 
0600 
0601 
0602 
0603 
0604 
0605 
0606 
0607 
0608 
0609 
0610 
0611 
0612 
0613 
0614 
0615 
0616 
0617 
0618 Deutsch & Gailly             Informational                     [Page 11]
0619