Back to home page

LXR

 
 

    


Warning, /cpukit/compression/zlib/doc/rfc1952.txt is written in an unsupported language. File is not indexed.

0001 
0002 
0003 
0004 
0005 
0006 
0007 Network Working Group                                         P. Deutsch
0008 Request for Comments: 1952                           Aladdin Enterprises
0009 Category: Informational                                         May 1996
0010 
0011 
0012                GZIP file format specification version 4.3
0013 
0014 Status of This Memo
0015 
0016    This memo provides information for the Internet community.  This memo
0017    does not specify an Internet standard of any kind.  Distribution of
0018    this memo is unlimited.
0019 
0020 IESG Note:
0021 
0022    The IESG takes no position on the validity of any Intellectual
0023    Property Rights statements contained in this document.
0024 
0025 Notices
0026 
0027    Copyright (c) 1996 L. Peter Deutsch
0028 
0029    Permission is granted to copy and distribute this document for any
0030    purpose and without charge, including translations into other
0031    languages and incorporation into compilations, provided that the
0032    copyright notice and this notice are preserved, and that any
0033    substantive changes or deletions from the original are clearly
0034    marked.
0035 
0036    A pointer to the latest version of this and related documentation in
0037    HTML format can be found at the URL
0038    <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
0039 
0040 Abstract
0041 
0042    This specification defines a lossless compressed data format that is
0043    compatible with the widely used GZIP utility.  The format includes a
0044    cyclic redundancy check value for detecting data corruption.  The
0045    format presently uses the DEFLATE method of compression but can be
0046    easily extended to use other compression methods.  The format can be
0047    implemented readily in a manner not covered by patents.
0048 
0049 
0050 
0051 
0052 
0053 
0054 
0055 
0056 
0057 
0058 Deutsch                      Informational                      [Page 1]
0059 
0060 RFC 1952             GZIP File Format Specification             May 1996
0061 
0062 
0063 Table of Contents
0064 
0065    1. Introduction ................................................... 2
0066       1.1. Purpose ................................................... 2
0067       1.2. Intended audience ......................................... 3
0068       1.3. Scope ..................................................... 3
0069       1.4. Compliance ................................................ 3
0070       1.5. Definitions of terms and conventions used ................. 3
0071       1.6. Changes from previous versions ............................ 3
0072    2. Detailed specification ......................................... 4
0073       2.1. Overall conventions ....................................... 4
0074       2.2. File format ............................................... 5
0075       2.3. Member format ............................................. 5
0076           2.3.1. Member header and trailer ........................... 6
0077               2.3.1.1. Extra field ................................... 8
0078               2.3.1.2. Compliance .................................... 9
0079       3. References .................................................. 9
0080       4. Security Considerations .................................... 10
0081       5. Acknowledgements ........................................... 10
0082       6. Author's Address ........................................... 10
0083       7. Appendix: Jean-Loup Gailly's gzip utility .................. 11
0084       8. Appendix: Sample CRC Code .................................. 11
0085 
0086 1. Introduction
0087 
0088    1.1. Purpose
0089 
0090       The purpose of this specification is to define a lossless
0091       compressed data format that:
0092 
0093           * Is independent of CPU type, operating system, file system,
0094             and character set, and hence can be used for interchange;
0095           * Can compress or decompress a data stream (as opposed to a
0096             randomly accessible file) to produce another data stream,
0097             using only an a priori bounded amount of intermediate
0098             storage, and hence can be used in data communications or
0099             similar structures such as Unix filters;
0100           * Compresses data with efficiency comparable to the best
0101             currently available general-purpose compression methods,
0102             and in particular considerably better than the "compress"
0103             program;
0104           * Can be implemented readily in a manner not covered by
0105             patents, and hence can be practiced freely;
0106           * Is compatible with the file format produced by the current
0107             widely used gzip utility, in that conforming decompressors
0108             will be able to read data produced by the existing gzip
0109             compressor.
0110 
0111 
0112 
0113 
0114 Deutsch                      Informational                      [Page 2]
0115 
0116 RFC 1952             GZIP File Format Specification             May 1996
0117 
0118 
0119       The data format defined by this specification does not attempt to:
0120 
0121           * Provide random access to compressed data;
0122           * Compress specialized data (e.g., raster graphics) as well as
0123             the best currently available specialized algorithms.
0124 
0125    1.2. Intended audience
0126 
0127       This specification is intended for use by implementors of software
0128       to compress data into gzip format and/or decompress data from gzip
0129       format.
0130 
0131       The text of the specification assumes a basic background in
0132       programming at the level of bits and other primitive data
0133       representations.
0134 
0135    1.3. Scope
0136 
0137       The specification specifies a compression method and a file format
0138       (the latter assuming only that a file can store a sequence of
0139       arbitrary bytes).  It does not specify any particular interface to
0140       a file system or anything about character sets or encodings
0141       (except for file names and comments, which are optional).
0142 
0143    1.4. Compliance
0144 
0145       Unless otherwise indicated below, a compliant decompressor must be
0146       able to accept and decompress any file that conforms to all the
0147       specifications presented here; a compliant compressor must produce
0148       files that conform to all the specifications presented here.  The
0149       material in the appendices is not part of the specification per se
0150       and is not relevant to compliance.
0151 
0152    1.5. Definitions of terms and conventions used
0153 
0154       byte: 8 bits stored or transmitted as a unit (same as an octet).
0155       (For this specification, a byte is exactly 8 bits, even on
0156       machines which store a character on a number of bits different
0157       from 8.)  See below for the numbering of bits within a byte.
0158 
0159    1.6. Changes from previous versions
0160 
0161       There have been no technical changes to the gzip format since
0162       version 4.1 of this specification.  In version 4.2, some
0163       terminology was changed, and the sample CRC code was rewritten for
0164       clarity and to eliminate the requirement for the caller to do pre-
0165       and post-conditioning.  Version 4.3 is a conversion of the
0166       specification to RFC style.
0167 
0168 
0169 
0170 Deutsch                      Informational                      [Page 3]
0171 
0172 RFC 1952             GZIP File Format Specification             May 1996
0173 
0174 
0175 2. Detailed specification
0176 
0177    2.1. Overall conventions
0178 
0179       In the diagrams below, a box like this:
0180 
0181          +---+
0182          |   | <-- the vertical bars might be missing
0183          +---+
0184 
0185       represents one byte; a box like this:
0186 
0187          +==============+
0188          |              |
0189          +==============+
0190 
0191       represents a variable number of bytes.
0192 
0193       Bytes stored within a computer do not have a "bit order", since
0194       they are always treated as a unit.  However, a byte considered as
0195       an integer between 0 and 255 does have a most- and least-
0196       significant bit, and since we write numbers with the most-
0197       significant digit on the left, we also write bytes with the most-
0198       significant bit on the left.  In the diagrams below, we number the
0199       bits of a byte so that bit 0 is the least-significant bit, i.e.,
0200       the bits are numbered:
0201 
0202          +--------+
0203          |76543210|
0204          +--------+
0205 
0206       This document does not address the issue of the order in which
0207       bits of a byte are transmitted on a bit-sequential medium, since
0208       the data format described here is byte- rather than bit-oriented.
0209 
0210       Within a computer, a number may occupy multiple bytes.  All
0211       multi-byte numbers in the format described here are stored with
0212       the least-significant byte first (at the lower memory address).
0213       For example, the decimal number 520 is stored as:
0214 
0215              0        1
0216          +--------+--------+
0217          |00001000|00000010|
0218          +--------+--------+
0219           ^        ^
0220           |        |
0221           |        + more significant byte = 2 x 256
0222           + less significant byte = 8
0223 
0224 
0225 
0226 Deutsch                      Informational                      [Page 4]
0227 
0228 RFC 1952             GZIP File Format Specification             May 1996
0229 
0230 
0231    2.2. File format
0232 
0233       A gzip file consists of a series of "members" (compressed data
0234       sets).  The format of each member is specified in the following
0235       section.  The members simply appear one after another in the file,
0236       with no additional information before, between, or after them.
0237 
0238    2.3. Member format
0239 
0240       Each member has the following structure:
0241 
0242          +---+---+---+---+---+---+---+---+---+---+
0243          |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
0244          +---+---+---+---+---+---+---+---+---+---+
0245 
0246       (if FLG.FEXTRA set)
0247 
0248          +---+---+=================================+
0249          | XLEN  |...XLEN bytes of "extra field"...| (more-->)
0250          +---+---+=================================+
0251 
0252       (if FLG.FNAME set)
0253 
0254          +=========================================+
0255          |...original file name, zero-terminated...| (more-->)
0256          +=========================================+
0257 
0258       (if FLG.FCOMMENT set)
0259 
0260          +===================================+
0261          |...file comment, zero-terminated...| (more-->)
0262          +===================================+
0263 
0264       (if FLG.FHCRC set)
0265 
0266          +---+---+
0267          | CRC16 |
0268          +---+---+
0269 
0270          +=======================+
0271          |...compressed blocks...| (more-->)
0272          +=======================+
0273 
0274            0   1   2   3   4   5   6   7
0275          +---+---+---+---+---+---+---+---+
0276          |     CRC32     |     ISIZE     |
0277          +---+---+---+---+---+---+---+---+
0278 
0279 
0280 
0281 
0282 Deutsch                      Informational                      [Page 5]
0283 
0284 RFC 1952             GZIP File Format Specification             May 1996
0285 
0286 
0287       2.3.1. Member header and trailer
0288 
0289          ID1 (IDentification 1)
0290          ID2 (IDentification 2)
0291             These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
0292             (0x8b, \213), to identify the file as being in gzip format.
0293 
0294          CM (Compression Method)
0295             This identifies the compression method used in the file.  CM
0296             = 0-7 are reserved.  CM = 8 denotes the "deflate"
0297             compression method, which is the one customarily used by
0298             gzip and which is documented elsewhere.
0299 
0300          FLG (FLaGs)
0301             This flag byte is divided into individual bits as follows:
0302 
0303                bit 0   FTEXT
0304                bit 1   FHCRC
0305                bit 2   FEXTRA
0306                bit 3   FNAME
0307                bit 4   FCOMMENT
0308                bit 5   reserved
0309                bit 6   reserved
0310                bit 7   reserved
0311 
0312             If FTEXT is set, the file is probably ASCII text.  This is
0313             an optional indication, which the compressor may set by
0314             checking a small amount of the input data to see whether any
0315             non-ASCII characters are present.  In case of doubt, FTEXT
0316             is cleared, indicating binary data. For systems which have
0317             different file formats for ascii text and binary data, the
0318             decompressor can use FTEXT to choose the appropriate format.
0319             We deliberately do not specify the algorithm used to set
0320             this bit, since a compressor always has the option of
0321             leaving it cleared and a decompressor always has the option
0322             of ignoring it and letting some other program handle issues
0323             of data conversion.
0324 
0325             If FHCRC is set, a CRC16 for the gzip header is present,
0326             immediately before the compressed data. The CRC16 consists
0327             of the two least significant bytes of the CRC32 for all
0328             bytes of the gzip header up to and not including the CRC16.
0329             [The FHCRC bit was never set by versions of gzip up to
0330             1.2.4, even though it was documented with a different
0331             meaning in gzip 1.2.4.]
0332 
0333             If FEXTRA is set, optional extra fields are present, as
0334             described in a following section.
0335 
0336 
0337 
0338 Deutsch                      Informational                      [Page 6]
0339 
0340 RFC 1952             GZIP File Format Specification             May 1996
0341 
0342 
0343             If FNAME is set, an original file name is present,
0344             terminated by a zero byte.  The name must consist of ISO
0345             8859-1 (LATIN-1) characters; on operating systems using
0346             EBCDIC or any other character set for file names, the name
0347             must be translated to the ISO LATIN-1 character set.  This
0348             is the original name of the file being compressed, with any
0349             directory components removed, and, if the file being
0350             compressed is on a file system with case insensitive names,
0351             forced to lower case. There is no original file name if the
0352             data was compressed from a source other than a named file;
0353             for example, if the source was stdin on a Unix system, there
0354             is no file name.
0355 
0356             If FCOMMENT is set, a zero-terminated file comment is
0357             present.  This comment is not interpreted; it is only
0358             intended for human consumption.  The comment must consist of
0359             ISO 8859-1 (LATIN-1) characters.  Line breaks should be
0360             denoted by a single line feed character (10 decimal).
0361 
0362             Reserved FLG bits must be zero.
0363 
0364          MTIME (Modification TIME)
0365             This gives the most recent modification time of the original
0366             file being compressed.  The time is in Unix format, i.e.,
0367             seconds since 00:00:00 GMT, Jan.  1, 1970.  (Note that this
0368             may cause problems for MS-DOS and other systems that use
0369             local rather than Universal time.)  If the compressed data
0370             did not come from a file, MTIME is set to the time at which
0371             compression started.  MTIME = 0 means no time stamp is
0372             available.
0373 
0374          XFL (eXtra FLags)
0375             These flags are available for use by specific compression
0376             methods.  The "deflate" method (CM = 8) sets these flags as
0377             follows:
0378 
0379                XFL = 2 - compressor used maximum compression,
0380                          slowest algorithm
0381                XFL = 4 - compressor used fastest algorithm
0382 
0383          OS (Operating System)
0384             This identifies the type of file system on which compression
0385             took place.  This may be useful in determining end-of-line
0386             convention for text files.  The currently defined values are
0387             as follows:
0388 
0389 
0390 
0391 
0392 
0393 
0394 Deutsch                      Informational                      [Page 7]
0395 
0396 RFC 1952             GZIP File Format Specification             May 1996
0397 
0398 
0399                  0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
0400                  1 - Amiga
0401                  2 - VMS (or OpenVMS)
0402                  3 - Unix
0403                  4 - VM/CMS
0404                  5 - Atari TOS
0405                  6 - HPFS filesystem (OS/2, NT)
0406                  7 - Macintosh
0407                  8 - Z-System
0408                  9 - CP/M
0409                 10 - TOPS-20
0410                 11 - NTFS filesystem (NT)
0411                 12 - QDOS
0412                 13 - Acorn RISCOS
0413                255 - unknown
0414 
0415          XLEN (eXtra LENgth)
0416             If FLG.FEXTRA is set, this gives the length of the optional
0417             extra field.  See below for details.
0418 
0419          CRC32 (CRC-32)
0420             This contains a Cyclic Redundancy Check value of the
0421             uncompressed data computed according to CRC-32 algorithm
0422             used in the ISO 3309 standard and in section 8.1.1.6.2 of
0423             ITU-T recommendation V.42.  (See http://www.iso.ch for
0424             ordering ISO documents. See gopher://info.itu.ch for an
0425             online version of ITU-T V.42.)
0426 
0427          ISIZE (Input SIZE)
0428             This contains the size of the original (uncompressed) input
0429             data modulo 2^32.
0430 
0431       2.3.1.1. Extra field
0432 
0433          If the FLG.FEXTRA bit is set, an "extra field" is present in
0434          the header, with total length XLEN bytes.  It consists of a
0435          series of subfields, each of the form:
0436 
0437             +---+---+---+---+==================================+
0438             |SI1|SI2|  LEN  |... LEN bytes of subfield data ...|
0439             +---+---+---+---+==================================+
0440 
0441          SI1 and SI2 provide a subfield ID, typically two ASCII letters
0442          with some mnemonic value.  Jean-Loup Gailly
0443          <gzip@prep.ai.mit.edu> is maintaining a registry of subfield
0444          IDs; please send him any subfield ID you wish to use.  Subfield
0445          IDs with SI2 = 0 are reserved for future use.  The following
0446          IDs are currently defined:
0447 
0448 
0449 
0450 Deutsch                      Informational                      [Page 8]
0451 
0452 RFC 1952             GZIP File Format Specification             May 1996
0453 
0454 
0455             SI1         SI2         Data
0456             ----------  ----------  ----
0457             0x41 ('A')  0x70 ('P')  Apollo file type information
0458 
0459          LEN gives the length of the subfield data, excluding the 4
0460          initial bytes.
0461 
0462       2.3.1.2. Compliance
0463 
0464          A compliant compressor must produce files with correct ID1,
0465          ID2, CM, CRC32, and ISIZE, but may set all the other fields in
0466          the fixed-length part of the header to default values (255 for
0467          OS, 0 for all others).  The compressor must set all reserved
0468          bits to zero.
0469 
0470          A compliant decompressor must check ID1, ID2, and CM, and
0471          provide an error indication if any of these have incorrect
0472          values.  It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC
0473          at least so it can skip over the optional fields if they are
0474          present.  It need not examine any other part of the header or
0475          trailer; in particular, a decompressor may ignore FTEXT and OS
0476          and always produce binary output, and still be compliant.  A
0477          compliant decompressor must give an error indication if any
0478          reserved bit is non-zero, since such a bit could indicate the
0479          presence of a new field that would cause subsequent data to be
0480          interpreted incorrectly.
0481 
0482 3. References
0483 
0484    [1] "Information Processing - 8-bit single-byte coded graphic
0485        character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987).
0486        The ISO 8859-1 (Latin-1) character set is a superset of 7-bit
0487        ASCII. Files defining this character set are available as
0488        iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/
0489 
0490    [2] ISO 3309
0491 
0492    [3] ITU-T recommendation V.42
0493 
0494    [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
0495        available in ftp://ftp.uu.net/pub/archiving/zip/doc/
0496 
0497    [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in
0498        ftp://prep.ai.mit.edu/pub/gnu/
0499 
0500    [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table
0501        Look-Up", Communications of the ACM, 31(8), pp.1008-1013.
0502 
0503 
0504 
0505 
0506 Deutsch                      Informational                      [Page 9]
0507 
0508 RFC 1952             GZIP File Format Specification             May 1996
0509 
0510 
0511    [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal,
0512        pp.118-133.
0513 
0514    [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt,
0515        describing the CRC concept.
0516 
0517 4. Security Considerations
0518 
0519    Any data compression method involves the reduction of redundancy in
0520    the data.  Consequently, any corruption of the data is likely to have
0521    severe effects and be difficult to correct.  Uncompressed text, on
0522    the other hand, will probably still be readable despite the presence
0523    of some corrupted bytes.
0524 
0525    It is recommended that systems using this data format provide some
0526    means of validating the integrity of the compressed data, such as by
0527    setting and checking the CRC-32 check value.
0528 
0529 5. Acknowledgements
0530 
0531    Trademarks cited in this document are the property of their
0532    respective owners.
0533 
0534    Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler,
0535    the related software described in this specification.  Glenn
0536    Randers-Pehrson converted this document to RFC and HTML format.
0537 
0538 6. Author's Address
0539 
0540    L. Peter Deutsch
0541    Aladdin Enterprises
0542    203 Santa Margarita Ave.
0543    Menlo Park, CA 94025
0544 
0545    Phone: (415) 322-0103 (AM only)
0546    FAX:   (415) 322-1734
0547    EMail: <ghost@aladdin.com>
0548 
0549    Questions about the technical content of this specification can be
0550    sent by email to:
0551 
0552    Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
0553    Mark Adler <madler@alumni.caltech.edu>
0554 
0555    Editorial comments on this specification can be sent by email to:
0556 
0557    L. Peter Deutsch <ghost@aladdin.com> and
0558    Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
0559 
0560 
0561 
0562 Deutsch                      Informational                     [Page 10]
0563 
0564 RFC 1952             GZIP File Format Specification             May 1996
0565 
0566 
0567 7. Appendix: Jean-Loup Gailly's gzip utility
0568 
0569    The most widely used implementation of gzip compression, and the
0570    original documentation on which this specification is based, were
0571    created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>.  Since this
0572    implementation is a de facto standard, we mention some more of its
0573    features here.  Again, the material in this section is not part of
0574    the specification per se, and implementations need not follow it to
0575    be compliant.
0576 
0577    When compressing or decompressing a file, gzip preserves the
0578    protection, ownership, and modification time attributes on the local
0579    file system, since there is no provision for representing protection
0580    attributes in the gzip file format itself.  Since the file format
0581    includes a modification time, the gzip decompressor provides a
0582    command line switch that assigns the modification time from the file,
0583    rather than the local modification time of the compressed input, to
0584    the decompressed output.
0585 
0586 8. Appendix: Sample CRC Code
0587 
0588    The following sample code represents a practical implementation of
0589    the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42
0590    for a formal specification.)
0591 
0592    The sample code is in the ANSI C programming language. Non C users
0593    may find it easier to read with these hints:
0594 
0595       &      Bitwise AND operator.
0596       ^      Bitwise exclusive-OR operator.
0597       >>     Bitwise right shift operator. When applied to an
0598              unsigned quantity, as here, right shift inserts zero
0599              bit(s) at the left.
0600       !      Logical NOT operator.
0601       ++     "n++" increments the variable n.
0602       0xNNN  0x introduces a hexadecimal (base 16) constant.
0603              Suffix L indicates a long value (at least 32 bits).
0604 
0605       /* Table of CRCs of all 8-bit messages. */
0606       unsigned long crc_table[256];
0607 
0608       /* Flag: has the table been computed? Initially false. */
0609       int crc_table_computed = 0;
0610 
0611       /* Make the table for a fast CRC. */
0612       void make_crc_table(void)
0613       {
0614         unsigned long c;
0615 
0616 
0617 
0618 Deutsch                      Informational                     [Page 11]
0619 
0620 RFC 1952             GZIP File Format Specification             May 1996
0621 
0622 
0623         int n, k;
0624         for (n = 0; n < 256; n++) {
0625           c = (unsigned long) n;
0626           for (k = 0; k < 8; k++) {
0627             if (c & 1) {
0628               c = 0xedb88320L ^ (c >> 1);
0629             } else {
0630               c = c >> 1;
0631             }
0632           }
0633           crc_table[n] = c;
0634         }
0635         crc_table_computed = 1;
0636       }
0637 
0638       /*
0639          Update a running crc with the bytes buf[0..len-1] and return
0640        the updated crc. The crc should be initialized to zero. Pre- and
0641        post-conditioning (one's complement) is performed within this
0642        function so it shouldn't be done by the caller. Usage example:
0643 
0644          unsigned long crc = 0L;
0645 
0646          while (read_buffer(buffer, length) != EOF) {
0647            crc = update_crc(crc, buffer, length);
0648          }
0649          if (crc != original_crc) error();
0650       */
0651       unsigned long update_crc(unsigned long crc,
0652                       unsigned char *buf, int len)
0653       {
0654         unsigned long c = crc ^ 0xffffffffL;
0655         int n;
0656 
0657         if (!crc_table_computed)
0658           make_crc_table();
0659         for (n = 0; n < len; n++) {
0660           c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8);
0661         }
0662         return c ^ 0xffffffffL;
0663       }
0664 
0665       /* Return the CRC of the bytes buf[0..len-1]. */
0666       unsigned long crc(unsigned char *buf, int len)
0667       {
0668         return update_crc(0L, buf, len);
0669       }
0670 
0671 
0672 
0673 
0674 Deutsch                      Informational                     [Page 12]
0675