Warning, /cpukit/compression/zlib/doc/rfc1952.txt is written in an unsupported language. File is not indexed.
0001
0002
0003
0004
0005
0006
0007 Network Working Group P. Deutsch
0008 Request for Comments: 1952 Aladdin Enterprises
0009 Category: Informational May 1996
0010
0011
0012 GZIP file format specification version 4.3
0013
0014 Status of This Memo
0015
0016 This memo provides information for the Internet community. This memo
0017 does not specify an Internet standard of any kind. Distribution of
0018 this memo is unlimited.
0019
0020 IESG Note:
0021
0022 The IESG takes no position on the validity of any Intellectual
0023 Property Rights statements contained in this document.
0024
0025 Notices
0026
0027 Copyright (c) 1996 L. Peter Deutsch
0028
0029 Permission is granted to copy and distribute this document for any
0030 purpose and without charge, including translations into other
0031 languages and incorporation into compilations, provided that the
0032 copyright notice and this notice are preserved, and that any
0033 substantive changes or deletions from the original are clearly
0034 marked.
0035
0036 A pointer to the latest version of this and related documentation in
0037 HTML format can be found at the URL
0038 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
0039
0040 Abstract
0041
0042 This specification defines a lossless compressed data format that is
0043 compatible with the widely used GZIP utility. The format includes a
0044 cyclic redundancy check value for detecting data corruption. The
0045 format presently uses the DEFLATE method of compression but can be
0046 easily extended to use other compression methods. The format can be
0047 implemented readily in a manner not covered by patents.
0048
0049
0050
0051
0052
0053
0054
0055
0056
0057
0058 Deutsch Informational [Page 1]
0059
0060 RFC 1952 GZIP File Format Specification May 1996
0061
0062
0063 Table of Contents
0064
0065 1. Introduction ................................................... 2
0066 1.1. Purpose ................................................... 2
0067 1.2. Intended audience ......................................... 3
0068 1.3. Scope ..................................................... 3
0069 1.4. Compliance ................................................ 3
0070 1.5. Definitions of terms and conventions used ................. 3
0071 1.6. Changes from previous versions ............................ 3
0072 2. Detailed specification ......................................... 4
0073 2.1. Overall conventions ....................................... 4
0074 2.2. File format ............................................... 5
0075 2.3. Member format ............................................. 5
0076 2.3.1. Member header and trailer ........................... 6
0077 2.3.1.1. Extra field ................................... 8
0078 2.3.1.2. Compliance .................................... 9
0079 3. References .................................................. 9
0080 4. Security Considerations .................................... 10
0081 5. Acknowledgements ........................................... 10
0082 6. Author's Address ........................................... 10
0083 7. Appendix: Jean-Loup Gailly's gzip utility .................. 11
0084 8. Appendix: Sample CRC Code .................................. 11
0085
0086 1. Introduction
0087
0088 1.1. Purpose
0089
0090 The purpose of this specification is to define a lossless
0091 compressed data format that:
0092
0093 * Is independent of CPU type, operating system, file system,
0094 and character set, and hence can be used for interchange;
0095 * Can compress or decompress a data stream (as opposed to a
0096 randomly accessible file) to produce another data stream,
0097 using only an a priori bounded amount of intermediate
0098 storage, and hence can be used in data communications or
0099 similar structures such as Unix filters;
0100 * Compresses data with efficiency comparable to the best
0101 currently available general-purpose compression methods,
0102 and in particular considerably better than the "compress"
0103 program;
0104 * Can be implemented readily in a manner not covered by
0105 patents, and hence can be practiced freely;
0106 * Is compatible with the file format produced by the current
0107 widely used gzip utility, in that conforming decompressors
0108 will be able to read data produced by the existing gzip
0109 compressor.
0110
0111
0112
0113
0114 Deutsch Informational [Page 2]
0115
0116 RFC 1952 GZIP File Format Specification May 1996
0117
0118
0119 The data format defined by this specification does not attempt to:
0120
0121 * Provide random access to compressed data;
0122 * Compress specialized data (e.g., raster graphics) as well as
0123 the best currently available specialized algorithms.
0124
0125 1.2. Intended audience
0126
0127 This specification is intended for use by implementors of software
0128 to compress data into gzip format and/or decompress data from gzip
0129 format.
0130
0131 The text of the specification assumes a basic background in
0132 programming at the level of bits and other primitive data
0133 representations.
0134
0135 1.3. Scope
0136
0137 The specification specifies a compression method and a file format
0138 (the latter assuming only that a file can store a sequence of
0139 arbitrary bytes). It does not specify any particular interface to
0140 a file system or anything about character sets or encodings
0141 (except for file names and comments, which are optional).
0142
0143 1.4. Compliance
0144
0145 Unless otherwise indicated below, a compliant decompressor must be
0146 able to accept and decompress any file that conforms to all the
0147 specifications presented here; a compliant compressor must produce
0148 files that conform to all the specifications presented here. The
0149 material in the appendices is not part of the specification per se
0150 and is not relevant to compliance.
0151
0152 1.5. Definitions of terms and conventions used
0153
0154 byte: 8 bits stored or transmitted as a unit (same as an octet).
0155 (For this specification, a byte is exactly 8 bits, even on
0156 machines which store a character on a number of bits different
0157 from 8.) See below for the numbering of bits within a byte.
0158
0159 1.6. Changes from previous versions
0160
0161 There have been no technical changes to the gzip format since
0162 version 4.1 of this specification. In version 4.2, some
0163 terminology was changed, and the sample CRC code was rewritten for
0164 clarity and to eliminate the requirement for the caller to do pre-
0165 and post-conditioning. Version 4.3 is a conversion of the
0166 specification to RFC style.
0167
0168
0169
0170 Deutsch Informational [Page 3]
0171
0172 RFC 1952 GZIP File Format Specification May 1996
0173
0174
0175 2. Detailed specification
0176
0177 2.1. Overall conventions
0178
0179 In the diagrams below, a box like this:
0180
0181 +---+
0182 | | <-- the vertical bars might be missing
0183 +---+
0184
0185 represents one byte; a box like this:
0186
0187 +==============+
0188 | |
0189 +==============+
0190
0191 represents a variable number of bytes.
0192
0193 Bytes stored within a computer do not have a "bit order", since
0194 they are always treated as a unit. However, a byte considered as
0195 an integer between 0 and 255 does have a most- and least-
0196 significant bit, and since we write numbers with the most-
0197 significant digit on the left, we also write bytes with the most-
0198 significant bit on the left. In the diagrams below, we number the
0199 bits of a byte so that bit 0 is the least-significant bit, i.e.,
0200 the bits are numbered:
0201
0202 +--------+
0203 |76543210|
0204 +--------+
0205
0206 This document does not address the issue of the order in which
0207 bits of a byte are transmitted on a bit-sequential medium, since
0208 the data format described here is byte- rather than bit-oriented.
0209
0210 Within a computer, a number may occupy multiple bytes. All
0211 multi-byte numbers in the format described here are stored with
0212 the least-significant byte first (at the lower memory address).
0213 For example, the decimal number 520 is stored as:
0214
0215 0 1
0216 +--------+--------+
0217 |00001000|00000010|
0218 +--------+--------+
0219 ^ ^
0220 | |
0221 | + more significant byte = 2 x 256
0222 + less significant byte = 8
0223
0224
0225
0226 Deutsch Informational [Page 4]
0227
0228 RFC 1952 GZIP File Format Specification May 1996
0229
0230
0231 2.2. File format
0232
0233 A gzip file consists of a series of "members" (compressed data
0234 sets). The format of each member is specified in the following
0235 section. The members simply appear one after another in the file,
0236 with no additional information before, between, or after them.
0237
0238 2.3. Member format
0239
0240 Each member has the following structure:
0241
0242 +---+---+---+---+---+---+---+---+---+---+
0243 |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
0244 +---+---+---+---+---+---+---+---+---+---+
0245
0246 (if FLG.FEXTRA set)
0247
0248 +---+---+=================================+
0249 | XLEN |...XLEN bytes of "extra field"...| (more-->)
0250 +---+---+=================================+
0251
0252 (if FLG.FNAME set)
0253
0254 +=========================================+
0255 |...original file name, zero-terminated...| (more-->)
0256 +=========================================+
0257
0258 (if FLG.FCOMMENT set)
0259
0260 +===================================+
0261 |...file comment, zero-terminated...| (more-->)
0262 +===================================+
0263
0264 (if FLG.FHCRC set)
0265
0266 +---+---+
0267 | CRC16 |
0268 +---+---+
0269
0270 +=======================+
0271 |...compressed blocks...| (more-->)
0272 +=======================+
0273
0274 0 1 2 3 4 5 6 7
0275 +---+---+---+---+---+---+---+---+
0276 | CRC32 | ISIZE |
0277 +---+---+---+---+---+---+---+---+
0278
0279
0280
0281
0282 Deutsch Informational [Page 5]
0283
0284 RFC 1952 GZIP File Format Specification May 1996
0285
0286
0287 2.3.1. Member header and trailer
0288
0289 ID1 (IDentification 1)
0290 ID2 (IDentification 2)
0291 These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
0292 (0x8b, \213), to identify the file as being in gzip format.
0293
0294 CM (Compression Method)
0295 This identifies the compression method used in the file. CM
0296 = 0-7 are reserved. CM = 8 denotes the "deflate"
0297 compression method, which is the one customarily used by
0298 gzip and which is documented elsewhere.
0299
0300 FLG (FLaGs)
0301 This flag byte is divided into individual bits as follows:
0302
0303 bit 0 FTEXT
0304 bit 1 FHCRC
0305 bit 2 FEXTRA
0306 bit 3 FNAME
0307 bit 4 FCOMMENT
0308 bit 5 reserved
0309 bit 6 reserved
0310 bit 7 reserved
0311
0312 If FTEXT is set, the file is probably ASCII text. This is
0313 an optional indication, which the compressor may set by
0314 checking a small amount of the input data to see whether any
0315 non-ASCII characters are present. In case of doubt, FTEXT
0316 is cleared, indicating binary data. For systems which have
0317 different file formats for ascii text and binary data, the
0318 decompressor can use FTEXT to choose the appropriate format.
0319 We deliberately do not specify the algorithm used to set
0320 this bit, since a compressor always has the option of
0321 leaving it cleared and a decompressor always has the option
0322 of ignoring it and letting some other program handle issues
0323 of data conversion.
0324
0325 If FHCRC is set, a CRC16 for the gzip header is present,
0326 immediately before the compressed data. The CRC16 consists
0327 of the two least significant bytes of the CRC32 for all
0328 bytes of the gzip header up to and not including the CRC16.
0329 [The FHCRC bit was never set by versions of gzip up to
0330 1.2.4, even though it was documented with a different
0331 meaning in gzip 1.2.4.]
0332
0333 If FEXTRA is set, optional extra fields are present, as
0334 described in a following section.
0335
0336
0337
0338 Deutsch Informational [Page 6]
0339
0340 RFC 1952 GZIP File Format Specification May 1996
0341
0342
0343 If FNAME is set, an original file name is present,
0344 terminated by a zero byte. The name must consist of ISO
0345 8859-1 (LATIN-1) characters; on operating systems using
0346 EBCDIC or any other character set for file names, the name
0347 must be translated to the ISO LATIN-1 character set. This
0348 is the original name of the file being compressed, with any
0349 directory components removed, and, if the file being
0350 compressed is on a file system with case insensitive names,
0351 forced to lower case. There is no original file name if the
0352 data was compressed from a source other than a named file;
0353 for example, if the source was stdin on a Unix system, there
0354 is no file name.
0355
0356 If FCOMMENT is set, a zero-terminated file comment is
0357 present. This comment is not interpreted; it is only
0358 intended for human consumption. The comment must consist of
0359 ISO 8859-1 (LATIN-1) characters. Line breaks should be
0360 denoted by a single line feed character (10 decimal).
0361
0362 Reserved FLG bits must be zero.
0363
0364 MTIME (Modification TIME)
0365 This gives the most recent modification time of the original
0366 file being compressed. The time is in Unix format, i.e.,
0367 seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this
0368 may cause problems for MS-DOS and other systems that use
0369 local rather than Universal time.) If the compressed data
0370 did not come from a file, MTIME is set to the time at which
0371 compression started. MTIME = 0 means no time stamp is
0372 available.
0373
0374 XFL (eXtra FLags)
0375 These flags are available for use by specific compression
0376 methods. The "deflate" method (CM = 8) sets these flags as
0377 follows:
0378
0379 XFL = 2 - compressor used maximum compression,
0380 slowest algorithm
0381 XFL = 4 - compressor used fastest algorithm
0382
0383 OS (Operating System)
0384 This identifies the type of file system on which compression
0385 took place. This may be useful in determining end-of-line
0386 convention for text files. The currently defined values are
0387 as follows:
0388
0389
0390
0391
0392
0393
0394 Deutsch Informational [Page 7]
0395
0396 RFC 1952 GZIP File Format Specification May 1996
0397
0398
0399 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
0400 1 - Amiga
0401 2 - VMS (or OpenVMS)
0402 3 - Unix
0403 4 - VM/CMS
0404 5 - Atari TOS
0405 6 - HPFS filesystem (OS/2, NT)
0406 7 - Macintosh
0407 8 - Z-System
0408 9 - CP/M
0409 10 - TOPS-20
0410 11 - NTFS filesystem (NT)
0411 12 - QDOS
0412 13 - Acorn RISCOS
0413 255 - unknown
0414
0415 XLEN (eXtra LENgth)
0416 If FLG.FEXTRA is set, this gives the length of the optional
0417 extra field. See below for details.
0418
0419 CRC32 (CRC-32)
0420 This contains a Cyclic Redundancy Check value of the
0421 uncompressed data computed according to CRC-32 algorithm
0422 used in the ISO 3309 standard and in section 8.1.1.6.2 of
0423 ITU-T recommendation V.42. (See http://www.iso.ch for
0424 ordering ISO documents. See gopher://info.itu.ch for an
0425 online version of ITU-T V.42.)
0426
0427 ISIZE (Input SIZE)
0428 This contains the size of the original (uncompressed) input
0429 data modulo 2^32.
0430
0431 2.3.1.1. Extra field
0432
0433 If the FLG.FEXTRA bit is set, an "extra field" is present in
0434 the header, with total length XLEN bytes. It consists of a
0435 series of subfields, each of the form:
0436
0437 +---+---+---+---+==================================+
0438 |SI1|SI2| LEN |... LEN bytes of subfield data ...|
0439 +---+---+---+---+==================================+
0440
0441 SI1 and SI2 provide a subfield ID, typically two ASCII letters
0442 with some mnemonic value. Jean-Loup Gailly
0443 <gzip@prep.ai.mit.edu> is maintaining a registry of subfield
0444 IDs; please send him any subfield ID you wish to use. Subfield
0445 IDs with SI2 = 0 are reserved for future use. The following
0446 IDs are currently defined:
0447
0448
0449
0450 Deutsch Informational [Page 8]
0451
0452 RFC 1952 GZIP File Format Specification May 1996
0453
0454
0455 SI1 SI2 Data
0456 ---------- ---------- ----
0457 0x41 ('A') 0x70 ('P') Apollo file type information
0458
0459 LEN gives the length of the subfield data, excluding the 4
0460 initial bytes.
0461
0462 2.3.1.2. Compliance
0463
0464 A compliant compressor must produce files with correct ID1,
0465 ID2, CM, CRC32, and ISIZE, but may set all the other fields in
0466 the fixed-length part of the header to default values (255 for
0467 OS, 0 for all others). The compressor must set all reserved
0468 bits to zero.
0469
0470 A compliant decompressor must check ID1, ID2, and CM, and
0471 provide an error indication if any of these have incorrect
0472 values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC
0473 at least so it can skip over the optional fields if they are
0474 present. It need not examine any other part of the header or
0475 trailer; in particular, a decompressor may ignore FTEXT and OS
0476 and always produce binary output, and still be compliant. A
0477 compliant decompressor must give an error indication if any
0478 reserved bit is non-zero, since such a bit could indicate the
0479 presence of a new field that would cause subsequent data to be
0480 interpreted incorrectly.
0481
0482 3. References
0483
0484 [1] "Information Processing - 8-bit single-byte coded graphic
0485 character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987).
0486 The ISO 8859-1 (Latin-1) character set is a superset of 7-bit
0487 ASCII. Files defining this character set are available as
0488 iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/
0489
0490 [2] ISO 3309
0491
0492 [3] ITU-T recommendation V.42
0493
0494 [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
0495 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
0496
0497 [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in
0498 ftp://prep.ai.mit.edu/pub/gnu/
0499
0500 [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table
0501 Look-Up", Communications of the ACM, 31(8), pp.1008-1013.
0502
0503
0504
0505
0506 Deutsch Informational [Page 9]
0507
0508 RFC 1952 GZIP File Format Specification May 1996
0509
0510
0511 [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal,
0512 pp.118-133.
0513
0514 [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt,
0515 describing the CRC concept.
0516
0517 4. Security Considerations
0518
0519 Any data compression method involves the reduction of redundancy in
0520 the data. Consequently, any corruption of the data is likely to have
0521 severe effects and be difficult to correct. Uncompressed text, on
0522 the other hand, will probably still be readable despite the presence
0523 of some corrupted bytes.
0524
0525 It is recommended that systems using this data format provide some
0526 means of validating the integrity of the compressed data, such as by
0527 setting and checking the CRC-32 check value.
0528
0529 5. Acknowledgements
0530
0531 Trademarks cited in this document are the property of their
0532 respective owners.
0533
0534 Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler,
0535 the related software described in this specification. Glenn
0536 Randers-Pehrson converted this document to RFC and HTML format.
0537
0538 6. Author's Address
0539
0540 L. Peter Deutsch
0541 Aladdin Enterprises
0542 203 Santa Margarita Ave.
0543 Menlo Park, CA 94025
0544
0545 Phone: (415) 322-0103 (AM only)
0546 FAX: (415) 322-1734
0547 EMail: <ghost@aladdin.com>
0548
0549 Questions about the technical content of this specification can be
0550 sent by email to:
0551
0552 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
0553 Mark Adler <madler@alumni.caltech.edu>
0554
0555 Editorial comments on this specification can be sent by email to:
0556
0557 L. Peter Deutsch <ghost@aladdin.com> and
0558 Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
0559
0560
0561
0562 Deutsch Informational [Page 10]
0563
0564 RFC 1952 GZIP File Format Specification May 1996
0565
0566
0567 7. Appendix: Jean-Loup Gailly's gzip utility
0568
0569 The most widely used implementation of gzip compression, and the
0570 original documentation on which this specification is based, were
0571 created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>. Since this
0572 implementation is a de facto standard, we mention some more of its
0573 features here. Again, the material in this section is not part of
0574 the specification per se, and implementations need not follow it to
0575 be compliant.
0576
0577 When compressing or decompressing a file, gzip preserves the
0578 protection, ownership, and modification time attributes on the local
0579 file system, since there is no provision for representing protection
0580 attributes in the gzip file format itself. Since the file format
0581 includes a modification time, the gzip decompressor provides a
0582 command line switch that assigns the modification time from the file,
0583 rather than the local modification time of the compressed input, to
0584 the decompressed output.
0585
0586 8. Appendix: Sample CRC Code
0587
0588 The following sample code represents a practical implementation of
0589 the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42
0590 for a formal specification.)
0591
0592 The sample code is in the ANSI C programming language. Non C users
0593 may find it easier to read with these hints:
0594
0595 & Bitwise AND operator.
0596 ^ Bitwise exclusive-OR operator.
0597 >> Bitwise right shift operator. When applied to an
0598 unsigned quantity, as here, right shift inserts zero
0599 bit(s) at the left.
0600 ! Logical NOT operator.
0601 ++ "n++" increments the variable n.
0602 0xNNN 0x introduces a hexadecimal (base 16) constant.
0603 Suffix L indicates a long value (at least 32 bits).
0604
0605 /* Table of CRCs of all 8-bit messages. */
0606 unsigned long crc_table[256];
0607
0608 /* Flag: has the table been computed? Initially false. */
0609 int crc_table_computed = 0;
0610
0611 /* Make the table for a fast CRC. */
0612 void make_crc_table(void)
0613 {
0614 unsigned long c;
0615
0616
0617
0618 Deutsch Informational [Page 11]
0619
0620 RFC 1952 GZIP File Format Specification May 1996
0621
0622
0623 int n, k;
0624 for (n = 0; n < 256; n++) {
0625 c = (unsigned long) n;
0626 for (k = 0; k < 8; k++) {
0627 if (c & 1) {
0628 c = 0xedb88320L ^ (c >> 1);
0629 } else {
0630 c = c >> 1;
0631 }
0632 }
0633 crc_table[n] = c;
0634 }
0635 crc_table_computed = 1;
0636 }
0637
0638 /*
0639 Update a running crc with the bytes buf[0..len-1] and return
0640 the updated crc. The crc should be initialized to zero. Pre- and
0641 post-conditioning (one's complement) is performed within this
0642 function so it shouldn't be done by the caller. Usage example:
0643
0644 unsigned long crc = 0L;
0645
0646 while (read_buffer(buffer, length) != EOF) {
0647 crc = update_crc(crc, buffer, length);
0648 }
0649 if (crc != original_crc) error();
0650 */
0651 unsigned long update_crc(unsigned long crc,
0652 unsigned char *buf, int len)
0653 {
0654 unsigned long c = crc ^ 0xffffffffL;
0655 int n;
0656
0657 if (!crc_table_computed)
0658 make_crc_table();
0659 for (n = 0; n < len; n++) {
0660 c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8);
0661 }
0662 return c ^ 0xffffffffL;
0663 }
0664
0665 /* Return the CRC of the bytes buf[0..len-1]. */
0666 unsigned long crc(unsigned char *buf, int len)
0667 {
0668 return update_crc(0L, buf, len);
0669 }
0670
0671
0672
0673
0674 Deutsch Informational [Page 12]
0675