zlib/doc/txtvsbin.txt

0001 A Fast Method for Identifying Plain Text Files
0002 ==============================================
0003
0004
0005 Introduction
0006 ------------
0007
0008 Given a file coming from an unknown source, it is sometimes desirable
0009 to find out whether the format of that file is plain text.  Although
0010 this may appear like a simple task, a fully accurate detection of the
0011 file type requires heavy-duty semantic analysis on the file contents.
0012 It is, however, possible to obtain satisfactory results by employing
0013 various heuristics.
0014
0015 Previous versions of PKZip and other zip-compatible compression tools
0016 were using a crude detection scheme: if more than 80% (4/5) of the bytes
0017 found in a certain buffer are within the range [7..127], the file is
0018 labeled as plain text, otherwise it is labeled as binary.  A prominent
0019 limitation of this scheme is the restriction to Latin-based alphabets.
0020 Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
0021 the bytes within the range [128..255], and texts using these alphabets
0022 are most often misidentified by this scheme; in other words, the rate
0023 of false negatives is sometimes too high, which means that the recall
0024 is low.  Another weakness of this scheme is a reduced precision, due to
0025 the false positives that may occur when binary files containing large
0026 amounts of textual characters are misidentified as plain text.
0027
0028 In this article we propose a new, simple detection scheme that features
0029 a much increased precision and a near-100% recall.  This scheme is
0030 designed to work on ASCII, Unicode and other ASCII-derived alphabets,
0031 and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
0032 and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
0033 (UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
0034
0035
0036 The Algorithm
0037 -------------
0038
0039 The algorithm works by dividing the set of bytecodes [0..255] into three
0040 categories:
0041 - The allow list of textual bytecodes:
0042   9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
0043 - The gray list of tolerated bytecodes:
0044   7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
0045 - The block list of undesired, non-textual bytecodes:
0046   0 (NUL) to 6, 14 to 31.
0047
0048 If a file contains at least one byte that belongs to the allow list and
0049 no byte that belongs to the block list, then the file is categorized as
0050 plain text; otherwise, it is categorized as binary.  (The boundary case,
0051 when the file is empty, automatically falls into the latter category.)
0052
0053
0054 Rationale
0055 ---------
0056
0057 The idea behind this algorithm relies on two observations.
0058
0059 The first observation is that, although the full range of 7-bit codes
0060 [0..127] is properly specified by the ASCII standard, most control
0061 characters in the range [0..31] are not used in practice.  The only
0062 widely-used, almost universally-portable control codes are 9 (TAB),
0063 10 (LF) and 13 (CR).  There are a few more control codes that are
0064 recognized on a reduced range of platforms and text viewers/editors:
0065 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
0066 codes are rarely (if ever) used alone, without being accompanied by
0067 some printable text.  Even the newer, portable text formats such as
0068 XML avoid using control characters outside the list mentioned here.
0069
0070 The second observation is that most of the binary files tend to contain
0071 control characters, especially 0 (NUL).  Even though the older text
0072 detection schemes observe the presence of non-ASCII codes from the range
0073 [128..255], the precision rarely has to suffer if this upper range is
0074 labeled as textual, because the files that are genuinely binary tend to
0075 contain both control characters and codes from the upper range.  On the
0076 other hand, the upper range needs to be labeled as textual, because it
0077 is used by virtually all ASCII extensions.  In particular, this range is
0078 used for encoding non-Latin scripts.
0079
0080 Since there is no counting involved, other than simply observing the
0081 presence or the absence of some byte values, the algorithm produces
0082 consistent results, regardless what alphabet encoding is being used.
0083 (If counting were involved, it could be possible to obtain different
0084 results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
0085
0086 There is an extra category of plain text files that are "polluted" with
0087 one or more block-listed codes, either by mistake or by peculiar design
0088 considerations.  In such cases, a scheme that tolerates a small fraction
0089 of block-listed codes would provide an increased recall (i.e. more true
0090 positives).  This, however, incurs a reduced precision overall, since
0091 false positives are more likely to appear in binary files that contain
0092 large chunks of textual data.  Furthermore, "polluted" plain text should
0093 be regarded as binary by general-purpose text detection schemes, because
0094 general-purpose text processing algorithms might not be applicable.
0095 Under this premise, it is safe to say that our detection method provides
0096 a near-100% recall.
0097
0098 Experiments have been run on many files coming from various platforms
0099 and applications.  We tried plain text files, system logs, source code,
0100 formatted office documents, compiled object code, etc.  The results
0101 confirm the optimistic assumptions about the capabilities of this
0102 algorithm.
0103
0104
0105 --
0106 Cosmin Truta
0107 Last updated: 2006-May-28