libmisc/utf8proc/README.md

0001 utf8proc
0002 ========
0003 Please read the LICENSE file, which is shipping with this software.
0004
0005
0006 *** QUICK START ***
0007
0008 For compilation of the C library call "make c-library", for compilation of
0009 the ruby library call "make ruby-library" and for compilation of the
0010 PostgreSQL extension call "make pgsql-library".
0011
0012 For ruby you can also create a gem-file by calling "make ruby-gem".
0013
0014 "make all" can be used to build everything, but both ruby and PostgreSQL
0015 installations are required in this case.
0016
0017
0018 *** GENERAL INFORMATION ***
0019
0020 The C library is found in this directory after successful compilation and
0021 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
0022 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
0023 subdirectory "ruby/". If you chose to create a gem-file it is placed in the
0024 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
0025 and resides in the "pgsql/" directory.
0026
0027 Both the ruby library and the PostgreSQL extension are built as stand-alone
0028 libraries and are therefore not dependent the dynamic version of the
0029 C library files, but this behaviour might change in future releases.
0030
0031 The Unicode version being supported is 5.0.0.
0032 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
0033       version 5.0.0 had not been available at the time of implementation.
0034
0035 For Unicode normalizations, the following options have to be used:
0036 Normalization Form C:  STABLE, COMPOSE
0037 Normalization Form D:  STABLE, DECOMPOSE
0038 Normalization Form KC: STABLE, COMPOSE, COMPAT
0039 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
0040
0041
0042 *** C LIBRARY ***
0043
0044 The documentation for the C library is found in the utf8proc.h header file.
0045 "utf8proc_map" is most likely function you will be using for mapping UTF-8
0046 strings, unless you want to allocate memory yourself.
0047
0048
0049 *** RUBY API ***
0050
0051 The ruby library adds the methods "utf8map" and "utf8map!" to the String
0052 class, and the method "utf8" to the Integer class.
0053
0054 The String#utf8map method does the same as the "utf8proc_map" C function.
0055 Options for the mapping procedure are passed as symbols, i.e:
0056 "Hello".utf8map(:casefold) => "hello"
0057
0058 The descriptions of all options are found in the C header file
0059 "utf8proc.h". Please notice that the according symbols in ruby are all
0060 lowercase.
0061
0062 String#utf8map! is the destructive function in the meaning that the string
0063 is replaced by the result.
0064
0065 There are shortcuts for the 4 normalization forms specified by Unicode:
0066 String#utf8nfd,  String#utf8nfd!,
0067 String#utf8nfc,  String#utf8nfc!,
0068 String#utf8nfkd, String#utf8nfkd!,
0069 String#utf8nfkc, String#utf8nfkc!
0070
0071 The method Integer#utf8 returns a UTF-8 string, which is containing the
0072 unicode char given by the code point.
0073 0x000A.utf8 => "\n"
0074 0x2028.utf8 => "\342\200\250"
0075
0076
0077 *** POSTGRESQL API ***
0078
0079 For PostgreSQL there are two SQL functions supplied named "unifold" and
0080 "unistrip". These functions function can be used to prepare index fields in
0081 order to be folded in a way where string-comparisons make more sense, e.g.
0082 where "bathtub" == "bath<soft hyphen>tub"
0083 or "Hello World" == "hello world".
0084
0085 CREATE TABLE people (
0086   id    serial8 primary key,
0087   name  text,
0088   CHECK (unifold(name) NOTNULL)
0089 );
0090 CREATE INDEX name_idx ON people (unifold(name));
0091 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
0092
0093 The function "unistrip" removes character marks like accents or diaeresis,
0094 while "unifold" keeps then.
0095
0096 NOTICE: The outputs of the function can change between releases, as
0097         utf8proc does not follow a versioning stability policy. You have to
0098         rebuild your database indicies, if you upgrade to a newer version
0099         of utf8proc.
0100
0101
0102 *** TODO ***
0103
0104 - detect stable code points and process segments independently in order to
0105   save memory
0106 - do a quick check before normalizing strings to optimize speed
0107 - support stream processing
0108
0109
0110 *** CONTACT ***
0111
0112 If you find any bugs or experience difficulties in compiling this software,
0113 please contact us:
0114
0115 Project page: http://www.public-software-group.org/utf8proc
0116
0117