Back to home page

LXR

 
 

    


Warning, /cpukit/libmisc/utf8proc/README.md is written in an unsupported language. File is not indexed.

0001 utf8proc
0002 ========
0003 Please read the LICENSE file, which is shipping with this software.
0004 
0005 
0006 *** QUICK START ***
0007 
0008 For compilation of the C library call "make c-library", for compilation of
0009 the ruby library call "make ruby-library" and for compilation of the
0010 PostgreSQL extension call "make pgsql-library".
0011 
0012 For ruby you can also create a gem-file by calling "make ruby-gem".
0013 
0014 "make all" can be used to build everything, but both ruby and PostgreSQL
0015 installations are required in this case.
0016 
0017 
0018 *** GENERAL INFORMATION ***
0019 
0020 The C library is found in this directory after successful compilation and
0021 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
0022 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
0023 subdirectory "ruby/". If you chose to create a gem-file it is placed in the
0024 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
0025 and resides in the "pgsql/" directory.
0026 
0027 Both the ruby library and the PostgreSQL extension are built as stand-alone
0028 libraries and are therefore not dependent the dynamic version of the
0029 C library files, but this behaviour might change in future releases.
0030 
0031 The Unicode version being supported is 5.0.0.
0032 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
0033       version 5.0.0 had not been available at the time of implementation.
0034 
0035 For Unicode normalizations, the following options have to be used:
0036 Normalization Form C:  STABLE, COMPOSE
0037 Normalization Form D:  STABLE, DECOMPOSE
0038 Normalization Form KC: STABLE, COMPOSE, COMPAT
0039 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
0040 
0041 
0042 *** C LIBRARY ***
0043 
0044 The documentation for the C library is found in the utf8proc.h header file.
0045 "utf8proc_map" is most likely function you will be using for mapping UTF-8
0046 strings, unless you want to allocate memory yourself.
0047 
0048 
0049 *** RUBY API ***
0050 
0051 The ruby library adds the methods "utf8map" and "utf8map!" to the String
0052 class, and the method "utf8" to the Integer class.
0053 
0054 The String#utf8map method does the same as the "utf8proc_map" C function.
0055 Options for the mapping procedure are passed as symbols, i.e:
0056 "Hello".utf8map(:casefold) => "hello"
0057 
0058 The descriptions of all options are found in the C header file
0059 "utf8proc.h". Please notice that the according symbols in ruby are all
0060 lowercase.
0061 
0062 String#utf8map! is the destructive function in the meaning that the string
0063 is replaced by the result.
0064 
0065 There are shortcuts for the 4 normalization forms specified by Unicode:
0066 String#utf8nfd,  String#utf8nfd!,
0067 String#utf8nfc,  String#utf8nfc!,
0068 String#utf8nfkd, String#utf8nfkd!,
0069 String#utf8nfkc, String#utf8nfkc!
0070 
0071 The method Integer#utf8 returns a UTF-8 string, which is containing the
0072 unicode char given by the code point.
0073 0x000A.utf8 => "\n"
0074 0x2028.utf8 => "\342\200\250"
0075 
0076 
0077 *** POSTGRESQL API ***
0078 
0079 For PostgreSQL there are two SQL functions supplied named "unifold" and
0080 "unistrip". These functions function can be used to prepare index fields in
0081 order to be folded in a way where string-comparisons make more sense, e.g.
0082 where "bathtub" == "bath<soft hyphen>tub"
0083 or "Hello World" == "hello world".
0084 
0085 CREATE TABLE people (
0086   id    serial8 primary key,
0087   name  text,
0088   CHECK (unifold(name) NOTNULL)
0089 );
0090 CREATE INDEX name_idx ON people (unifold(name));
0091 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
0092 
0093 The function "unistrip" removes character marks like accents or diaeresis,
0094 while "unifold" keeps then.
0095 
0096 NOTICE: The outputs of the function can change between releases, as
0097         utf8proc does not follow a versioning stability policy. You have to
0098         rebuild your database indicies, if you upgrade to a newer version
0099         of utf8proc.
0100 
0101 
0102 *** TODO ***
0103 
0104 - detect stable code points and process segments independently in order to
0105   save memory
0106 - do a quick check before normalizing strings to optimize speed
0107 - support stream processing
0108 
0109 
0110 *** CONTACT ***
0111 
0112 If you find any bugs or experience difficulties in compiling this software,
0113 please contact us:
0114 
0115 Project page: http://www.public-software-group.org/utf8proc
0116 
0117