The Unicode HOWTO: Making your programs Unicode aware

6. Making your programs Unicode aware

6.1 C/C++

The C `char' type is 8-bit and will stay 8-bit because it denotes the smallest addressable data unit. Various facilities are available:

For normal text handling

The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).

Good references for this API are

the GNU libc-2.1 manual, chapters 4 "Character Handling" and 6 "Character Set Handling",
the manual pages man-mbswcs.tar.gz, now contained in ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz,
the OpenGroup's introduction http://www.unix-systems.org/version2/whatsnew/login_mse.html,
the OpenGroup's Single Unix specification http://www.UNIX-systems.org/online.html,
the ISO/IEC 9899:1999 (ISO C 99) standard. The latest draft before it was adopted is called n2794. You find it at ftp://ftp.csn.net/DMK/sc22wg14/review/ or http://java-tutor.com/docs/c/.
Clive Feather's introduction http://www.lysator.liu.se/c/na1.html,
the Dinkumware C library reference http://www.dinkumware.com/htm_cl/.

Advantages of using this API:

It's a vendor independent standard.
The functions do the right thing, depending on the user's locale. All a program needs to call is setlocale(LC_ALL,"");.

Drawbacks of this API:

Some of the functions are not multithread-safe, because they keep a hidden internal state between function calls.
There is no first-class locale datatype. Therefore this API cannot reasonably be used for anything that needs more than one locale or character set at the same time.
The OS support for this API is not good on most OSes.

Portability notes

A `wchar_t' may or may not be encoded in Unicode; this is platform and sometimes also locale dependent. A multibyte sequence `char *' may or may not be encoded in UTF-8; this is platform and sometimes also locale dependent.

In detail, here is what the Single Unix specification says about the `wchar_t' type: All wide-character codes in a given process consist of an equal number of bits. This is in contrast to characters, which can consist of a variable number of bytes. The byte or byte sequence that represents a character can also be represented as a wide-character code. Wide-character codes thus provide a uniform size for manipulating text data. A wide-character code having all bits zero is the null wide-character code, and terminates wide-character strings. The wide-character value for each member of the Portable Character Set (i.e. ASCII) will equal its value when used as the lone character in an integer character constant. Wide-character codes for other characters are locale- and implementation-dependent. State shift bytes do not have a wide-character code representation.

One particular consequence is that in portable programs you shouldn't use non-ASCII characters in string literals. That means, even though you know the Unicode double quotation marks have the codes U+201C and U+201D, you shouldn't write a string literal L"\u201cHello\u201d, he said" or "\xe2\x80\x9cHello\xe2\x80\x9d, he said" in C programs. Instead, use GNU gettext, write it as gettext("'Hello', he said"), and create a message database en.po which translates "'Hello', he said" to "\u201cHello\u201d, he said".

Here is a survey of the portability of the ISO/ANSI C facilities on various Unix flavours.

GNU glibc-2.2.x

<wchar.h> and <wctype.h> exist.
Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
Has five UTF-8 locales.
mbrtowc works.

GNU glibc-2.0.x, glibc-2.1.x

<wchar.h> and <wctype.h> exist.
Has wcs/mbs functions, but no fgetwc/fputwc/wprintf.
No UTF-8 locale.
mbrtowc returns EILSEQ for bytes >= 0x80.

AIX 4.3

<wchar.h> and <wctype.h> exist.
Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
Has many UTF-8 locales, one for every country.
Needs -D_XOPEN_SOURCE=500 in order to define mbstate_t.
mbrtowc works.

Solaris 2.7

<wchar.h> and <wctype.h> exist.
Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
Has the following UTF-8 locales: en_US.UTF-8, de.UTF-8, es.UTF-8, fr.UTF-8, it.UTF-8, sv.UTF-8.
mbrtowc returns -1/EILSEQ (instead of -2) for bytes >= 0x80.

OSF/1 4.0d

<wchar.h> and <wctype.h> exist.
Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
Has an add-on universal.utf8@ucs4 locale, see "man 5 unicode".
mbrtowc does not know about UTF-8.

Irix 6.5

<wchar.h> and <wctype.h> exist.
Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
Has no multibyte locales.
Has only a dummy definition for mbstate_t.
Doesn't have mbrtowc.

HP-UX 11.00

<wchar.h> exists, <wctype.h> does not.
Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
Has a C.utf8 locale.
Doesn't have mbstate_t.
Doesn't have mbrtowc.

As a consequence, I recommend to use the restartable and multithread-safe wcsr/mbsr functions, forget about those systems which don't have them (Irix, HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below) on those systems which permit you to compile programs which use these wcsr/mbsr functions (Linux, Solaris, OSF/1).

A similar advice, given by Sun in http://www.sun.com/software/white-papers/wp-unicode/, section "Internationalized Applications with Unicode", is:

To properly internationalize an application, use the following guidelines:

Avoid direct access with Unicode. This is a task of the platform's internationalization framework.
Use the POSIX model for multibyte and wide-character interfaces.
Only call the APIs that the internationalization framework provides for language and cultural-specific operations.
Remain code-set independent.

If, for some reason, in some piece of code, you really have to assume that `wchar_t' is Unicode (for example, if you want to do special treatment of some Unicode characters), you should make that piece of code conditional upon the result of is_locale_utf8(). Otherwise you will mess up your program's behaviour in different locales or other platforms. The function is_locale_utf8 is declared in utf8locale.h and defined in utf8locale.c.

The libutf8 library

A portable implementation of the ISO/ANSI C API, which supports 8-bit locales and UTF-8 locales, can be found in libutf8-0.7.3.tar.gz.

Advantages:

Unicode UTF-8 support now, portably, even on OSes whose multibyte character support does not work or which don't have multibyte/wide character support at all.
The same binary works in all OS supported 8-bit locales and in UTF-8 locales.
When an OS vendor adds proper multibyte character support, you can take advantage of it by simply recompiling without -DHAVE_LIBUTF8 compiler option.

The Plan9 way

The Plan9 operating system, a variant of Unix, uses UTF-8 as character encoding in all applications. Its wide character type is called `Rune', not `wchar_t'. Parts of its libraries, written by Rob Pike and Howard Trickey, are available at ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1.0.tar.gz. Another similar library, written by Alistair G. Crooks, is ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz. In particular, each of these libraries contains an UTF-8 aware regular expression matcher.

Drawback of this API:

UTF-8 is compiled in, not optional. Programs compiled in this universe lose support for the 8-bit encodings which are still frequently used in Europe.

For graphical user interface

The Qt-2.0 library http://www.troll.no/ contains a fully-Unicode QString class. You can use the member functions QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text. The QString::ascii and QString::latin1 member functions should not be used any more.

For advanced text handling

The previously mentioned libraries implement Unicode aware versions of the ASCII concepts. Here are libraries which deal with Unicode concepts, such as titlecase (a third letter case, different from uppercase and lowercase), distinction between punctuation and symbols, canonical decomposition, combining classes, canonical ordering and the like.

ucdata-2.4

The ucdata library by Mark Leisher http://crl.nmsu.edu/~mleisher/ucdata.html deals with character properties, case conversion, decomposition, combining classes. The companion package ure-0.5 http://crl.nmsu.edu/~mleisher/ure-0.5.tar.gz is a Unicode regular expression matcher.

ustring

The ustring C++ library by Rodrigo Reyes http://ustring.charabia.net/ deals with character properties, case conversion, decomposition, combining classes, and includes a Unicode regular expression matcher.

ICU

International Components for Unicode http://oss.software.ibm.com/icu/. IBM's very comprehensive internationalization library featuring Unicode strings, resource bundles, number formatters, date/time formatters, message formatters, collation and more. Lots of supported locales. Portable to Unix and Win32, but compiles out of the box only on Linux libc6, not libc5.

libunicode

The GNOME libunicode library http://cvs.gnome.org/lxr/source/libunicode/ by Tom Tromey and others. It covers character set conversion, character properties, decomposition.

For conversion

Two kinds of conversion libraries, which support UTF-8 and a large number of 8-bit character sets, are available:

iconv

The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.2. ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz. The iconv manpages are now contained in ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz.

The portable iconv implementation by Bruno Haible. ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz

The portable iconv implementation by Konstantin Chuguev. <joy@urc.ac.ru> ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz

Advantages:

iconv is POSIX standardized, programs using iconv to convert from/to UTF-8 will also run under Solaris. However, the names for the character sets differ between platforms. For example, "EUC-JP" under glibc is "eucJP" under HP-UX. (The official IANA name for this character set is "EUC-JP", so it's clearly a HP-UX deficiency.)
On glibc-2.1 systems, no additional library is needed. On other systems, one of the two other iconv implementations can be used.

librecode

librecode by François Pinard ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz.

Advantages:

Support for transliteration, i.e. conversion of non-ASCII characters to sequences of ASCII characters in order to preserve readability by humans, even when a lossless transformation is impossible.

Drawbacks:

Non-standard API.
Slow initialization.

ICU

International Components for Unicode 1.7 http://oss.software.ibm.com/icu/. IBM's internationalization library also has conversion facilities, declared in `ucnv.h'.

Advantages:

Comprehensive set of supported encodings.

Drawbacks:

Non-standard API.

Other approaches

libutf-8

libutf-8 by G. Adam Stanislav <adam@whizkidtech.net> contains a few functions for on-the-fly conversion from/to UTF-8 encoded `FILE*' streams. http://www.whizkidtech.net/i18n/libutf-8-1.0.tar.gz

Advantages:

Very small.

Drawbacks:

Non-standard API.
UTF-8 is compiled in, not optional. Programs compiled with this library lose support for the 8-bit encodings which are still frequently used in Europe.
Installation is nontrivial: Makefile needs tweaking, not autoconfiguring.

6.2 Java

Java has Unicode support built into the language. The type `char' denotes a Unicode character, and the `java.lang.String' class denotes a string built up from Unicode characters.

Java can display any Unicode characters through its windowing system AWT, provided that 1. you set the Java system property "user.language" appropriately, 2. the /usr/lib/java/lib/font.properties.language font set definitions are appropriate, and 3. the fonts specified in that file are installed. For example, in order to display text containing japanese characters, you would install japanese fonts and run "java -Duser.language=ja ...". You can combine font sets: In order to display western european, greek and japanese characters simultaneously, you would create a combination of the files "font.properties" (covers ISO-8859-1), "font.properties.el" (covers ISO-8859-7) and "font.properties.ja" into a single file. ??This is untested??

The interfaces java.io.DataInput and java.io.DataOutput have methods called `readUTF' and `writeUTF' respectively. But note that they don't use UTF-8; they use a modified UTF-8 encoding: the NUL character is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at the end. Encoded this way, strings can contain NUL characters and nevertheless need not be prefixed with a length field - the C <string.h> functions like strlen() and strcpy() can be used to manipulate them.

6.3 Lisp

The Common Lisp standard specifies two character types: `base-char' and `character'. It's up to the implementation to support Unicode or not. The language also specifies a keyword argument `:external-format' to `open', as the natural place to specify a character set or encoding.

Among the free Common Lisp implementations, only CLISP http://clisp.cons.org/ supports Unicode. You need a CLISP version from March 2000 or newer. ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz. The types `base-char' and `character' are both equivalent to 16-bit Unicode. The functions char-width and string-width provide an API comparable to wcwidth() and wcswidth(). The encoding used for file or socket/pipe I/O can be specified through the `:external-format' argument. The encodings used for tty I/O and the default encoding for file/socket/pipe I/O are locale dependent.

Among the commercial Common Lisp implementations:

LispWorks http://www.xanalys.com/software_tools/products/ supports Unicode. The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char' (subtype of `character') contains all Unicode characters. The encoding used for file I/O can be specified through the `:external-format' argument, for example '(:UTF-8). Limitations: Encodings cannot be used for socket I/O. The editor cannot edit UTF-8 encoded files.

Eclipse http://www.elwood.com/eclipse/eclipse.htm supports Unicode. See http://www.elwood.com/eclipse/char.htm. The type `base-char' is equivalent to ISO-8859-1, and the type `character' contains all Unicode characters. The encoding used for file I/O can be specified through a combination of the `:element-type' and `:external-format' arguments to `open'. Limitations: Character attribute functions are locale dependent. Source and compiled source files cannot contain Unicode string literals.

The commercial Common Lisp implementation Allegro CL, in version 6.0, has Unicode support. The types `base-char' and `character' are both equivalent to 16-bit Unicode. The encoding used for file I/O can be specified through the `:external-format' argument, for example :external-format :utf8. The default encoding is locale dependent. More details are at http://www.franz.com/support/documentation/6.0/doc/iacl.htm.

6.4 Ada95

Ada95 was designed for Unicode support and the Ada95 standard library features special ISO 10646-1 data types Wide_Character and Wide_String, as well as numerous associated procedures and functions. The GNU Ada95 compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of wide characters. This allows you to use UTF-8 in both source code and application I/O. To activate it in the application, use "WCEM=8" in the FORM string when opening a file, and use compiler option "-gnatW8" if the source code is in UTF-8. See the GNAT ( ftp://cs.nyu.edu/pub/gnat/) and Ada95 ( ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm) reference manuals for details.

6.5 Python

Python 2.0 ( http://www.python.org/2.0/, http://www.python.org/pipermail/python-announce-list/2000-October/000889.html, http://starship.python.net/crew/amk/python/writing/new-python/new-python.html) contains Unicode support. It has a new fundamental data type `unicode', representing a Unicode string, a module `unicodedata' for the character properties, and a set of converters for the most important encodings. See http://starship.python.net/crew/lemburg/unicode-proposal.txt, or the file Misc/unicode.txt in the distribution, for details.

6.6 JavaScript/ECMAscript

Since JavaScript version 1.3, strings are always Unicode. There is no character type, but you can use the \uXXXX notation for Unicode characters inside strings. No normalization is done internally, so it expects to receive Unicode Normalization Form C, which the W3C recommends. See http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode for details and http://developer.netscape.com/docs/javascript/e262-pdf.pdf for the complete ECMAscript specification.

6.7 Tcl

Tcl/Tk started using Unicode as its base character set with version 8.1. Its internal representation for strings is UTF-8. It supports the \uXXXX notation for Unicode characters. See http://dev.scriptics.com/doc/howto/i18n.html.

6.8 Perl

Perl 5.6 stores strings internally in UTF-8 format, if you write


use utf8;

at the beginning of your script. length() returns the number of characters of a string. For details, see the Perl-i18n FAQ at http://rf.net/~james/perli18n.html.

Support for other (non-8-bit) encodings is available through the iconv interface module http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz.

6.9 Related reading

Tomohiro Kubota has written an introduction to internationalization http://www.debian.org/doc/manuals/intro-i18n/. The emphasis of his document is on writing software that runs in any locale, using the locale's encoding.

Next Previous Contents