Opentopia Directory Encyclopedia Tools

CESU-8

Encyclopedia : C : CE : CES : CESU-8


Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail
Unicode typefaces
CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. The code point is first represented with UTF-16, and then that result is re-encoded in UTF-8. It is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000). Like Modified UTF-8, it can be decoded into one UTF-16 word at a time. Because it doesn't have special treatment of NUL, the resulting string will not be safe for NUL-terminated string handling if the original string contained NUL characters.

In practice, CESU-8 is often used to communicate with the Oracle database software, which in modern configurations apparently uses UTF-16 as an internal character representation. Oracle's "UTF-8" (actually CESU-8) codec rejects proper UTF-8 sequences for characters from outside the Basic Multilingual Plane, but happily accepts and generates technically invalid UTF-8 sequences for codepoints in the surrogate range (U+D800 .. U+DFFF), as specified in CESU-8.

External links

 


From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.

Search Titles
0123456789
ABCDEFGHIJ
KLMNOPQRST
UVWXYZ?

E-mail this article to:

Personal Message: