The Unicode® Standard
A Technical Introduction
This is intended as a concise source of information about the Unicode® Standard. It is neither a comprehensive definition of, nor a technical guide to the Unicode Standard. The authoritative source of information is The Unicode Standard, Version 2.0, Addison Wesley Longman Publisher, 1996. The book may be ordered from the Consortium by using the publications order form. It should be used with The Unicode Standard, Version 2.1, which is available on this web site and provides the necessary updates and additions.
The Unicode Standard is the universal character encoding standard used for representation of text for computer processing. It is fully compatible with the International Standard ISO/IEC 10646-1; 1993, and contains all the same characters and encoding points as ISO/IEC 10646. The Unicode Standard also provides additional information about the characters and their use. Any implementation that is conformant to Unicode is also conformant to ISO/IEC 10646.
Unicode provides a consistent way of encoding multilingual plain text and brings order to a chaotic state of affairs that has made it difficult to exchange text files internationally. Computer users who deal with multilingual text -- business people, linguists, researchers, scientists, and others -- will find that the Unicode Standard greatly simplifies their work. Mathematicians and technicians, who regularly use mathematical symbols and other technical characters, will also find the Unicode Standard valuable.
The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. It uses a 16-bit encoding that provides code points for more than 65,000 characters. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes.
While 65,000 characters are sufficient for encoding most of the many thousands of characters used in major languages of the world, the Unicode standard and ISO 10646 provide an extension mechanism called UTF-16 that allows for encoding as many as a million more characters, without use of escape codes. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world.
What Characters Does the Unicode Standard Include?
The Unicode Standard defines codes for characters used in the major languages written today. Scripts include Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs. Many more scripts and characters are to be added shortly, including Ethiopic, Canadian Syllabics, Cherokee, additional rare ideographs, Sinhala, Syriac, Burmese, Khmer, and Braille.
The Unicode Standard also includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to encode accented or vocalized letters (ñ, for example). In all, the Unicode Standard provides codes for nearly 39,000 characters from the world's alphabets, ideograph sets, and symbol collections.
There are about 18,000 unused code values for future expansion in the basic 16-bit encoding, plus provision for another 917,504 code values through the UTF-16 extension mechanism. The Unicode Standard also reserves 6,400 code values for private use, which software and hardware developers can assign internally for their own characters and symbols. UTF-16 makes another 131,072 private use code values available, should 6,400 be insufficient for particular applications.
Character encoding standards define not only the identity of each character and its numeric value, or code position, but also how this value is represented in bits. The Unicode Standard endorses two forms that correspond to ISO 10646 transformation formats, UTF-8 and UTF-16.
The ISO/IEC 10646 transformation formats UTF-8 and UTF-16 are essentially ways of turning the encoding into the actual bits that are used in implementation. The first is known as UTF-16. It assumes 16-bit characters and allows for a certain range of characters to be used as an extension mechanism in order to access an additional million characters using 16-bit character pairs. The Unicode Standard, Version 2.0, has adopted this transformation format as defined in ISO/IEC 10646.
The other transformation format is known as UTF-8. This is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set end up having the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. The Unicode Consortium also endorses the use of UTF-8 as a way of implementing the Unicode Standard. Any Unicode character expressed in the 16-bit UTF-16 form can be converted to the UTF-8 form and back without loss of information.
To make it possible to successfully encode, process, and interpret text, a character set must:
- define the smallest useful elements of text to be encoded;
- assign a unique code to each element; and,
- provide basic rules for encoding and interpreting text so that programs can successfully read and process text.
These requirements are the basis for the design of the Unicode Standard.
Defining Elements of Text
Written languages are represented by textual elements that are used to create words and sentences. These elements may be letters such as "w" or "M"; characters such as those used in Japanese Hiragana to represent syllables; or ideographs such as those used in Chinese to represent full words or concepts.
The definition of text elements often changes depending on the process handling the text. For example, in historic Spanish language sorting, "ll"; counts as a single text element. However, when Spanish words are typed, "ll" is two separate text elements: "l" and "l".
To avoid deciding what is and is not a text element in different processes, the Unicode Standard defines code elements (commonly called "characters"). A code element is fundamental and useful for computer text processing. For the most part, code elements correspond to the most commonly used text elements. In the case of the Spanish "ll", the Unicode Standard defines each "l" as a separate code element. The task of combining two "l" together for alphabetic sorting is left to the software processing the text. As another example, each upper- and lowercase letter in the English alphabet is a single code element.Text Processing
Computer text handling involves processing and encoding. Consider, for example, a word processor user typing text at a keyboard. The computer's system software receives a message that the user pressed a key combination for "T", which it encodes as U+0054. The word processor stores the number in memory, and also passes it on to the display software responsible for putting the character on the screen. The display software, which may be a window manager or part of the word processor itself, uses the number as an index to find an image of a "T", which it draws on the monitor screen. The process continues as the user types in more characters.
The Unicode Standard directly addresses only the encoding and semantics text. It addresses no other action performed on the text. For example, the word processor may check the typist's input after it has been encoded to look for misspelled words, and then beep if it finds any. Or it may insert line breaks when it counts a certain number of characters entered since the last line break. An important principle of the Unicode Standard is that it does not specify how to carry out these processes as long as the character encoding and decoding is performed properly.
Interpreting Characters and Rendering Glyphs
The difference between identifying a code value and rendering it on screen or paper is crucial to understanding the Unicode Standard's role in text processing. The character identified by a Unicode code value is an abstract entity, such as "LATIN CHARACTER CAPITAL A" or "BENGALI DIGIT 5." The mark made on screen or paper -- called a glyph -- is a visual representation of the character.
The Unicode Standard does not define glyph images. The standard defines how characters are interpreted, not how glyphs are rendered. The software or hardware-rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, nor orientation of on-screen characters.
Creating Composite Characters
Text elements may be encoded as composed character sequences; in presentation, the multiple characters are rendered together. For example, "â" is a composite character created by rendering "a" and "^" together. A composed character sequence is typically made up of a base letter, which occupies a single space, and one or more non-spacing marks, which are rendered in the same space as the base letter.
The Unicode Standard specifies the order of characters used to create a composite character. The base character comes first, followed by one or more non-spacing marks. If a text element is encoded with more than one non-spacing mark, the order in which the non-spacing marks are stored isn't important if the marks don't interact typographically. If they do, order is important. The Unicode Standard specifies how competing non-spacing characters are applied to a base character.
Precomposed characters are another option for some composite characters. Each precomposed character is represented by a single code value rather than two or more code values which may combine during rendering. For example, the character "ü" can be encoded as the single code value U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard offers precomposed characters to retain compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".
Precomposed characters may be decomposed for consistency or analysis. For example, a word processor importing a text file containing the precomposed character "ü" may decompose that character into a "u" followed by the non-spacing character "¨". Once the character has been decomposed, it may be easier for the word processor to work with the character because the word processor can now easily recognize the character as a "u" with modifications. This allows easy alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines decomposition for all precomposed characters.
Principles of the Unicode Standard
The Unicode Standard was created by a team of computer professionals, linguists, and scholars to become a worldwide character standard, one easily used for text encoding everywhere. To that end, the Unicode Standard follows a set of fundamental principles:
- Logical order
- Full Encoding
- Characters, not glyphs
- Dynamic composition
- Equivalent Sequence
- Plain Text
Duplicate encoding of characters is avoided by unifying characters within scripts across languages; characters that are equivalent in form are given a single code. Chinese/Japanese/Korean (CJK) consolidation is achieved by assigning a single code for each ideograph that is common to more than one of these languages. This is instead of providing a separate code for the ideograph each time it appears in a different language. (These three languages share many thousands of identical characters because their ideograph sets evolved from the same source.)
Characters are stored in logical order. The Unicode Standard includes characters to specify changes in direction when scripts of different directionality are mixed. For all scripts Unicode text is in logical order within the memory representation, corresponding to the order in which text is typed on the keyboard. The Unicode Standard specifies an algorithm for the presentation of text of opposite directionality, for example, Arabic and English; as well as occurrences of mixed directionality text within.
Assigning Character Codes
A single 16-bit number is assigned to each code element defined by the Unicode Standard. Each of these 16-bit numbers is called a code value and, when referred to in text, is listed in hexadecimal form following the prefix "U". For example, the code value U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character "A" in the Unicode Standard.
Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name "LATIN CAPITAL LETTER A." U+0A1B is assigned the character name "GURMUKHI LETTER CHA." These Unicode names are identical to the ISO/IEC 10646 names for the same characters.
The Unicode Standard groups characters together by scripts in code blocks. A script is any system of related characters. The standard retains the order of characters in a source set where possible. When the characters of a script are traditionally arranged in a certain order -- alphabetic order, for example -- the Unicode Standard arranges them in its code space using the same order whenever possible. Code blocks vary greatly in size. For example, the Cyrillic code block does not exceed 256 code values, while the CJK code block has a range of thousands of code values.
Code elements are grouped logically throughout the range of code values, called the codespace. The coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then followed by symbols and punctuation. The code space continues with Hiragana, Katakana, and Bopomofo. The unified Han ideographs are followed by the complete set of modern Hangul. The surrogate range of code values is reserved for future expansion with UTF-16. Towards the end of the codespace is a range of code values reserved for private use, followed by a range of compatibility characters. The compatibility characters are character variants that are encoded only to enable transcoding to earlier standards and old implementations which made use of them.
A range of code values are reserved as user space. These code values have no universal meaning, and may be used for characters specific to a program or by a group of users for their own purposes. For example, a group of choreographers may design a set of characters for dance notation and encode the characters using code values in user space. A set of page-layout programs may use the same code values as control codes to position text on the page. The main point of user space is that the Unicode Standard assigns no meaning to these code values, and reserves them as user space, promising never to assign them meaning in the future.
Conformance to the Unicode Standard
The Unicode Standard specifies unambiguous requirements for conformance in terms of the principles and encoding architecture it embodies. A conforming implementation has the following characteristics, as a minimum requirement:
UTF-8 implementations of the Unicode Standard are conformant as long as they treat each UTF-8 encoding of a Unicode character (sequence of bytes) as if it were the corresponding 16-bit unit and otherwise interpret characters according to the Unicode specification. The full conformance requirements are available within The Unicode Standard, Version 2.0, Addison Wesley Longman, 1996.
Unicode and ISO/IEC 10646
The Unicode Standard is very closely aligned with the international standard ISO/IEC 10646-1; 1993 (also known as the Universal Character Set, or UCS, for short). In 1991 a formal convergence of the two standards was negotiated between the Unicode Technical Committee and JTC1/WC2/WG2, the ISO committee responsible for ISO/IEC 10646. Since that time, close cooperation and formal liaison between the committees has ensured that all additions to either standard are coordinated and kept in synch, so that the two standards maintain exactly the same character repertoire and encoding.
Version 2.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646-1; 1993, plus its first seven published amendments. This code-for-code identity is true for all encoded characters in the two standards, including the East Asian (Han) ideographic characters.
The international standard ISO/IEC 10646 allows for two forms of use, a two-octet (=byte) form known as UCS-2 and a four-octet form known as UCS-4. The Unicode Standard, as a profile of ISO/IEC 10646, chooses the two-octet form, which is equivalent to saying that characters are represented in 16-bits per character. When extended characters are used, Unicode is equivalent to UTF-16.
The fundamental source of information about Unicode is The Unicode Standard, Version 2.0, published by Addison Wesley Longman, 1996. It should be used with The Unicode Standard, Version 2.1, which is available on this web site and provides the necessary updates and additions. The book comes with a CD-ROM that contains character names and properties, as well as tables for mapping and transcoding. The Unicode Standard, Version 2.0 may be ordered from the Unicode Consortium by using the Publications Order Form. The Unicode Standard Updates and Errata are posted on this web site.
Telephone, fax, e-mail, postal and courier addresses