ISO 14651

ISO/IEC 14651:2016, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering, is an ISO Standard specifying an algorithm that can be used when comparing two strings. This comparison can be used when collating a set of strings. The standard also specifies a datafile specifying the comparison order, the Common Tailorable Template, CTT. The comparison order is supposed to be tailored for different languages (hence the CTT is regarded as a template and not a default, though the empty tailoring, not changing any weighting, is appropriate in many cases), since different languages have incompatible ordering requirements. One such tailoring is European ordering rules (EOR), which in turn is supposed to be tailored for different European languages.

The Common Tailorable Template (CTT) datafile of this ISO Standard is aligned with the Default Unicode Collation Entity Table (DUCET) datafile of the Unicode Collation Algorithm (UCA) specified in Unicode Technical Standard #10.

This is the fourth edition of the standard and was published on 2016-02-15, corrected on 2016-05-01 and covers up to and including Unicode 8.0. One additional amendment Amd.1:2017 was published in September 2017 and covers up to and including Unicode 9.0.

See also

External links and references

Arabic letter mark

The Arabic letter mark (ALM) is a non-printing character used in the computerized typesetting of bi-directional text containing mixed left-to-right scripts (such as Latin and Cyrillic) and right-to-left scripts (such as Persian, Arabic, Syriac and Hebrew).

Similar to Right-to-left mark (RLM), it is used to change the way adjacent characters are grouped with respect to text direction, with some difference on how it affects the bidirectional level resolutions for nearby characters.

Binary Ordered Compression for Unicode

Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.Both SCSU and BOCU-1 are IANA registered charsets.

Combining Grapheme Joiner

The combining grapheme joiner (CGJ), U+034F ͏ COMBINING GRAPHEME JOINER (HTML ͏) is a Unicode character that has no visible glyph and is "default ignorable" by applications. Its name is a misnomer and does not describe its function; the character does not join graphemes. Its purpose is to separate characters that should not be considered digraphs.

For example, in a Hungarian language context, adjoining characters c and s would normally be considered equivalent to the cs digraph. If they are separated by the CGJ, they will be considered as two separate graphemes.

It is also needed for complex scripts. For example, in most cases the Hebrew cantillation accent Metheg is supposed to appear to the left of the vowel point and by default most display systems will render it like this even if it is typed before the vowel. But in some words in Biblical Hebrew the Metheg appears to the right of the vowel, and to tell the display engine to render it properly on the right, CGJ must be typed between the Metheg and the vowel. Compare:

These examples may not be supported if you do not have a font that properly supports Hebrew cantillation display. Ezra SIL SR is recommended. These examples may not render the same in other operating systems, applications and browsers.

In the case of several consecutive combining diacritics, an intervening CGJ indicates that they should not be subject to canonical reordering.Compare to this the "zero-width non-joiner" at U+200C in the General Punctuation range, which prevents two adjacent character from turning into a ligature.

Common Locale Data Repository

The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in LDML (Locale Data Markup Language). The information is currently used in International Components for Unicode, Apple's macOS, LibreOffice, MediaWiki, and IBM's AIX, among other applications and operating systems.

Among the types of data that CLDR includes are the following:

Translations for language names.

Translations for territory and country names.

Translations for currency names, including singular/plural modifications.

Translations for weekday, month, era, period of day, in full and abbreviated forms.

Translations for timezones and example cities (or similar) for timezones.

Translations for calendar fields.

Patterns for formatting/parsing dates or times of day.

Exemplar sets of characters used for writing the language.

Patterns for formatting/parsing numbers.

Rules for language-adapted collation.

Rules for formatting numbers in traditional numeral systems (like Roman numerals, Armenian numerals, …).

Rules for spelling out numbers as words.

Rules for transliteration between scripts. A lot of it is based on BGN/PCGN romanization.It overlaps somewhat with ISO/IEC 15897 (POSIX locales). POSIX locale information can be derived from CLDR by using some of CLDR's conversion tools.

CLDR is maintained by the CLDR technical committee, which includes employees from IBM, Apple, Google, Microsoft, and some government-based organizations. The committee is currently chaired by John Emmons (IBM), with Mark Davis (Google) as vice-chair.

ConScript Unicode Registry

The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Area for the encoding of artificial scripts including those for constructed languages. It was founded by John Cowan and is maintained by him and Michael Everson but has not been updated in several years. It has no formal connection with the Unicode Consortium.

The Under-ConScript Unicode Registry (UCSUR) is a clone of the CSUR that is acting as a holding area for new scripts until they can be added to the dormant CSUR. It is run by Rebecca Bettencourt.

Enclosed Alphanumerics

Enclosed Alphanumerics is a Unicode block of typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop. There is another block for these characters (U+1F100—U+1F1FF), encoded in the Supplementary Multilingual Plane, which contains the set of Regional Indicator Symbols as of Unicode 6.0.

Ideographic Rapporteur Group

The Ideographic Rapporteur Group (IRG) is a subgroup of the ISO/IEC JTC 1/SC 2 working group WG2.

International Ideographs Core

International Ideographs Core (IICore) is a subset of up to ten thousand CJK Unified Ideographs characters, which can be implemented on devices with limited memories and capability that make it not feasible to implement the full ISO 10646/Unicode standard.

Left-to-right mark

The left-to-right mark (LRM) is a control character (an invisible formatting character) used in computerized typesetting (including word processing in a program like Microsoft Word) of text that contains a mixture of left-to-right text (such as English or Russian) and right-to-left text (such as Arabic, Persian or Hebrew). It is used to set the way adjacent characters are grouped with respect to text direction.

List of precomposed Latin characters in Unicode

This is a list of precomposed Latin characters in Unicode. Unicode typefaces (e.g. Fixedsys Excelsior) may be needed for these to display correctly.

Precomposed character

A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é (Latin small letter e with acute accent). Technically, é (U+00E9) is a character that can be decomposed into an equivalent string of the base letter e (U+0065) and combining acute accent (U+0301). Similarly, ligatures are precompositions of their constituent letters or graphemes.

Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.

Right-to-left mark

The right-to-left mark (RLM) is a non-printing character used in the computerized typesetting of bi-directional text containing mixed left-to-right scripts (such as English and Cyrillic) and right-to-left scripts (such as Persian, Arabic, Urdu, Syriac and Hebrew).

RLM is used to change the way adjacent characters are grouped with respect to text direction. However, for Arabic script, Arabic letter mark may be a better choice.

UTF-1

UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

Unicode and HTML for the Hebrew alphabet

The Unicode and HTML for the Hebrew alphabet are found in the following tables. The Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB4F. It includes letters, ligatures, combining diacritical marks (niqqud and cantillation marks) and punctuation. The Numeric Character References is included for HTML. These can be used in many markup languages, and they are often used on web pages to create the Hebrew glyphs presentable by the majority of web browsers.

Unicode collation algorithm

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.

Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET). This datafile specifies the default collation ordering. The DUCET is customizable for different languages. Some such customisations can be found in Common Locale Data Repository (CLDR).

An important open source implementation of UCA is included with the International Components for Unicode, ICU. ICU also supports tailoring and the collation tailorings from CLDR are included in ICU. You can see the effects of tailoring and a large number of language specific tailorings in the on-line ICU Locale Explorer.

Unicode equivalence

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones). Each of these four normal forms can be used in text processing.

Word joiner

The word joiner (WJ) is a code point in Unicode used to indicate that word separation should not occur at a position, when using scripts that do not use explicit spacing. It is encoded since Unicode version 3.2 (released in 2002) as U+2060. The word joiner does not produce any space and prohibits a line break at its position.

The word joiner replaces the zero width no-break space (ZWNBSP), a deprecated use of the Unicode character at code point U+FEFF. Character U+FEFF is intended for use as a Byte Order Mark (BOM) at the start of a file. However, if encountered elsewhere, it should, according to Unicode, be treated as a "zero width no-break space". The dedicated use of U+FEFF for this purpose is deprecated as of Unicode 3.2, with the word joiner strongly preferred.

Z-variant

In Unicode, two glyphs are said to be Z-variants (often spelled zVariants) if they share the same etymology but have slightly different appearances and different Unicode code points. For example, the Unicode characters U+8AAA 說 and U+8AAC 説 are Z-variants. The notion of Z-variance is only applicable to the “CJKV scripts” — Chinese, Japanese, Korean and Vietnamese — and is a subtopic of Han unification.

Zero-width joiner

The zero-width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

In some cases, such as the second Devanagari example below, the ZWJ follows the second rather than the first character.

When a ZWJ is placed between two emoji characters, it can also result in a new form being shown, such as the family emoji, made up of two adult emoji and one or two child emoji.The character's code point is U+200D ZERO WIDTH JOINER (HTML ‍ · ‍). In the InScript keyboard layout for Indian languages, it is typed by the key combination Ctrl+Shift+1. However, many layouts use the ']' key for this character.

Unicode
Code points
Characters
Processing
On pairs of
code points
Usage
Related standards
Related topics
ISO standards by standard number
1–9999
10000–19999
20000+

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.