Many of the character sets included as ISO/IEC 2022 encodings are 'double byte' encodings where two bytes correspond to a single character. This makes ISO-2022 a variable width encoding. But a specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation.
|Standard||ISO 2022, ECMA 35, JIS X 0202|
|Transforms / Encodes||US-ASCII and, depending on implementation:|
|Succeeded by||ISO 10646 (Unicode)|
Many languages or language families not based on the Latin alphabet such as Greek, Cyrillic, Arabic, or Hebrew have historically been represented on computers with different 8-bit extended ASCII encodings. Written East Asian languages, specifically Chinese, Japanese, and Korean, use far more characters than can be represented in an 8-bit computer byte and were first represented on computers with language-specific double byte encodings.
ISO/IEC 2022 was developed as a technique to attack both of these problems: to represent characters in multiple character sets within a single character encoding, and to represent large character sets.
A second requirement of ISO-2022 was that it should be compatible with 7-bit communication channels. So even though ISO-2022 is an 8-bit character set any 8-bit sequence can be reencoded to use only 7-bits without loss and normally only a small increase in size.
To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and follow the patterns defined within the standard. These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on previously encountered escape sequences. Note, however, that other standards such as ISO-2022-JP may impose extra conditions such as the current character set is reset to US-ASCII before the end of a line.
To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that one seven bit character will normally define 94 graphic (printable) characters (in addition to space and 33 control characters). Using two bytes, it is thus possible to represent up to 8836 (94×94) characters; and, using three bytes, up to 830584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although EUC-TW's unregistered G2 is). For the two-byte character sets, the code point of each character is normally specified in so-called kuten (Japanese: 区点) form (sometimes called quwei (Chinese: 区位), especially when dealing with GB2312 and related standards), which specifies a zone (区, Japanese: ku, Chinese: qu), and the point (Japanese: 点 ten) or position (Chinese: 位 wei) of that character within the zone.
The escape sequences therefore do not only declare which character set is being used, but also, by knowing the properties of these character sets, know whether a 94-, 96-, 8836-, or 830584-character (or some other sized) encoding is being dealt with.
In practice, the escape sequences declaring the national character sets may be absent if context or convention dictates that a certain national character set is to be used. For example, ISO-8859-1 states that no defining escape sequence is needed and RFC 1922, which defines ISO-2022-CN, allows ISO-2022 SHIFT characters to be used without explicit use of escape sequences.
The ISO-2022 definitions of the ISO-8859-X character sets are specific fixed combinations of the components that form ISO-2022. Specifically the lower control characters (C0) the US-ASCII character set (in GL) and the upper control characters (C1) are standard and the high characters (GR) are defined for each of the ISO-8859-X variants; for example ISO-8859-1 is defined by the combination of ISO-IR-1, ISO-IR-6, ISO-IR-77 and ISO-IR-100 with no shifts or character changes allowed.
Although ISO/IEC 2022 character sets using control sequences are still in common use, particularly ISO-2022-JP, most modern e-mail applications are converting to use the simpler Unicode transforms such as UTF-8. The encodings that don't use control sequences, such as the ISO-8859 sets are still very common.
ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.
Character codes from the 7-bit ASCII graphic range (0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while codes from the "high ASCII" range (0xA0–0xFF), if available, are referred to as the "GR" codes ("graphics right").
By default, GL codes specify G0 characters, and GR codes specify G1 characters, but this may be modified with control codes or by prior agreement:
Locking shift zero
|GL encodes G0 from now on|
Locking shift one
|GL encodes G1 from now on|
|ESC 0x6E (n)||LS2||Locking shift two||GL encodes G2 from now on|
|ESC 0x6F (o)||LS3||Locking shift three||GL encodes G3 from now on|
ESC 0x4E (N)
|SS2||Single shift two||GL encodes G2 for next character only|
ESC 0x4F (O)
|SS3||Single shift three||GL encodes G3 for next character only|
|ESC 0x7E (~)||LS1R||Locking shift one right||GR encodes G1 from now on|
|ESC 0x7D (})||LS2R||Locking shift two right||GR encodes G2 from now on|
|ESC 0x7C (|)||LS3R||Locking shift three right||GR encodes G3 from now on|
Each of the four working sets may be a 94-character set or a 94n-character set. Additionally, G1 through G3 may be a 96- or 96n-character set. When one of the latter is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available.
There are additional (rarely used) features for switching control character sets, but this is a single-level lookup: the 0x00–0x1F range is the C0 control character set, the 0x80–0x9F range is the C1 control character set, and there are escape sequences which switch in various alternatives. It is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible.
As seen in the SS2 and SS3 examples above, single control characters from the C1 control character set may be invoked using only 7 bits using the sequences
ESC 0x40 (@) through
ESC 0x5F (_). Additional control functions are assigned in the range
ESC 0x60 (`) through
ESC 0x7E (~). While this article describes escape sequences using the corresponding ASCII characters, they are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.
Escape sequences to designate character sets take the form
ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.
|ESC ! F||1B 21 F||CZD||C0-designate||F selects a C0 control character set to be used.|
|ESC " F||1B 22 F||C1D||C1-designate||F selects a C1 control character set to be used.|
|ESC % F||1B 25 F||DOCS||Designate other coding system||F selects an 8-bit code; use |
|ESC % / F||1B 25 2F F||DOCS||Designate other coding system||F selects an 8-bit code; there is no standard way to return. E.g. |
|ESC & F||1B 26 F||IRR||Identify revised registration||F, adjusted to the range 1-63, indicates which revision of the immediately-following registration is needed, so that old systems know that they are old.|
|ESC ( F||1B 28 F||GZD4||G0-designate 94-set||F selects a 94-character set to be used for G0.|
|ESC ) F||1B 29 F||G1D4||G1-designate 94-set||F selects a 94-character set to be used for G1.|
|ESC * F||1B 2A F||G2D4||G2-designate 94-set||F selects a 94-character set to be used for G2.|
|ESC + F||1B 2B F||G3D4||G3-designate 94-set||F selects a 94-character set to be used for G3.|
|ESC - F||1B 2D F||G1D6||G1-designate 96-set||F selects a 96-character set to be used for G1.|
|ESC . F||1B 2E F||G2D6||G2-designate 96-set||F selects a 96-character set to be used for G2.|
|ESC / F||1B 2F F||G3D6||G3-designate 96-set||F selects a 96-character set to be used for G3.|
|ESC $ F
ESC $ ( F
|1B 24 F
1B 24 28 F
|GZDM4||G0-designate multibyte 94-set||F selects a 94n-character set to be used for G0.|
|ESC $ ) F||1B 24 29 F||G1DM4||G1-designate multibyte 94-set||F selects a 94n-character set to be used for G1.|
|ESC $ * F||1B 24 2A F||G2DM4||G2-designate multibyte 94-set||F selects a 94n-character set to be used for G2.|
|ESC $ + F||1B 24 2B F||G3DM4||G3-designate multibyte 94-set||F selects a 94n-character set to be used for G3.|
|ESC $ - F||1B 24 2D F||G1DM6||G1-designate multibyte 96-set||F selects a 96n-character set to be used for G1.|
|ESC $ . F||1B 24 2E F||G2DM6||G2-designate multibyte 96-set||F selects a 96n-character set to be used for G2.|
|ESC $ / F||1B 24 2F F||G3DM6||G3-designate multibyte 96-set||F selects a 96n-character set to be used for G3.|
Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by
ESC ( A through
ESC + A is not related in any way to the 96-character set designated by
ESC - A through
ESC / A. And neither of those is related to the 94n-character set designated by
ESC $ ( A through
ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes,
ESC A is a way of specifying the C1 control code 0x81.)
Also note that C0 and C1 control character sets are independent; the C0 control character set designated by
ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by
ESC " A (the CCITT attribute control set for Videotex).
Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of the form
ESC ( ! F have been assigned. At the other extreme, no multibyte 96-sets have been registered, so the sequences above are strictly theoretical.
Character encodings using ISO/IEC 2022 mechanism include:
The character after the
ESC (for single-byte character sets) or
ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character
( (0x28) designates a 94-character set to the G0 character set. This may be replaced by
+ (0x29–0x2B) to designate to the G1–G3 character sets.
Two of the codes above are 96-character codes, and in the above examples, the character
- (0x2D) designates to the G1 character set. This may be replaced with
/ (0x2E or 0x2F) to designate to the G2 or G3 character sets. As mentioned earlier, a 96-character set may not be designated to the G0 set.
There are three special cases for multi-byte codes. The code sequences
ESC $ @,
ESC $ A, and
ESC $ B were all registered before the ISO/IEC 2022 standard was finalized, so must be accepted as synonyms for the sequences
ESC $ ( @ through
ESC $ ( B to designate to the G0 character set. The latter form may also be used, and may be adapted by changing the
( character to designate to the G1 through G3 character sets.
The standard also defines a way to specify coding systems that do not follow its own structure. Of particular interest, the sequence
ESC % G designates the UTF-8 coding system, which does not reserve the range 0x80–0x9F for control characters.
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use the ISO/IEC 2022 system of specifying control and graphic characters. Most character encodings, in addition to representing printable characters, also have characters such as these that represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.
The C0 set defines codes in the range 00HEX–1FHEX and the C1 set defines codes in the range 80HEX–9FHEX. The default C0 set was originally defined in ISO 646 (ASCII), while the default C1 set was originally defined in ECMA-48 (harmonized later with ISO 6429). While other C0 and C1 sets are available for specialized applications, they are rarely used.Character encoding
Character encoding is used to represent a repertoire of characters by some kind of encoding system. Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data. "Character set", "character map", "codeset" and "code page" are related, but not identical, terms.
Early character codes associated with the optical or electrical telegraph could only represent a subset of the characters used in written languages, sometimes restricted to upper case letters, numerals and some punctuation only. The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages. Character encoding using internationally accepted standards permits worldwide interchange of text in electronic form.Code page 1287
Code page 1287, also known as CP1287, DEC Greek (8-bit) and EL8DEC, is one of the code pages implemented for the VT220 terminals. It supports the Greek language.Ecma International
Ecma is a standards organization for information and communication systems. It acquired its current name in 1994, when the European Computer Manufacturers Association (ECMA) changed its name to reflect the organization's global reach and activities. As a consequence, the name is no longer considered an acronym and no longer uses full capitalization.
The organization was founded in 1961 to standardize computer systems in Europe. Membership is open to large and small companies worldwide that produce, market or develop computer or communication systems, and have interest and experience in the areas addressed by the group's technical bodies. It is located in Geneva.ISO-8859-8-I
ISO-8859-8-I is the IANA charset name for the character encoding ISO/IEC 8859-8 used together with the control codes from ISO/IEC 6429 for the C0 (00–1F hex) and C1 (80–9F) parts. The characters are in logical order.
Escape sequences (from ISO/IEC 6429 or ISO/IEC 2022) are not to be interpreted. Most applications only interpret the control codes for LF, CR, and HT. A few applications also interpret VT, FF, and NEL (in C1). Very few applications interpret the other C0 and C1 control codes.
ISO-8859-8 is sometimes in logical order (HTML, XML), and sometimes in visual (left-to-right) order (plain text without any markup).
Logical order for this charset requires bidi processing for display.ISO/IEC 6937
ISO/IEC 6937:2001, Information technology — Coded graphic character set for text communication — Latin alphabet, is a multibyte extension of ASCII, or rather of ISO/IEC 646-IRV. It was developed in common with ITU-T (then CCITT) for telematic services under the name of T.51, and first became an ISO standard in 1983. Certain byte codes are used as lead bytes for letters with diacritics (accents). The value of the lead byte often indicates which diacritic that the letter has, and the follow byte then has the ASCII-value for the letter that the diacritic is on. Only certain combinations of lead byte and follow byte are allowed, and there are some exceptions to the lead byte interpretation for some follow bytes. However, there are no combining characters at all are encoded in ISO/IEC 6937. But one can represent some free-standing diacritics, often by letting the follow byte have the code for ASCII space.
ISO/IEC 6937's architects were Hugh McGregor Ross, Peter Fenwick, Bernard Marti and Loek Zeckendorf.
ISO6937/2 defines 327 characters found in modern European languages using the Latin alphabet. Non-Latin European characters, such as Cyrillic and Greek, are not included in the standard. Also, some diacritics used with the Latin alphabet like the Romanian comma are not included, using cedilla instead as no distinction between cedilla and comma below was made at the time.
IANA has registered the charset names ISO_6937-2-25 and ISO_6937-2-add for two (older) versions of this standard (plus control codes). But in practice this character encoding is unused on the Internet.
The ISO/IEC 2022 escape sequence to specify the right-hand side of the ISO/IEC 6937 character set is ESC - R (hex 1B 2D 52).ISO/IEC 8859-12
ISO/IEC 8859-12 would have been part 12 of the ISO/IEC 8859 character encoding standard series.
ISO 8859-12 was originally proposed to support the Celtic languages. ISO 8859-12 was later slated for Latin/Devanagari, but this was abandoned in 1997, during the 12th meeting of ISO/IEC JTC 1/SC 2/WG 3 in Iraklion-Crete, Greece, 4 to 7 July 1997. The Celtic proposal was changed to ISO 8859-14.ISO/IEC 8859-16
ISO/IEC 8859-16:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 16: Latin alphabet No. 10, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. It is informally referred to as Latin-10 or South-Eastern European. It was designed to cover Albanian, Croatian, Hungarian, Polish, Romanian, Serbian and Slovenian, but also French, German, Italian and Irish Gaelic (new orthography).
ISO-8859-16 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429.
Microsoft has assigned code page 28606 a.k.a. Windows-28606 to ISO-8859-16.ISO/IEC 8859-3
ISO/IEC 8859-3:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 3: Latin alphabet No. 3, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin-3 or South European. It was designed to cover Turkish, Maltese and Esperanto, though the introduction of ISO/IEC 8859-9 superseded it for Turkish. The encoding remains popular with users of Esperanto, though use is waning as application support for Unicode becomes more common.
ISO-8859-3 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Microsoft has assigned code page 28593 a.k.a. Windows-28593 to ISO-8859-3 in Windows. IBM has assigned code page 913 to ISO 8859-3.ISO/IEC 8859-9
ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for these six replacements of Icelandic characters with characters unique to the Turkish alphabet:
ISO-8859-9 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. In modern applications Unicode and UTF-8 are preferred. 0.1% of all web pages use ISO-8859-9 in February 2016.Microsoft has assigned code page 28599 a.k.a. Windows-28599 to ISO-8859-9 in Windows. IBM has assigned Code page 920 to ISO-8859-9.ISO/IEC JTC 1/SC 2
ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that develops and facilitates standards within the field of coded character sets. The international secretariat of ISO/IEC JTC 1/SC 2 is the Japanese Industrial Standards Committee (JISC), located in Japan.Index of Japan-related articles (I)
This page lists Japan-related articles with romanized titles beginning with the letter I. For names of people, please list by surname (i.e., "Tarō Yamada" should be listed under "Y", not "T"). Please also ignore particles (e.g. "a", "an", "the") when listing articles (i.e., "A City with No People" should be listed under "City").List of Ecma standards
This is a list of standards published by Ecma International, formerly the European Computer Manufacturers Association.MARC-8
The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in library database systems. The character encoding now known as MARC-8 was introduced in 1968 as part of the MARC format. Originally based on the Latin alphabet, from 1979 to 1983 the JACKPHY initiative expanded the repertoire to include Japanese, Arabic, Chinese, and Hebrew characters (among others), with the later addition of Cyrillic and Greek scripts. If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8 must be used instead. UTF-8 has support for many more characters than MARC-8, which is rarely used outside library data.Registration authority
Registration authorities exist for many standards organizations, such as ANNA (Association of National Numbering Agencies for ISIN), the Object Management Group, W3C, IEEE and others. In general, registration authorities all perform a similar function, in promoting the use of a particular standard through facilitating its use. This may be by applying the standard, where appropriate, or by verifying that a particular application satisfies the standard's tenants. Maintenance agencies, in contrast, may change an element in a standard based on set rules – such as the creation or change of a currency code when a currency is created or revalued (i.e. TRL to TRY for Turkish lira). The Object Management Group has an additional concept of certified provider, which is deemed an entity permitted to perform some functions on behalf of the registration authority, under specific processes and procedures documented within the standard for such a role.
An ISO registration authority (RAs) is not authorized to update standards but provides a registration function to facilitate implementation of an International Standard (e.g. ISBN number for books). Frequently, facilitating the implementation of an ISO standard’s requirements is best suited, by its nature, to one entity, an RA. This, de facto, creates a monopoly situation and this is why care needs to be taken with respect to the functions carried out and the fees charged to avoid an abuse of such a situation. In most cases, there is a formal legal contract in place between the standards body, such as the ISO General Secretariat, and the selected registration authority.
ISO registration authorities differ from a maintenance agency. Maintenance agencies are authorized to update particular elements in an International Standard and as a matter of policy, the secretariats of MAs are assigned to bodies forming part of the ISO system (member bodies or organizations to which a member body delegates certain tasks in its country). The membership of MAs and their operating procedures are subject to approval by the Technical Management Board.
While registration authorities for a particular standard typically do not change, the position is not formally guaranteed and is subject to review and reassignment to a different firm or organization. In some cases, the concept of a registration authority may not exist for a standard at all.
By further example, the equivalent registration authority organization for Internet standards is the Internet Assigned Numbers Authority.Shift Out and Shift In characters
Shift Out (SO) and Shift In (SI) are ASCII control characters 14 and 15, respectively (0x0E and 0x0F). These are sometimes also called "Control-N" and "Control-O".
The original meaning of those characters provided a way to shift a coloured ribbon, split longitudinally usually with red and black, up and down to the other colour in an electro-mechanical typewriter or teleprinter, such as the Teletype Model 38, to automate the same function of manual typewriters. Black was the conventional ambient default colour and so was shifted "in" or "out" with the other colour on the ribbon.
Later advancements in technology instigated use of this function for switching to a different font or character set and back. This was used, for instance, in the Russian character set known as KOI7, where SO starts printing Russian letters, and SI starts printing Latin letters again. SO/SI control characters also are used to display VT-100 pseudographics, and emoji (Japanese picture icons) on SoftBank Mobile. ISO/IEC 2022 standard specifies their generalized usage.Text editor
A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software, following the naming of Microsoft Notepad. Text editors are provided with operating systems and software development packages, and can be used to change files such as configuration files, documentation files and programming language source code.
|MacOS code pages("scripts")|
|DOS code pages|
|IBM AIX code pages|
|IBM Apple MacIntoshemulations|
|IBM Adobe emulations|
|IBM DEC emulations|
|IBM HP emulations|
|Windows code pages|
|EBCDIC code pages|
|Unicode / ISO/IEC 10646|
|TeX typesetting system|
|Miscellaneous code pages|
Standards of Ecma International
|File systems (tape)|
|File systems (disk)|
|Radio link interfaces|
List of ECMA Standards (1961 - Present)
ISO standards by standard number