ISO/IEC 2022

ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard (equivalent to the ECMA standard ECMA-35[1]) specifying

  • a technique for including multiple character sets in a single character encoding system, and
  • a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.

Many of the character sets included as ISO/IEC 2022 encodings are 'double byte' encodings where two bytes correspond to a single character. This makes ISO-2022 a variable width encoding. But a specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation.

ISO 2022
Language(s)Various.
StandardISO 2022, ECMA 35, JIS X 0202
ClassificationStateful encoding
Transforms / EncodesUS-ASCII and, depending on implementation:
Succeeded byISO 10646 (Unicode)

Introduction

Many languages or language families not based on the Latin alphabet such as Greek, Cyrillic, Arabic, or Hebrew have historically been represented on computers with different 8-bit extended ASCII encodings. Written East Asian languages, specifically Chinese, Japanese, and Korean, use far more characters than can be represented in an 8-bit computer byte and were first represented on computers with language-specific double byte encodings.

ISO/IEC 2022 was developed as a technique to attack both of these problems: to represent characters in multiple character sets within a single character encoding, and to represent large character sets.

A second requirement of ISO-2022 was that it should be compatible with 7-bit communication channels. So even though ISO-2022 is an 8-bit character set any 8-bit sequence can be reencoded to use only 7-bits without loss and normally only a small increase in size.

To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and follow the patterns defined within the standard. These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on previously encountered escape sequences. Note, however, that other standards such as ISO-2022-JP may impose extra conditions such as the current character set is reset to US-ASCII before the end of a line.

To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that one seven bit character will normally define 94 graphic (printable) characters (in addition to space and 33 control characters). Using two bytes, it is thus possible to represent up to 8836 (94×94) characters; and, using three bytes, up to 830584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although EUC-TW's unregistered G2 is). For the two-byte character sets, the code point of each character is normally specified in so-called kuten (Japanese: 区点) form (sometimes called quwei (Chinese: 区位), especially when dealing with GB2312 and related standards), which specifies a zone (, Japanese: ku, Chinese: qu), and the point (Japanese: ten) or position (Chinese: wei) of that character within the zone.

The escape sequences therefore do not only declare which character set is being used, but also, by knowing the properties of these character sets, know whether a 94-, 96-, 8836-, or 830584-character (or some other sized) encoding is being dealt with.

In practice, the escape sequences declaring the national character sets may be absent if context or convention dictates that a certain national character set is to be used. For example, ISO-8859-1 states that no defining escape sequence is needed and RFC 1922, which defines ISO-2022-CN, allows ISO-2022 SHIFT characters to be used without explicit use of escape sequences.

The ISO-2022 definitions of the ISO-8859-X character sets are specific fixed combinations of the components that form ISO-2022. Specifically the lower control characters (C0) the US-ASCII character set (in GL) and the upper control characters (C1) are standard and the high characters (GR) are defined for each of the ISO-8859-X variants; for example ISO-8859-1 is defined by the combination of ISO-IR-1, ISO-IR-6, ISO-IR-77 and ISO-IR-100 with no shifts or character changes allowed.

Although ISO/IEC 2022 character sets using control sequences are still in common use, particularly ISO-2022-JP, most modern e-mail applications are converting to use the simpler Unicode transforms such as UTF-8. The encodings that don't use control sequences, such as the ISO-8859 sets are still very common.

Code structure

ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.

Character codes from the 7-bit ASCII graphic range (0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while codes from the "high ASCII" range (0xA0–0xFF), if available, are referred to as the "GR" codes ("graphics right").

By default, GL codes specify G0 characters, and GR codes specify G1 characters, but this may be modified with control codes or by prior agreement:

Code Abbr. Name Effect
0x0F SI
LS0
Shift In
Locking shift zero
GL encodes G0 from now on
0x0E SO
LS1
Shift Out
Locking shift one
GL encodes G1 from now on
ESC 0x6E (n) LS2 Locking shift two GL encodes G2 from now on
ESC 0x6F (o) LS3 Locking shift three GL encodes G3 from now on
0x8E
ESC 0x4E
(N)
SS2 Single shift two GL encodes G2 for next character only
0x8F
ESC 0x4F
(O)
SS3 Single shift three GL encodes G3 for next character only
ESC 0x7E (~) LS1R Locking shift one right GR encodes G1 from now on
ESC 0x7D (}) LS2R Locking shift two right GR encodes G2 from now on
ESC 0x7C (|) LS3R Locking shift three right GR encodes G3 from now on

Each of the four working sets may be a 94-character set or a 94n-character set. Additionally, G1 through G3 may be a 96- or 96n-character set. When one of the latter is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available.

There are additional (rarely used) features for switching control character sets, but this is a single-level lookup: the 0x00–0x1F range is the C0 control character set, the 0x80–0x9F range is the C1 control character set, and there are escape sequences which switch in various alternatives. It is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible.

As seen in the SS2 and SS3 examples above, single control characters from the C1 control character set may be invoked using only 7 bits using the sequences ESC 0x40 (@) through ESC 0x5F (_). Additional control functions are assigned in the range ESC 0x60 (`) through ESC 0x7E (~). While this article describes escape sequences using the corresponding ASCII characters, they are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.

Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.

Code Hex Abbr. Name Effect
ESC ! F 1B 21 F CZD C0-designate F selects a C0 control character set to be used.
ESC " F 1B 22 F C1D C1-designate F selects a C1 control character set to be used.
ESC % F 1B 25 F DOCS Designate other coding system F selects an 8-bit code; use ESC % @ to return to ISO/IEC 2022. E.g. ESC % G for UTF-8.
ESC % / F 1B 25 2F F DOCS Designate other coding system F selects an 8-bit code; there is no standard way to return. E.g. ESC % / E for UCS-2.
ESC & F 1B 26 F IRR Identify revised registration F, adjusted to the range 1-63, indicates which revision of the immediately-following registration is needed, so that old systems know that they are old.
ESC ( F 1B 28 F GZD4 G0-designate 94-set F selects a 94-character set to be used for G0.
ESC ) F 1B 29 F G1D4 G1-designate 94-set F selects a 94-character set to be used for G1.
ESC * F 1B 2A F G2D4 G2-designate 94-set F selects a 94-character set to be used for G2.
ESC + F 1B 2B F G3D4 G3-designate 94-set F selects a 94-character set to be used for G3.
ESC - F 1B 2D F G1D6 G1-designate 96-set F selects a 96-character set to be used for G1.
ESC . F 1B 2E F G2D6 G2-designate 96-set F selects a 96-character set to be used for G2.
ESC / F 1B 2F F G3D6 G3-designate 96-set F selects a 96-character set to be used for G3.
ESC $ F
ESC $ ( F
1B 24 F
1B 24 28 F
GZDM4 G0-designate multibyte 94-set F selects a 94n-character set to be used for G0.
ESC $ ) F 1B 24 29 F G1DM4 G1-designate multibyte 94-set F selects a 94n-character set to be used for G1.
ESC $ * F 1B 24 2A F G2DM4 G2-designate multibyte 94-set F selects a 94n-character set to be used for G2.
ESC $ + F 1B 24 2B F G3DM4 G3-designate multibyte 94-set F selects a 94n-character set to be used for G3.
ESC $ - F 1B 24 2D F G1DM6 G1-designate multibyte 96-set F selects a 96n-character set to be used for G1.
ESC $ . F 1B 24 2E F G2DM6 G2-designate multibyte 96-set F selects a 96n-character set to be used for G2.
ESC $ / F 1B 24 2F F G3DM6 G3-designate multibyte 96-set F selects a 96n-character set to be used for G3.

Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by ESC ( A through ESC + A is not related in any way to the 96-character set designated by ESC - A through ESC / A. And neither of those is related to the 94n-character set designated by ESC $ ( A through ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes, ESC A is a way of specifying the C1 control code 0x81.)

Also note that C0 and C1 control character sets are independent; the C0 control character set designated by ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by ESC " A (the CCITT attribute control set for Videotex).

Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of the form ESC ( ! F have been assigned. At the other extreme, no multibyte 96-sets have been registered, so the sequences above are strictly theoretical.

ISO/IEC 2022 character sets

Moz-cjk
Various ISO 2022 and other CJK encodings supported by Mozilla Firefox as of 2004. (This support has been reduced in later versions to avoid certain cross site scripting attacks.)

Character encodings using ISO/IEC 2022 mechanism include:

  • ISO-2022-JP. A widely used encoding for Japanese. Starts in ASCII and includes the following escape sequences
    • ESC ( B to switch to ASCII (1 byte per character)
    • ESC ( J to switch to JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)
    • ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
    • ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)
  • ISO-2022-JP-1. The same as ISO-2022-JP with one additional escape sequence
  • ISO-2022-JP-2. A multilingual extension of ISO-2022-JP. The same as ISO-2022-JP-1 with the following additional escape sequences [2]
    • ESC $ A to switch to GB 2312-1980 (2 bytes per character)
    • ESC $ ( C to switch to KS X 1001-1992 (2 bytes per character)
    • ESC . A to switch to ISO/IEC 8859-1 high part, Extended Latin 1 set (1 byte per character) [designated to G2]
    • ESC . F to switch to ISO/IEC 8859-7 high part, Basic Greek set (1 byte per character) [designated to G2]
  • ISO-2022-JP-3. The same as ISO-2022-JP with three additional escape sequences
  • ISO-2022-JP-2004. The same as ISO-2022-JP-3 with one additional escape sequence
  • ISO-2022-KR. An encoding for Korean.
    • ESC $ ) C to switch to KS X 1001-1992,[3][4] previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
  • ISO-2022-CN. An encoding for Chinese.
    • ESC $ ) A to switch to GB 2312-1980 (2 bytes per character) [designated to G1]
    • ESC $ ) G to switch to CNS 11643-1992 Plane 1 (2 bytes per character) [designated to G1]
    • ESC $ * H to switch to CNS 11643-1992 Plane 2 (2 bytes per character)
  • ISO-2022-CN-EXT. The same as ISO-2022-CN with six additional escape sequences
    • ESC $ ) E to switch to ISO-IR-165 (2 bytes per character) [designated to G1]
    • ESC $ + I to switch to CNS 11643-1992 Plane 3 (2 bytes per character) [designated to G3]
    • ESC $ + J to switch to CNS 11643-1992 Plane 4 (2 bytes per character) [designated to G3]
    • ESC $ + K to switch to CNS 11643-1992 Plane 5 (2 bytes per character) [designated to G3]
    • ESC $ + L to switch to CNS 11643-1992 Plane 6 (2 bytes per character) [designated to G3]
    • ESC $ + M to switch to CNS 11643-1992 Plane 7 (2 bytes per character) [designated to G3]

The character after the ESC (for single-byte character sets) or ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character ( (0x28) designates a 94-character set to the G0 character set. This may be replaced by ), * or + (0x29–0x2B) to designate to the G1–G3 character sets.

Two of the codes above are 96-character codes, and in the above examples, the character - (0x2D) designates to the G1 character set. This may be replaced with . or / (0x2E or 0x2F) to designate to the G2 or G3 character sets. As mentioned earlier, a 96-character set may not be designated to the G0 set.

There are three special cases for multi-byte codes. The code sequences ESC $ @, ESC $ A, and ESC $ B were all registered before the ISO/IEC 2022 standard was finalized, so must be accepted as synonyms for the sequences ESC $ ( @ through ESC $ ( B to designate to the G0 character set. The latter form may also be used, and may be adapted by changing the ( character to designate to the G1 through G3 character sets.

The standard also defines a way to specify coding systems that do not follow its own structure. Of particular interest, the sequence ESC % G designates the UTF-8 coding system, which does not reserve the range 0x80–0x9F for control characters.

Comparison with other encodings

Advantages

  • As ISO/IEC 2022's entire range of 94-set graphical character encodings can be delegated to GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
  • As compared to Unicode, ISO/IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues associated with unification, such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font.

Disadvantages

  • Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a back up to the previous escape sequence before the bytes following the escape sequence can be interpreted.
  • Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be delegated to any of G0 through G3, which may be accessed using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
  • Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 in addition to supporting several other encodings.[5] This type of variation makes it difficult to portably transfer text between computer systems.
  • UTF-1, the multi-byte Unicode transformation format compatible with ISO/IEC 2022, has various disadvantages in comparison with UTF-8, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
  • Because of its escape sequences, it is possible to construct attack byte sequences that round-trip from ISO/IEC 2022 to Unicode and back. Use of this encoding is thus treated as suspicious by malware protection suites.[6]
  • Concatenation can pose issues. Profiles such as ISO-2022-JP specify that the stream starts in the ASCII state and must end in the ASCII state.[7] This is necessary to ensure that characters in concatenated ISO-2022-JP and/or ASCII streams will be interpreted in the correct set. However, it means that if a stream that ends in a multi-byte character is concatenated with one that starts with a multi-byte character, a pair of escape codes are generated switching to ASCII and immediately away from it. However, as stipulated in Unicode Technical Report #36 ("Unicode Security Considerations"), pairs of ISO 2022 escape sequences with no characters between them should generate a replacement character ("�") to prevent them from being used to mask malicious sequences.[8] Implementing this measure, e.g. in Mozilla Thunderbird, has led to interoperability issues, with unexpected "�" characters being generated where two ISO-2022-JP streams have been concatenated.[9]

See also

References

  1. ^ "Standard ECMA 35" (PDF).
  2. ^ RFC 1554 - ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP. Tools.ietf.org. Retrieved on 2014-05-20.
  3. ^ "KS X 1001:1992" (PDF).
  4. ^ "KS C 5601:1987" (PDF). 1988-10-01.
  5. ^ "DICOM ISO 2022 variation".
  6. ^ https://bugzilla.mozilla.org/show_bug.cgi?id=935453
  7. ^ RFC 1468
  8. ^ Davis, Mark; Suignard, Michel (2014-09-19). "3.6.2 Some Output For All Input". Unicode Technical Report #36: Unicode Security Considerations (revision 15). Unicode Consortium.
  9. ^ Sivonen, Henri (2018-12-17). "(UNSUBMITTED DRAFT) No U+FFFD Generation for Zero-Length ASCII-State Content between ISO-2022-JP Escape Sequences" (PDF).
  • Lunde, Ken. CJKV Information Processing. Cambridge, Massachusetts: O'Reilly & Associates, 1998. ISBN 1-56592-224-7.

External links

RFCs
  • RFC 1468: description of ISO-2022-JP
  • RFC 2237: description of ISO-2022-JP-1
  • RFC 1554: description of ISO-2022-JP-2
  • RFC 1922: description of ISO-2022-CN and ISO-2022-CN-EXT
  • RFC 1557: description of ISO-2022-KR
C0 and C1 control codes

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use the ISO/IEC 2022 system of specifying control and graphic characters. Most character encodings, in addition to representing printable characters, also have characters such as these that represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

The C0 set defines codes in the range 00HEX–1FHEX and the C1 set defines codes in the range 80HEX–9FHEX. The default C0 set was originally defined in ISO 646 (ASCII), while the default C1 set was originally defined in ECMA-48 (harmonized later with ISO 6429). While other C0 and C1 sets are available for specialized applications, they are rarely used.

Character encoding

Character encoding is used to represent a repertoire of characters by some kind of encoding system. Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data. "Character set", "character map", "codeset" and "code page" are related, but not identical, terms.

Early character codes associated with the optical or electrical telegraph could only represent a subset of the characters used in written languages, sometimes restricted to upper case letters, numerals and some punctuation only. The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages. Character encoding using internationally accepted standards permits worldwide interchange of text in electronic form.

Code page 1287

Code page 1287, also known as CP1287, DEC Greek (8-bit) and EL8DEC, is one of the code pages implemented for the VT220 terminals. It supports the Greek language.

Ecma International

Ecma is a standards organization for information and communication systems. It acquired its current name in 1994, when the European Computer Manufacturers Association (ECMA) changed its name to reflect the organization's global reach and activities. As a consequence, the name is no longer considered an acronym and no longer uses full capitalization.

The organization was founded in 1961 to standardize computer systems in Europe. Membership is open to large and small companies worldwide that produce, market or develop computer or communication systems, and have interest and experience in the areas addressed by the group's technical bodies. It is located in Geneva.

ISO-8859-8-I

ISO-8859-8-I is the IANA charset name for the character encoding ISO/IEC 8859-8 used together with the control codes from ISO/IEC 6429 for the C0 (00–1F hex) and C1 (80–9F) parts. The characters are in logical order.

Escape sequences (from ISO/IEC 6429 or ISO/IEC 2022) are not to be interpreted. Most applications only interpret the control codes for LF, CR, and HT. A few applications also interpret VT, FF, and NEL (in C1). Very few applications interpret the other C0 and C1 control codes.

ISO-8859-8 is sometimes in logical order (HTML, XML), and sometimes in visual (left-to-right) order (plain text without any markup).

Logical order for this charset requires bidi processing for display.

ISO/IEC 6937

ISO/IEC 6937:2001, Information technology — Coded graphic character set for text communication — Latin alphabet, is a multibyte extension of ASCII, or rather of ISO/IEC 646-IRV. It was developed in common with ITU-T (then CCITT) for telematic services under the name of T.51, and first became an ISO standard in 1983. Certain byte codes are used as lead bytes for letters with diacritics (accents). The value of the lead byte often indicates which diacritic that the letter has, and the follow byte then has the ASCII-value for the letter that the diacritic is on. Only certain combinations of lead byte and follow byte are allowed, and there are some exceptions to the lead byte interpretation for some follow bytes. However, there are no combining characters at all are encoded in ISO/IEC 6937. But one can represent some free-standing diacritics, often by letting the follow byte have the code for ASCII space.

ISO/IEC 6937's architects were Hugh McGregor Ross, Peter Fenwick, Bernard Marti and Loek Zeckendorf.

ISO6937/2 defines 327 characters found in modern European languages using the Latin alphabet. Non-Latin European characters, such as Cyrillic and Greek, are not included in the standard. Also, some diacritics used with the Latin alphabet like the Romanian comma are not included, using cedilla instead as no distinction between cedilla and comma below was made at the time.

IANA has registered the charset names ISO_6937-2-25 and ISO_6937-2-add for two (older) versions of this standard (plus control codes). But in practice this character encoding is unused on the Internet.

The ISO/IEC 2022 escape sequence to specify the right-hand side of the ISO/IEC 6937 character set is ESC - R (hex 1B 2D 52).

ISO/IEC 8859-12

ISO/IEC 8859-12 would have been part 12 of the ISO/IEC 8859 character encoding standard series.

ISO 8859-12 was originally proposed to support the Celtic languages. ISO 8859-12 was later slated for Latin/Devanagari, but this was abandoned in 1997, during the 12th meeting of ISO/IEC JTC 1/SC 2/WG 3 in Iraklion-Crete, Greece, 4 to 7 July 1997. The Celtic proposal was changed to ISO 8859-14.

ISO/IEC 8859-16

ISO/IEC 8859-16:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 16: Latin alphabet No. 10, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. It is informally referred to as Latin-10 or South-Eastern European. It was designed to cover Albanian, Croatian, Hungarian, Polish, Romanian, Serbian and Slovenian, but also French, German, Italian and Irish Gaelic (new orthography).

ISO-8859-16 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429.

Microsoft has assigned code page 28606 a.k.a. Windows-28606 to ISO-8859-16.

ISO/IEC 8859-3

ISO/IEC 8859-3:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 3: Latin alphabet No. 3, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin-3 or South European. It was designed to cover Turkish, Maltese and Esperanto, though the introduction of ISO/IEC 8859-9 superseded it for Turkish. The encoding remains popular with users of Esperanto, though use is waning as application support for Unicode becomes more common.

ISO-8859-3 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Microsoft has assigned code page 28593 a.k.a. Windows-28593 to ISO-8859-3 in Windows. IBM has assigned code page 913 to ISO 8859-3.

ISO/IEC 8859-9

ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for these six replacements of Icelandic characters with characters unique to the Turkish alphabet:

ISO-8859-9 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. In modern applications Unicode and UTF-8 are preferred. 0.1% of all web pages use ISO-8859-9 in February 2016.Microsoft has assigned code page 28599 a.k.a. Windows-28599 to ISO-8859-9 in Windows. IBM has assigned Code page 920 to ISO-8859-9.

ISO/IEC JTC 1/SC 2

ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that develops and facilitates standards within the field of coded character sets. The international secretariat of ISO/IEC JTC 1/SC 2 is the Japanese Industrial Standards Committee (JISC), located in Japan.

Index of Japan-related articles (I)

This page lists Japan-related articles with romanized titles beginning with the letter I. For names of people, please list by surname (i.e., "Tarō Yamada" should be listed under "Y", not "T"). Please also ignore particles (e.g. "a", "an", "the") when listing articles (i.e., "A City with No People" should be listed under "City").

List of Ecma standards

This is a list of standards published by Ecma International, formerly the European Computer Manufacturers Association.

MARC-8

The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in library database systems. The character encoding now known as MARC-8 was introduced in 1968 as part of the MARC format. Originally based on the Latin alphabet, from 1979 to 1983 the JACKPHY initiative expanded the repertoire to include Japanese, Arabic, Chinese, and Hebrew characters (among others), with the later addition of Cyrillic and Greek scripts. If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8 must be used instead. UTF-8 has support for many more characters than MARC-8, which is rarely used outside library data.

Registration authority

Registration authorities exist for many standards organizations, such as ANNA (Association of National Numbering Agencies for ISIN), the Object Management Group, W3C, IEEE and others. In general, registration authorities all perform a similar function, in promoting the use of a particular standard through facilitating its use. This may be by applying the standard, where appropriate, or by verifying that a particular application satisfies the standard's tenants. Maintenance agencies, in contrast, may change an element in a standard based on set rules – such as the creation or change of a currency code when a currency is created or revalued (i.e. TRL to TRY for Turkish lira). The Object Management Group has an additional concept of certified provider, which is deemed an entity permitted to perform some functions on behalf of the registration authority, under specific processes and procedures documented within the standard for such a role.

An ISO registration authority (RAs) is not authorized to update standards but provides a registration function to facilitate implementation of an International Standard (e.g. ISBN number for books). Frequently, facilitating the implementation of an ISO standard’s requirements is best suited, by its nature, to one entity, an RA. This, de facto, creates a monopoly situation and this is why care needs to be taken with respect to the functions carried out and the fees charged to avoid an abuse of such a situation. In most cases, there is a formal legal contract in place between the standards body, such as the ISO General Secretariat, and the selected registration authority.

ISO registration authorities differ from a maintenance agency. Maintenance agencies are authorized to update particular elements in an International Standard and as a matter of policy, the secretariats of MAs are assigned to bodies forming part of the ISO system (member bodies or organizations to which a member body delegates certain tasks in its country). The membership of MAs and their operating procedures are subject to approval by the Technical Management Board.

While registration authorities for a particular standard typically do not change, the position is not formally guaranteed and is subject to review and reassignment to a different firm or organization. In some cases, the concept of a registration authority may not exist for a standard at all.

By further example, the equivalent registration authority organization for Internet standards is the Internet Assigned Numbers Authority.

Shift Out and Shift In characters

Shift Out (SO) and Shift In (SI) are ASCII control characters 14 and 15, respectively (0x0E and 0x0F). These are sometimes also called "Control-N" and "Control-O".

The original meaning of those characters provided a way to shift a coloured ribbon, split longitudinally usually with red and black, up and down to the other colour in an electro-mechanical typewriter or teleprinter, such as the Teletype Model 38, to automate the same function of manual typewriters. Black was the conventional ambient default colour and so was shifted "in" or "out" with the other colour on the ribbon.

Later advancements in technology instigated use of this function for switching to a different font or character set and back. This was used, for instance, in the Russian character set known as KOI7, where SO starts printing Russian letters, and SI starts printing Latin letters again. SO/SI control characters also are used to display VT-100 pseudographics, and emoji (Japanese picture icons) on SoftBank Mobile. ISO/IEC 2022 standard specifies their generalized usage.

Text editor

A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software, following the naming of Microsoft Notepad. Text editors are provided with operating systems and software development packages, and can be used to change files such as configuration files, documentation files and programming language source code.

Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
EUC
ISO/IEC 2022
MacOS code pages("scripts")
DOS code pages
IBM AIX code pages
IBM Apple MacIntoshemulations
IBM Adobe emulations
IBM DEC emulations
IBM HP emulations
Windows code pages
EBCDIC code pages
Platform specific
Unicode / ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Related topics
Standards of Ecma International
Application interfaces
File systems (tape)
File systems (disk)
Graphics
Programming languages
Radio link interfaces
Other
ISO standards by standard number
1–9999
10000–19999
20000+
IEC standards
ISO/IEC standards
Related

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.