The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme.
G0 is almost always an ISO-646 compliant coded character set such as US-ASCII, ISO 646:KR (KS X 1003) or ISO 646:JP (the lower half of JIS X 0201) that is invoked on GL (i.e. with the most significant bit cleared). An exception from US-ASCII is that 0x5C (backslash in US-ASCII) is often used to represent a Yen sign in EUC-JP (see below) and a Won sign in EUC-KR.
To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.
The most commonly used EUC codes are variable-width encodings with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.
Modern applications are more likely to use UTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors.
|MIME / IANA||GB2312|
|Language(s)||Simplified Chinese, English, Russian|
|Standard||GB 2312 (1980)|
|Classification||Extended ASCII, Variable-width encoding, CJK encoding, EUC|
|Extensions||748, GBK, GB18030, x-mac-chinesesimp|
|Transforms / Encodes||GB 2312|
|Succeeded by||GBK, GB18030|
EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022 form of GB2312 is not normally used, though a variant form called HZ was sometimes used on USENET. An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes in the range 0xA1 – 0xFE.
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
GBK is an extension to GB2312. It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1, including traditional Chinese characters and characters used only in Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.
The Unicode-based GB18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode. However, Unicode encoded as GB18030 is a variable-width encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8.
|MIME / IANA||EUC-JP|
|Alias(es)||Unixized JIS (UJIS), csEUCPkdFmtJapanese|
|Language(s)||Japanese, English, Russian|
|Classification||Extended ISO 646, Variable-width encoding, CJK encoding, EUC|
|Extends||US-ASCII or ISO 646:JP|
|Transforms / Encodes||JIS X 0208, JIS X 0212, JIS X 0201|
|Language(s)||Japanese, Ainu, English, Russian|
|Standard||JIS X 0213|
|Classification||Extended ASCII, Variable-width encoding, CJK encoding, EUC|
|Transforms / Encodes||JIS X 0213, JIS X 0201 (Kana)|
EUC-JP is a variable-width encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. 0.1% of all web pages use EUC-JP since August 2018. Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS. It is called Code page 954 by IBM. Microsoft has two code page numbers for this encoding (51932 and 20932).
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS).
Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions (Windows code page 932 on Microsoft Windows, and MacJapanese on classic Mac OS), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX). Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.
Vendor extensions to EUC-JP were usually allocated within the individual code sets, as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).
Characters are encoded as follows:
EUC-KR code structure
|MIME / IANA||EUC-KR|
|Language(s)||Korean, English, Russian|
|Standard||KS X 2901 (KS C 5861)|
|Classification||Extended ISO 646, Variable-width encoding, CJK encoding, EUC|
|Extends||US-ASCII or ISO 646:KR|
|Extensions||Mac OS Korean, IBM-949, Unified Hangul Code (Windows-949)|
|Transforms / Encodes||KS X 1001|
|Succeeded by||Unified Hangul Code (web standards)|
EUC-KR is a variable-width encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601) and either ISO 646:KR (KS X 1003, formerly KS C 5636) or US-ASCII, depending on variant. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR.
A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or US-ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).
When used with ASCII, it is called Code page 970 by IBM. It is known as Code page 51949 by Microsoft. It is usually referred to as Wansung (Korean: 완성, romanized: Wanseong, lit. 'precomposed') in the Republic of Korea.
A common extension of EUC-KR is the Unified Hangul Code (통합형 한글 코드, Tonghabhyeong Hangeul Kodeu, or 통합 완성형, Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows (code page 949, numbered 1363 by IBM). The W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR. Other EUC-KR compatible extensions include the Mac OS Korean encoding, used by the classic Mac OS. IBM's code page 949 is yet another, unrelated, EUC-KR extension. Similarly to the EUC-CN extensions described above, these extensions do not conform to the EUC structure.
As of March 2019, 0.2% of all web pages use EUC-KR. Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and macOS), but its use has been very slowly decreasing as UTF-8 gains popularity, especially on Linux and macOS.
As with most other encodings, UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.
EUC-TW is a variable-width encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan. Big5 is much more common.
Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.
UTF-8 is becoming more common than EUC-TW, as with most code pages.
The encodings described above (using bytes in 0x21–0x7E for code set 0, bytes in 0xA1–0xFE for code set 1, 0x8E followed by bytes in 0xA1–0xFE for code set 2 and 0x8F followed by bytes in 0xA1–0xFE for code set 3) are in a variable-width form referred to as the EUC packed format. This is the form usually labelled as EUC.
Internal processing may make use of a fixed-length alternative form called the EUC complete two-byte format. This represents:
Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed length format. These fixed length forms are suited to internal processing and are not usually encountered in interchange.
EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese". Only the packed format is included in the WHATWG Encoding Standard used by HTML5.
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use the ISO/IEC 2022 system of specifying control and graphic characters. Most character encodings, in addition to representing printable characters, also have characters such as these that represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.
The C0 set defines codes in the range 00HEX–1FHEX and the C1 set defines codes in the range 80HEX–9FHEX. The default C0 set was originally defined in ISO 646 (ASCII), while the default C1 set was originally defined in ECMA-48 (harmonized later with ISO 6429). While other C0 and C1 sets are available for specialized applications, they are rarely used.CNS 11643
The CNS 11643 character set (Chinese National Standard 11643), also officially known as the "Chinese Standard Interchange Code" (中文標準交換碼), is officially the standard character set of the Republic of China.
(In practice, variants of Big5 are de facto standard.)
CNS 11643 is a superset of ASCII designed to conform to ISO 2022.
It contains 16 planes, so the maximum possible number of encodable characters is 16×94×94 = 141376.
Planes 12 to 15 (35344 code points) are specifically designated for user-defined characters.
Unlike CCCII, the encoding of variant characters in CNS 11643 is not related.
EUC-TW is a representation of CNS 11643 in Extended Unix Code (EUC) form.DBCS
A double-byte character set (DBCS) is a character encoding in which either all characters (including control characters) are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set (SBCS) is encoded in two bytes (Han characters would generally comprise most of these two-byte characters). A DBCS supports national languages that contain a large number of unique characters or symbols (the maximum number of characters that can be represented with one byte is 256 characters, while two bytes can represent up to 65,536 characters). Examples of such languages include Japanese and Chinese. Korean Hangul does not contain as many characters, but KS X 1001 supports both Hangul and Hanja, and uses two bytes per character.Extended ASCII
Extended ASCII (EASCII or high ASCII) character encodings are eight-bit or larger encodings that include the standard seven-bit ASCII characters, plus additional characters. Using the term "extended ASCII" on its own is sometimes criticized, because it can be mistakenly interpreted to mean that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, neither of which is the case.
There are many extended ASCII encodings (more than 220 DOS and Windows codepages). EBCDIC ("the other" major 8-bit character code) likewise developed many extended variants (more than 186 EBCDIC codepages) over the decades.Index of Japan-related articles (E)
This page lists Japan-related articles with romanized titles beginning with the letter E. For names of people, please list by surname (i.e., "Tarō Yamada" should be listed under "Y", not "T"). Please also ignore particles (e.g. "a", "an", "the") when listing articles (i.e., "A City with No People" should be listed under "City").JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 (then called JIS C 6220 until the JIS category reform), was the first Japanese electronic character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).
The first 96 codes comprise an ISO 646 variant, mostly following ASCII with some differences, while the second 96 character codes represent the phonetic Japanese katakana signs. Since the encoding does not provide any way to express hiragana or kanji, it is only capable of expressing simplified written Japanese. Nevertheless, it is possible to express, at least phonetically, the full range of sounds in the language. In the 1980s, this was acceptable for media such as text mode computer terminals, telegrams, receipts or other electronically handled data.
JIS X 0201 was supplanted by subsequent encodings such as Shift JIS (which combines this standard and JIS X 0208) and later Unicode.JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange (7ビット及び8ビットの2バイト情報交換用符号化漢字集合, Nana-Bitto Oyobi Hachi-Bitto no Ni-Baito Jōhō Kōkan'yō Fugōka Kanji Shūgō). It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM.Japanese language and computers
In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to write English is very small, and thus it is possible to use only one byte (28=256 possible values) to encode one English character. However, the number of characters in Japanese is much more than 256 and thus cannot be encoded using a single byte - Japanese is thus encoded using two or more bytes, in a so-called "double byte" or "multi-byte" encoding. Problems that arise relate to transliteration and romanization, character encoding, and input of Japanese text.List of computing and IT abbreviations
This is a list of computing and IT acronyms and abbreviations.Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters.
(Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)
Early variable width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in adventure games for early microcomputers. However disks (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose compression algorithms have rendered such tricks largely obsolete.
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking backward compatibility with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.
|MacOS code pages("scripts")|
|DOS code pages|
|IBM AIX code pages|
|IBM Apple MacIntoshemulations|
|IBM Adobe emulations|
|IBM DEC emulations|
|IBM HP emulations|
|Windows code pages|
|EBCDIC code pages|
|Unicode / ISO/IEC 10646|
|TeX typesetting system|
|Miscellaneous code pages|