CCSID

A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page. For example, Unicode is a code page that has several encoding forms, like UTF-8, UTF-16 and UTF-32.

Difference between a code page and a CCSID

The terms code page and CCSID are often used interchangeably, even though they are not synonymous. A code page may be only part of what makes up a CCSID. The following definitions from IBM help to illustrate this point:

  • A glyph is the actual physical pattern of pixels or ink that shows up on a display or printout.
  • A character is a concept that covers all glyphs associated with a certain symbol. For instance, "F", "F", "F", "F", "F", and "F" are all different glyphs, but use the same character. The various modifiers (bold, italic, underline, color, and font) do not change the F's essential F-ness.
  • A character set contains the characters necessary to allow a particular human to carry on a meaningful interaction with the computer. It does not specify how those characters are represented in a computer.[1] This level is the first one to separate characters into various alphabets (Latin, Arabic, Hebrew, Cyrillic, and so on) or ideographic groups (e.g., Chinese, Korean). It corresponds to a "character repertoire" in the Unicode encoding model.
  • A code page represents a particular assignment of code point values to characters.[1] It corresponds to a "coded character set" in the Unicode encoding model. A code point for a character is the computer's internal representation of that character in a given code page.[1] Many characters are represented by different code points in different code pages. Certain character sets can be adequately represented with single-byte code pages (which have a maximum 256 code points, hence a maximum of 256 characters), but many require more than that. Examples include JIS X 0208 and Unicode.
  • An encoding scheme is the byte format of a code page. It maps code point values to sequences of one or more byte values in a computer.[2] For example, UTF-8 and UTF-16BE are two encodings of the same Unicode code page. In IBM's character data representation architecture (CDRA), this is typically represented with an ESID (encoding scheme identifier).[3] EUC and ISO-2022 are other examples of encoding schemes.
  • A coded character set identifier (CCSID) contains all of the information necessary to assign and preserve the meaning and rendering of characters through various stages of processing and interchange. This information always includes at least one code page, but may include multiple code pages of differing byte-lengths. The CCSID also has an associated encoding scheme that governs how various code points are to be handled. This mechanism allows a program to recognize bidirectional orientation, character shaping (mainly of Arabic characters), and other complex encoding information.

Examples

The following examples show how some CCSIDs are made up of other CCSIDs.

CCSID 932
Character set Code page CCSID Encoding scheme
1122 897 897 SBCS
370 301 301 DBCS
CCSID 942
Character set Code page CCSID Encoding scheme
1172 1041 1041 SBCS
370 301 301 DBCS
CCSID 5028
Character set Code page CCSID Encoding scheme
1170 897 4993 SBCS
370 301 301 DBCS

All three of these variant Shift-JIS CCSIDs are multi-byte character sets (MBCS): the single-byte character set (SBCS) portion of each CCSID is different. The double-byte character set (DBCS) portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other two CCSIDs, which is 1041.

Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with the same code page identifier. This is a common way that CDRA denotes an upgraded CCSID.

There are a few reasons for this complexity:

  • Many of the CCSIDs are used in IBM databases, like DB2, where a database field only supports an SBCS, DBCS or MBCS string. CCSIDs allow programs to differentiate between which one is being used.
  • When characters are added or replaced, like the Euro currency sign introduction, one can know whether the stored strings support or do not support those character additions because a different CCSID is being used. This versioning is important for the integrity of the data.
  • It enables reuse of resources among similar CCSIDs.[4]

References

  1. ^ a b c "IBM Terminology—Terms C". IBM. Retrieved 2013-01-25.
  2. ^ "IBM Character Data Representation Architecture, Appendix A. Encoding Schemes". IBM. Retrieved 2013-01-25.
  3. ^ "IBM Character Data Representation Architecture, Chapter 3. CDRA Identifiers". section "Long-Form Identification". Retrieved 2013-01-25.
  4. ^ http://www.ibm.com/software/globalization/cdra/chapter7.html

External links

Code page 875

IBM code page 875 (CCSIDs 875, 4971, 9067) is an EBCDIC code page with full Greek-charset used in IBM mainframes. It has superseded Code page 423.

In CCSID 4971 (November 1998), the euro sign was added in position FC.

In CCSID 9067 (March 2005), the drachma sign and Greek ypogegrammeni were added in positions E2 and EC, respectively, to match the characters that were added to ISO-8859-7.

Code page 896

Code page 896, called Japan 7-Bit Katakana Extended, is IBM's code page for code-set G2 of EUC-JP, a 7-bit code page representing the Kana set (upper half) of JIS X 0201 and accompanying Code page 895 which corresponds to the lower half of that standard. It encodes half-width katakana.

The code page defines five extended characters in addition to standard JIS X 0201 assignments; use of these characters is not permitted by the corresponding CCSID 896, but is permitted by the alternative CCSID 4992.Code page 896 is a 7-bit encoding and therefore does not use the high bit. When it used as the right half of an 8-bit encoding, all values except 0x20 use encoding bytes 0x80 above those defined in the code page (i.e. with the high bit set).

Code page 930

CCSID 930 (sometimes known as CP930 or codepage 930) is one of several Japanese EBCDIC code pages created by IBM for representation of Japanese text. It is commonly used on IBM z/OS and IBM System i operating system.

It encodes halfwidth Katakana, fullwidth Katakana, Hiragana and Kanji.

EBCDIC 002

IBM code page 2 (CCSID 2) is an EBCDIC code page used in IBM mainframes in the United States.

EBCDIC 012

IBM code page 12 (CCSID 12) is an EBCDIC code page used in IBM mainframes in Italy to support the Italian language.

EBCDIC 037

IBM code page 37 is an EBCDIC code page with the full Latin-1 character set used in IBM mainframes. It is used in some English- and Portuguese-speaking countries, including Australia, Brazil, Canada, New Zealand, Portugal, South Africa, and the United States.

CCSID 1140 is the Euro currency update of code page/CCSID 37. In that code page, the "¤" (currency sign) character at code point 9F is replaced with the "€" (Euro sign) character.

EBCDIC 1025

IBM code page 1025 (CCSID 1025) is an EBCDIC code page with full Cyrillic-charset used in IBM mainframes. It is a revision of EBCDIC 880 to cover all of the Cyrillic-charset.

CCSID 1154 is the Euro currency update of code page/CCSID 1025. Byte E1 is replacing § with € in that code page.

EBCDIC 1026

IBM code page 1026 (CCSID 1026) is an EBCDIC code page with full Latin-5-charset used in IBM mainframes.

CCSID 1155 is the Euro currency update of code page/CCSID 500. Byte 9F is replacing ¤ with € in that code page.

EBCDIC 277

IBM code page 277 is an EBCDIC code page with the full Latin-1 character set used in IBM mainframes. It is used in Denmark and Norway.

CCSID 1142 is the Euro currency update of code page/CCSID 277. In that code page, the "¤" (currency) character at code point 5A is replaced with the "€" (Euro) character.

Characters 00–3F and FF are controls, 40 is space, 41 is no-break space, and CA is soft hyphen. Characters are shown with their equivalent Unicode codes. Differences from EBCDIC 037 are boxed.

EBCDIC 278

IBM code page 278 (CCSID 278) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes.It is used in Finland and Sweden.

CCSID 1143 is the Euro currency update of code page/CCSID 278. Byte 5A is replacing ¤ with € in that code page.

EBCDIC 280

IBM code page 280 (CCSID 280) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes. It is used in Italy.

CCSID 1144 is the Euro currency update of code page/CCSID 280. Byte 9F is replacing ¤ with € in that code page.

EBCDIC 284

IBM code page 284 (CCSID 284) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes. It is used in Spain and Latin America.

CCSID 1145 is the Euro currency update of code page/CCSID 284. Byte 9F is replacing ¤ with € in that code page.

EBCDIC 285

IBM code page 285 is an EBCDIC code page with full Latin-1-charset used in IBM mainframes. It is used in Ireland and the United Kingdom.

CCSID 1146 is the Euro currency update of code page/CCSID 285. Byte 9F is replaced ¤ with € in that code page.

For other English-speaking countries, see EBCDIC code page 037.

EBCDIC 289

IBM code page 289 (CCSID 289) is an EBCDIC code page used on IBM mainframes in Spain to support the Spanish language.

EBCDIC 297

IBM code page 297 (CCSID 297) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes. It is used in France.

CCSID 1147 is the Euro currency update of code page/CCSID 297. Byte 9F is replacing ¤ with € in that code page.

EBCDIC 424

IBM code page 424 is an EBCDIC code page that supports Hebrew used in IBM mainframes.

In CCSID 8616, the directional controls were added at positions DB, DE, DF, FB, FC, FD, and FE. In CCSID 12712, the euro sign was added at position 9C new sheqel sign was added at position 9E.

EBCDIC 500

IBM code page 500 (CCSID 500) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes.

CCSID 1148 is the Euro currency update of code page/CCSID 500. Byte 9F is replacing ¤ with € in that code page.

It superseded EBCDIC 256.

EBCDIC 870

IBM code page 870 (CCSID 870) is an EBCDIC code page with full Latin-2-charset used in IBM mainframes.

CCSID 1110 replaces byte 90 ˚ (ring above) with ° (degree sign)

CCSID 1153 is the Euro currency update of code page/CCSID 870. Byte 9F is replacing ¤ with € in that code page.

EBCDIC 871

IBM code page 871 (CCSID 871) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes. It is used in Iceland.

CCSID 1149 is the Euro currency update of code page/CCSID 871. In that code page, the "¤" (currency sign) character at code point 9F is replaced with the "€" (Euro sign) character.

Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
EUC
ISO/IEC 2022
MacOS code pages("scripts")
DOS code pages
IBM AIX code pages
IBM Apple MacIntoshemulations
IBM Adobe emulations
IBM DEC emulations
IBM HP emulations
Windows code pages
EBCDIC code pages
Platform specific
Unicode / ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Related topics

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.