Numeric character reference

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that don't fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

Examples

In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma

Numerical character reference of U+03A3 Σ GREEK CAPITAL LETTER SIGMA
(3A316 = 931)
Unicode character Numerical base Numerical reference in markup Effect
U+03A3 Decimal Σ Σ
U+03A3 Decimal Σ Σ
U+03A3 Hexadecimal Σ Σ
U+03A3 Hexadecimal Σ Σ
U+03A3 Hexadecimal Σ Σ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE

Numerical character reference of U+00C6 Æ LATIN CAPITAL LETTER AE
Unicode character Numerical base Numerical reference in markup Effect
U+00C6 Decimal Æ Æ
U+00C6 Hexadecimal Æ Æ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s ß

Numerical character reference of U+00DF ß LATIN SMALL LETTER SHARP S
Unicode character Numerical base Numerical reference in markup Effect
U+00DF Decimal ß ß
U+00DF Hexadecimal ß ß

List of numeric character references for the printable ASCII characters:

Unicode character Character
Reference
(decimal)
Character
Reference
(hexadecimal)
Effect
U+0020     (space)
U+0021 ! ! !
U+0022 " " "
U+0023 # # #
U+0024 $ $ $
U+0025 % % %
U+0026 & & &
U+0027 ' ' '
U+0028 ( ( (
U+0029 ) ) )
U+002A * * *
U+002B + + +
U+002C , , ,
U+002D - - -
U+002E . . .
U+002F / / /
U+0030 0 0 0
U+0031 1 1 1
U+0032 2 2 2
U+0033 3 3 3
U+0034 4 4 4
U+0035 5 5 5
U+0036 6 6 6
U+0037 7 7 7
U+0038 8 8 8
U+0039 9 9 9
U+003A : : :
U+003B &#59; &#x3B; ;
U+003C &#60; &#x3C; <
U+003D &#61; &#x3D; =
U+003E &#62; &#x3E; >
U+003F &#63; &#x3F; ?
U+0040 &#64; &#x40; @
U+0041 &#65; &#x41; A
U+0042 &#66; &#x42; B
U+0043 &#67; &#x43; C
U+0044 &#68; &#x44; D
U+0045 &#69; &#x45; E
U+0046 &#70; &#x46; F
U+0047 &#71; &#x47; G
U+0048 &#72; &#x48; H
U+0049 &#73; &#x49; I
U+004A &#74; &#x4A; J
U+004B &#75; &#x4B; K
U+004C &#76; &#x4C; L
U+004D &#77; &#x4D; M
U+004E &#78; &#x4E; N
U+004F &#79; &#x4F; O
U+0050 &#80; &#x50; P
U+0051 &#81; &#x51; Q
U+0052 &#82; &#x52; R
U+0053 &#83; &#x53; S
U+0054 &#84; &#x54; T
U+0055 &#85; &#x55; U
U+0056 &#86; &#x56; V
U+0057 &#87; &#x57; W
U+0058 &#88; &#x58; X
U+0059 &#89; &#x59; Y
U+005A &#90; &#x5A; Z
U+005B &#91; &#x5B; [
U+005C &#92; &#x5C; \
U+005D &#93; &#x5D; ]
U+005E &#94; &#x5E; ^
U+005F &#95; &#x5F; _
U+0060 &#96; &#x60; `
U+0061 &#97; &#x61; a
U+0062 &#98; &#x62; b
U+0063 &#99; &#x63; c
U+0064 &#100; &#x64; d
U+0065 &#101; &#x65; e
U+0066 &#102; &#x66; f
U+0067 &#103; &#x67; g
U+0068 &#104; &#x68; h
U+0069 &#105; &#x69; i
U+006A &#106; &#x6A; j
U+006B &#107; &#x6B; k
U+006C &#108; &#x6C; l
U+006D &#109; &#x6D; m
U+006E &#110; &#x6E; n
U+006F &#111; &#x6F; o
U+0070 &#112; &#x70; p
U+0071 &#113; &#x71; q
U+0072 &#114; &#x72; r
U+0073 &#115; &#x73; s
U+0074 &#116; &#x74; t
U+0075 &#117; &#x75; u
U+0076 &#118; &#x76; v
U+0077 &#119; &#x77; w
U+0078 &#120; &#x78; x
U+0079 &#121; &#x79; y
U+007A &#122; &#x7A; z
U+007B &#123; &#x7B; {
U+007C &#124; &#x7C; |
U+007D &#125; &#x7D; }
U+007E &#126; &#x7E; ~

Discussion

Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding.

Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.

Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each.

Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.

The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.

Character references that are based on the referenced character's UCS or Unicode code point are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows:

Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:

  • one or more decimal digits zero (U+0030) through nine (U+0039); or
  • character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);

all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.

The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.

There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity.) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.

Restrictions

The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.

While the syntax of SGML does not prohibit references to invalid or unassigned code points, such as &#xFFFF;, SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters.

Restrictions may also apply for other reasons. For example, in HTML 4, &#12;, which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference. As another example, &#128;, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers – some of which interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding for compatibility reasons. This character, "€", has to be represented as &#8364; in a standard-compliant HTML code. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like &#65536; (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended.

Markup languages also place restrictions on where character references can occur.

Compatibility issues

In the initial versions of SGML and HTML, numeric character references were interpreted in relationship to the document character encoding, rather than Unicode. For Latin-script documents, numeric character references to characters between x80 and x9F in those documents will not be correct against Unicode, and must be recoded. HTML standards prior to HTML 4 only supported Western Latin script documents: the treatment of character references above #7F may vary between applications and national conventions.

For example, as mentioned above, the correct numeric character reference for the Euro sign "€" U+20AC when using Unicode is decimal &#8364; and hexadecimal &#x20AC;. However, if using tools supporting obsolete implementations of HTML, the reference &#128; (Euro in Cp1252 code page) or &#164; (Euro in ISO/IEC 8859-15 ) may work.

As another example, if some text was created originally MacRoman character set, the left double quotation mark “ will be represented with code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or CP1252, where this code point is occupied by the letter Ò. The correct numeric character reference for “ in HTML 4 and newer is &#x201C;, because U+201C is its UCS code. In some systems, the named character reference &ldquo; may also be available.

See also

Alpha

Alpha (uppercase Α, lowercase α; Ancient Greek: ἄλφα, álpha, modern pronunciation álfa) is the first letter of the Greek alphabet. In the system of Greek numerals, it has a value of 1.

It was derived from the Phoenician and Hebrew letter aleph - an ox or leader.Letters that arose from alpha include the Latin A and the Cyrillic letter А.

In English, the noun "alpha" is used as a synonym for "beginning", or "first" (in a series), reflecting its Greek roots.

Ayin

Ayin (also ayn or ain; transliterated ⟨ʿ⟩) is the sixteenth letter of the Semitic abjads, including Phoenician ʿayin , Hebrew ʿayin ע, Aramaic ʿē , Syriac ʿē ܥ, and Arabic ʿayn ع‎ (where it is sixteenth in abjadi order only).The letter represents or is used to represent a voiced pharyngeal fricative (/ʕ/) or a similarly articulated consonant. In some Semitic languages and dialects, the phonetic value of the letter has changed, or the phoneme has been lost altogether (thus, in Modern Hebrew it is reduced to a glottal stop or is omitted entirely).

The Phoenician letter is the origin of the Greek, Latin and Cyrillic letter O.

Beta

Beta (UK: , US: ; uppercase Β, lowercase β, or cursive ϐ; Ancient Greek: βῆτα, translit. bē̂ta or Greek: βήτα vita) is the second letter of the Greek alphabet. In the system of Greek numerals it has a value of 2. In Ancient Greek, beta represented the voiced bilabial plosive /b/. In Modern Greek, it represents the voiced labiodental fricative /v/. Letters that arose from beta include the Roman letter ⟨B⟩ and the Cyrillic letters ⟨Б⟩ and ⟨В⟩.

Chi (letter)

Chi (uppercase Χ, lowercase χ; Greek: χῖ) is the 22nd letter of the Greek alphabet, pronounced or in English.

Delta (letter)

Delta (uppercase Δ, lowercase δ or 𝛿; Greek: δέλτα délta, [ˈðelta]) is the fourth letter of the Greek alphabet. In the system of Greek numerals it has a value of 4. It was derived from the Phoenician letter dalet 𐤃, Letters that come from delta include Latin D and Cyrillic Д.

A river delta (originally, the Nile River delta) is so named because its shape approximates the triangular uppercase letter delta. Despite a popular legend, this use of the word delta was not coined by Herodotus.

Epsilon

Epsilon (uppercase Ε, lowercase ε or lunate ϵ; Greek: έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel /e/. In the system of Greek numerals it also has the value five. It was derived from the Phoenician letter He . Letters that arose from epsilon include the Roman E, Ë and Ɛ, and Cyrillic Е, È, Ё, Є and Э.

The name of the letter was originally εἶ (Ancient Greek: [êː]), but the name was changed to ἒ ψιλόν (e psilon "simple e") in the Middle Ages to distinguish the letter from the digraph αι, a former diphthong that had come to be pronounced the same as epsilon.

In essence, the uppercase form of epsilon looks identical to Latin E. The lowercase version has two typographical variants, both inherited from medieval Greek handwriting. One, the most common in modern typography and inherited from medieval minuscule, looks like a reversed "3". The other, also known as lunate or uncial epsilon and inherited from earlier uncial writing, looks like a semicircle crossed by a horizontal bar. While in normal typography these are just alternative font variants, they may have different meanings as mathematical symbols. Computer systems therefore offer distinct encodings for them. In Unicode, the character U+03F5 "Greek lunate epsilon symbol" (ϵ) is provided specifically for the lunate form. In TeX, \epsilon () denotes the lunate form, while \varepsilon () denotes the reversed-3 form.

There is also a Latin epsilon or "open e", which looks similar to the Greek lowercase epsilon. It is encoded in Unicode as U+025B ("Latin small-letter open e", ɛ) and U+0190 ("Latin capital-letter open e", Ɛ) and is used as an IPA phonetic symbol. The lunate or uncial epsilon has also provided inspiration for the euro sign (€).

The lunate epsilon (ϵ) is not to be confused with the set membership symbol (∈); nor should the Latin uppercase epsilon (Ɛ) be confused with the Greek uppercase sigma (Σ). The symbol , first used in set theory and logic by Giuseppe Peano and now used in mathematics in general for set membership ("belongs to") did, however, evolve from the letter epsilon, since the symbol was originally used as an abbreviation for the Latin word "est". In addition, mathematicians often read the symbol as "element of", as in "1 is an element of the natural numbers" for , for example. As late as 1960, itself was used for set membership, while its negation "does not belong to" (now ) was denoted by (epsilon prime). Only gradually did a fully separate, stylized symbol take the place of epsilon in this role. In a related context, Peano also introduced the use of a backwards epsilon, , for the phrase "such that", although the abbreviation "s.t." is occasionally used in place of in informal cardinals

Gamma

Gamma (uppercase Γ, lowercase γ; Greek: γάμμα gámma) is the third letter of the Greek alphabet. In the system of Greek numerals it has a value of 3. In Ancient Greek, the letter gamma represented a voiced velar stop /ɡ/. In Modern Greek, this letter represents either a voiced velar fricative or a voiced palatal fricative.

In the International Phonetic Alphabet and other modern Latin-alphabet based phonetic notations, it represents the voiced velar fricative.

Iota

Iota (; uppercase Ι, lowercase ι; Greek: ιώτα) is the ninth letter of the Greek alphabet. It was derived from the Phoenician letter Yodh. Letters that arose from this letter include the Latin I and J, the Cyrillic І (І, і), Yi (Ї, ї), and Je (Ј, ј), and iotated letters (e.g. Yu (Ю, ю)).

In the system of Greek numerals, iota has a value of 10.Iota represents the sound [i]. In ancient Greek it occurred in both long [iː] and short [i] versions, but this distinction was lost in Koine Greek.Iota participated as the second element in falling diphthongs, with both long and short vowels as the first element. Where the first element was long, the iota was lost in pronunciation at an early date, and was written in polytonic orthography as iota subscript, in other words as a very small ι under the main vowel. Examples include ᾼ ᾳ ῌ ῃ ῼ ῳ. The former diphthongs became digraphs for simple vowels in Koine Greek.The word is used in a common English phrase, "not one iota", meaning "not the slightest amount", in reference to a phrase in the New Testament (Matthew 5:18): "until heaven and earth pass away, not an iota, not a dot, (King James Version: '[not] one jot or one tittle') will pass from the Law until all is accomplished." (Mt 5:18) This refers to iota, the smallest letter, or possibly Yodh, י, the smallest letter in the Hebrew alphabet.

The word 'jot' (or iot) derives from iota.The German, Portuguese and Spanish name for the letter J (Jot / jota) is derived from iota.

Kappa

Kappa (uppercase Κ, lowercase κ or cursive ϰ; Greek: κάππα, káppa) is the 10th letter of the Greek alphabet, used to represent the [k] sound in Ancient and Modern Greek. In the system of Greek numerals, Kʹ has a value of 20. It was derived from the Phoenician letter kaph . Letters that arose from kappa include the Roman K and Cyrillic К.

Greek proper names and placenames containing kappa are often written in English with "c" due to the Romans' transliterations into the Latin alphabet: Constantinople, Corinth, Crete. All formal modern romanizations of Greek now use the letter "k", however: Thessaloniki, Kalamata, Nikaia.

The cursive form ϰ is generally a simple font variant of lower-case kappa, but it is encoded separately in Unicode for occasions where it is used as a separate symbol in math and science. In mathematics, the kappa curve is named after this letter; the tangents of this curve were first calculated by Isaac Barrow in the 17th century.

List of hexagrams of the I Ching

This is a list of the 64 hexagrams of the I Ching, or Book of Changes, and their Unicode character codes.

This list is in King Wen order. (Cf. other hexagram sequences.)

NCR

NCR may refer to:

NCR Corporation, business technology company, previously National Cash Register

"No carbon required" carbonless copy paper

Napier City Rovers, a New Zealand association football club

A Nature Conservation Review, listing of British nature conservation sites

Naval Construction Regiment, unit of US Navy Seabees

Navin Ramgoolam, prime minister of Mauritius

Nodule-specific cysteine rich, an antimicrobial peptide produced by many root nodule forming plants

New California Republic, fictional government in game franchise Fallout

Not criminally responsible, insanity defense

Numeric character reference, mechanism for specifying Unicode characters

nCr or nCr, mathematics operation a.k.a. "from n choose r" or "combinations of n things, taken r at a time"

Nugget Casino Resort, a hotel and casino located in Sparks, Nevada

Omega

Omega (capital: Ω, lowercase: ω; Greek ὦ, later ὦ μέγα, Modern Greek ωμέγα) is the 24th and last letter of the Greek alphabet. In the Greek numeric system/Isopsephy (Gematria), it has a value of 800. The word literally means "great O" (ō mega, mega meaning "great"), as opposed to omicron, which means "little O" (o mikron, micron meaning "little").In phonetic terms, the Ancient Greek Ω is a long open-mid o [ɔː], comparable to the vowel of British English raw. In Modern Greek, Ω represents the mid back rounded vowel /o̞/, the same sound as omicron. The letter omega is transcribed ō or simply o.

As the last letter of the Greek alphabet, Omega is often used to denote the last, the end, or the ultimate limit of a set, in contrast to alpha, the first letter of the Greek alphabet.

Pi (letter)

Pi (; uppercase Π, lowercase π and ϖ; Greek: πι [pi]) is the sixteenth letter of the Greek alphabet, representing the sound [p]. In the system of Greek numerals it has a value of 80. It was derived from the Phoenician letter Pe (). Letters that arose from pi include Cyrillic Pe (П, п), Coptic pi (Ⲡ, ⲡ), and Gothic pairthra (𐍀).

Psi (letter)

Psi (; uppercase Ψ, lowercase ψ; Greek: ψι psi [ˈpsi]) is the 23rd letter of the Greek alphabet and has a numeric value of 700. In both Classical and Modern Greek, the letter indicates the combination /ps/ (as in English word "lapse").

For Greek loanwords in Latin and modern languages with Latin alphabets, psi is usually transliterated as "ps".

The letter's origin is uncertain. It may or may not derive from the Phoenician alphabet. It appears in the 7th century BC, expressing /ps/ in the Eastern alphabets, but /kʰ/ in the Western alphabets (the sound expressed by Χ in the Eastern alphabets). In writing, the early letter appears in an angular shape ().

There were early graphical variants that omitted the stem ("chickenfoot-shaped psi" as: or ).

The Western letter (expressing /kʰ/, later /x/) was adopted into the Old Italic alphabets, and its shape is also continued into the Algiz rune of the Elder Futhark.

The classical Greek letter was adopted into the early Cyrillic alphabet as "Ѱ".

Rho

Rho (; uppercase Ρ, lowercase ρ or ϱ; Greek: ῥῶ) is the 17th letter of the Greek alphabet. In the system of Greek numerals, it has a value of 100. It is derived from Phoenician letter res . Its uppercase form uses the same glyph, Ρ, as the distinct Latin letter P; the two letters have different Unicode encodings.

Sigma

Sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς; Greek: σίγμα) is the eighteenth letter of the Greek alphabet. In the system of Greek numerals, it has a value of 200. When used at the end of a word (when the word is not all caps), the final form (ς) is used, e.g. Ὀδυσσεύς (Odysseus); note the two sigmas in the center of the name, and the word-final sigma at the end.

Theta

Theta (UK: , US: ; uppercase Θ or ϴ, lowercase θ (which resembles digit 0 with horizontal line) or ϑ; Ancient Greek: θῆτα thē̂ta [tʰɛ̂ːta]; Modern: θήτα thī́ta [ˈθita]) is the eighth letter of the Greek alphabet, derived from the Phoenician letter Teth . In the system of Greek numerals it has the value 9.

Upsilon

Upsilon (; or UK: ; uppercase Υ, lowercase υ; Greek: ύψιλον ýpsilon [ˈipsilon]) or ypsilon is the 20th letter of the Greek alphabet. In the system of Greek numerals, Υʹ has a value of 400. It is derived from the Phoenician waw .

Yus

Little yus (Ѧ ѧ) and big yus (Ѫ ѫ), or jus, are letters of the Cyrillic script representing two Common Slavonic nasal vowels in the early Cyrillic and Glagolitic alphabets. Each can occur in iotified form (Ѩ ѩ, Ѭ ѭ), formed as ligatures with the decimal i (І). Other yus letters are blended yus (Ꙛ ꙛ), closed little yus (Ꙙ ꙙ) and iotified closed little yus (Ꙝ ꙝ).

Phonetically, little yus represents a nasalized front vowel, possibly [ɛ̃], while big yus represents a nasalized back vowel, such as IPA [ɔ̃]. This is also suggested by the appearance of each as a 'stacked' digraph of 'Am' and 'om' respectively.

The names of the letters do not imply capitalization, as both little and big yus exist in majuscule and minuscule variants.

Unicode
Code points
Characters
Processing
On pairs of
code points
Usage
Related standards
Related topics

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.