IETF language tag

An IETF BCP 47 language tag is a code to identify human languages. For example, the tag en stands for English; es-419 for Latin American Spanish; rm-sursilv for Sursilvan; gsw-u-sd-chzh for Zürich German; nan-Hant-TW for Min Nan Chinese as spoken in Taiwan using traditional Han characters. To distinguish language variants for countries, regions, writing systems etc., IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1, and UN M.49. The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.[1][2][3][4] IETF language tags are used by computing standards such as HTTP,[5], HTML,[6] XML,[7] and PNG.[8]

History

IETF language tags were first defined in RFC 1766, edited by Harald Tveit Alvestrand, published in March 1995. The tags used ISO 639 two-letter language codes and ISO 3166 two-letter country codes, and allowed registration of whole tags that included variant or script subtags of three to eight letters.

In January 2001 this was updated by RFC 3066, which added the use of ISO 639-2 three-letter codes, permitted subtags with digits, and adopted the concept of language ranges from HTTP/1.1 to help with matching of language tags.

The next revision of the specification came in September 2006 with the publication of RFC 4646 (the main part of the specification), edited by Addison Philips and Mark Davis and RFC 4647 (which deals with matching behaviour). RFC 4646 introduced a more structured format for language tags, added the use of ISO 15924 four-letter script codes and UN M.49 three-digit geographical region codes, and replaced the old registry of tags with a new registry of subtags. The small number of previously defined tags that did not conform to the new structure were grandfathered in order to maintain compatibility with RFC 3066.

The current version of the specification, RFC 5646, was published in September 2009. The main purpose of this revision was to incorporate three-letter codes from ISO 639-3 and 639-5 into the Language Subtag Registry, in order to increase the interoperability between ISO 639 and BCP 47.[9]

Syntax of language tags

Each language tag is composed of one or more "subtags" separated by hyphens (-). Each subtag is composed of basic Latin letters or digits only.

With the exceptions of private-use language tags beginning with an x- prefix and grandfathered language tags (including those starting with an i- prefix and those previously registered in the old Language Tag Registry), subtags occur in the following order:

  • A single primary language subtag based on a two-letter language code from ISO 639-1 (2002) or a three-letter code from ISO 639-2 (1998), ISO 639-3 (2007) or ISO 639-5 (2008), or registered through the BCP 47 process and composed of five to eight letters;
  • Up to three optional extended language subtags composed of three letters each, separated by hyphens; (There is currently no extended language subtag registered in the Language Subtag Registry without an equivalent and preferred primary language subtag. This component of language tags is preserved for backwards compatibility and to allow for future parts of ISO 639.)
  • An optional script subtag, based on a four-letter script code from ISO 15924 (usually written in Title Case);
  • An optional region subtag based on a two-letter country code from ISO 3166-1 alpha-2 (usually written in upper case), or a three-digit code from UN M.49 for geographical regions;
  • Optional variant subtags, separated by hyphens, each composed of five to eight letters, or of four characters starting with a digit; (Variant subtags are registered with IANA and not associated with any external standard.)
  • Optional extension subtags, separated by hyphens, each composed of a single character, with the exception of the letter x, and a hyphen followed by one or more subtags of two to eight characters each, separated by hyphens;
  • An optional private-use subtag, composed of the letter x and a hyphen followed by subtags of one to eight characters each, separated by hyphens.

Subtags are not case-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are UPPERCASE, script subtags are Title Case, and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards.

Optional script and region subtags are preferred to be omitted when they add no distinguishing information to a language tag. For example, es is preferred over es-Latn, as Spanish is fully expected to be written in the Latin script; ja is preferred over ja-JP, as Japanese as used in Japan does not differ markedly from Japanese as used elsewhere.

Not all linguistic regions can be represented with a valid region subtag: the subnational regional dialects of a primary language are registered as variant subtags. For example, the valencia variant subtag for the Valencian dialect of Catalan is registered in the Language Subtag Registry with the prefix ca. As this dialect is spoken almost exclusively in Spain, the region subtag ES can normally be omitted.

IETF language tags have been used as locale identifiers in many applications. It may be necessary for these applications to establish their own strategy for defining, encoding and matching locales if the strategy described in RFC 4647 is not adequate.

The use, interpretation and matching of IETF language tags is currently defined in RFC 5646 and RFC 4647. The Language Subtag Registry lists all currently valid public subtags. Private-use subtags are not included in the Registry as they are implementation-dependent and subject to private agreements between third parties using them. These private agreements are out of scope of BCP 47.

Relation to other standards

Although some types of subtags are derived from ISO or UN core standards, they do not follow these standards absolutely, as this could lead to the meaning of language tags changing over time. In particular, a subtag derived from a code assigned by ISO 639, ISO 15924, ISO 3166, or UN M.49 remains a valid (though deprecated) subtag even if the code is withdrawn from the corresponding core standard. If the standard later assigns a new meaning to the withdrawn code, the corresponding subtag will still retain its old meaning.

This stability was introduced in RFC 4646.

ISO 639-3 and ISO 639-1

RFC 4646 defined the concept of an "extended language subtag" (sometimes referred to as extlang), although no such subtags were registered at that time.[10],[11]

RFC 5645 and RFC 5646 added primary language subtags corresponding to ISO 639-3 codes for all languages that did not already exist in the Registry. In addition, codes for languages encompassed by certain macrolanguages were registered as extended language subtags. Sign languages were also registered as extlangs, with the prefix sgn. These languages may be represented either with the subtag for the encompassed language alone (cmn for Mandarin) or with a language-extlang combination (zh-cmn). The first option is preferred for most purposes. The second option is called "extlang form" and is new in RFC 5646.

Whole tags that were registered prior to RFC 4646 and are now classified as "grandfathered" or "redundant" (depending on whether they fit the new syntax) are deprecated in favor of the corresponding ISO 639-3–based language subtag, if one exists. To list a few examples, nan is preferred over zh-min-nan for Min Nan Chinese; hak is preferred over i-hak and zh-hakka for Hakka Chinese; and ase is preferred over sgn-US for American Sign Language.

ISO 639-5 and ISO 639-2

ISO 639-5 defines language collections with alpha-3 codes in a different way than they were initially encoded in ISO 639-2 (including one code already present in ISO 639-1). Specifically, the language collections are now all defined in ISO 639-5 as inclusive, rather than some of them being defined exclusively. This means that language collections have a broader scope than before, in some cases where they could encompass languages that were already encoded separately within ISO 639-2.

For example, the ISO 639-2 code afa was previously associated with the name "Afro-Asiatic (Other)", excluding languages such as Arabic that already had their own code. In ISO 639-5, this collection is named "Afro-Asiatic languages" and includes all such languages. ISO 639-2 changed the exclusive names in 2009 to match the inclusive ISO 639-5 names.[12]

To avoid breaking implementations that may still depend on the older (exclusive) definition of these collections, ISO 639-5 defines a grouping type attribute for all collections that were already encoded in ISO 639-2 (such grouping type is not defined for the new collections added only in ISO 639-5).

BCP 47 defines a "Scope" property to identify subtags for language collections. However, it does not define any given collection as inclusive or exclusive, and does not use the ISO 639-5 grouping type attribute, although the description fields in the Language Subtag Registry for these subtags match the ISO 639-5 (inclusive) names. As a consequence, BCP 47 language tags that include a primary language subtag for a collection may be ambiguous as to whether the collection is intended to be inclusive or exclusive.

ISO 639-5 does not define precisely which languages are members of these collections; only the hierarchical classification of collections is defined, using the inclusive definition of these collections. Because of this, RFC 5646 does not recommend the use of subtags for language collections for most applications, although they are still preferred over subtags whose meaning is even less specific, such as "Multiple languages" and "Undetermined".

In contrast, the classification of individual languages within their macrolanguage is standardized, in both ISO 639-3 and the Language Subtag Registry.

ISO 15924, ISO/IEC 10646 and Unicode

Script subtags were first added to the Language Subtag Registry when RFC 4646 was published, from the list of codes defined in ISO 15924. They are encoded in the language tag after primary and extended language subtags, but before other types of subtag, including region and variant subtags.

Some primary language subtags are defined with a property named "Suppress-Script" which indicates the cases where a single script can usually be assumed by default for the language, even if it can be written with another script. When this is the case, it is preferable to omit the script subtag, to improve the likelihood of successful matching. A different script subtag can still be appended to make the distinction when necessary. For example, yi is preferred over yi-Hebr in most contexts, because the Hebrew script subtag is assumed for the Yiddish language.

As another example, zh-Hans-SG may be considered equivalent to zh-Hans, because the region code is probably not significant; the written form of Chinese used in Singapore uses the same simplified Chinese characters as in other countries where Chinese is written. However, the script subtag is maintained because it is significant.

Note that ISO 15924 includes some codes for script variants (for example, Hans and Hant for simplified and traditional forms of Chinese characters) that are unified within Unicode and ISO/IEC 10646. These script variants are most often encoded for bibliographic purposes, but are not always significant from a linguistic point of view (for example, Latf and Latg script codes for the Fraktur and Gaelic variants of the Latin script, which are mostly encoded with regular Latin letters in Unicode and ISO/IEC 10646). They may occasionally be useful in language tags to expose orthographic or semantic differences, with different analysis of letters, diacritics, and digraphs/trigraphs as default grapheme clusters, or differences in letter casing rules.

ISO 3166-1 and UN M.49

Two-letter region subtags are based on codes assigned, or "exceptionally reserved", in ISO 3166-1. If the ISO 3166 Maintenance Agency were to reassign a code that had previously been assigned to a different country, the existing BCP 47 subtag corresponding to that code would retain its meaning, and a new region subtag based on UN M.49 would be registered for the new country. UN M.49 is also the source for numeric region subtags for geographical regions, such as 005 for South America.

Region subtags are used to specify the variety of a language "as used in" a particular region. They are appropriate when the variety is regional in nature, and can be captured adequately by identifying the countries involved, as when distinguishing British English (en-GB) from American English (en-US). When the difference is one of script or script variety, as for simplified versus traditional Chinese characters, it should be expressed with a script subtag instead of a region subtag; in this example, zh-Hans and zh-Hant should be used instead of zh-CN and zh-HK.

When a distinct language subtag exists for a language that could be considered a regional variety, it is often preferable to use the more specific subtag instead of a language-region combination. For example, ar-DZ (Arabic as used in Algeria) may be better expressed as arq for Algerian Spoken Arabic.

Extensions

Extension subtags (not to be confused with extended language subtags) allow additional information to be attached to a language tag that does not necessarily serve to identify a language. One use for extensions is to encode locale information, such as calendar and currency.

Extension subtags are composed of multiple hyphen-separated character strings, starting with a single character (other than x), called a singleton. Each extension is described in its own IETF RFC, which identifies a Registration Authority to manage the data for that extension. IANA is responsible for allocating singletons.

Two extensions have been assigned as of January 2014.

Extension T (Transformed Content)

Extension T allows a language tag to include information on how the tagged data was transliterated, transcribed, or otherwise transformed. For example, the tag en-t-jp could be used for content in English that was translated from the original Japanese. Additional substrings could indicate that the translation was done mechanically, or in accordance with a published standard.

Extension T is described in RFC 6497, published in February 2012. The Registration Authority is the Unicode Consortium.

Extension U (Unicode Locale)

Extension U allows a wide variety of locale attributes found in the Common Locale Data Repository (CLDR) to be embedded in language tags. These attributes include country subdivisions, calendar and time zone data, collation order, currency, number system, and keyboard identification.

Some examples include:

Extension U is described in RFC 6067, published in December 2010. The Registration Authority is the Unicode Consortium.

See also

References

  1. ^ "Language Subtag Registry". iana.org. Internet Assigned Numbers Authority. Retrieved 2018-12-05.
  2. ^ "Language subtag lookup app:". r12a.github.io. Retrieved 28 July 2015.
  3. ^ "Language Tag Extensions Registry". iana.org. Internet Assigned Numbers Authority. Retrieved 2018-12-06.
  4. ^ "IANA — Protocol Registries". iana.org. Retrieved 28 July 2015.
  5. ^ "RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content". ietf.org. Retrieved 28 July 2015.
  6. ^ "Language information and text direction". w3.org. Retrieved 28 July 2015.
  7. ^ "Extensible Markup Language (XML) 1.0 (Fifth Edition)". w3.org. Retrieved 28 July 2015.
  8. ^ "Portable Network Graphics (PNG) Specification (Second Edition)". w3.org. Retrieved 28 July 2015.
  9. ^ Language Tag Registry Update charter Archived 2007-02-10 at the Wayback Machine
  10. ^ Addison Phillips, Mark Davis (2008). "Tags for Identifying Languages (old draft for the revision of RFC 4646, now obsolete and may disappear soon)". IETF WG LTRU. Retrieved 2008-06-23.
  11. ^ Doug Ewell (2008). "Update to the Language Subtag Registry (old draft for the revision of RFC 4645, now obsolete and may disappear soon)" (1MB). IETF WG LTRU. Retrieved 2008-06-23.
  12. ^ "ISO 639-2 Language Code List - Codes for the representation of names of languages (Library of Congress)". loc.gov. Retrieved 28 July 2015.

External links

African Nova Scotian English

African Nova Scotian English (ANSE and ANSD) is a variety of the English language spoken by descendants of black immigrants from the United States who live in Nova Scotia, Canada. Members of these communities are collectively known as Black Nova Scotians.Though most African-American immigrants to Canada ended up in Ontario through the Underground Railroad, only the dialect of Nova Scotian blacks retains the influence of West African pidgin. In the 19th century, African Nova Scotian English would have been indistinguishable from English spoken in Jamaica or Suriname. However, it has been increasingly de-creolized since this time, due to interaction and influence from the white Nova Scotian population, who mostly hail from the British Isles. Desegregation of the province's school boards in 1964 further accelerated the process of de-creolization.

The language is a relative of the African-American Vernacular English, with variations unique to the group's history in the area. There are noted differences in the dialects of those from Guysborough County (Black Loyalists), and those from North Preston (Black Refugees), the Guysborough group having been in the province three generations earlier.Howe & Walker (2000) use data from early recordings of African Nova Scotian English, Samaná English, and the recordings of former slaves to demonstrate that speech patterns were inherited from nonstandard colonial English. The dialect was extensively studied in 1992 by Shana Poplack and Sali Tagliamonte from the University of Ottawa.A commonality between African Nova Scotian English and African American Vernacular English is (r)-deletion. This rate of deletion is 57% among Black Nova Scotians, and 60% among African Americans in Philadelphia. Meanwhile, in the surrounding mostly white communities of Nova Scotia, (r)-deletion does not occur. The exception to this is the non-rhotic dialect of Lunenburg English.

Aluo language

Aluo (autonym: ɑ˥lo˧ pho˥; Naluo) is a Loloish language spoken by the Yi people of China. It is also known by its Nasu name Laka (also Gan Yi, Yala, Lila, Niluo).

Asante dialect

Ashanti, Asante, or Asante Twi, is spoken by over 2.8 million Ashanti people. Ashanti (or Ashanti Twi) is one of three literary dialects of the Akan language of West Africa, and the prestige dialect of that language. It is spoken in and around Kumasi, the capital of the Ashanti Region of Ghana.

The two dialects of Akuapem and Asante are known as Twi and are in many ways mutually intelligible. There are about 9 million Twi speakers, mainly in Ashanti. Akuapem Twi was the first dialect to be used for Bible translation, and became the prestige dialect as a result.In Ethnologue and ISO 639-3, Asante is analysed as a dialect of Twi. Twi in its turn is a language belonging to the macrolanguage of Akan. In Glottolog, Asante is found as a sub-dialect of Twi, which is in turn classified as a dialect of the Akan language.

Chakma language

Chakma language (; autonym: 𑄌𑄋𑄴𑄟𑄳𑄦 𑄞𑄌𑄴, Changmha Bhach) is an Indo-Aryan language spoken by the Chakma and Daingnet people. Its better-known closest relatives are Assamese, Hajong, Bengali, Chittagonian, and Bishnupriya Manipuri of Manipur, Tanchangya, and Sylheti. It is spoken by nearly 310,000 people in southeast Bangladesh in Chittagong Hill Tracts, and another 300,000 in India in Assam and Tripura and 40,265 in Mizoram. It is written using the Chakma script, which is also called Ajhā pāṭh, sometimes romanised Ojhopath. Literacy in Chakma script is low.

It is officially recognised by neither the Bangladesh government nor the Indian government, the only two countries where local Chakma people live.

Although there were no Chakma language radio or television stations as of 2011, the language has a presence in social media and on YouTube. The Hill Education Chakma Script website provides tutorials, videos, e-books, and Chakma language forums.In 2012, the Government of Tripura announced it would "introduce Chakma language in Chakma script in primary schools of Tripura. Imparting of education up to elementary stage in mother tongue is a national policy. To begin with Chakma language subjects in its own scripts will be introduced in 58 primary schools in Chakma concentrated areas.""In preparation for the January 2014 education season, the national curriculum and textbook board has already started printing books in six languages ... Chakma, Kokborok (Tripura community), Marma, Santal, Sadri (Orao community) and Achik."Mor Thengari (My Bicycle) was Bangladesh's first Chakma-language movie. However, it was banned in Bangladesh.

Colonia Tovar dialect

The Colonia Tovar dialect, or Alemán Coloniero, is a dialect spoken in Colonia Tovar, Venezuela, that belongs to the Low Alemannic branch of German.

Cuban Spanish

Cuban Spanish—also referred to colloquially as simply cubano, or even cubañol— is the variety of the Spanish language as it is spoken in Cuba. As a Caribbean language variety, Cuban Spanish shares a number of features with nearby varieties, including coda deletion, seseo, and /s/ debuccalization ("aspiration").

Equatoguinean Spanish

Equatoguinean Spanish (Spanish: Español ecuatoguineano) is the variety of Spanish spoken in Equatorial Guinea. This is the only Spanish variety that holds national official status in Sub-Saharan Africa. It is regulated by the Equatoguinean Academy of the Spanish Language and is spoken by about 90% of the population, estimated at 1,170,308 for the year 2010 (though population figures for this country are highly dubious), all of them second-language speakers.

European Portuguese

European Portuguese (Portuguese: português europeu, pronounced [puɾtuˈɣez ewɾuˈpew]), also known as Lusitanian Portuguese (português lusitano) and Portuguese of Portugal (português de Portugal) in Brazil, or even “Portuguese Portuguese” refers to the Portuguese language spoken in Portugal. Standard Portuguese pronunciation, the prestige norm based on European Portuguese, is the reference for Portugal, the Portuguese-speaking African countries, East Timor and Macau. The word “European” was chosen to avoid the clash of “Portuguese Portuguese” (“português português”) as opposed to Brazilian Portuguese.

The language is the same with different accents in many countries. It is a Latin based language, with Gaelic, Germanic, Greek and Arabic influence. It was spoken in the Iberian Peninsula before as Galician-Portuguese. With the formation of Portugal as a country in the 12th century, the language evolved into Portuguese. In the Spanish province of Galicia, Northern border of Portugal, the native language is Galician. Both Portuguese and Galician are very similar and natives can understand each other as they share the same recent common ancestor. Portuguese and Spanish are different languages, although they share 89% of their lexicon.

Falkland Islands English

Falkland Islands English is mainly British in character. However, as a result of the isolation of the islands, the small population has developed and retains its own accent/dialect, which persists despite a large number of immigrants from the United Kingdom in recent years. In rural areas (i.e. anywhere outside Stanley), known as ‘Camp’ (from Spanish campo or ‘countryside’), the Falkland accent tends to be stronger. The dialect has resemblances to Australian, New Zealand, West Country and Norfolk dialects of English, as well as Lowland Scots.

Two notable Falkland Island terms are ‘kelper’ meaning a Falkland Islander, from the kelp surrounding the islands (sometimes used pejoratively in Argentina) and ‘smoko’, for a smoking break (as in Australia and New Zealand).

The word ‘yomp’ was used by the British armed forces during the Falklands War but is passing out of usage.

In recent years, a substantial Saint Helenian population has arrived, mainly to do low-paid work, and they too have a distinct form of English.

German Standard German

German Standard German, Standard German of Germany, or High German of Germany is the variety of Standard German that is written and spoken in Germany. It is the variety of German most commonly taught to foreigners.

It is not uniform, which means it has considerable regional variation. Anthony Fox asserts that British English is more standardized than German Standard German.

Guinean Portuguese

Guinean Portuguese (Portuguese: Português Guineense) is the variety of Portuguese spoken in Guinea-Bissau, where it is the official language.

Haitian French

Haitian French (French: français haïtien, Haitian Creole: fransè ayisyen) is the variety of French spoken in Haiti. Haitian French is close to standard French. It should be distinguished from Haitian Creole.

Languedocien dialect

Languedocien (French name) or Lengadocian (native name) is an Occitan dialect spoken in rural parts of southern France such as Languedoc, Rouergue, Quercy, Agenais and Southern Périgord. Due to its central position among the dialects of Occitan, it is often used as a basis for a Standard Occitan.About 10% of the population of Languedoc are fluent in the language (about 300,000),and another 20% (600,000) "have some understanding" of the language. All speak French as their first or second language.

Limousin dialect

Limousin (Occitan: Lemosin) is a dialect of the Occitan language, spoken in the three departments of Limousin, parts of Charente and the Dordogne in the southwest of France.

The first Occitan documents are in an early form of this dialect, particularly the Boecis, written around the year 1000.

Limousin is used primarily by people over age 50 in rural communities. All speakers speak French as a first or second language. Due to the French single language policy, it is not recognised by the government and might be disappearing. A revivalist movement around the Félibrige and the Institut d'Estudis Occitans is active in Limousin (as well as in other parts of Occitania).

Orokaiva language

Orokaiva is a Papuan language spoken in the "tail" of Papua New Guinea.

Provençal dialect

Provençal (, also UK: , US: ; Occitan: Provençau or Prouvençau [pʀuvenˈsaw]) is a variety of Occitan spoken by a minority of people in southern France, mostly in Provence. In the English-speaking world, the term Provençal has historically also been used to refer to all of Occitan, but is now mainly understood to refer to the variety spoken in Provence.Provençal is also the customary name given to the older version of the Occitan language used by the troubadours of medieval literature, while Old French or the langue d'oïl was limited to the northern areas of France. Thus the ISO 639-3 code for Old Occitan is [pro].

In 2007, all the ISO 639-3 codes for Occitan dialects, including [prv] for Provençal, were retired and merged into [oci] Occitan.

Sanapaná language

Sanapana (sanapana payvoma) is a language of the Paraguayan Chaco. Use is vigorous, and it is a language of instruction in primary schools..

Sanapaná people call themselves nenlhet; Enxet people call Sanapaná people saapa'ang; Guaná people call them kasnapan; and Enlhet people, kelya'mok.

Swabian German

Swabian (Schwäbisch ) is one of the dialect groups of Alemannic German that belong to the High German dialect continuum. It is mainly spoken in Swabia which is located in central and southeastern Baden-Württemberg (including e.g. its capital Stuttgart and the Swabian Jura region) and the southwest of Bavaria (Bavarian Swabia). Furthermore, Swabian German dialects are spoken by Caucasus Germans in Transcaucasia. The dialects of the Danube Swabian population of Hungary, the former Yugoslavia and Romania are only nominally Swabian and can be traced back not only to Swabian but also to Frankonian, Bavarian and Hessian German dialects, with locally varying degrees of influence of the initial dialects.

Tarpia language

Tarpia is an Austronesian language spoken on the eastern north coast of Papua province, Indonesia.

See Sarmi languages for a comparison with related languages.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.