Non-breaking space

In word processing and digital typesetting, a non-breaking space (" "), also called no-break space, non-breakable space (NBSP), hard space, or fixed space,[note 1] is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.

In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as   or  . In Unicode, it is encoded as U+00A0.

Non-breaking space characters with other widths also exist.

Uses and variations

Despite having layout and uses similar to those of whitespace, it differs in contextual behavior.[1][2]

Non-breaking behavior

Text-processing software typically assumes that an automatic line break may be inserted anywhere a space character occurs; a non-breaking space prevents this from happening (provided the software recognizes the character). For example, if the text "100 km" will not quite fit at the end of a line, the software may insert a line break between "100" and "km". An editor who finds this behaviour undesirable may choose to use a non-breaking space between "100" and "km". This guarantees that the text "100 km" will not be broken: if it does not fit at the end of a line, it is moved in its entirety to the next line.

Non-collapsing behavior

A second common application of non-breaking spaces is in plain text file formats such as SGML, HTML, TeX and LaTeX, whose rendering engines are programmed to treat sequences of whitespace characters (space, newline, tab, form feed, etc.) as if they were a single character (but this behavior can be overridden). Such "collapsing" of whitespace allows the author to neatly arrange the source text using line breaks, indentation and other forms of spacing without affecting the final typeset result.[3][4]

In contrast, non-breaking spaces are not merged with neighboring whitespace characters when displayed, and can therefore be used by an author to simply insert additional visible space in the resulting output without using spans styled with peculiar values of the CSS “white-space” property. Conversely, indiscriminate use (see the recommended use in style guides), in addition to a normal space, gives extraneous space in the output.

Width variation

Other non-breaking variants, defined in Unicode:

  • U+202F NARROW NO-BREAK SPACE (HTML   · NNBSP). It was introduced in Unicode 3.0 for Mongolian,[5] to separate a suffix[6] from the word stem without indicating a word boundary. It is also required for big punctuation in French, sometimes inaccurately referred to as ”double punctuation“ (before ;, ?, !, »,  and after «, ; today often also before :) and Russian (before em dashes [—]), and in German between multi-part abbreviations (e.g. ”z. B.“, ”d. h.“, ”v. l. n. r.“)[7]. When used with Mongolian, its width is usually one third of the normal space; in other contexts, its width is about 70% of the normal space but may resemble that of the thin space (U+2009), at least with some fonts.[8]. Also starting from release 34 of Unicode Common Locale Data Repository (CLDR) the NNBSP is used in numbers as thousands group separator for French locale [9].
  • U+2007 FIGURE SPACE (HTML  ). Produces a space equal to the figure (0–9) characters.
  • U+2060 WORD JOINER (HTML ⁠ · WJ): encoded in Unicode since version 3.2. The word-joiner does not produce any space, and prohibits a line break at its position.

Encodings

Format Representation of non-breaking space
Unicode and ISO/IEC 10646 U+00A0   NO-BREAK SPACE
UTF-8 C2 A0
ISO/IEC 8859 (1-16) / ECMA-94 A0
Windows code pages: 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258 A0
KOI8-R, KOI8-U 9A
EBCDIC 41 – RSP, Required Space
DOS code pages: 437, 850, 851, 852, 853, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 869 FF
HTML (including Wikitext)   (character entity reference)

  or   (numeric character references)

TeX ~ (tilde)
HP Roman-8, HP Roman-9 A0
LICS 9A
ASCII, ISO/IEC 646 Not available

Unicode defines several other non-break space characters. See #Width variation. Encoding remarks:

  • Word joiner, encoded in Unicode 3.2 and above as U+2060, and in HTML as ⁠ or ⁠.
  • Byte order mark (BOM), U+FEFF, which may be interpreted as a "zero width no-break space", a deprecated alternative to word joiner.

Keyboard entry methods

It is rare for national or international standards on keyboard layouts to define an input method for the non-breaking space. An exception is the Finnish multilingual keyboard, accepted as the national standard SFS 5966 in 2008. According to the SFS setting, the non-breaking space can be entered with the key combination AltGr + Space.[10]

Typically, authors of keyboard drivers and application programs (e.g., word processors) have devised their own keyboard shortcuts for the non-breaking space. For example:

System/application Entry method
Microsoft Windows Alt+0160 or Alt+255 (doesn't always work)
macOS Opt+Space
Linux or Unix using X11 Compose, Space, Space or AltGr+Space
AmigaOS Alt+Space
GNU Emacs Ctrl+X 8 Space
Vim Ctrl+K, Space, Space; or Ctrl+K, Shift+N, ⇧ Shift+S
Dreamweaver, LibreOffice, Microsoft Word,
OpenOffice.org (since 3.0)
Ctrl+⇧ Shift+Space
FrameMaker, LyX (non-Mac), OpenOffice.org (before 3.0),
WordPerfect
Ctrl+Space
Mac Adobe InDesign ⌥ Opt+⌘ Cmd+X

Apart from this, applications and environments often have methods of entering unicode entities directly via their code point, e.g. via the Alt Numpad input method. (Non-breaking space has code point 255 decimal (FF hex) in codepage 437 and codepage 850, and code point 160 decimal (A0 hex) in codepage 1252.)

See also

Notes

  1. ^ The use of the term "fixed space" for no-break space is strongly discouraged, as it is confusable with the term "fixed-width space".

References

  1. ^ Elyaakoubi, Mohamed; Lazrek, Azzeddine (2010). "Justify Just or Just Justify". The Journal of Electronic Publishing. 13. doi:10.3998/3336451.0013.105.
  2. ^ "Special Characters". The Chicago Manual of Style Online.
  3. ^ "Structure", HTML 4.01, W3, 1999-12-24.
  4. ^ "Text", CSS 2.1, W3.
  5. ^ ISO/IEC 10646-1:1993/FDAM 29:1999(E)
  6. ^ Mongolian NNBSP-connected suffixes
  7. ^ Solbrig, Amelie (30 January 2008). "Zweisprachige Mikrotypografie" (PDF) (in German). Hochschule für Technik, Wirtschaft und Kultur Leipzig. p. 58 (PDF p. 113). Archived from the original (PDF) on 2016-03-11. Retrieved 10 June 2018. Alle Abkürzungen mit Binnenpunkten werden im Deutschen mit einem gFL [geschütztes flexibles Leerzeichen] spationiert. [...] Die englische Schreibweise sieht keine Abstände zwischen einzelnen Buchstaben vor. Nach einem Binnenpunkt folgt demnach ohne gFL sofort der nächste Buchstabe.
  8. ^ "Writing Systems and Punctuation" (PDF). The Unicode Standard 7.0. Unicode Inc. 2014. Retrieved 2014-11-02.
  9. ^ "CLDR Chart: Numbers".
  10. ^ Kotoistus (2006-12-28), Uusi näppäinasettelu [Status of the new keyboard layout] (presentation) (in Finnish and English), CSC – IT Center for Science, archived from the original on 2011-07-27. Drafts of the Finnish multilingual keyboard.
0W

0W (zero W) or 0-W may refer to:

0W, zero west, or 0°W, coordinate of the prime meridian

0W or ZW, or zero width, a non-printing character used in computer typesetting of some complex scripts

Zero-width joiner

Zero-width non-joiner

Zero-width space

Zero-width non-breaking space

Zero waste, an environmental concept

A0

A0, A-0, A0, or a0 may refer to:

101 A0 and 103 A0, two versions of the German Heinkel Tourist moped

A0 paper size, an international ISO 216 standard paper size (841 × 1189 mm), which results in an area very close to 1 m²

A0 highway (Zimbabwe), a highway which orbits Zimbabwe

A0, the lowest A (musical note) note on a standard piano

A0, a climbing grade

A00, Irregular chess openings code in the Encyclopaedia of Chess Openings

A-0 Geyser, a geyser in Yellowstone National Park

A-0 System, an early compiler related tool developed for electronic computers

L'Avion, IATA airline designator for the French airline

Characters of type A0, an older term for algebraic Hecke characters

a0, the accepted mathematical symbol for the Bohr radius

Haplogroup A00 and A0; see Y-chromosomal Adam and Haplogroup A (Y-DNA)

A0, a subdivision in stellar classification

A0, sometimes written as 0xA0, is the hexadecimal representation of non-breaking space in various character encoding standards

ArmSCII

ArmSCII or ARMSCII is a set of obsolete single-byte character encodings for the Armenian alphabet defined by Armenian national standard 166-9. ArmSCII is an acronym for Armenian Standard Code for Information Interchange, similar to ASCII for the American standard. It has been superseded by the Unicode standard.

However, these encodings are not widely used because the standard was published one year after the publication of international standard ISO 10585 that defined another 7-bit encoding, from which the encoding and mapping to the UCS (Universal Coded Character Set (ISO/IEC 10646) and Unicode standards) were also derived a few years after, and there was a lack of support in the computer industry for adding ArmSCII.

Byte order mark

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

The byte order, or endianness, of the text stream;

The fact that the text stream's encoding is Unicode, to a high level of confidence;

Which Unicode encoding the text stream is encoded as.BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM is encoded in the same scheme as the rest of the document and becomes a non-character Unicode code point if its bytes are swapped. Hence, the process accessing the text can examine these first few bytes to determine the endianess, without requiring some contract or metadata outside of the text stream itself. Generally the receiving computer will swap the bytes to its own endianess, if necessary, and would no longer need the BOM for processing.

The byte sequence of the BOM differs per Unicode encoding (including ones outside the Unicode standard such as UTF-7, see table below), and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can indicate that the text is Unicode and identify the encoding scheme used. This use of the BOM character is called a "Unicode signature".

Ellipsis

An ellipsis (plural ellipses; from the Ancient Greek: ἔλλειψις, élleipsis, 'omission' or 'falling short') is a series of dots (typically three, such as "…") that usually indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning.Opinions differ as to how to render ellipses in printed material. According to the Chicago Manual of Style, each dot should be separated from its neighbor by a non-breaking space. Such spaces should be omitted, however, according to the Associated Press. A third option, illustrated in the opening sentence of this article, is to use the precomposed Unicode character with code point U+2026, in which the gaps are not as wide as standard spaces.

Em (typography)

An em is a unit in the field of typography, equal to the currently specified point size. For example, one em in a 16-point typeface is 16 points. Therefore, this unit is the same for all typefaces at a given point size.The em dash (—) and em space ( ) are each one em wide.

Typographic measurements using this unit are frequently expressed in decimal notation (e.g., 0.7 em) or as fractions of 100 or 1000 (e.g., 70/100 em or 700/1000 em). The name em was originally a reference to the width of the capital M in the typeface and size being used, which was often the same as the point size.

En (typography)

An en is a typographic unit, half of the width of an em. By definition, it is equivalent to half of the height of the font (e.g. in 16 point type it is 8 points). As its name suggests, it is also traditionally the width of an uppercase letter "N".

The en dash (–) and en space ( ) are each one en wide. In English, the en dash is commonly used for inclusive ranges (e.g., "pages 12–17" or "August 7, 1988 – November 26, 2005"), and increasingly to replace the long dash ("—", also called an em dash or en rule). (When using it to replace a long dash, spaces are needed either side of it – like so.) This is standard practice in German, another Germanic language, where the hyphen is the only dash without spaces on either side. (And line breaks are not spaces per se.)

Figure space

A figure space is a typographic unit equal to the size of a single typographic figure (numeral or letter), minus leading. Its size can fluctuate somewhat depending on which font is being used. This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

Hard space

In typesetting and text editors, the term hard space has several meanings, all related to a special way of representing the space between characters.

The most commonly used meaning is the same as non-breaking space, a special space character used by a word processor that forbids an automatic line breaking (line wrap) at its position.

In earlier days of text editors that worked with text mode CRT displays, when a paragraph had to be justified, this was achieved by means of inserting extra soft spaces at whitespaces. The soft spaces were so called because they could be "compressed" away during further editing. By contrast, ordinary spaces were called hard or incompressible spaces.

Also, in some older text editors, the hard spaces were both non-expandable—i.e., no soft spaces could be added to them—and non-breaking ones.

In many term programs and game parsers, a hard space was a special kind of field delimiter, against which a filename could be examined or listed, or a semantic thought or consideration could be interpreted.

In the Commodore directory system, a hard space usually terminated the spelling of a filename, and was replaced with a quotation mark when listed to the user.

ISO/IEC 8859-11

ISO/IEC 8859-11:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 11: Latin/Thai alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. It is informally referred to as Latin/Thai. It is nearly identical to the national Thai standard TIS-620 (1990). The sole difference is that ISO/IEC 8859-11 allocates non-breaking space to code 0xA0, while TIS-620 leaves it undefined. (In practice, this small distinction is usually ignored.)

ISO-8859-11 is not a main registered IANA charset name despite following the normal pattern for IANA charsets based on the ISO 8859 series. However, it is defined as an alias of the close equivalent TIS-620 (which lacks the non-breaking space), and which can without problems be used for ISO/IEC 8859-11, since the no-break space has a code which was unallocated in TIS-620. Microsoft has assigned code page 28601 a.k.a. Windows-28601 to ISO-8859-11 in Windows. A draft had the Thai letters in different spots.As with all varieties of ISO/IEC 8859, the lower 128 codes are equivalent to ASCII. The additional characters, apart from no-break space, are found in Unicode in the same order, only shifted from 0xA1 to U+0E01 and so forth.

The Microsoft Windows code page 874 as well as the code page used in the Thai version of the Apple Macintosh, MacThai, are extensions of TIS-620 — incompatible with each other, however.

Lontara script

The Lontara script is a Brahmic script traditionally used for the Bugis, Makassarese and Mandar languages of Sulawesi in Indonesia. It is also known as the Bugis script, as Lontara documents written in this language are the most numerous.

It was largely replaced by the Latin alphabet during the period of Dutch colonization, though it is still used today to a limited extent. The term Lontara is derived from the Malay name for palmyra palm, lontar, whose leaves are traditionally used for manuscripts. In Buginese, this script is called urupu sulapa eppa which means "four-cornered letters", referencing the Bugis-Makasar belief of the four elements that shaped the universe: fire, water, air and earth.

Non-printing character in word processors

Non-printing character or formatting marks are characters for content designing in word processors, which aren't displayed at printing. It is also possible to customize their display on the monitor. The most common non-printable characters in word processors are pilcrow, space, non-breaking space, Tab character etc.

Percent sign

The percent (per cent) sign (%) is the symbol used to indicate a percentage, a number or ratio as a fraction of 100. Related signs include the permille (per thousand) sign ‰ and the permyriad (per ten thousand) sign ‱ (also known as a basis point), which indicate that a number is divided by one thousand or ten thousand respectively. Higher proportions use parts-per notation.

Section sign

The section sign (§) is a typographical glyph for referencing individual numbered sections of a document, frequently used when referring to legal code. Encoded as Unicode U+00A7 § SECTION SIGN and HTML § it is also commonly called section symbol, section mark, double-s, silcrow, or alternatively paragraph mark in parts of Europe.

Thin space

In typography, a thin space is a space character that is usually ​1⁄5 or ​1⁄6 of an em in width. It is used to add a narrow space, such as between nested quotation marks or to separate glyphs that interfere with one another. It is not as narrow as the hair space.

In Unicode, thin space is encoded at U+2009   THIN SPACE (HTML   ·  ). Unicode's U+202F   NARROW NO-BREAK SPACE (HTML  ) is a non-breaking space with a width similar to that of the thin space.

In LaTeX and Plain TeX, \thinspace produces a narrow, non-breaking space. Outside of math formulae in LaTeX, \, also produces a narrow, non-breaking space, but inside math formulas it produces a narrow, breakable space.

In some versions of Microsoft Word, the symbol dialog (often available via Insert > Symbol or Insert > Special Characters), has both the thin space and the narrow no-break space available for point-and-click insertion. In Word's Symbol dialog, under font = "(normal text)", they are found in subset = "General Punctuation", Unicode character 2009 and nearby. Other word processing programs have ways of producing a thin space.

The International System of Units uses the thin space as a thousands separator. Neither a point nor a comma should be used as both of these are reserved for use as decimal markers.

Trimming (computer programming)

In computer programming, trimming (trim) or stripping (strip) is a string manipulation in which leading and trailing whitespace is removed from a string.

For example, the string (enclosed by apostrophes)

would be changed, after trimming, to

VSCII

VSCII (Vietnamese Standard Code for Information Interchange) also known as TCVN 5712:1993 and ISO-IR-180, is a set of three Vietnamese national standard character encodings for using the Vietnamese language with computers. It should not be confused with the similarly-named unofficial VISCII encoding.

Unicode and the Windows-1258 code page are now used for virtually all Vietnamese computer data, but legacy VSCII and VISCII files may need conversion.

Whitespace character

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.

Word joiner

The word joiner (WJ) is a code point in Unicode used to indicate that word separation should not occur at a position, when using scripts that do not use explicit spacing. It is encoded since Unicode version 3.2 (released in 2002) as U+2060. The word joiner does not produce any space and prohibits a line break at its position.

The word joiner replaces the zero width no-break space (ZWNBSP), a deprecated use of the Unicode character at code point U+FEFF. Character U+FEFF is intended for use as a Byte Order Mark (BOM) at the start of a file. However, if encountered elsewhere, it should, according to Unicode, be treated as a "zero width no-break space". The dedicated use of U+FEFF for this purpose is deprecated as of Unicode 3.2, with the word joiner strongly preferred.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.