Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.
For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.
All numbers in this section are hexadecimal, and all ranges are inclusive.
Code points from
U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is,
U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (
U+0020). The initial state is
U+0040. The normalization mapping is as follows:
|Code range||Normalized code point||Notes|
||encoder state kept as is||Space|
(excluding ranges above)
(excluding ranges above)
The difference between the current code point and the normalized previous code point is encoded as follows:
|Difference range||Byte sequence range|
Each byte range is lexicographically ordered with the following thirteen byte values excluded:
00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence
FC 06 FF, coding for a difference of
1156B, is immediately followed by the byte sequence
FC 10 01, coding for a difference of
Any ASCII input
U+007F excluding space
U+0020 resets the encoder to
U+0040. Because the above-mentioned values cover line end code points
U+000A as is (
0D 0A), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document.
BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code
0xFF. When a decoder finds this octet it resets its state to
U+0040 as for a line end. The use of
0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the binary order.
The optional use of a signature
U+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence
FB EE 28, changes the initial state
U+FEC0. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (
FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practice.
In theory UTF-1 and UTF-8 could encode the original UCS-4 set with 31 bits up to
7FFFFFFF. BOCU-1 and UTF-16 can encode
the modern Unicode set from
U+10FFFF. Excluding the thirteen protected code points encoded as single octets BOCU-1 can use octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference.
Note that the reset byte
0xFF is not protected and can occur as trail byte.
The general BOCU algorithm is covered by United States Patent #6,737,994, which also mentions the specific BOCU-1 implementation. IBM, which employed both of the inventors of BOCU-1 at the time it was created, states in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" must contact IBM to request a royalty-free license. BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property restrictions.
By contrast, IBM also filed for a patent on UTF-EBCDIC, but it chose in that case to make the documentation and encoding scheme “freely available to anyone concerned towards making the transformation format as part of the UCS standards,” instead of requiring implementers to request a license.
BOCU may refer to:
Binary Ordered Compression for Unicode
Borough Operational Command Unit/Basic Command Unit
Ashiko called the bocu in CubaComparison of Unicode encodings
This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.Standard Compression Scheme for Unicode
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character (plus setup overhead, which for common languages is often only 1 byte), most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.
Symbian OS, an operating system for mobile phones and other mobile devices, uses SCSU to serialise strings.
Reuters, the organization that floated the first draft of SCSU, is believed to use SCSU internally.
SQL Server 2008 R2 uses SCSU to compress Unicode values stored in nchar(n) and nvarchar(n) columns, achieving space savings between 15% and 50%, depending on the language of the data.
|On pairs of|
|MacOS code pages("scripts")|
|DOS code pages|
|IBM AIX code pages|
|IBM Apple MacIntoshemulations|
|IBM Adobe emulations|
|IBM DEC emulations|
|IBM HP emulations|
|Windows code pages|
|EBCDIC code pages|
|Unicode / ISO/IEC 10646|
|TeX typesetting system|
|Miscellaneous code pages|