In the IEEE 754-2008 standard, the 16-bit base 2 format is referred to as binary16. It is intended for storage of floating-point values in applications where higher precision is not essential for performing arithmetic computations.
Although implementations of the IEEE Half-precision floating point are relatively new, several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982, Scott's WIF and the 3dfx Voodoo Graphics processor.
Nvidia and Microsoft defined the half datatype in the Cg language, released in early 2002, and implemented it in silicon in the GeForce FX, released in late 2002. ILM was searching for an image format that could handle a wide dynamic range, but without the hard drive and memory cost of floating-point representations that are commonly used for floating-point computation (single and double precision). The hardware-accelerated programmable shading group led by John Airey at SGI (Silicon Graphics) invented the s10e5 data type in 1997 as part of the 'bali' design effort. This is described in a SIGGRAPH 2000 paper (see section 4.3) and further documented in US patent 7518615.
This format is used in several computer graphics environments including OpenEXR, JPEG XR, GIMP, OpenGL, Cg, and D3DX. The advantage over 8-bit or 16-bit binary integers is that the increased dynamic range allows for more detail to be preserved in highlights and shadows for images. The advantage over 32-bit single-precision binary formats is that it requires half the storage and bandwidth (at the expense of precision and range).
The IEEE 754 standard specifies a binary16 as having the following format:
The format is laid out as follows:
The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 units in the last place).
The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.
Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.
The stored exponents 000002 and 111112 are interpreted specially.
|Exponent||Significand = zero||Significand ≠ zero||Equation|
|000002||zero, −0||subnormal numbers||(−1)signbit × 2−14 × 0.significantbits2|
|000012, ..., 111102||normalized value||(−1)signbit × 2exponent−15 × 1.significantbits2|
|111112||±infinity||NaN (quiet, signalling)|
The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2−10) × 215 = 65504.
These examples are given in bit representation of the floating-point value. This includes the sign bit, (biased) exponent, and significand.
0 01111 0000000000 = 1 0 01111 0000000001 = 1 + 2−10 = 1.0009765625 (next smallest float after 1) 1 10000 0000000000 = −2 0 11110 1111111111 = 65504 (max half precision) 0 00001 0000000000 = 2−14 ≈ 6.10352 × 10−5 (minimum positive normal) 0 00000 1111111111 = 2−14 - 2−24 ≈ 6.09756 × 10−5 (maximum subnormal) 0 00000 0000000001 = 2−24 ≈ 5.96046 × 10−8 (minimum positive subnormal) 0 00000 0000000000 = 0 1 00000 0000000000 = −0 0 11111 0000000000 = infinity 1 11111 0000000000 = −infinity 0 01101 0101010101 = 0.333251953125 ≈ 1/3
By default, 1/3 rounds down like for double precision, because of the odd number of bits in the significand.
So the bits beyond the rounding point are
0101... which is less than 1/2 of a unit in the last place.
ARM processors support (via a floating point control register bit) an "alternative half-precision" format, which does away with the special case for an exponent value of 31 (111112). It is almost identical to the IEEE format, but there is no encoding for infinity or NaNs; instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008.