Half-precision floating-point format

In computing, half precision is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory.

In the IEEE 754-2008 standard, the 16-bit base-2 format is referred to as binary16. It is intended for storage of floating-point values in applications where higher precision is not essential for performing arithmetic computations.

Although implementations of the IEEE Half-precision floating point are relatively new, several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP[1] of 1982, Scott's WIF[2] and the 3dfx Voodoo Graphics processor.[3]

Nvidia and Microsoft defined the half datatype in the Cg language, released in early 2002, and implemented it in silicon in the GeForce FX, released in late 2002.[4] ILM was searching for an image format that could handle a wide dynamic range, but without the hard drive and memory cost of floating-point representations that are commonly used for floating-point computation (single and double precision).[5] The hardware-accelerated programmable shading group led by John Airey at SGI (Silicon Graphics) invented the s10e5 data type in 1997 as part of the 'bali' design effort. This is described in a SIGGRAPH 2000 paper[6] (see section 4.3) and further documented in US patent 7518615.[7]

This format is used in several computer graphics environments including OpenEXR, JPEG XR, GIMP, OpenGL, Cg, and D3DX. The advantage over 8-bit or 16-bit binary integers is that the increased dynamic range allows for more detail to be preserved in highlights and shadows for images. The advantage over 32-bit single-precision binary formats is that it requires half the storage and bandwidth (at the expense of precision and range).[5]

The F16C extension allows x86 processors to convert half-precision floats to and from single-precision floats.

IEEE 754 half-precision binary floating-point format: binary16

The IEEE 754 standard specifies a binary16 as having the following format:

The format is laid out as follows:

IEEE 754r Half Floating Point Format

IEEE 754r Half Floating Point Format

The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 units in the last place).

Exponent encoding

The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.

  • Emin = 000012 − 011112 = −14
  • Emax = 111102 − 011112 = 15
  • Exponent bias = 011112 = 15

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.

The stored exponents 000002 and 111112 are interpreted specially.

Exponent Significand = zero Significand ≠ zero Equation
000002 zero, −0 subnormal numbers (−1)signbit × 2−14 × 0.significantbits2
000012, ..., 111102 normalized value (−1)signbit × 2exponent−15 × 1.significantbits2
111112 ±infinity NaN (quiet, signalling)

The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2−10) × 215 = 65504.

Half precision examples

These examples are given in bit representation of the floating-point value. This includes the sign bit, (biased) exponent, and significand.

0 00000 00000000012 = 000116 = 2−14 × 2−10 = 2−24 ≈ 0.000000059605
                              (smallest positive subnormal number)
0 00000 11111111112 = 03ff16 = 2−14 × (1 − 2−10) ≈ 0.000060976
                              (largest subnormal number)
0 00001 00000000002 = 040016 = 2−14 ≈ 0.000061035
                              (smallest positive normal number)
0 11110 11111111112 = 7bff16 = 215 × (1 + (1 − 2−10)) ≈ 65504
                              (largest normal number)
0 01110 11111111112 = 3bff16 = 1 − 2−11 ≈ 0.99951
                              (largest number less than one)
0 01111 00000000002 = 3c0016 = 1 (one)
0 01111 00000000012 = 3c0116 = 1 + 2−10 ≈ 1.001
                              (smallest number larger than one)
1 10000 00000000002 = c00016 = −2

0 00000 00000000002 = 000016 = 0
1 00000 00000000002 = 800016 = −0

0 11111 00000000002 = 7c0016 = infinity
1 11111 00000000002 = fc0016 = −infinity

0 01101 01010101012 = 355516 = 0.333251953125 ≈ 1/3

By default, 1/3 rounds down like for double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

Precision limitations on decimal values in [0, 1]

  • Decimals between 2−24 (minimum positive subnormal) and 2−14 (maximum subnormal): fixed interval 2−24
  • Decimals between 2−14 (minimum positive normal) and 2−13: fixed interval 2−24
  • Decimals between 2−13 and 2−12: fixed interval 2−23
  • Decimals between 2−12 and 2−11: fixed interval 2−22
  • Decimals between 2−11 and 2−10: fixed interval 2−21
  • Decimals between 2−10 and 2−9: fixed interval 2−20
  • Decimals between 2−9 and 2−8: fixed interval 2−19
  • Decimals between 2−8 and 2−7: fixed interval 2−18
  • Decimals between 2−7 and 2−6: fixed interval 2−17
  • Decimals between 2−6 and 2−5: fixed interval 2−16
  • Decimals between 2−5 and 2−4: fixed interval 2−15
  • Decimals between 2−4 and 2−3: fixed interval 2−14
  • Decimals between 2−3 and 2−2: fixed interval 2−13
  • Decimals between 2−2 and 2−1: fixed interval 2−12
  • Decimals between 2−1 and 2−0: fixed interval 2−11

Precision limitations on decimal values in [1, 2048]

  • Decimals between 1 and 2: fixed interval 2−10 (1+2−10 is the next largest float after 1)
  • Decimals between 2 and 4: fixed interval 2−9
  • Decimals between 4 and 8: fixed interval 2−8
  • Decimals between 8 and 16: fixed interval 2−7
  • Decimals between 16 and 32: fixed interval 2−6
  • Decimals between 32 and 64: fixed interval 2−5
  • Decimals between 64 and 128: fixed interval 2−4
  • Decimals between 128 and 256: fixed interval 2−3
  • Decimals between 256 and 512: fixed interval 2−2
  • Decimals between 512 and 1024: fixed interval 2−1
  • Decimals between 1024 and 2048: fixed interval 20

Precision limitations on integer values

  • Integers between 0 and 2048 can be exactly represented (and also between −2048 and 0)
  • Integers between 2048 and 4096 round to a multiple of 2 (even number)
  • Integers between 4096 and 8192 round to a multiple of 4
  • Integers between 8192 and 16384 round to a multiple of 8
  • Integers between 16384 and 32768 round to a multiple of 16
  • Integers between 32768 and 65519 round to a multiple of 32[8]
  • Integers equal to or above 65520 are rounded to "infinity" (if using round-to-even, or equal to or above 65536 if using round-to-zero, or strictly above 65504 if using round-to-infinity).

ARM alternative half-precision

ARM processors support (via a floating point control register bit) an "alternative half-precision" format, which does away with the special case for an exponent value of 31 (111112).[9] It is almost identical to the IEEE format, but there is no encoding for infinity or NaNs; instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008.

See also


  1. ^ "hitachi :: dataBooks :: HD61810 Digital Signal Processor Users Manual". Archive.org. Retrieved 2017-07-14.
  2. ^ Scott, Thomas J. (March 1991). "Mathematics and Computer Science at Odds over Real Numbers". SIGCSE '91 Proceedings of the twenty-second SIGCSE technical symposium on Computer science education. 23 (1): 130–139.
  3. ^ "/home/usr/bk/glide/docs2.3.1/GLIDEPGM.DOC". Gamers.org. Retrieved 2017-07-14.
  4. ^ "vs_2_sw". Cg 3.1 Toolkit Documentation. Nvidia. Retrieved 17 August 2016.
  5. ^ a b "OpenEXR". OpenEXR. Retrieved 2017-07-14.
  6. ^ Mark S. Peercy; Marc Olano; John Airey; P. Jeffrey Ungar. "Interactive Multi-Pass Programmable Shading" (PDF). People.csail.mit.edu. Retrieved 2017-07-14.
  7. ^ "Patent US7518615 - Display system having floating point rasterization and floating point ... - Google Patents". Google.com. Retrieved 2017-07-14.
  8. ^ "Mediump float calculator". Retrieved 2016-07-26. Half precision floating point calculator
  9. ^ "Half-precision floating-point number support". RealView Compilation Tools Compiler User Guide. 10 December 2010. Retrieved 2015-05-05.

Further reading

External links

Bfloat16 floating-point format

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use.

The bfloat16 format is utilized in upcoming Intel AI processors, such as Nervana NNP-L1000, Xeon processors, and Intel FPGAs, Google Cloud TPUs, and TensorFlow.


DisplayID is a VESA standard for metadata describing display device capabilities to the video source. It is designed to replace E-EDID standard and EDID structure v1.4.

The DisplayID standard was initially released in December 2007. Version 1.1 was released in March 2009 and was followed by version 1.2 released in August 2011. Version 1.3 was released in June 2013 and current version 2.0 was released in September 2017.

DisplayID uses variable-length structures of up to 256 bytes each, which encompass all existing EDID extensions as well as new extensions for 3D displays, embedded displays, wide color gamut and HDR EOTF. DisplayID format includes several blocks which describe logical parts of the display such as video interfaces, display device technology, timing details and manufacturer information. Data blocks are identified with a unique tag. The length of each block can be variable or fixed to a specific number of bytes. Only the base data block is mandatory, while all extension blocks are optional. This variable structure is based on CEA EDID Extension Block Version 3 first defined in CEA-861-B.

The DisplayID standard is freely available and is royalty-free to implement.


F16, F 16, F-16 or FI6 may refer to:

General Dynamics F-16 Fighting Falcon, a 1974 American multirole fighter jet aircraft

FI6 (antibody), an antibody that targets influenza A viruses

F 16 Uppsala, a Swedish air force base

Formula 16, a 5 metre catamaran

Volvo F16, a truck

the ICD-10 code for mental and behavioural disorders due to use of hallucinogens

Eric Fehr, a Canadian ice hockey player and author who wears #16 for the Toronto Maple Leafs of the National Hockey League

Half-precision floating-point format, a 16-bit computer number format


In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Minifloats with 16 bits are half-precision numbers (opposed to single and double precision). There are also minifloats with 8 bits or even fewer.

Minifloats can be designed following the principles of the IEEE 754 standard. In this case they must obey the (not explicitly written) rules for the frontier between subnormal and normal numbers and must have special patterns for infinity and NaN. Normalized numbers are stored with a biased exponent. The new revision of the standard, IEEE 754-2008, has 16-bit binary minifloats.

The Radeon R300 and R420 GPUs used an "fp24" floating-point format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa.

"Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 (Shader Model 2.0) graphics API initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.

In computer graphics minifloats are sometimes used to represent only integral values. If at the same time subnormal values should exist, the least subnormal number has to be 1. This statement can be used to calculate the bias value. The following example demonstrates the calculation, as well as the underlying principles.

Precision (computer science)

In computer science, the precision of a numerical quantity is a measure of the detail in which the quantity is expressed. This is usually measured in bits, but sometimes in decimal digits. It is related to precision in mathematics, which describes the number of digits that are used to express a value.

Some of the standardized precision formats are

Half-precision floating-point format

Single-precision floating-point format

Double-precision floating-point format

Quadruple-precision floating-point format

Octuple-precision floating-point formatOf these, octuple-precision format is rarely used. The single- and double-precision formats are most widely used and supported on nearly all platforms. The use of half-precision format has been increasing especially in the field of machine learning since many machine learning algorithms are inherently error-tolerant.


This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.