Doubleprecision floatingpoint format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Floating point is used to represent fractional values, or when a wider range is needed than is provided by fixed point (of the same bit width), even if at the cost of precision. Double precision may be chosen when the range or precision of single precision would be insufficient.
In the IEEE 7542008 standard, the 64bit base2 format is officially referred to as binary64; it was called double in IEEE 7541985. IEEE 754 specifies additional floatingpoint formats, including 32bit base2 single precision and, more recently, base10 representations.
One of the first programming languages to provide single and doubleprecision floatingpoint data types was Fortran. Before the widespread adoption of IEEE 7541985, the representation and properties of floatingpoint data types depended on the computer manufacturer and computer model, and upon decisions made by programminglanguage implementers. E.g., GWBASIC's doubleprecision data type was the 64bit MBF floatingpoint format.
Doubleprecision binary floatingpoint is a commonly used format on PCs, due to its wider range over singleprecision floating point, in spite of its performance and bandwidth cost. As with singleprecision floatingpoint format, it lacks precision on integer numbers when compared with an integer format of the same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:
The sign bit determines the sign of the number (including when this number is zero, which is signed).
The exponent field can be interpreted as either an 11bit signed integer from −1024 to 1023 (2's complement) or an 11bit unsigned integer from 0 to 2047, which is the accepted biased form in the IEEE 754 binary64 definition. If the unsigned integer format is used, the exponent value used in the arithmetic is the exponent shifted by a bias – for the IEEE 754 binary64 case, an exponent value of 1023 represents the actual zero (i.e. for 2^{e − 1023} to be one, e must be 1023). Exponents range from −1022 to +1023 because exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers.
The 53bit significand precision gives from 15 to 17 significant decimal digits precision (2^{−53} ≈ 1.11 × 10^{−16}). If a decimal string with at most 15 significant digits is converted to IEEE 754 doubleprecision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 doubleprecision number is converted to a decimal string with at least 17 significant digits, and then converted back to doubleprecision representation, the final result must match the original number.^{[1]}
The format is written with the significand having an implicit integer bit of value 1 (except for special data, see the exponent encoding below). With the 52 bits of the fraction significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log_{10}(2) ≈ 15.955). The bits are laid out as follows:
The real value assumed by a given 64bit doubleprecision datum with a given biased exponent and a 52bit fraction is
or
Between 2^{52}=4,503,599,627,370,496 and 2^{53}=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2^{53} to 2^{54}, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2^{51} to 2^{52}, the spacing is 0.5, etc.
The spacing as a fraction of the numbers in the range from 2^{n} to 2^{n+1} is 2^{n−52}. The maximum relative rounding error when rounding a number to the nearest representable one (the machine epsilon) is therefore 2^{−53}.
The 11 bit width of the exponent allows the representation of numbers between 10^{−308} and 10^{308}, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10^{−324}.
The doubleprecision binary floatingpoint exponent is encoded using an offsetbinary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such representations would be:
e=00000000001_{2} =001_{16} =1:

(smallest exponent for normal numbers)  
e=01111111111_{2} =3ff_{16} =1023:

(zero offset)  
e=10000000101_{2} =405_{16} =1029:


e=11111111110_{2} =7fe_{16} =2046:

(highest exponent) 
The exponents 000_{16}
and 7ff_{16}
have a special meaning:
00000000000_{2}
=000_{16}
is used to represent a signed zero (if F=0) and subnormals (if F≠0); and11111111111_{2}
=7ff_{16}
is used to represent ∞ (if F=0) and NaNs (if F≠0),where F is the fractional part of the significand. All bit patterns are valid encoding.
Except for the above exceptions, the entire doubleprecision number is described by:
In the case of subnormals (e=0) the doubleprecision number is described by:
Although the ubiquitous x86 processors of today use littleendian storage for all types of data (integer, floating point, BCD), there are a number of hardware architectures where floatingpoint numbers are represented in bigendian form while integers are represented in littleendian form.^{[2]} There are ARM processors that have half littleendian, half bigendian floatingpoint representation for doubleprecision numbers: both 32bit words are stored in littleendian like integer registers, but the most significant one first. Because there have been many floatingpoint formats with no "network" standard representation for them, the XDR standard uses bigendian IEEE 754 as its representation. It may therefore appear strange that the widespread IEEE 754 floatingpoint standard does not specify endianness.^{[3]} Theoretically, this means that even standard IEEE floatingpoint data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may in practice safely assume that the endianness is the same for floatingpoint numbers as for integers, making the conversion straightforward regardless of data type. (Small embedded systems using special floatingpoint formats may be another matter however.)
0 01111111111 0000000000000000000000000000000000000000000000000000_{2} ≙ +2^{0}·1 = 1

0 01111111111 0000000000000000000000000000000000000000000000000001_{2} ≙ +2^{0}·(1 + 2^{−52}) ≈ 1.0000000000000002, the smallest number > 1

0 01111111111 0000000000000000000000000000000000000000000000000010_{2} ≙ +2^{0}·(1 + 2^{−51}) ≈ 1.0000000000000004

0 10000000000 0000000000000000000000000000000000000000000000000000_{2} ≙ +2^{1}·1 = 2

1 10000000000 0000000000000000000000000000000000000000000000000000_{2} ≙ −2^{1}·1 = −2

0 10000000000 1000000000000000000000000000000000000000000000000000_{2} ≙ +2^{1}·1.1_{2}

= 11_{2} = 3 
0 10000000001 0000000000000000000000000000000000000000000000000000_{2} ≙ +2^{2}·1

= 100_{2} = 4 
0 10000000001 0100000000000000000000000000000000000000000000000000_{2} ≙ +2^{2}·1.01_{2}

= 101_{2} = 5 
0 10000000001 1000000000000000000000000000000000000000000000000000_{2} ≙ +2^{2}·1.1_{2}

= 110_{2} = 6 
0 10000000011 0111000000000000000000000000000000000000000000000000_{2} ≙ +2^{4}·1.0111_{2}

= 10111_{2} = 23 
0 00000000000 0000000000000000000000000000000000000000000000000001_{2}

≙ +2^{−1022}·2^{−52} = 2^{−1074} ≈ 4.9·10^{−324} 
(Min. subnormal positive double) 
0 00000000000 1111111111111111111111111111111111111111111111111111_{2}

≙ +2^{−1022}·(1 − 2^{−52}) ≈ 2.2250738585072009·10^{−308} 
(Max. subnormal double) 
0 00000000001 0000000000000000000000000000000000000000000000000000_{2}

≙ +2^{−1022}·1 ≈ 2.2250738585072014·10^{−308} 
(Min. normal positive double) 
0 11111111110 1111111111111111111111111111111111111111111111111111_{2}

≙ +2^{1023}·(1 + (1 − 2^{−52})) ≈ 1.7976931348623157·10^{308} 
(Max. Double) 
0 00000000000 0000000000000000000000000000000000000000000000000000_{2} ≙ +0


1 00000000000 0000000000000000000000000000000000000000000000000000_{2} ≙ −0


0 11111111111 0000000000000000000000000000000000000000000000000000_{2} ≙ +∞

(positive infinity)  
1 11111111111 0000000000000000000000000000000000000000000000000000_{2} ≙ −∞

(negative infinity)  
0 11111111111 0000000000000000000000000000000000000000000000000001_{2} ≙ NaN

(sNaN on most processors, such as x86 and ARM)  
0 11111111111 1000000000000000000000000000000000000000000000000001_{2} ≙ NaN

(qNaN on most processors, such as x86 and ARM)  
0 11111111111 1111111111111111111111111111111111111111111111111111_{2} ≙ NaN

(an alternative encoding) 
0 01111111101 0101010101010101010101010101010101010101010101010101_{2} = 3fd5 5555 5555 5555_{16}

≙ +2^{−2}·(1 + 2^{−2} + 2^{−4} + ... + 2^{−52}) ≈ ^{1}/_{3} 
0 10000000000 1001001000011111101101010100010001000010110100011000_{2} = 4009 21fb 5444 2d18_{16}

≈ pi 
Encodings of qNaN and sNaN are not completely specified in IEEE 754 and depend on the processor. Most processors, such as the x86 family and the ARM family processors, use the most significant bit of the significand field to indicate a quiet NaN; this is what is recommended by IEEE 754. The PARISC processors use the bit to indicate a signaling NaN.
By default, ^{1}/_{3} rounds down, instead of up like single precision, because of the odd number of bits in the significand.
In more detail:
Given the hexadecimal representation 3FD5 5555 5555 5555_{16}, Sign = 0 Exponent = 3FD_{16} = 1021 Exponent Bias = 1023 (constant value; see above) Fraction = 5 5555 5555 5555_{16} Value = 2^{(Exponent − Exponent Bias)} × 1.Fraction – Note that Fraction must not be converted to decimal here = 2^{−2} × (15 5555 5555 5555_{16} × 2^{−52}) = 2^{−54} × 15 5555 5555 5555_{16} = 0.333333333333333314829616256247390992939472198486328125 ≈ 1/3
Using doubleprecision floatingpoint variables and mathematical functions (e.g., sin, cos, atan2, log, exp and sqrt) are slower than working with their single precision counterparts. One area of computing where this is a particular issue is for parallel code running on GPUs. For example, when using NVIDIA's CUDA platform, on video cards designed for gaming, calculations with double precision take 3 to 24 times longer to complete than calculations using single precision.^{[4]}
Doubles are implemented in many programming languages in different ways such as the following. On processors with only dynamic precision, such as x86 without SSE2 (or when SSE2 is not used, for compatibility purpose) and with extended precision used by default, software may have difficulties to fulfill some requirements.
C and C++ offer a wide variety of arithmetic types. Double precision is not required by the standards (except by the optional annex F of C99, covering IEEE 754 arithmetic), but on most systems, the double
type corresponds to double precision. However, on 32bit x86 with extended precision by default, some compilers may not conform to the C standard and/or the arithmetic may suffer from double rounding.^{[5]}
Common Lisp provides the types SHORTFLOAT, SINGLEFLOAT, DOUBLEFLOAT and LONGFLOAT. Most implementations provide SINGLEFLOATs and DOUBLEFLOATs with the other types appropriate synonyms. Common Lisp provides exceptions for catching floatingpoint underflows and overflows, and the inexact floatingpoint exception, as per IEEE 754. No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions.
As specified by the ECMAScript standard, all arithmetic in JavaScript shall be done using doubleprecision floatingpoint arithmetic.^{[6]}
This page is based on a Wikipedia article written by authors
(here).
Text is available under the CC BYSA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.