The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point computation established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating point implementations that made them difficult to use reliably and portably. Many hardware floating point units now use the IEEE 754 standard.
The standard defines:
The current version, IEEE 754-2008 published in August 2008, includes nearly all of the original IEEE 754-1985 standard and the IEEE Standard for Radix-Independent Floating-Point Arithmetic (IEEE 854-1987).
The current version, IEEE 754-2008 published in August 2008, is derived from and replaces IEEE 754-1985, the previous version, following a seven-year revision process, chaired by Dan Zuras and edited by Mike Cowlishaw.
The binary formats in the original standard are included in the new standard along with three new basic formats, one binary and two decimal. To conform to the current standard, an implementation must implement at least one of the basic formats as both an arithmetic format and an interchange format.
An IEEE 754 format is a "set of representations of numerical values and symbols". A format may also include how the set is encoded.
A format comprises:
The possible finite values that can be represented in a format are determined by the base b, the number of digits in the significand (precision p), and the maximum exponent parameter emax:
Hence (for the example parameters) the smallest non-zero positive number that can be represented is 1×10−101 and the largest is 9999999×1090 (9.999999×1096), and the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.
Zero values are finite values with significand 0. These are signed zeros, the sign bit specifies if a zero is +0 (positive zero) or −0 (negative zero).
Some numbers may have several representations in the model that has just been described. For instance, if b=10 and p=7, −12.345 can be represented by −12345×10−3, −123450×10−4, and −1234500×10−5. However, for most operations, such as arithmetic operations, the result (value) does not depend on the representation of the inputs.
For the decimal formats, any representation is valid, and the set of these representations is called a cohort. When a result can have several representations, the standard specifies which member of the cohort is chosen.
For the binary formats, the representation is made unique by choosing the smallest representable exponent allowing the value to be represented exactly. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers. For numbers with an exponent in the normal range (the exponent field being not all ones or all zeros), the leading bit of the significand will always be 1. Consequently, a leading 1 can be implied rather than explicitly present in the memory encoding, and under the standard the explicitly represented part of the significand will lie between 0 and 1. This rule is called leading bit convention, implicit bit convention, or hidden bit convention. This rule allows the memory format to have one more bit of precision. The leading bit convention cannot be used for the subnormal numbers: they have an exponent outside the normal exponent range and scale by the smallest represented exponent as used for the smallest normal numbers.
The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats (encoded with 32, 64 or 128 bits) and two decimal floating-point basic formats (encoded with 64 or 128 bits). The binary32 and binary64 formats are the single and double formats of IEEE 754-1985. A conforming implementation must fully implement at least one of the basic formats.
The standard also defines interchange formats, which generalize these basic formats. For the binary ones, the leading bit convention is required. The following table summarizes the smallest interchange formats (including the basic ones).
|E min||E max||Notes|
|binary16||Half precision||2||11||3.31||5||4.51||24−1 = 15||−14||+15||not basic|
|binary32||Single precision||2||24||7.22||8||38.23||27−1 = 127||−126||+127|
|binary64||Double precision||2||53||15.95||11||307.95||210−1 = 1023||−1022||+1023|
|binary128||Quadruple precision||2||113||34.02||15||4931.77||214−1 = 16383||−16382||+16383|
|binary256||Octuple precision||2||237||71.34||19||78913.2||218−1 = 262143||−262142||+262143||not basic|
Note that in the table above, the minimum exponents listed are for normal numbers; the special subnormal number representation allows even smaller numbers to be represented (with some loss of precision). For example, the smallest positive number that can be represented in binary64 is 2−1074 (because 1074 = 1022 + 53 − 1).
Decimal digits is digits × log10 base, this gives an approximate precision in decimal.
Decimal E max is Emax × log10 base, this gives the maximum exponent in decimal.
As stated previously, the binary32 and binary64 formats are identical to the single and double formats respectively of IEEE 754-1985 and are two of the most common formats used today. The figure below shows the absolute precision for both the binary32 and binary64 formats in the range of 10−12 to 10+12. Such a figure can be used to select an appropriate format given the expected value of a number and the required precision.
The standard specifies extended and extendable precision formats, which are recommended for allowing a greater precision than that provided by the basic formats. An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (b, p, and emax). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent.
The standard does not require an implementation to support extended or extendable precision formats.
The standard recommends that languages provide a method of specifying p and emax for each supported base b.
The standard recommends that languages and implementations support an extended format which has a greater precision than the largest basic format supported for each radix b.
For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64-bit extended precision binary number must have an 'emax' of at least 16383. The x87 80-bit extended format meets this requirement.
Interchange formats are intended for the exchange of floating-point data using a fixed-length bit-string for a given format.
For the exchange of binary floating-point numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥128 are defined. The 16-bit format is intended for the exchange or storage of small numbers (e.g., for graphics).
The encoding scheme for these binary interchange formats is the same as that of IEEE 754-1985: a sign bit, followed by w exponent bits that describe the exponent offset by a bias, and p−1 bits that describe the significand. The width of the exponent field for a k-bit format is computed as w = round(4 log2(k))−13. The existing 64- and 128-bit formats follow this rule, but the 16- and 32-bit formats have more exponent bits (5 and 8) than this formula would provide (3 and 7, respectively).
As with IEEE 754-1985, the biased-exponent field is filled with all 1 bits to indicate either infinity (trailing significand field = 0) or a NaN (trailing significand field ≠ 0). For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively (the standard recommends 0 for signaling NaNs, 1 for quiet NaNs, so that a signaling NaNs can be quieted by changing only this bit to 1, while the reverse could yield the encoding of an infinity), and the payload is carried in the remaining bits.
For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined.
The encoding scheme for the decimal interchange formats similarly encodes the sign, exponent, and significand, but two different bit-level representations are defined. Interchange is complicated by the fact that some external indicator of the representation in use is required. The two options allow the significand to be encoded as a compressed sequence of decimal digits (using densely packed decimal) or alternatively as a binary integer. The former is more convenient for direct hardware implementation of the standard, while the latter is more suited to software emulation on a binary computer. In either case the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and special values (±zero, ±infinity, quiet NaNs, and signaling NaNs) have identical binary representations.
The standard defines five rounding rules. The first two rules round to a nearest value; the others are called directed roundings:
|Mode / Example Value||+11.5||+12.5||−11.5||−12.5|
|to nearest, ties to even||+12.0||+12.0||−12.0||−12.0|
|to nearest, ties away from zero||+12.0||+13.0||−12.0||−13.0|
Required operations for a supported arithmetic format (including the basic formats) include:
The standard provides a predicate totalOrder which defines a total ordering for all floating numbers for each format. The predicate agrees with the normal comparison operations when they say one floating point number is less than another. The normal comparison operations however treat NaNs as unordered and compare −0 and +0 as equal. The totalOrder predicate will order these cases, and it also distinguishes between different representations of NaNs and between the same decimal floating point number encoded in different ways.
The standard defines five exceptions, each of which returns a default value and has a corresponding status flag that (except in certain cases of underflow) is raised when the exception occurs. No other exception handling is required, but additional non-default alternatives are recommended (see below).
The five possible exceptions are:
These are the same five exceptions as were defined in IEEE 754-1985, but the division by zero exception has been extended to operations other than the division.
Additionally, operations like quantize when either operand is infinite, or when the result does not fit the destination format, will also signal invalid operation exception.
The standard recommends optional exception handling in various forms, including presubstitution of user-defined default values, and traps (exceptions that change the flow of control in some way) and other exception handling models which interrupt the flow, such as try/catch. The traps and other exception mechanisms remain optional, as they were in IEEE 754-1985.
Clause 9 in the standard recommends fifty operations, that language standards should define. These are all optional (not required in order to conform to the standard).
Recommended arithmetic operations, which must round correctly:
The asinPi, acosPi and tanPi functions are not part of the standard because the feeling was that they were less necessary. The first two are mentioned in a paragraph, but this is regarded as an error.
The operations also include setting and accessing dynamic mode rounding direction, and implementation-defined vector reduction operations such as sum, scaled product, and dot product, whose accuracy is unspecified by the standard.
The standard recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result. By contrast, the previous 1985 version of the standard left aspects of the language interface unspecified, which led to inconsistent behavior between compilers, or different optimization levels in a single compiler.
Programming languages should allow a user to specify a minimum precision for intermediate calculations of expressions for each radix. This is referred to as "preferredWidth" in the standard, and it should be possible to set this on a per block basis. Intermediate calculations within expressions should be calculated, and any temporaries saved, using the maximum of the width of the operands and the preferred width, if set. Thus, for instance, a compiler targeting x87 floating-point hardware should have a means of specifying that intermediate calculations must use the double-extended format. The stored value of a variable must always be used when evaluating subsequent expressions, rather than any precursor from before rounding and assigning to the variable.
The IEEE 754-1985 allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 754-2008 has strengthened up many of these, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.
The standard requires operations to convert between basic formats and external character sequence formats. Conversions to and from a decimal character format are required for all formats. Conversion to an external character sequence must be such that conversion back using round to even will recover the original number. There is no requirement to preserve the payload of a NaN or signaling NaN, and conversion from the external character sequence may turn a signaling NaN into a quiet NaN.
The original binary value will be preserved by converting to decimal and back again using:
For other binary formats, the required number of decimal digits is
where p is the number of significant bits in the binary format, e.g. 237 bits for binary256.
(Note: as an implementation limit, correct rounding is only guaranteed for the number of decimal digits above plus 3 for the largest supported binary format. For instance, if binary32 is the largest supported binary format, then a conversion from a decimal external sequence with 12 decimal digits is guaranteed to be correctly rounded when converted to binary32; but conversion of a sequence of 13 decimal digits is not; however the standard recommends that implementations impose no such limit.)
When using a decimal floating-point format, the decimal representation will be preserved using: