The IEEE Standard for FloatingPoint Arithmetic (IEEE 754) is a technical standard for floatingpoint arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floatingpoint implementations that made them difficult to use reliably and portably. Many hardware floatingpoint units use the IEEE 754 standard.
The standard defines:
The current version, IEEE 7542008 revision published in August 2008, includes nearly all of the original IEEE 7541985 standard plus IEEE 8541987 Standard for RadixIndependent FloatingPoint Arithmetic.
The current version, IEEE 7542008 published in August 2008, is derived from and replaces IEEE 7541985, the previous version, following a sevenyear revision process, chaired by Dan Zuras and edited by Mike Cowlishaw.
The international standard ISO/IEC/IEEE 60559:2011 (with content identical to IEEE 7542008) has been approved for adoption through JTC1/SC 25 under the ISO/IEEE PSDO Agreement^{[1]} and published.^{[2]}
The binary formats in the original standard are included in the new standard along with three new basic formats, one binary and two decimal. To conform to the current standard, an implementation must implement at least one of the basic formats as both an arithmetic format and an interchange format.
As of September 2015, the standard is being revised to incorporate clarifications and errata.^{[3]}^{[4]}
An IEEE 754 format is a "set of representations of numerical values and symbols". A format may also include how the set is encoded.^{[5]}
A floatingpoint format is specified by:
A format comprises:
For example, if b = 10, p = 7 and emax = 96, then emin = −95, the significand satisfies 0 ≤ c ≤ 9,999,999, and the exponent satisfies −101 ≤ q ≤ 90. Consequently, the smallest nonzero positive number that can be represented is 1×10^{−101}, and the largest is 9999999×10^{90} (9.999999×10^{96}), so the full range of numbers is −9.999999×10^{96} through 9.999999×10^{96}. The numbers −b^{1−emax} and b^{1−emax} (here, −1×10^{−95} and 1×10^{−95}) are the smallest (in magnitude) normal numbers; nonzero numbers between these smallest numbers are called subnormal numbers.
Some numbers may have several possible exponential format representations. For instance, if b=10 and p=7, −12.345 can be represented by −12345×10^{−3}, −123450×10^{−4}, and −1234500×10^{−5}. However, for most operations, such as arithmetic operations, the result (value) does not depend on the representation of the inputs.
For the decimal formats, any representation is valid, and the set of these representations is called a cohort. When a result can have several representations, the standard specifies which member of the cohort is chosen.
For the binary formats, the representation is made unique by choosing the smallest representable exponent allowing the value to be represented exactly. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers. For numbers with an exponent in the normal range (the exponent field being not all ones or all zeros), the leading bit of the significand will always be 1. Consequently, a leading 1 can be implied rather than explicitly present in the memory encoding, and under the standard the explicitly represented part of the significand will lie between 0 and 1. This rule is called leading bit convention, implicit bit convention, or hidden bit convention. This rule allows the binary format to have an extra bit of precision. The leading bit convention cannot be used for the subnormal numbers as they have an exponent outside the normal exponent range and scale by the smallest represented exponent as used for the smallest normal numbers.
Due to the possibility of multiple encodings (at least in formats called interchange formats), a NaN may carry other information: a sign bit (which has no meaning, but may be used by some operations) and a payload, which is intended for diagnostic information indicating the source of the NaN (but the payload may have other uses, such as NaNboxing^{[6]}^{[7]}^{[8]}).
The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floatingpoint basic formats (encoded with 32, 64 or 128 bits) and two decimal floatingpoint basic formats (encoded with 64 or 128 bits). The binary32 and binary64 formats are the single and double formats of IEEE 7541985 respectively. A conforming implementation must fully implement at least one of the basic formats.
The standard also defines interchange formats, which generalize these basic formats.^{[9]} For the binary formats, the leading bit convention is required. The following table summarizes the smallest interchange formats (including the basic ones).
Name  Common name  Base  Significand bits^{[b]} or digits  Decimal digits  Exponent bits  Decimal E max  Exponent bias^{[10]}  E min  E max  Notes 

binary16  Half precision  2  11  3.31  5  4.51  2^{4}−1 = 15  −14  +15  not basic 
binary32  Single precision  2  24  7.22  8  38.23  2^{7}−1 = 127  −126  +127  
binary64  Double precision  2  53  15.95  11  307.95  2^{10}−1 = 1023  −1022  +1023  
binary128  Quadruple precision  2  113  34.02  15  4931.77  2^{14}−1 = 16383  −16382  +16383  
binary256  Octuple precision  2  237  71.34  19  78913.2  2^{18}−1 = 262143  −262142  +262143  not basic 
decimal32  10  7  7  7.58  96  101  −95  +96  not basic  
decimal64  10  16  16  9.58  384  398  −383  +384  
decimal128  10  34  34  13.58  6144  6176  −6143  +6144 
Note that in the table above, the minimum exponents listed are for normal numbers; the special subnormal number representation allows even smaller numbers to be represented (with some loss of precision). For example, the smallest positive number that can be represented in binary64 is 2^{−1074} (because 1074 = 1022 + 53 − 1).
Decimal digits is digits × log_{10} base, this gives an approximate precision in decimal.
Decimal E max is Emax × log_{10} base, this gives the maximum exponent in decimal.
As stated previously, the binary32 and binary64 formats are identical to the single and double formats respectively of IEEE 7541985 and are two of the most common formats used today. The figure below shows the absolute precision for both the binary32 and binary64 formats in the range of 10^{−12} to 10^{+12}. Such a figure can be used to select an appropriate format given the expected value of a number and the required precision.
The standard specifies extended and extendable precision formats, which are recommended for allowing a greater precision than that provided by the basic formats.^{[11]} An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (b, p, and emax). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent.
The standard does not require an implementation to support extended or extendable precision formats.
The standard recommends that languages provide a method of specifying p and emax for each supported base b.^{[12]}
The standard recommends that languages and implementations support an extended format which has a greater precision than the largest basic format supported for each radix b.^{[13]}
For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64bit extended precision binary number must have an 'emax' of at least 16383. The x87 80bit extended format meets this requirement.
Interchange formats are intended for the exchange of floatingpoint data using a fixedlength bitstring for a given format.
For the exchange of binary floatingpoint numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥128 are defined. The 16bit format is intended for the exchange or storage of small numbers (e.g., for graphics).
The encoding scheme for these binary interchange formats is the same as that of IEEE 7541985: a sign bit, followed by w exponent bits that describe the exponent offset by a bias, and p−1 bits that describe the significand. The width of the exponent field for a kbit format is computed as w = round(4 log_{2}(k))−13. The existing 64 and 128bit formats follow this rule, but the 16 and 32bit formats have more exponent bits (5 and 8) than this formula would provide (3 and 7, respectively).
As with IEEE 7541985, the biasedexponent field is filled with all 1 bits to indicate either infinity (trailing significand field = 0) or a NaN (trailing significand field ≠ 0). For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively (the standard recommends 0 for signaling NaNs, 1 for quiet NaNs, so that a signaling NaNs can be quieted by changing only this bit to 1, while the reverse could yield the encoding of an infinity), and the payload is carried in the remaining bits.
For the exchange of decimal floatingpoint numbers, interchange formats of any multiple of 32 bits are defined.
The encoding scheme for the decimal interchange formats similarly encodes the sign, exponent, and significand, but two different bitlevel representations are defined. Interchange is complicated by the fact that some external indicator of the representation in use is required. The two options allow the significand to be encoded as a compressed sequence of decimal digits (using densely packed decimal) or alternatively as a binary integer. The former is more convenient for direct hardware implementation of the standard, while the latter is more suited to software emulation on a binary computer. In either case the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and special values (±zero, ±infinity, quiet NaNs, and signaling NaNs) have identical binary representations.
The standard defines five rounding rules. The first two rules round to a nearest value; the others are called directed roundings:
Mode / Example Value  +11.5  +12.5  −11.5  −12.5 

to nearest, ties to even  +12.0  +12.0  −12.0  −12.0 
to nearest, ties away from zero  +12.0  +13.0  −12.0  −13.0 
toward 0  +11.0  +12.0  −11.0  −12.0 
toward +∞  +12.0  +13.0  −11.0  −12.0 
toward −∞  +11.0  +12.0  −12.0  −13.0 
Required operations for a supported arithmetic format (including the basic formats) include:
The standard provides a predicate totalOrder which defines a total ordering for all floatingpoint data for each format. The predicate agrees with the normal comparison operations when one floatingpoint number is less than the other one. The normal comparison operations, however, treat NaNs as unordered and compare −0 and +0 as equal. The totalOrder predicate orders all floatingpoint data strictly and totally. When comparing two floatingpoint numbers, it acts as the ≤ operation, except that totalOrder(−0, +0) ∧ ¬ totalOrder(+0, −0), and different representations of the same floatingpoint number are ordered by their exponent multiplied by the sign bit. The ordering is then extended to the NaNs by ordering −qNaN < −sNaN < numbers < +sNaN < +qNaN, with ordering between two NaNs in the same class being based on the integer payload, multiplied by the sign bit, of those data.^{[21]}
The standard defines five exceptions, each of which returns a default value and has a corresponding status flag that (except in certain cases of underflow) is raised when the exception occurs. No other exception handling is required, but additional nondefault alternatives are recommended (see below).
The five possible exceptions are:
These are the same five exceptions as were defined in IEEE 7541985, but the division by zero exception has been extended to operations other than the division.
For decimal floating point, there are additional exceptions along with the above:^{[25]}^{[26]}
Additionally, operations like quantize when either operand is infinite, or when the result does not fit the destination format, will also signal invalid operation exception.^{[27]}
The standard recommends optional exception handling in various forms, including presubstitution of userdefined default values, and traps (exceptions that change the flow of control in some way) and other exception handling models which interrupt the flow, such as try/catch. The traps and other exception mechanisms remain optional, as they were in IEEE 7541985.
Clause 9 in the standard recommends fifty operations, that language standards should define.^{[28]} These are all optional (not required in order to conform to the standard).
Recommended arithmetic operations, which must round correctly:^{[29]}
The asinPi, acosPi and tanPi functions are not part of the standard because the feeling was that they were less necessary.^{[30]} The first two are mentioned in a paragraph, but this is regarded as an error.^{[31]}
The operations also include setting and accessing dynamic mode rounding direction,^{[32]} and implementationdefined vector reduction operations such as sum, scaled product, and dot product, whose accuracy is unspecified by the standard.^{[33]}
The standard recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result. By contrast, the previous 1985 version of the standard left aspects of the language interface unspecified, which led to inconsistent behavior between compilers, or different optimization levels in a single compiler.
Programming languages should allow a user to specify a minimum precision for intermediate calculations of expressions for each radix. This is referred to as "preferredWidth" in the standard, and it should be possible to set this on a per block basis. Intermediate calculations within expressions should be calculated, and any temporaries saved, using the maximum of the width of the operands and the preferred width, if set. Thus, for instance, a compiler targeting x87 floatingpoint hardware should have a means of specifying that intermediate calculations must use the doubleextended format. The stored value of a variable must always be used when evaluating subsequent expressions, rather than any precursor from before rounding and assigning to the variable.
The IEEE 7541985 allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 7542008 has strengthened up many of these, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.
The standard requires operations to convert between basic formats and external character sequence formats.^{[34]} Conversions to and from a decimal character format are required for all formats. Conversion to an external character sequence must be such that conversion back using round to even will recover the original number. There is no requirement to preserve the payload of a NaN or signaling NaN, and conversion from the external character sequence may turn a signaling NaN into a quiet NaN.
The original binary value will be preserved by converting to decimal and back again using:^{[35]}
For other binary formats, the required number of decimal digits is
where p is the number of significant bits in the binary format, e.g. 237 bits for binary256.
(Note: as an implementation limit, correct rounding is only guaranteed for the number of decimal digits above plus 3 for the largest supported binary format. For instance, if binary32 is the largest supported binary format, then a conversion from a decimal external sequence with 12 decimal digits is guaranteed to be correctly rounded when converted to binary32; but conversion of a sequence of 13 decimal digits is not; however the standard recommends that implementations impose no such limit.)
When using a decimal floatingpoint format, the decimal representation will be preserved using:
Algorithms, with code, for correctly rounded conversion from binary to decimal and decimal to binary are discussed in ^{[36]} and for testing in.^{[37]}
The bfloat16 floatingpoint format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16bit) version of the 32bit IEEE 754 singleprecision floatingpoint format (binary32) with the intent of accelerating machine learning and nearsensor computing. It preserves the approximate dynamic range of 32bit floatingpoint numbers by retaining 8 exponent bits, but supports only an 8bit precision rather than the 24bit significand of the binary32 format. More so than singleprecision 32bit floatingpoint numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use.
The bfloat16 format is utilized in upcoming Intel AI processors, such as Nervana NNPL1000, Xeon processors, and Intel FPGAs, Google Cloud TPUs, and TensorFlow.
Binary integer decimalThe IEEE 7542008 standard includes an encoding format for decimal floating point numbers in which the significand and the exponent (and the payloads of NaNs) can be encoded in two ways, referred to in the draft as binary encoding and decimal encoding.Both formats break a number down into a sign bit s, an exponent q (between qmin and qmax), and a pdigit significand c (between 0 and 10p−1). The value encoded is (−1)s×10q×c. In both formats the range of possible values is identical, but they differ in how the significand c is represented. In the decimal encoding, it is encoded as a series of p decimal digits (using the densely packed decimal (DPD) encoding). This makes conversion to decimal form efficient, but requires a specialized decimal ALU to process. In the binary integer decimal (BID) encoding, it is encoded as a binary number.
C99C99 (previously known as C9X) is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version (C90) with new features for the language and the standard library, and helps implementations make better use of available computer hardware, such as IEEE 7541985 floatingpoint arithmetic, and compiler technology. The C11 version of the C programming language standard, published in 2011, replaces C99.
CBORCBOR (Concise Binary Object Representation) is a binary data serialization format loosely based on JSON. Like JSON it allows the transmission of data objects that contain name–value pairs, but in a more concise manner. This increases processing and transfer speeds at the cost of humanreadability. It is defined in IETF RFC 7049.Amongst other uses, it is the recommended data serialization layer for the CoAP Internet of Things protocol suite and the data format on which COSE messages are based. It is also used in the ClienttoAuthenticator Protocol (CTAP) within the scope of the FIDO2 project.
Doubleprecision floatingpoint formatDoubleprecision floatingpoint format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Floating point is used to represent fractional values, or when a wider range is needed than is provided by fixed point (of the same bit width), even if at the cost of precision. Double precision may be chosen when the range or precision of single precision would be insufficient.
In the IEEE 7542008 standard, the 64bit base2 format is officially referred to as binary64; it was called double in IEEE 7541985. IEEE 754 specifies additional floatingpoint formats, including 32bit base2 single precision and, more recently, base10 representations.
One of the first programming languages to provide single and doubleprecision floatingpoint data types was Fortran. Before the widespread adoption of IEEE 7541985, the representation and properties of floatingpoint data types depended on the computer manufacturer and computer model, and upon decisions made by programminglanguage implementers. E.g., GWBASIC's doubleprecision data type was the 64bit MBF floatingpoint format.
Extended precisionExtended precision refers to floating point number formats that provide greater precision than the basic floating point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitraryprecision arithmetic refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).
Floatingpoint arithmeticIn computing, floatingpoint arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation so as to support a tradeoff between range and precision. For this reason, floatingpoint computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
where significand is an integer, base is an integer greater than or equal to two, and exponent is also an integer. For example:
The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated as the exponent component, and thus the floatingpoint representation can be thought of as a kind of scientific notation.
A floatingpoint system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: e.g. the distance between galaxies or the diameter of an atomic nucleus can be expressed with the same unit of length. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale.
Over the years, a variety of floatingpoint representations have been used in computers. In 1985, the IEEE 754 Standard for FloatingPoint Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.
The speed of floatingpoint operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.
A floatingpoint unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floatingpoint numbers.
Halfprecision floatingpoint formatIn computing, half precision is a binary floatingpoint computer number format that occupies 16 bits (two bytes in modern computers) in computer memory.
In the IEEE 7542008 standard, the 16bit base2 format is referred to as binary16. It is intended for storage of floatingpoint values in applications where higher precision is not essential for performing arithmetic computations.
Although implementations of the IEEE Halfprecision floating point are relatively new, several earlier 16bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982, Scott's WIF and the 3dfx Voodoo Graphics processor.Nvidia and Microsoft defined the half datatype in the Cg language, released in early 2002, and implemented it in silicon in the GeForce FX, released in late 2002. ILM was searching for an image format that could handle a wide dynamic range, but without the hard drive and memory cost of floatingpoint representations that are commonly used for floatingpoint computation (single and double precision). The hardwareaccelerated programmable shading group led by John Airey at SGI (Silicon Graphics) invented the s10e5 data type in 1997 as part of the 'bali' design effort. This is described in a SIGGRAPH 2000 paper (see section 4.3) and further documented in US patent 7518615.This format is used in several computer graphics environments including OpenEXR, JPEG XR, GIMP, OpenGL, Cg, and D3DX. The advantage over 8bit or 16bit binary integers is that the increased dynamic range allows for more detail to be preserved in highlights and shadows for images. The advantage over 32bit singleprecision binary formats is that it requires half the storage and bandwidth (at the expense of precision and range).The F16C extension allows x86 processors to convert halfprecision floats to and from singleprecision floats.
IEEE 7541985IEEE 7541985 was an industry standard for representing floatingpoint numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 7542008. During its 23 years, it was the most widely used format for floatingpoint computation. It was implemented in software, in the form of floatingpoint libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 7541985 was the Intel 8087.
IEEE 7541985 represents numbers in binary, providing definitions for four levels of precision, of which the two most commonly used are:
The standard also defines representations for positive and negative infinity, a "negative zero", five exceptions to handle invalid results like division by zero, special values called NaNs for representing those exceptions, denormal numbers to represent numbers smaller than shown above, and four rounding modes.
IEEE 7542008 revisionIEEE 7542008 (previously known as IEEE 754r) was published in August 2008 and is a significant revision to, and replaces, the IEEE 7541985 floating point standard. The revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 (the radixindependent floatingpoint standard).
In a few cases, where stricter definitions of binary floatingpoint arithmetic might be performanceincompatible with some existing implementation, they were made optional.
IEEE 8541987IEEE Std 8541987, the Standard for RadixIndependent FloatingPoint Arithmetic, was the first Institute of Electrical and Electronics Engineers (IEEE) international standard for floatingpoint arithmetic with radix 2 or radix 10.
The standard was published in 1987, was almost immediately superseded by IEEE 7541985, but was never terminated (the year of ratification appears after the dash). IEEE 854 did not specify any formats, whereas IEEE 7541985 did. IEEE 754 specifies floatingpoint arithmetic for both radix 2 (binary) and radix 10 (decimal), and specifies two alternative formats for radix 10 floatingpoint values. IEEE 7541985 was superseded in 2008 by IEEE 7542008. IEEE 7542008 also has many other updates to the IEEE floatingpoint standardisation.
ISO/IEC 10967ISO/IEC 10967, Language independent arithmetic (LIA), is a series of
standards on computer arithmetic. It is compatible with ISO/IEC/IEEE 60559:2011,
more known as IEEE 7542008, and much of the
specifications are for IEEE 754 special values
(though such values are not required by LIA itself, unless the parameter iec559 is true).
It was developed by the working group ISO/IEC JTC1/SC22/WG11, which was disbanded in 2011.LIA currently consists of three parts:
Part 1: Integer and floating point arithmetic, second edition published 2012.
Part 2: Elementary numerical functions, first edition published 2001.
Part 3: Complex integer and floating point arithmetic and complex elementary numerical functions, first edition published 2006.
MinifloatIn computing, minifloats are floatingpoint values represented with very few bits. Predictably, they are not well suited for generalpurpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Additionally, they are frequently encountered as a pedagogical tool in computerscience courses to demonstrate the properties and structures of floatingpoint arithmetic and IEEE 754 numbers.
Minifloats with 16 bits are halfprecision numbers (opposed to single and double precision). There are also minifloats with 8 bits or even fewer.
Minifloats can be designed following the principles of the IEEE 754 standard. In this case they must obey the (not explicitly written) rules for the frontier between subnormal and normal numbers and must have special patterns for infinity and NaN. Normalized numbers are stored with a biased exponent. The new revision of the standard, IEEE 7542008, has 16bit binary minifloats.
The Radeon R300 and R420 GPUs used an "fp24" floatingpoint format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa.
"Full Precision" in Direct3D 9.0 is a proprietary 24bit floatingpoint format. Microsoft's D3D9 (Shader Model 2.0) graphics API initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.
In computer graphics minifloats are sometimes used to represent only integral values. If at the same time subnormal values should exist, the least subnormal number has to be 1. This statement can be used to calculate the bias value. The following example demonstrates the calculation, as well as the underlying principles.
NaNIn computing, NaN, standing for not a number, is a numeric data type value representing an undefined or unrepresentable value, especially in floatingpoint arithmetic. Systematic use of NaNs was introduced by the IEEE 754 floatingpoint standard in 1985, along with the representation of other nonfinite quantities such as infinities.
For example, 0/0 is undefined as a real number, and is therefore represented by NaN. The square root of a negative number is an imaginary number and cannot be represented as a real number, so is represented by NaN. NaNs may also be used to represent missing values in computations.Two separate kinds of NaNs are provided, termed quiet NaNs and signaling NaNs. Quiet NaNs are used to propagate errors resulting from invalid operations or values. Signaling NaNs can support advanced features such as mixing numerical and symbolic computation or other extensions to basic floatingpoint arithmetic.
Octupleprecision floatingpoint formatIn computing, octuple precision is a binary floatingpointbased computer number format that occupies 32 bytes (256 bits) in computer memory. This 256bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely (if ever) used and very few environments support it.
Quadrupleprecision floatingpoint formatIn computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision more than twice the 53bit double precision.
This 128bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and roundoff errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE754 floating point standard noted, "For now the 10byte Extended format is a tolerable compromise between the value of extraprecise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for FloatingPoint Arithmetic was framed."In IEEE 7542008 the 128bit base2 format is officially referred to as binary128.
Signed zeroSigned zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are identical. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 (negative zero) and +0 (positive zero), regarded as equal by the numerical comparison operations but with possible different behaviors in particular operations. This occurs in the sign and magnitude and ones' complement signed number representations for integers, and in most floatingpoint number representations. The number 0 is usually encoded as +0, but can be represented by either +0 or −0.
The IEEE 754 standard for floatingpoint arithmetic (presently used by most computers and programming languages that support floating point numbers) requires both +0 and −0. Real arithmetic with signed zeros can be considered a variant of the extended real number line such that 1/−0 = −∞ and 1/+0 = +∞; division is only undefined for ±0/±0 and ±∞/±∞.
Negatively signed zero echoes the mathematical analysis concept of approaching 0 from below as a onesided limit, which may be denoted by x → 0−, x → 0−, or x → ↑0. The notation "−0" may be used informally to denote a small negative number that has been rounded to zero. The concept of negative zero also has some theoretical applications in statistical mechanics and other disciplines.
It is claimed that the inclusion of signed zero in IEEE 754 makes it much easier to achieve numerical accuracy in some critical problems, in particular when computing with complex elementary functions. On the other hand, the concept of signed zero runs contrary to the general assumption made in most mathematical fields that negative zero is the same thing as zero. Representations that allow negative zero can be a source of errors in programs, if software developers do not take into account that while the two zero representations behave as equal under numeric comparisons, they yield different results in some operations.
Singleprecision floatingpoint formatSingleprecision floatingpoint format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floatingpoint variable can represent a wider range of numbers than a fixedpoint variable of the same bit width at the cost of precision. A signed 32bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32bit base2 floatingpoint variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038. All integers with 6 or fewer significant decimal digits, and any number that can be written as 2n such that n is a whole number from 126 to 127, can be converted into an IEEE 754 floatingpoint value without loss of precision.
In the IEEE 7542008 standard, the 32bit base2 format is officially referred to as binary32; it was called single in IEEE 7541985. IEEE 754 specifies additional floatingpoint types, such as 64bit base2 double precision and, more recently, base10 representations.
One of the first programming languages to provide single and doubleprecision floatingpoint data types was Fortran. Before the widespread adoption of IEEE 7541985, the representation and properties of floatingpoint data types depended on the computer manufacturer and computer model, and upon decisions made by programminglanguage designers. E.g., GWBASIC's singleprecision data type was the 32bit MBF floatingpoint format.
Single precision is termed REAL in Fortran, SINGLEFLOAT in Common Lisp, float in C, C++, C#, Java, Float in Haskell, and Single in Object Pascal (Delphi), Visual Basic, and MATLAB. However, float in Python, Ruby, PHP, and OCaml and single in versions of Octave before 3.2 refer to doubleprecision numbers. In most implementations of PostScript, and some embedded systems, the only supported precision is single.
William KahanWilliam "Velvel" Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist who received the Turing Award in 1989 for "his fundamental contributions to numerical analysis"
, was named an ACM Fellow in 1994, and inducted into the National Academy of Engineering in 2005.Born to a Canadian Jewish family, he attended the University of Toronto, where he received his bachelor's degree in 1954, his master's degree in 1956, and his Ph.D. in 1958, all in the field of mathematics. Kahan is now emeritus professor of mathematics and of electrical engineering and computer sciences (EECS) at the University of California, Berkeley.
Kahan was the primary architect behind the IEEE 7541985 standard for floatingpoint computation (and its radixindependent followon, IEEE 854). He has been called "The Father of Floating Point," since he was instrumental in creating the original IEEE 754 specification. Kahan continued his contributions to the IEEE 754 revision that led to the current IEEE 754 standard.
In the 1980s he developed the program "paranoia", a benchmark that tests for a wide range of potential floating point bugs. It would go on to detect the infamous Pentium division bug, and continues to have important uses to this day. He also developed the Kahan summation algorithm, an important algorithm for minimizing error introduced when adding a sequence of finite precision floating point numbers. He coined the term "The TableMaker's Dilemma" for the unknown cost of correctly rounding transcendental functions to some preassigned number of digits.
The Davis–Kahan–Weinberger dilation theorem is one of the landmark results in the dilation theory of Hilbert space operators and has found applications in many different areas.He is an outspoken advocate of better education of the general computing population about floatingpoint issues, and regularly denounces decisions in the design of computers and programming languages that may impair good floatingpoint computations.
When Hewlett–Packard introduced the original HP35 pocket scientific calculator, its numerical accuracy in evaluating transcendental functions for some arguments was not optimal. Hewlett–Packard worked extensively with Kahan to enhance the accuracy of the algorithms, which led to major improvements. This was documented at the time in the HewlettPackard Journal.
He also contributed substantially to the design of the algorithms in the HP Voyager series, and wrote part of their intermediate and advanced manuals.
Current  

802 series 
 
Proposed  
Superseded  

This page is based on a Wikipedia article written by authors
(here).
Text is available under the CC BYSA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.