Significand

The significand (also mantissa or coefficient, sometimes also argument or fraction)[1] is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction. The word mantissa seems to have been introduced by Arthur Burks in 1946[2] writing for the Institute for Advanced Study at Princeton, although this use of the word is discouraged by the IEEE floating-point standard committee as well as some professionals such as the creator of the standard, William Kahan,[3] and also the prominent computer programmer and author of The Art of Computer Programming, Donald E. Knuth.[4]

Example

The number 123.45 can be represented as a decimal floating-point number with the integer 12345 as the significand and a 10−2 power term, also called characteristics,[2][5][6] where −2 is the exponent (and 10 is the base). Its value is given by the following arithmetic:

123.45 = 12345 × 10−2.

This same value can also be represented in normalized form with 1.2345 as the fractional coefficient, and +2 as the exponent (and 10 as the base):

123.45 = 1.2345 × 10+2.

Schmid, however, called this representation with a significand ranging between 1.0 and 10 a modified normalized form.[5][6]

For base 2, this 1.xxxx form is also called a normalized significand.

Finally, the value can be represented in the format given by the Language Independent Arithmetic standard and several programming language standards, including Ada, C, Fortran and Modula-2, as

123.45 = 0.12345 × 10+3.

Schmid called this representation with a significand ranging between 0.1 and 1.0 the true normalized form.[5][6]

This later 0.xxxx form is called a normed significand.

Significands and the hidden bit

For a normalized number, the most significant digit is always non-zero. When working in binary, this uniquely determines this digit to always be 1; as such, it doesn't need to be explicitly stored, being called the hidden bit. The significand is characterized by its width in (binary) digits, and depending on the context, the hidden bit may or may not be counted towards the width of the significand. For example, the same IEEE 754 double-precision format is commonly described as having either a 53-bit significand, including the hidden bit, or a 52-bit significand, excluding the hidden bit. IEEE 754 defines the precision p to be the number of digits in the significand, including any implicit leading bit (e.g., p = 53 for the double-precision format).

Use of "mantissa"

In American English, the original word for this seems to have been mantissa (Burks[2] et al.), and this usage remains common in computing and among computer scientists. However, the term significand was introduced by George Forsythe and Cleve Moler in 1967,[7][8][9][1] and the use of mantissa for this purpose is discouraged by the IEEE floating-point standard committee and by some professionals such as William Kahan[3] and Donald Knuth, because it conflicts with the pre-existing use of mantissa for the fractional part of a logarithm (see also common logarithm). For instance, Knuth adopts the third representation 0.12345 × 10+3 in the example above and calls 0.12345 the fraction part of the number; he adds:[10] "it is an abuse of terminology to call the fraction part a mantissa, since this concept has quite a different meaning in connection with logarithms".

The confusion is because scientific notation and floating-point representation are log-linear, not logarithmic. To multiply two numbers, given their logarithms, one just adds the characteristic (integer part) and the mantissa (fractional part). By contrast, to multiply two floating-point numbers, one adds the exponent (which is logarithmic) and multiplies the significand (which is linear).

See also

References

  1. ^ a b Savard, John J. G. (2018) [2005]. "Floating-Point Formats". quadibloc. A Note on Field Designations. Archived from the original on 2018-07-16. Retrieved 2018-07-16.
  2. ^ a b c Burks, Arthur Walter; Goldstine, Herman H.; von Neumann, John (1963) [1946]. "5.3.". In Taub, A. H. Preliminary discussion of the logical design of an electronic computing instrument (PDF). Collected Works of John von Neumann (Technical report, Institute for Advanced Study, Princeton, New Jersey, USA). 5. New York, USA: The Macmillan Company. p. 42. Retrieved 2016-02-07. Several of the digital computers being built or planned in this country and England are to contain a so-called "floating decimal point". This is a mechanism for expressing each word as a characteristic and a mantissa—e.g. 123.45 would be carried in the machine as (0.12345,03), where the 3 is the exponent of 10 associated with the number.
  3. ^ a b Kahan, William Morton (2002-04-19), Names for Standardized Floating-Point Formats (PDF), m is the significand or coefficient or (wrongly) mantissa
  4. ^ The Art of Computer Programming, Chapter 4.2 (part of Vol. 2)
  5. ^ a b c Schmid, Hermann (1974). Decimal Computation (1 ed.). Binghamton, New York, USA: John Wiley & Sons, Inc. p. 204-205. ISBN 0-471-76180-X. Retrieved 2016-01-03.
  6. ^ a b c Schmid, Hermann (1983) [1974]. Decimal Computation (1 (reprint) ed.). Malabar, Florida, USA: Robert E. Krieger Publishing Company. p. 204–205. ISBN 0-89874-318-4. Retrieved 2016-01-03. (NB. At least some batches of this reprint edition were misprints with defective pages 115–146.)
  7. ^ Forsythe, George Elmer; Moler, Cleve Barry (September 1967). Computer Solution of Linear Algebraic Systems. Automatic Computation (1st ed.). New Jersey, USA: Prentice-Hall, Englewood Cliffs. ISBN 0-13-165779-8.
  8. ^ Sterbenz, Pat H. (1974-05-01). Floating-Point Computation. Prentice-Hall Series in Automatic Computation (1 ed.). Englewood Cliffs, New Jersey, USA: Prentice Hall. ISBN 0-13-322495-3.
  9. ^ Goldberg, David (March 1991). "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (PDF). Computing Surveys. Xerox Palo Alto Research Center (PARC), Palo Alto, California, USA: Association for Computing Machinery, Inc. 23 (1): 7. Archived (PDF) from the original on 2016-07-13. Retrieved 2016-07-13. This term was introduced by Forsythe and Moler [1967], and has generally replaced the older term mantissa. (NB. A newer edited version can be found here: [1])
  10. ^ Knuth, Donald Ervin (1969). "4.2.1.A". The Art of Computer Programming. 2. Addison-Wesley.
Bfloat16 floating-point format

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use.

The bfloat16 format is utilized in upcoming Intel AI processors, such as Nervana NNP-L1000, Xeon processors, and Intel FPGAs, Google Cloud TPUs, and TensorFlow.

Binary integer decimal

The IEEE 754-2008 standard includes an encoding format for decimal floating point numbers in which the significand and the exponent (and the payloads of NaNs) can be encoded in two ways, referred to in the draft as binary encoding and decimal encoding.Both formats break a number down into a sign bit s, an exponent q (between qmin and qmax), and a p-digit significand c (between 0 and 10p−1). The value encoded is (−1)s×10q×c. In both formats the range of possible values is identical, but they differ in how the significand c is represented. In the decimal encoding, it is encoded as a series of p decimal digits (using the densely packed decimal (DPD) encoding). This makes conversion to decimal form efficient, but requires a specialized decimal ALU to process. In the binary integer decimal (BID) encoding, it is encoded as a binary number.

Computer number format

A computer number format is the internal representation of numeric values in digital computer and calculator hardware and software. Normally, numeric values are stored as groupings of bits, named for the number of bits that compose them. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the bit format used by the computer's instruction set generally requires conversion for external use such as printing and display. Different types of processors may have different internal representations of numerical values. Different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Decimal128 floating-point format

In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes (128 bits) in computer memory.

It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

Decimal128 supports 34 decimal digits of significand and an exponent range of −6143 to +6144, i.e. ±0.000000000000000000000000000000000×10^−6143 to ±9.999999999999999999999999999999999×10^6144. (Equivalently, ±0000000000000000000000000000000000×10^−6176 to ±9999999999999999999999999999999999×10^6111.) Therefore, decimal128 has the greatest range of values compared with other IEEE basic floating point formats. Because the significand is not normalized, most values with less than 34 significant digits have multiple possible representations; 1×102=0.1×103=0.01×104, etc. Zero has 12288 possible representations (24576 if you include both signed zeros).

Decimal128 floating point is a relatively new decimal floating-point format, formally introduced in the 2008 version of IEEE 754 as well as with ISO/IEC/IEEE 60559:2011.

Decimal32 floating-point format

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory.

It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

Decimal32 supports 7 decimal digits of significand and an exponent range of −95 to +96, i.e. ±0.000000×10^−95 to ±9.999999×10^96. (Equivalently, ±0000000×10^−101 to ±9999999×10^90.) Because the significand is not normalized (there is no implicit leading "1"), most values with less than 7 significant digits have multiple possible representations; 1×102=0.1×103=0.01×104, etc. Zero has 192 possible representations (384 when both signed zeros are included).

Decimal32 floating point is a relatively new decimal floating-point format, formally introduced in the 2008 version of IEEE 754 as well as with ISO/IEC/IEEE 60559:2011.

Decimal64 floating-point format

In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes (64 bits) in computer memory.

It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

Decimal64 supports 16 decimal digits of significand and an exponent range of −383 to +384, i.e. ±0.000000000000000×10^−383 to ±9.999999999999999×10^384. (Equivalently, ±0000000000000000×10^−398 to ±9999999999999999×10^369.) In contrast, the corresponding binary format, which is the most commonly used type, has an approximate range of ±0.000000000000001×10^−308 to ±1.797693134862315×10^308. Because the significand is not normalized, most values with less than 16 significant digits have multiple possible representations; 1×102=0.1×103=0.01×104, etc. Zero has 768 possible representations (1536 if both signed zeros are included).

Decimal64 floating point is a relatively new decimal floating-point format, formally introduced in the 2008 version of IEEE 754 as well as with ISO/IEC/IEEE 60559:2011.

Decimal floating point

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions.

The advantage of decimal floating-point representation over decimal fixed-point and integer representation is that it supports a much wider range of values. For example, while a fixed-point representation that allocates 8 decimal digits and 2 decimal places can represent the numbers 123456.78, 8765.43, 123.00, and so on, a floating-point representation with 8 decimal digits could also represent 1.2345678, 1234567.8, 0.000012345678, 12345678000000000, and so on. This wider range can dramatically slow the accumulation of rounding errors during successive calculations; for example, the Kahan summation algorithm can be used in floating point to add many numbers with no asymptotic accumulation of rounding error.

Denormal number

In computer science, denormal numbers or denormalized numbers (now often called subnormal numbers) fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is "subnormal".

In a normal floating-point value, there are no leading zeros in the significand; instead leading zeros are removed by adjusting the exponent. So 0.0123 would be written as 1.23 × 10−2. Denormal numbers are numbers where this representation would result in an exponent that is below the smallest representable exponent (the exponent usually having a limited range).

Such numbers are represented using leading zeros in the significand.

The significand (or mantissa) of an IEEE floating-point number is the part of a floating-point number that represents the significant digits. For a positive normalised number it can be represented as m0.m1m2m3...mp−2mp−1 (where m represents a significant digit, and p is the precision) with non-zero m0. Notice that for a binary radix, the leading binary digit is always 1. In a denormal number, since the exponent is the least that it can be, zero is the leading significand digit (0.m1m2m3...mp−2mp−1), allowing the representation of numbers closer to zero than the smallest normal number. A floating-point number may be recognized as denormal whenever its exponent is the least value possible.

By filling the underflow gap like this, significant digits are lost, but not as abruptly as when using the flush to zero on underflow approach (discarding all significant digits when underflow is reached). Hence the production of a denormal number is sometimes called gradual underflow because it allows a calculation to lose precision slowly when the result is small.

In IEEE 754-2008, denormal numbers are renamed subnormal numbers and are supported in both binary and decimal formats. In binary interchange formats, subnormal numbers are encoded with a biased exponent of 0, but are interpreted with the value of the smallest allowed exponent, which is one greater (i.e., as if it were encoded as a 1). In decimal interchange formats they require no special encoding because the format supports unnormalized numbers directly.

Mathematically speaking, the normalized floating-point numbers of a given sign are roughly logarithmically spaced, and as such any finite-sized normal float cannot include zero. The denormal floats are a linearly spaced set of values, which span the gap between the negative and positive normal floats.

Double-precision floating-point format

Double-precision floating-point format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Floating point is used to represent fractional values, or when a wider range is needed than is provided by fixed point (of the same bit width), even if at the cost of precision. Double precision may be chosen when the range or precision of single precision would be insufficient.

In the IEEE 754-2008 standard, the 64-bit base-2 format is officially referred to as binary64; it was called double in IEEE 754-1985. IEEE 754 specifies additional floating-point formats, including 32-bit base-2 single precision and, more recently, base-10 representations.

One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language implementers. E.g., GW-BASIC's double-precision data type was the 64-bit MBF floating-point format.

Extended precision

Extended precision refers to floating point number formats that provide greater precision than the basic floating point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).

Floating-point arithmetic

In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation so as to support a trade-off between range and precision. For this reason, floating-point computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

where significand is an integer, base is an integer greater than or equal to two, and exponent is also an integer. For example:

The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated as the exponent component, and thus the floating-point representation can be thought of as a kind of scientific notation.

A floating-point system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: e.g. the distance between galaxies or the diameter of an atomic nucleus can be expressed with the same unit of length. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale.

Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.

The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.

A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers.

Half-precision floating-point format

In computing, half precision is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory.

In the IEEE 754-2008 standard, the 16-bit base-2 format is referred to as binary16. It is intended for storage of floating-point values in applications where higher precision is not essential for performing arithmetic computations.

Although implementations of the IEEE Half-precision floating point are relatively new, several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982, Scott's WIF and the 3dfx Voodoo Graphics processor.Nvidia and Microsoft defined the half datatype in the Cg language, released in early 2002, and implemented it in silicon in the GeForce FX, released in late 2002. ILM was searching for an image format that could handle a wide dynamic range, but without the hard drive and memory cost of floating-point representations that are commonly used for floating-point computation (single and double precision). The hardware-accelerated programmable shading group led by John Airey at SGI (Silicon Graphics) invented the s10e5 data type in 1997 as part of the 'bali' design effort. This is described in a SIGGRAPH 2000 paper (see section 4.3) and further documented in US patent 7518615.This format is used in several computer graphics environments including OpenEXR, JPEG XR, GIMP, OpenGL, Cg, and D3DX. The advantage over 8-bit or 16-bit binary integers is that the increased dynamic range allows for more detail to be preserved in highlights and shadows for images. The advantage over 32-bit single-precision binary formats is that it requires half the storage and bandwidth (at the expense of precision and range).The F16C extension allows x86 processors to convert half-precision floats to and from single-precision floats.

IBM hexadecimal floating point

IBM System/360 computers, and subsequent machines based on that architecture (mainframes), support a hexadecimal floating-point format (HFP).In comparison to IEEE 754 floating-point, the IBM floating-point format has a longer significand, and a shorter exponent. All IBM floating-point formats have 7 bits of exponent with a bias of 64. The normalized range of representable numbers is from 16−65 to 1663 (approx. 5.39761 × 10−79 to 7.237005 × 1075).

The number is represented as the following formula: (−1)sign × 0.significand × 16exponent−64.

IEEE 754

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The standard defines:

arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special "not a number" values (NaNs)

interchange formats: encodings (bit strings) that may be used to exchange floating-point data in an efficient and compact form

rounding rules: properties to be satisfied when rounding numbers during arithmetic and conversions

operations: arithmetic and other operations (such as trigonometric functions) on arithmetic formats

exception handling: indications of exceptional conditions (such as division by zero, overflow, etc.)The current version, IEEE 754-2008 revision published in August 2008, includes nearly all of the original IEEE 754-1985 standard plus IEEE 854-1987 Standard for Radix-Independent Floating-Point Arithmetic.

NaN

In computing, NaN, standing for not a number, is a numeric data type value representing an undefined or unrepresentable value, especially in floating-point arithmetic. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.

For example, 0/0 is undefined as a real number, and is therefore represented by NaN. The square root of a negative number is an imaginary number and cannot be represented as a real number, so is represented by NaN. NaNs may also be used to represent missing values in computations.Two separate kinds of NaNs are provided, termed quiet NaNs and signaling NaNs. Quiet NaNs are used to propagate errors resulting from invalid operations or values. Signaling NaNs can support advanced features such as mixing numerical and symbolic computation or other extensions to basic floating-point arithmetic.

Octuple-precision floating-point format

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes (256 bits) in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely (if ever) used and very few environments support it.

Quadruple-precision floating-point format

In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision more than twice the 53-bit double precision.

This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE-754 floating point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128.

Scientific notation

Scientific notation (also referred to as scientific form or standard index form, or standard form in the UK) is a way of expressing numbers that are too big or too small to be conveniently written in decimal form. It is commonly used by scientists, mathematicians and engineers, in part because it can simplify certain arithmetic operations. On scientific calculators it is usually known as "SCI" display mode.

In scientific notation all numbers are written in the form

m × 10n(m times ten raised to the power of n), where the exponent n is an integer, and the coefficient m is any real number. The integer n is called the

order of magnitude and the real number m is called the significand or mantissa. However, the term "mantissa" may cause confusion because it is the name of the fractional part of the common logarithm. If the number is negative then a minus sign precedes m (as in ordinary decimal notation). In normalized notation, the exponent is chosen so that the absolute value of the coefficient is at least one but less than ten.

Decimal floating point is a computer arithmetic system closely related to scientific notation.

Single-precision floating-point format

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

A floating-point variable can represent a wider range of numbers than a fixed-point variable of the same bit width at the cost of precision. A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038. All integers with 6 or fewer significant decimal digits, and any number that can be written as 2n such that n is a whole number from -126 to 127, can be converted into an IEEE 754 floating-point value without loss of precision.

In the IEEE 754-2008 standard, the 32-bit base-2 format is officially referred to as binary32; it was called single in IEEE 754-1985. IEEE 754 specifies additional floating-point types, such as 64-bit base-2 double precision and, more recently, base-10 representations.

One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language designers. E.g., GW-BASIC's single-precision data type was the 32-bit MBF floating-point format.

Single precision is termed REAL in Fortran, SINGLE-FLOAT in Common Lisp, float in C, C++, C#, Java, Float in Haskell, and Single in Object Pascal (Delphi), Visual Basic, and MATLAB. However, float in Python, Ruby, PHP, and OCaml and single in versions of Octave before 3.2 refer to double-precision numbers. In most implementations of PostScript, and some embedded systems, the only supported precision is single.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.