Floating point
Encyclopedia : F : FL : FLO : Floating point
For the IEEE binary floating-point standard, see its page.
Floating-point is a way of representing a subset of the real numbers in terms of an exponent, which is an integer, and a significand, which is a string of some chosen number of digits. Floating-point representation is very common because of computers, but it has its roots in scientific notation.
- 1 Basics
- 2 Accuracy, and misconceptions thereof
- 3 Computer Representation
- 4 Overflow, underflow, and zero
- 5 Behavior of computer arithmetic
- 6 Exceptional values, and exceptions, under the IEEE standard
- 7 Peculiar properties of floating point arithmetic
- 8 Problems with floating-point
- 9 A few nice properties
- 10 IEEE standard
- 11 Example
- 12 See also
- 13 References
Basics
A floating-point representation requires, first of all, a choice of base or radix for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by b, and the number of digits, or precision, by p. The significand is a number consisting of p digits in radix b, so each digit lies between 0 and b-1. A base of 2 (that is, binary representation) is nearly always used in computers, though some computers use b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.
Now this could be represented as any of
- 1528.535047 × 102
- 1528535047. × 10-4
- .000001528535047 × 1011
- 1.528535047 × 105
- e=5 ; s=1528535047
- Some people (and some computer representations) may prefer a different convention for the presumed location of the point, such as to the left of the leftmost digit. This doesn't make any actual difference; it simply offsets the exponent.
The mathematical value of a floating-point number is
- s.ssssssss...sss × be
- 11.0010010000111111011010101000100010000101101000110000100011010011.....
- 11.0010010000111111011011
- e=1 ; s=110010010000111111011011
Accuracy, and misconceptions thereof
For any fixed precision p, the floating-point numbers can represent only a subset of the real numbers (This is so even if the possible exponents are any integer. In actual computer representations, there is a finite range for the exponent, and the total set of floating-point numbers is actually finite.) This incompleteness is the same as what people are used to when they express measurements to a certain number of decimal places, as in "The atomic weight of Hydrogen is 1.008".However, there is a common belief that
- Floating-point numbers are imprecise, because they are only approximations.
Double as well as single precision floating point representations, in fact, carry extreme amounts of precision (even if we accept them as approximations). However, repeated operations of multiplication, division, addition, and, most importantly of all, subtraction, accumulate very large amounts of error. It is these floating point arithmetic operations that are the real source of imprecision, and it is they that make the very large representations necessary and even insufficient.
For example, one million repeated multiplications of numbers which have 7 significant decimal digits (ie IEEE single precision) will, on average, produce a final result accurate to just 4. In the worst case, a result may be accurate to just 1 significant decimal digit. Worst of all, it will be impossible to tell just how accurate the result turned out to be! Given that a modern processor can compute in the magnitude of 10 billion floating point operations per second, even double-precision numbers may not be safe in regard to multiplication, division, and addition.
However, the worst arithmetic operation to give imprecision is subtraction between two numbers of near-equal value. For example, 1.234567e0 - 1.234888e0 results, in case of decimal representations, in -3.210000e-4. Each one of those zeroes represents a digit whose value is unknown. In just a single operation, the precision of arithmetic has been halfed!
Other operations, such as powers, square roots, trigonometric function, etc. carry various amounts of error and risk.
Vital in avoiding the pitfalls of floating point numbers is the choice of algorithm. Different algorithms, computing the same problem, may have various degrees of numerical instability.
Lastly, there is a very important caveat in trying to employ floating-point numbers in "infinitely exact" arithmetic (although they are as capable of it as integers). It stems from the fact that floating point numbers are in fact binary fractions, while we commonly use decimals. Thus, 0.5 multiplied by 0.5 will exactly equal 0.25, with no caveats or gotchas. Comparisons for equality may be safely made, just as one is accustomed to do with integers. However, 0.1*0.1 will not equal 0.01, primarily because there isn't actually a way of representing 0.1 or 0.01 exactly as a binary fraction. Nor will the 0.1 approximation multiplied by itself equal the 0.01 approximation. Note, this is like adding 1/3 (0.333) to itself and getting 0.666 instead of 0.667 (2/3). Thus, in most cases, if one is trying to perform comparisons of arithmetic results, even those which seemingly are safely bound by the precision of the format, one must take the absolute value of the difference between them and compare it against a small number. (ie, in C, instead of if(a==b), write if(abs(a-b)<0.00001)).
Banker's rounding
When the bits after the last available result bit are 1000000000..... exactly (that is, it is known that there are an infinite number of zeros after the initial 1), rounding upward or downward is equally accurate. In this case, the technique of Banker's rounding is usually used. The rounding is done in the direction that makes the significand even.Mantissa
The word "mantissa" is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand. The term "mantissa" is nevertheless widely used to refer to the significand.Computer Representation
(This section describes some general issues, but mostly follows the IEEE standard.)
To represent a floating-point number in a computer datum, the exponent has to be encoded into a bit field. Since the exponent could be negative, one could use two's complement representation. Instead, what is commonly done is something very similar: a fixed constant is added to the exponent, with the intention that the result be a positive number capable of being packed into a fixed bit field. For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. The result of this addition is placed in an 8 bit field.
Since the leftmost significand bit of a (normalized) floating-point number is always 1, that bit is not actually placed in the computer datum. The computer's arithmetic hardware acts as though that "1" had been provided. This is the "implicit bit" or "hidden bit" of the IEEE standard. Because of this, a 24 bit significand is actually placed in a 23 bit field.
Finally, a sign bit is required. This is one to indicate that the entire floating-point number is negative, or zero to indicate that it is positive. (In the past, some computers have used a kind of two's complement encoding for the entire number, rather than simple "sign/magnitude" format, but it didn't work very well.)
The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.
For the approximation to π, we have
- sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
- e+127 = 128 = 10000000 in (8 bit) binary
- final 32 bit result = 0 10000000 10010010000111111011011 = 40490FDB in hexadecimal
In the common 64 bit "double precision" or "double" format of the IEEE standard, the offset added to the exponent is 1023, and the result is placed into an 11 bit field. The precision is 53 bits. After removal of the hidden bit, 52 bits remain. The result comprises 1+11+52=64 bits. The approximation to π is
- sign = 0 ; e=1 ; s=11001001000011111101101010100010001000010110100011000 (including the hidden bit)
- e+1023 = 1024 = 10000000000 in (11 bit) binary
- final 64 bit result = 0 10000000000 1001001000011111101101010100010001000010110100011000
- ::= 400921FB54442D18 in hexadecimal
Overflow, underflow, and zero
The necessity to pack the offset exponent into a fixed-size bit field places limits on the exponent. For the standard 32 bit format, e+127 must fit into an 8 bit field, so -127 ≤ e ≤ 128. The values -127 and +128 are reserved for special meanings, so the actual range for normalized floating-point numbers is -126 ≤ e ≤ 127. This means that the smallest normalized number is
- e=-126 ; s=100000000000000000000000
- e=+127 ; s=111111111111111111111111
Any floating-point computation that gives a result (after rounding to a representable value) higher than the upper limit is said to overflow. Under the IEEE standard, such result is set to a special value "infinity", which has the appropriate sign bit, the reserved exponent +128, and a bit pattern in the significand (typically zero) to indicate that this is infinity. Such numbers are generally printed as "+INF" or "-INF".
Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as
- (+INF) + (+7) = (+INF)
- (+INF) * (-2) = (-INF)
First, if the number is zero, it is represented by an exponent of -127 and a significand field of all zeros. This means that zero is represented in hexadecimal as 00000000.
Otherwise, if normalizing the number would lead to an exponent of -127 or less, it is only normalized until the exponent is -127. That is, instead of shifting the significand bits left until the leftmost bit is 1, they are shifted until the exponent reaches -127. For example, the smallest non-underflowing number is
- e=-126 ; s=1.00000000000000000000000 (about 1.18 × 10-38)
- e=-130 ; s=1.00000000000000000000000 (about 7.3 × 10-40)
- e=-127 ; s=0.00100000000000000000000
- 0 00000000 00010000000000000000000 = 00080000 in hexadecimal
The smallest possible (subnormal) nonzero number is
- 0 00000000 00000000000000000000001 = 00000001 in hexadecimal
- e=-127 ; s=0.0000000000000000000001
The handling of the number zero can be seen to be a completely ordinary case of a subnormal number.
The creation of denorms is often called "gradual underflow". As numbers get extremely small, significand bits are slowly sacrificed. The alternative is "sudden underflow", in which any number that can't be normalized is simply set to zero by the computer hardware. Gradual underflow is difficult for computer hardware to handle, so hardware often uses software to assist it, through interrupts. This creates a serious performance penalty, and sometimes sudden underflow is used instead. This is a source of controversy.
Behavior of computer arithmetic
As noted above, the standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of- round to nearest (the default; by far the most common mode.)
- round up (toward +infinity. This means that negative results will round toward zero.)
- round down (toward -infinity. Negative results round away from zero.)
- round toward zero (always. This is sometimes called "chop" mode. It is similar to the common behavior of float-to-integer conversions, which convert -3.5 to -3.)
The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.
The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10-7 in single precision, and 10-16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within 1/2 of an ULP.
Exceptional values, and exceptions, under the IEEE standard
In addition to the "infinity" value that is produced when an overflow occurs, there is a special value "NaN" (standing for "not a number") that is produced by such things as taking the square root of a negative number, or the inverse sine of 3. (The inverse sine function is typically handled by a library routine, which explicitly produces this value when the argument is out of range.) NaN is encoded with the reserved exponent of 128 (or 1024), and a significand field that distinguishes it from infinity.The intention of the INF and NaN values is that, under the most common circumstances, they can just propagate from one operation to the next (any operation with NaN as an operand produces NaN as a result), and they only need to be attended to at a point that the programmer chooses.
In addition to the creation of exceptional values, there are "events" that may occur, though some of them are quite benign:
- An overflow occurs as described previously, producing an infinity.
- An underflow occurs as described previously, producing a denorm.
- A zerodivide occurs whenever a divisor is zero, producing an infinity of the appropriate sign. (The sign of zero is meaningful here.) Note that a very small but nonzero divisor can still cause an overflow and produce an infinity.
- An "operand error" occurs whenever a NaN has to be created. This occurs whenever any operand to an operation is a NaN, or some other obvious thing happens, such a sqrt(-2.0) or log(-1.0).
- An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.
Peculiar properties of floating point arithmetic
The fact that the set of representable floating-point numbers is not the entire set of reals, and that floating-point arithmetic operations have to round each result into a representable item, means that the arithmetic performed by computers does not obey all of the expected properties.While floating point addition and multiplication are commutative, they are not associative. This means that in general for floating point numbers x, y, and z:
- [ (x + y) + z \neq x + (y + z) ]
- [ (x \cdot y) \cdot z \neq x \cdot (y \cdot z) ]
- [ x \cdot (y + z) \neq (x \cdot y) + (x \cdot z) ]
- 1.0 + (1e100 - 1e100) gives the result 1.0
- (1.0 + 1e100) - 1e100 gives the result 0.0.
- (The situation is actually more subtle than that. By "1e100", we mean the floating point number closest to it. The starting point for any floating point operation has to be a representable number.)
Problems with floating-point
Because of the inability to represent all of the reals, and the consequent rounding of every arithmetic result, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) will lead to errors. There is much controversy about the failings of compilers and language designs in this area.
Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/ The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.
The general types of things that can cause trouble are:
- Non-representable numbers: common numbers, such as 0.1, are compromised right from the beginning.
- Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6.
- Absorption: 1015 + 1 may be the same as 1015.
- Cancellation: subtraction of nearly equal operands may cause serious loss of accuracy.
- Limited exponent range: results might overflow, yielding infinity.
- Testing for safe division is problematical: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
- Comparison for exact equality of two numbers is problematical. Programmers often perform comparisons within some tolerance, but that doesn't necessarily make the problem go away.
A few nice properties
One can sometimes take advantage of a few nice properties:- Any integer strictly less than 224-1 can be exactly represented in the single precision format, and any integer strictly less than 253-1 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented.
- The bit representations are monotonic, as long as exceptional values are avoided and the signs are handled properly. Floating point numbers are equal if and only if their integer bit representations are equal. Comparisons for larger or smaller can be done with integer comparisons on the bit patterns, as long as the signs match. However, the actual floating point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
- To a rough approximation, the bit representation of a floating point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.
IEEE standard
The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.
The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit format, with 15 exponent bits and 64 significand bits, with no hidden bit. Software vendors may also provide additional extended formats, such as the H-P "quad" format (1 sign bit, 15 exponent bits, and 113 significand bits, 1 of which is hidden.) There is controversy about the failure of programming languages to make these hardware formats available to programmers.
As of 2000, the IEEE 754 standard is currently under revision. See: IEEE 754r
Example
Archimedes approximated π by calculating the perimeters of polygons inscribed within and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. This generates two recurrence formulae for the perimeters, in terms of the half-angle formulae for sines and tangents respectively. Both have two forms, mathematically equivalent, but not so computationally. Here are results via turbo Pascal using 80-bit floating point in an IBM-PC clone:Inscribed (sine) Circumscribed (tangent) a) s:=sqrt t:=[sqrt(1 + t*t) - 1]/t b) s:= s/sqrt t:=t/[1 + sqrt(1 + t*t)] Start: s = 1/2 t = 1/sqrt(3)First, form a) for both:
Stage Sides Inscribed (2I + C)/3 Circumscribed 1 6 3.00000000000000000 3.15470053837925153 3.46410161513775459 2 12 3.10582854123024915 3.14234913054465692 3.21539030917347248 3 24 3.13262861328123820 3.14163905621999229 3.15965994209750048 4 48 3.13935020304686720 3.14159554040838980 3.14608621513143498 5 96 3.14103195089050961 3.14159283380879585 3.14271459964536834 6 192 3.14145247228546194 3.14159266485024924 3.14187304997982385 7 384 3.14155760791185764 3.14159265429352108 3.14166274705684795 8 768 3.14158389214831805 3.14159265363377626 3.14161017660469268 9 1536 3.14159046322805003 3.14159265359253476 3.14159703432150421 10 3072 3.14159210599928383 3.14159265358995624 3.14159374877130105 11 6144 3.14159251669204966 3.14159265358971935 3.14159292738505874 12 12288 3.14159261936512077 3.14159265358966492 3.14159272203875323 13 24576 3.14159264503460936 3.14159265358907182 3.14159267069799674 14 49152 3.14159265144937596 3.14159265359313918 3.14159265788066561 15 98304 3.14159265303352628 3.14159265355780571 3.14159265460636456 16 196608 3.14159265332534345 3.14159265356534707 3.14159265404535431 17 393216 3.14159265332534345 3.14159265308533982 3.14159265260533258 18 786432 3.14159265332534345 3.14159265178665197 3.14159264870926902 19 1572864 3.14159265599338609 3.14159265130664472 3.14159264193316198 20 3145728 3.14159264532121550 3.14159264645056668 3.14159264870926902 21 6291456 3.14159264532121550 3.14159264419186433 3.14159264193316198 22 12582912 3.14159298683065634 3.14159264645055406 3.14159196569034949 23 25165824 3.14159503588652176 3.14159423790206850 3.14159264193316198 24 50331648 3.14159776795893004 3.14159219110183233 3.14158103738763692Second, form b) for both:
Stage Sides Inscribed (2I + C)/3 Circumscribed 1 6 3.00000000000000000 3.15470053837925153 3.46410161513775459 2 12 3.10582854123024915 3.14234913054465692 3.21539030917347248 3 24 3.13262861328123820 3.14163905621999229 3.15965994209750048 4 48 3.13935020304686721 3.14159554040838980 3.14608621513143497 5 96 3.14103195089050964 3.14159283380879586 3.14271459964536830 6 192 3.14145247228546208 3.14159266485024934 3.14187304997982387 7 384 3.14155760791185765 3.14159265429352127 3.14166274705684853 8 768 3.14158389214831841 3.14159265363377545 3.14161017660468954 9 1536 3.14159046322805010 3.14159265359254211 3.14159703432152615 10 3072 3.14159210599927155 3.14159265358996504 3.14159374877135203 11 6144 3.14159251669215745 3.14159265358980398 3.14159292738509703 12 12288 3.14159261936538396 3.14159265358979391 3.14159272203861382 13 24576 3.14159264503369090 3.14159265358979328 3.14159267070199805 14 49152 3.14159265145076765 3.14159265358979324 3.14159265786784442 15 98304 3.14159265305503684 3.14159265358979324 3.14159265465930603 16 196608 3.14159265345610414 3.14159265358979324 3.14159265385717144 17 393216 3.14159265355637096 3.14159265358979324 3.14159265365663779 18 786432 3.14159265358143767 3.14159265358979324 3.14159265360650438 19 1572864 3.14159265358770435 3.14159265358979324 3.14159265359397102 20 3145728 3.14159265358927102 3.14159265358979324 3.14159265359083769 21 6291456 3.14159265358966268 3.14159265358979324 3.14159265359005435 22 12582912 3.14159265358976060 3.14159265358979324 3.14159265358985852 23 25165824 3.14159265358978508 3.14159265358979324 3.14159265358980956 24 50331648 3.14159265358979120 3.14159265358979324 3.14159265358979732
4*atan(1) = 3.14159265358979324 Good digits: 3.14159265358979323846264338327950288
See also
- Significant digits
- Fixed-point arithmetic
- Computable number
- IEEE Floating Point Standard
- IBM Floating Point Architecture
- FLOPS
- −0 (number)
- half precision – single precision – double precision – quad precision– minifloat
- Scientific notation
References
- An edited reprint of the paper [What Every Computer Scientist Should Know About Floating-Point Arithmetic], by David Goldberg, published in the March, 1991 issue of Computing Surveys.
- David Bindel’s [Annotated Bibliography] on computer support for scientific computation.
- Donald Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89684-2. Section 4.2: Floating Point Arithmetic, pp.214–264.
- Kahan, William and Darcy, Joseph (2001). How Java’s floating-point hurts everyone everywhere. Retrieved Sep. 5, 2003 from [http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf].
- [Introduction to Floating point calculations and IEEE 754 standard] by Jamil Khatib
- [Survey of Floating-Point Formats] This page gives a very brief summary of floating-point formats that have been used over the years.
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
