Opentopia Directory Encyclopedia Tools

Floating point

Encyclopedia : F : FL : FLO : Floating point



 

For the IEEE binary floating-point standard, see its page.

Floating-point is a way of representing a subset of the real numbers in terms of an exponent, which is an integer, and a significand, which is a string of some chosen number of digits. Floating-point representation is very common because of computers, but it has its roots in scientific notation.

Basics

A floating-point representation requires, first of all, a choice of base or radix for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by b, and the number of digits, or precision, by p. The significand is a number consisting of p digits in radix b, so each digit lies between 0 and b-1. A base of 2 (that is, binary representation) is nearly always used in computers, though some computers use b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.

As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.

Now this could be represented as any of

1528.535047 × 102
1528535047. × 10-4
.000001528535047 × 1011
The main benefit of scientific notation is that it makes representations like the last one unnecessary, by allowing the decimal point to be put in a convenient place. True floating-point notation uses a precise specification that the point is always just to the right of the leftmost digit of the significand, so the correct representation is
1.528535047 × 105
This, plus the requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to say where the point is; it is deduced from the exponent. In decimal floating-point notation with precision of 10, the revolution period of Io is simply
e=5 ; s=1528535047
Some people (and some computer representations) may prefer a different convention for the presumed location of the point, such as to the left of the leftmost digit. This doesn't make any actual difference; it simply offsets the exponent.
When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero can't be represented in normalized floating-point notation, and computers have to deal with it separately.)

The mathematical value of a floating-point number is

s.ssssssss...sss × be
In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The number π, represented in binary, is
11.0010010000111111011010101000100010000101101000110000100011010011.....
when rounded to a precision of 24, it is
11.0010010000111111011011
in binary floating-point, it is
e=1 ; s=110010010000111111011011
(There is no reason to write the exponent in any form other than a plain decimal number. When floating-point numbers are stored in a computer, a suitable binary encoding is used.)

Accuracy, and misconceptions thereof

For any fixed precision p, the floating-point numbers can represent only a subset of the real numbers (This is so even if the possible exponents are any integer. In actual computer representations, there is a finite range for the exponent, and the total set of floating-point numbers is actually finite.) This incompleteness is the same as what people are used to when they express measurements to a certain number of decimal places, as in "The atomic weight of Hydrogen is 1.008".

However, there is a common belief that

Floating-point numbers are imprecise, because they are only approximations.
In practice, it is the floating-point arithmetic operations, not the numbers themselves, that are imprecise. Floating point numbers, having approximately as many as 15 decimal significant figures, carry, on their own, far more precision than exists in the vast majority of real-world measurements or is relevant to nearly all applications. The point can also be made that floating point numbers carry the same infinite precision as plain integers (this, according to the previous writers of this page, being the main reason that people who claim FP numbers are imprecise are wrong). While this is technically true, it also happens to be of little relevance (since making actual use of their infinite precision is far less practical than it is for integers, for reasons which will be given shortly.)

Double as well as single precision floating point representations, in fact, carry extreme amounts of precision (even if we accept them as approximations). However, repeated operations of multiplication, division, addition, and, most importantly of all, subtraction, accumulate very large amounts of error. It is these floating point arithmetic operations that are the real source of imprecision, and it is they that make the very large representations necessary and even insufficient.

For example, one million repeated multiplications of numbers which have 7 significant decimal digits (ie IEEE single precision) will, on average, produce a final result accurate to just 4. In the worst case, a result may be accurate to just 1 significant decimal digit. Worst of all, it will be impossible to tell just how accurate the result turned out to be! Given that a modern processor can compute in the magnitude of 10 billion floating point operations per second, even double-precision numbers may not be safe in regard to multiplication, division, and addition.

However, the worst arithmetic operation to give imprecision is subtraction between two numbers of near-equal value. For example, 1.234567e0 - 1.234888e0 results, in case of decimal representations, in -3.210000e-4. Each one of those zeroes represents a digit whose value is unknown. In just a single operation, the precision of arithmetic has been halfed!

Other operations, such as powers, square roots, trigonometric function, etc. carry various amounts of error and risk.

Vital in avoiding the pitfalls of floating point numbers is the choice of algorithm. Different algorithms, computing the same problem, may have various degrees of numerical instability.

Lastly, there is a very important caveat in trying to employ floating-point numbers in "infinitely exact" arithmetic (although they are as capable of it as integers). It stems from the fact that floating point numbers are in fact binary fractions, while we commonly use decimals. Thus, 0.5 multiplied by 0.5 will exactly equal 0.25, with no caveats or gotchas. Comparisons for equality may be safely made, just as one is accustomed to do with integers. However, 0.1*0.1 will not equal 0.01, primarily because there isn't actually a way of representing 0.1 or 0.01 exactly as a binary fraction. Nor will the 0.1 approximation multiplied by itself equal the 0.01 approximation. Note, this is like adding 1/3 (0.333) to itself and getting 0.666 instead of 0.667 (2/3). Thus, in most cases, if one is trying to perform comparisons of arithmetic results, even those which seemingly are safely bound by the precision of the format, one must take the absolute value of the difference between them and compare it against a small number. (ie, in C, instead of if(a==b), write if(abs(a-b)<0.00001)).

Banker's rounding

When the bits after the last available result bit are 1000000000..... exactly (that is, it is known that there are an infinite number of zeros after the initial 1), rounding upward or downward is equally accurate. In this case, the technique of Banker's rounding is usually used. The rounding is done in the direction that makes the significand even.

Mantissa

The word "mantissa" is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand. The term "mantissa" is nevertheless widely used to refer to the significand.

Computer Representation

(This section describes some general issues, but mostly follows the IEEE standard.)

To represent a floating-point number in a computer datum, the exponent has to be encoded into a bit field. Since the exponent could be negative, one could use two's complement representation. Instead, what is commonly done is something very similar: a fixed constant is added to the exponent, with the intention that the result be a positive number capable of being packed into a fixed bit field. For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. The result of this addition is placed in an 8 bit field.

Since the leftmost significand bit of a (normalized) floating-point number is always 1, that bit is not actually placed in the computer datum. The computer's arithmetic hardware acts as though that "1" had been provided. This is the "implicit bit" or "hidden bit" of the IEEE standard. Because of this, a 24 bit significand is actually placed in a 23 bit field.

Finally, a sign bit is required. This is one to indicate that the entire floating-point number is negative, or zero to indicate that it is positive. (In the past, some computers have used a kind of two's complement encoding for the entire number, rather than simple "sign/magnitude" format, but it didn't work very well.)

The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.

For the approximation to π, we have

sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
e+127 = 128 = 10000000 in (8 bit) binary
final 32 bit result = 0 10000000 10010010000111111011011 = 40490FDB in hexadecimal
As noted above, this number is not really π, but is exactly 3.1415927410125732421875.

In the common 64 bit "double precision" or "double" format of the IEEE standard, the offset added to the exponent is 1023, and the result is placed into an 11 bit field. The precision is 53 bits. After removal of the hidden bit, 52 bits remain. The result comprises 1+11+52=64 bits. The approximation to π is

sign = 0 ; e=1 ; s=11001001000011111101101010100010001000010110100011000 (including the hidden bit)
e+1023 = 1024 = 10000000000 in (11 bit) binary
final 64 bit result = 0 10000000000 1001001000011111101101010100010001000010110100011000
::= 400921FB54442D18 in hexadecimal
This number is exactly 3.141592653589793115997963468544185161590576171875.

Overflow, underflow, and zero

The necessity to pack the offset exponent into a fixed-size bit field places limits on the exponent. For the standard 32 bit format, e+127 must fit into an 8 bit field, so -127 ≤ e ≤ 128. The values -127 and +128 are reserved for special meanings, so the actual range for normalized floating-point numbers is -126 ≤ e ≤ 127. This means that the smallest normalized number is

e=-126 ; s=100000000000000000000000
which is about 1.18 × 10-38, and is represented in hexadecimal as 00800000. The largest representable number is
e=+127 ; s=111111111111111111111111
which is about 3.4 × 1038, and is represented in hexadecimal as 7F7FFFFF. For double precision the range is about 2.2 × 10-308 to 1.8 × 10308.

Any floating-point computation that gives a result (after rounding to a representable value) higher than the upper limit is said to overflow. Under the IEEE standard, such result is set to a special value "infinity", which has the appropriate sign bit, the reserved exponent +128, and a bit pattern in the significand (typically zero) to indicate that this is infinity. Such numbers are generally printed as "+INF" or "-INF".

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as

(+INF) + (+7) = (+INF)
(+INF) * (-2) = (-INF)
A floating-point computation that (after rounding) gives a nonzero result lower than the lower limit is said to underflow. This could happen, for example, if 10-25 is multiplied by 10-25 in single precision. Under the IEEE standard, the reserved exponent -127 is used, and the significand is set as follows:

First, if the number is zero, it is represented by an exponent of -127 and a significand field of all zeros. This means that zero is represented in hexadecimal as 00000000.

Otherwise, if normalizing the number would lead to an exponent of -127 or less, it is only normalized until the exponent is -127. That is, instead of shifting the significand bits left until the leftmost bit is 1, they are shifted until the exponent reaches -127. For example, the smallest non-underflowing number is

e=-126 ; s=1.00000000000000000000000 (about 1.18 × 10-38)
A number 1/16th as large would be
e=-130 ; s=1.00000000000000000000000 (about 7.3 × 10-40)
If it is partially normalized, one gets
e=-127 ; s=0.00100000000000000000000
This does not have a leading bit of 1, so using the "hidden bit" mechanism won't work. What is done is to store the significand without removing the leftmost bit, since there is no guarantee that it is 1. This means that the precision is only 23 bits, not 24. The exponent of -127 is stored in the usual excess 127 format, that is, all zeros. The final representation is
0 00000000 00010000000000000000000 = 00080000 in hexadecimal
Whenever the exponent is -127 (that is, all zeros in the datum), the bits are interpreted in this special format. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal".

The smallest possible (subnormal) nonzero number is

0 00000000 00000000000000000000001 = 00000001 in hexadecimal
e=-127 ; s=0.0000000000000000000001
Which is 2-149, or about 1.4 × 10-45

The handling of the number zero can be seen to be a completely ordinary case of a subnormal number.

The creation of denorms is often called "gradual underflow". As numbers get extremely small, significand bits are slowly sacrificed. The alternative is "sudden underflow", in which any number that can't be normalized is simply set to zero by the computer hardware. Gradual underflow is difficult for computer hardware to handle, so hardware often uses software to assist it, through interrupts. This creates a serious performance penalty, and sometimes sudden underflow is used instead. This is a source of controversy.

Behavior of computer arithmetic

As noted above, the standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of In the default rounding mode the IEEE 754 standard mandates the round-to-nearest behavior described above for all fundamental algebraic operations, including square root. ("Library" functions such as cosine and log are not mandated.) This means that IEEE-compliant hardware's behavior is completely determined in all 32 or 64 bits.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10-7 in single precision, and 10-16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within 1/2 of an ULP.

Exceptional values, and exceptions, under the IEEE standard

In addition to the "infinity" value that is produced when an overflow occurs, there is a special value "NaN" (standing for "not a number") that is produced by such things as taking the square root of a negative number, or the inverse sine of 3. (The inverse sine function is typically handled by a library routine, which explicitly produces this value when the argument is out of range.) NaN is encoded with the reserved exponent of 128 (or 1024), and a significand field that distinguishes it from infinity.

The intention of the INF and NaN values is that, under the most common circumstances, they can just propagate from one operation to the next (any operation with NaN as an operand produces NaN as a result), and they only need to be attended to at a point that the programmer chooses.

In addition to the creation of exceptional values, there are "events" that may occur, though some of them are quite benign:

Computer hardware is typically able to raise exceptions ("traps") when these events occur. How this is done is very system-dependent. Usually all exceptions are masked (disabled). Sometimes overflow, zerodivide, and operand error are enabled.

Peculiar properties of floating point arithmetic

The fact that the set of representable floating-point numbers is not the entire set of reals, and that floating-point arithmetic operations have to round each result into a representable item, means that the arithmetic performed by computers does not obey all of the expected properties.

While floating point addition and multiplication are commutative, they are not associative. This means that in general for floating point numbers x, y, and z:

Floating point arithmetic is also not distributive. This means that in general: For example, with most floating-point implementations
1.0 + (1e100 - 1e100) gives the result 1.0
(1.0 + 1e100) - 1e100 gives the result 0.0.
The reason is that, in the first case, the initial subtraction gives zero, since that is the true result and is representable, and 1 can be added to it successfully. In the second case, the floating point approximation to 1e100, when 1 is added, results in the same approximation, so the subtraction yields zero.

(The situation is actually more subtle than that. By "1e100", we mean the floating point number closest to it. The starting point for any floating point operation has to be a representable number.)
In short, the order in which operations are carried out can change the output of a floating point calculation. This is important in numerical analysis since two mathematically equivalent formulas may not produce the same numerical output, and one may be substantially more accurate than the other.

Problems with floating-point

Because of the inability to represent all of the reals, and the consequent rounding of every arithmetic result, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.

In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) will lead to errors. There is much controversy about the failings of compilers and language designs in this area.

Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/ The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.

The general types of things that can cause trouble are:

Unlike the fixed-point counterpart, the application of dither in a floating point environment is nearly impossible. See external references for more information about the difficulty of applying dither and the rounding error problems in floating point systems.

A few nice properties

One can sometimes take advantage of a few nice properties:

IEEE standard

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.

The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit format, with 15 exponent bits and 64 significand bits, with no hidden bit. Software vendors may also provide additional extended formats, such as the H-P "quad" format (1 sign bit, 15 exponent bits, and 113 significand bits, 1 of which is hidden.) There is controversy about the failure of programming languages to make these hardware formats available to programmers.

As of 2000, the IEEE 754 standard is currently under revision. See: IEEE 754r

Example

Archimedes approximated π by calculating the perimeters of polygons inscribed within and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. This generates two recurrence formulae for the perimeters, in terms of the half-angle formulae for sines and tangents respectively. Both have two forms, mathematically equivalent, but not so computationally. Here are results via turbo Pascal using 80-bit floating point in an IBM-PC clone:
Inscribed (sine)              Circumscribed (tangent)
a)  s:=sqrt     t:=[sqrt(1 + t*t) - 1]/t
b)  s:= s/sqrt  t:=t/[1 + sqrt(1 + t*t)]
Start:     s = 1/2                           t = 1/sqrt(3)
First, form a) for both:

Stage    Sides       Inscribed           (2I + C)/3       Circumscribed
1        6 3.00000000000000000 3.15470053837925153 3.46410161513775459
2       12 3.10582854123024915 3.14234913054465692 3.21539030917347248
3       24 3.13262861328123820 3.14163905621999229 3.15965994209750048
4       48 3.13935020304686720 3.14159554040838980 3.14608621513143498
5       96 3.14103195089050961 3.14159283380879585 3.14271459964536834
6      192 3.14145247228546194 3.14159266485024924 3.14187304997982385
7      384 3.14155760791185764 3.14159265429352108 3.14166274705684795
8      768 3.14158389214831805 3.14159265363377626 3.14161017660469268
9     1536 3.14159046322805003 3.14159265359253476 3.14159703432150421
10     3072 3.14159210599928383 3.14159265358995624 3.14159374877130105
11     6144 3.14159251669204966 3.14159265358971935 3.14159292738505874
12    12288 3.14159261936512077 3.14159265358966492 3.14159272203875323
13    24576 3.14159264503460936 3.14159265358907182 3.14159267069799674
14    49152 3.14159265144937596 3.14159265359313918 3.14159265788066561
15    98304 3.14159265303352628 3.14159265355780571 3.14159265460636456
16   196608 3.14159265332534345 3.14159265356534707 3.14159265404535431
17   393216 3.14159265332534345 3.14159265308533982 3.14159265260533258
18   786432 3.14159265332534345 3.14159265178665197 3.14159264870926902
19  1572864 3.14159265599338609 3.14159265130664472 3.14159264193316198
20  3145728 3.14159264532121550 3.14159264645056668 3.14159264870926902
21  6291456 3.14159264532121550 3.14159264419186433 3.14159264193316198
22 12582912 3.14159298683065634 3.14159264645055406 3.14159196569034949
23 25165824 3.14159503588652176 3.14159423790206850 3.14159264193316198
24 50331648 3.14159776795893004 3.14159219110183233 3.14158103738763692
Second, form b) for both:
Stage    Sides       Inscribed           (2I + C)/3       Circumscribed
1        6 3.00000000000000000 3.15470053837925153 3.46410161513775459
2       12 3.10582854123024915 3.14234913054465692 3.21539030917347248
3       24 3.13262861328123820 3.14163905621999229 3.15965994209750048
4       48 3.13935020304686721 3.14159554040838980 3.14608621513143497
5       96 3.14103195089050964 3.14159283380879586 3.14271459964536830
6      192 3.14145247228546208 3.14159266485024934 3.14187304997982387
7      384 3.14155760791185765 3.14159265429352127 3.14166274705684853
8      768 3.14158389214831841 3.14159265363377545 3.14161017660468954
9     1536 3.14159046322805010 3.14159265359254211 3.14159703432152615
10     3072 3.14159210599927155 3.14159265358996504 3.14159374877135203
11     6144 3.14159251669215745 3.14159265358980398 3.14159292738509703
12    12288 3.14159261936538396 3.14159265358979391 3.14159272203861382
13    24576 3.14159264503369090 3.14159265358979328 3.14159267070199805
14    49152 3.14159265145076765 3.14159265358979324 3.14159265786784442
15    98304 3.14159265305503684 3.14159265358979324 3.14159265465930603
16   196608 3.14159265345610414 3.14159265358979324 3.14159265385717144
17   393216 3.14159265355637096 3.14159265358979324 3.14159265365663779
18   786432 3.14159265358143767 3.14159265358979324 3.14159265360650438
19  1572864 3.14159265358770435 3.14159265358979324 3.14159265359397102
20  3145728 3.14159265358927102 3.14159265358979324 3.14159265359083769
21  6291456 3.14159265358966268 3.14159265358979324 3.14159265359005435
22 12582912 3.14159265358976060 3.14159265358979324 3.14159265358985852
23 25165824 3.14159265358978508 3.14159265358979324 3.14159265358980956
24 50331648 3.14159265358979120 3.14159265358979324 3.14159265358979732
4*atan(1) = 3.14159265358979324
Good digits: 3.14159265358979323846264338327950288

See also

References

 


From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.


Search Titles
0123456789
ABCDEFGHIJ
KLMNOPQRST
UVWXYZ?

E-mail this article to:

Personal Message: