Floatingpoint Numbers in IDL
TOPIC:
Exelis VIS Technical Support frequently gets questions about "floatingpoint errors" in IDL.
This help article discusses frequently asked questions regarding floatingpoint numbers with respect to their use in IDL.
More information about floating point numbers in IDL may be found in the IDL Online Help under the Contents section:
IDL Programming > Concepts > IDL Data Types
under the topic:
Precision of FloatingPoint Numbers
and also by following the link under this topic titled:
Accuracy and Floating Point Operations
Note also that most of the concepts explained in this help article may also be found in introductory Numerical Analysis textbooks.
DISCUSSION:
First, two important facts regarding floatingpoint numbers:
1) Floatingpoint processors in modern computers use BINARY arithmetic, not DECIMAL arithmetic. Consequently, floatingpoint numbers are stored in BINARY formats, not DECIMAL formats. Decimal numbers must undergo a conversion from base 10 to base 2, when entered into a computer program. Likewise, they must undergo another conversion, from base 2 to base 10, before they can be displayed.
2) Floatingpoint numbers are represented using standard floatingpoint formats, with a limited number of significant BINARY places, which translate to a limited number of significant DECIMAL places.
For these reasons, inaccuracies in floatingpoint number representations are guaranteed.
Floatingpoint inaccuracies are not unique to IDL. The limitations imposed by floatingpoint formats exist on all computer hardware and are evident in every computer programming language. In fact, IDL uses standard CLibrary routines, such as scanf() and printf(), for converting, formatting, and displaying numbers; inaccuracies observed in IDL are directly related to CLibrary routines.
Regarding Fact 1, consider the number 5 3/4, which has the "exact" decimal representation 5.75 (as opposed to the number 1/3, which would require an infinite number of significant decimal places: 0.333...forever...).
When the number 5.75 is converted to binary, the result is 101.11, which is also exact.
Not all decimal numbers have such a "nice" relationship to the binary number system. For example, the number 1/5 has the exact decimal representation 0.2, but an infinite number of significant binary places would be required to represent 0.2.
Regarding Fact 2, the floatingpoint formats in use on modern computer hardware include a singleprecision floatingpoint format and a doubleprecision floatingpoint format. The formats are defined by The Institute of Electrical and Electronics Engineers (IEEE). The singleprecision format supports the equivalent of approximately 7 significant decimal places. The doubleprecision format supports the equivalent of approximately 16 significant decimal places. The word "approximately" refers to the fact that each decimal place does not correspond, exactly, to an integral number of binary places. Therefore, the least significant decimal place associated with these formats may be "fuzzy." In general, it is best not to trust it, if it is displayed.
An example:
Using IDL and the decimal number 423.36, observe what happens when it is entered as a singleprecision floatingpoint value and is printed with an increasing number of significant DISPLAY places.
IDL> a = 423.36
IDL> ; Display 7 significant decimal places:
IDL> print, a, format='(f20.4)'
423.3600
IDL> ; Display 8 significant decimal places:
IDL> print, a, format='(f20.5)'
423.35999
IDL>
Asking for 8 significant decimal places to be DISPLAYED, reveals the limitations of the singleprecision format. "Under the hood," 423.36 can't be represented, exactly, in binary floatingpoint format. The number 423.36 can be DISPLAYED exactly, because IDL (which means the CLibrary printf() function) performs rounding (in binary, not decimal) to the requested number of display places. The rounding process fails, when there aren't enough significant binary places available.
Now, look at a doubleprecision version of the same number:
IDL> a = 423.36d0
IDL> print, a, format='(d20.13)' ; Display 16 significant decimal places.
423.3600000000000
IDL> print, a, format='(d20.14)' ; Display 17 significant decimal places.
423.36000000000001
IDL>
Requesting 17 significant decimal places reveals the limitations of the doubleprecision format; the least significant DISPLAYED place represents uninitialized data bits (a.k.a., "garbage").
A Detailed Look at the IEEE Singleprecision Floatingpoint Format
The IEEE floatingpoint formats are, essentially, the binary equivalent of the familiar Scientific Notation used in the decimal number system. The IEEE formats store a signed mantissa part, which contains the significant places of a number, and a signed exponent part, which represents the magnitude of a number. The mantissa bits are coefficients of powers of two and the exponent part is a binary exponent. Just as in Scientific Notation, where the exponent part represents the number of decimal places (and direction) by which the decimal point is to be shifted, so the IEEE exponent part represents the number of binary places (and direction) by which the binary point is to be shifted. As with Scientific Notation, the significant places are positioned so that the most significant place is in the units position (the position corresponding to the zeroth power of the base). This process is called normalization.
Except for the special case of the number 0.0, every other floatingpoint number will have at least one nonzero significant binary place. Therefore, after the normalization process, the units place will always be nonzero (one) for nonzero numbers. This fact allows for the concept of a "hidden" bit in the IEEE formats. An extra significant binary place is gained by treating the units place as a "virtual" onebit (a special IEEE "bitpattern" allows computer floatingpoint arithmetic units to identify the 0.0 case). Only the remaining significant places of the normalized mantissa, from the negativefirst power of two downwards, are actually stored in computer memory.
Another example:
IDL can be used to examine the actual bitpattern of any floatingpoint number. The singleprecision format can be revealed by copying the bitpattern into a variable of type LONG and printing it using the hexadecimal editing code. Using the previous example of the decimal number 5.75, we would expect to see the number 101.11 pop out of the binary floatingpoint representation.
IDL> a = 5.75
IDL> b = long(a, 0)
IDL> print, b, format='(z8)'
40b80000
IDL>
Convert each hexadecimal place to its binary equivalent:
0100 0000 1011 1000 0000 0000 0000 0000
Spread out the sign bit, the exponent part, and the mantissa part:
s

exponent

Mantissa

0

10000001

01110000000000000000000

The IEEE singleprecision floatingpoint format has a sign bit (set to zero for positive numbers and to one for negative numbers), an 8bit exponent part, and a 23bit mantissa part. The exponent part is stored with a bias value, so that both negative and positive exponents may be represented. The bias value is 01111111 (127 decimal). Exponent parts which are greater than, or equal to, the bias value represent positive powers of two. Otherwise, they represent negative powers of two.
Subtract the bias value from the biased exponent part:
10000001  01111111 = 00000010
OK., it is easier to understand the subtraction if it is done in decimal:
129  127 = 2
The exponent part is actually a positive 2.
Combine the "hidden" bit (units place) with the bits actually stored in the mantissa part:
1.0111
Since the exponent part is a positive 2, shift the binary point two places to the right:
101.11
This result agrees with our expectations.
IDL TYPECONVERSION FUNCTIONS
The behavior and purpose of the IDL typeconversion functions, such as the DOUBLE function, are often misunderstood. For example, using the DOUBLE function with a singleprecision argument will create a doubleprecision result. However, the result will contain only a singleprecision amount of significant places.
IDL> x = double(423.36)
IDL> print, x, format='(d20.4)' ; Display 7 significant decimal places.
423.3600
IDL> print, x, format='(d20.5)' ; Display 8 significant decimal places.
423.35999
The number 423.36, which is an argument to the DOUBLE function, is converted to a singleprecision floatingpoint value BEFORE it is passed to the DOUBLE function. Since 423.36 can't be represented, exactly, in binary floatingpoint format, truncation of significant places occurs BEFORE it is passed to the DOUBLE function.
To declare a floatingpoint number as a doubleprecision number, with a full doubleprecision mantissa, the "D" exponential notation may be used:
IDL> x = 423.36d0
This notation forces the conversion process, from decimal to binary, to generate as many significant binary places as are possible in the doubleprecision format.
The IDL typeconversion functions are useful in expressions where specific behavior is required. For example, to avoid truncation of fractional parts during integer division:
IDL> x = 5
IDL> y = 7
IDL> print, x / y
0
IDL> print, float(x) / float(y)
0.714286