Floating Point Numbers

Floating point numbers are represented by non-computers (humans) in scientific notation (** represents raising to a power)

: 4.01 X 10**8 = 401,000,000.0
: 4.01 X 10**-3 = 0.00401
: - 4.01 X 10**8 = -401,000,000.0
: -4.01 X 10**-3 = -0.00401

From these examples, it is apparent that a floating point number is represented using

: 2 numbers - the exponent and the mantissa
: 2 signs - one for the exponent and one for the mantissa

The computer represents each of these signed numbers differently in a floating point number

: exponent and sign - excess 7FH notation
: mantissa and sign - signed magnitude

Floating Point Numbers Using Decimal Digits and Excess 49 Notation

For this paragraph, decimal digits will be used along with excess 49 notation for the exponent. This is done just to make the math a little easier. However, the format for the computer is structurally similar, but hex digits are used with excess 7FH notation.

Eight digits are used to represent a floating point number : two for the exponent and six for the mantissa. The sign of the mantissa will be represented as + or -, but in the computer it is represented by a bit: 1 means negative, 0 means positive.

Here are the above examples in the format recognized by the computer

: 4.01 X 10**8 = +57401000
: 4.01 X 10**-3 = +46401000
: - 4.01 X 10**8 = -57401000
: -4.01 X 10**-3 = -46401000

This representation makes it easy to compare numbers. If two numbers have the same sign, then they can be compared numerically after the sign bit to determine which number is larger.

Actual representation in the computer

Things aren't quite as simple as the above paragraph would indicate. If the above format were followed, then 33 bits would be needed to represent a floating point number (1 bit for the sign, 4 bits for each hex digit). 33 is bad, 32 is good. So, how is the extra bit discarded? Through absolute trickery!

Actually, all of the precision of the above format is obtained, but it is accomplished using 32 bits instead of 33. The trick is to remember that in reality these numbers are stored in binary. Also, every number is always in NORMALIZED form, which means that it starts with a 1, not a 0. The exponent is always adjusted to eliminate any leading 0's from the mantissa. So this is where the extra bit is squeezed in (or out). If EVERY number begins with a 1, then why store it in memory? Why not just have the program place a 1 at the beginning of every mantissa?

Using this trick, the layout of a number in the computer is

1 bit for the sign, 8 bits for the exponent, 23 bits for the mantissa

However, since the leading bit in the mantissa is never stored, then there are actually 24 bits for the mantissa. Pretty sneaky.

Representing Binary numbers with decimal places

In base 10, a number like 0.123 represents

1/10 + 2/100 + 3/1000

What is the significance of the denominators 10, 100, 1000? They are the powers of the base (base 10). So, what would the number 0.101 represent in binary?

	1/2 + 0/4 + 1/8 = 5/8

since the powers of two are 2, 4, 8.

There is another way to calculate this, just count the number of decimal places, and raise 2 to that power. Since there are three decimal places in this example, then the denominator is 2**3 = 8. Then, just calculate the numerator as a binary number, in this case 5. So the final number is 5/8.

Here are some more examples

	101.1101 = 5 13/16
	- 11101.11101 = - 29 29/32
	0.001011 = 11/64

Examples of computer storage of floating point numbers

Example for 5 13/16

101.1101
normalized number = 1.011101 * 2**2

sign of mantissa = 0
mantissa = 011101 (leading 1 is not stored)
excess 7FH exponent = 81H = 10000001 in binary

Binary representation of number
0 10000001 01110100000000000000000

Regroup
0100 0000 1011 1010 0000 0000 0000 0000

Hex representation of number

40BA0000

Example for -29 29/32

- 11101.11101
normalized number = 1.110111101 * 2**4

sign of mantissa = 1
mantissa = 110111101 (leading 1 is not stored)
excess 7FH exponent = 83H = 10000011 in binary

Binary representation of number
1 10000011 110111101000000000000000

Regroup
1100 0001 1110 1111 0100 0000 0000 0000

Hex representation of number

C1EF4000

Example for 11/64

0.001011
normalized number = 1.011 * 2**(-3)

sign of mantissa = 0
mantissa = 011 (leading 1 is not stored)
excess 7FH exponent = 7CH = 01111100 in binary

Binary representation of number
0 01111100 01100000000000000000000

Regroup
0011 1110 0011 0000 0000 0000 0000 0000

Hex representation of number

3E300000

Standard Formats for floating point numbers

There are two standard formats for floating point numbers according to IEEE

: The above description is for the IEEE Short Real Format that uses 32 bits
: There is also the IEEE Long Real Format that uses 64 bits; 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa.