of 25
8/14/2019 Floating-Point Format
1/25
CS220
April 11, 2007
8/14/2019 Floating-Point Format
2/25
Floating-Point Format Scientific Notation
Coefficient/mantissa, exponent
Decimal
Example: 2.429843 x 105, 7.3434 x 10-3
Binary
Example: 1.0111 x 22 => 101.11 (in binary)
22
x1+21
x0+20
x1+2-1
x1+2-2
x1
8/14/2019 Floating-Point Format
3/25
IEEE 754 Floating-Point Format Components
Sign (1 negative, 0 positive)
Significand/Coefficient/Mantissa/Fraction
Normalized or Demormalized
Exponent (positive unsigned, biased)
8/14/2019 Floating-Point Format
4/25
Exponent Bias the value of exponent is offset from the actual
value two's complement makes comparison harder
adjusting its value to put it within an unsigned
range suitable for comparison, biased by 2e-1
-1(Here e is the size of exponent part)
For a single-precision, an exponent in the range
-126 to +127 is biased by adding 127 to get avalue in the range 1 to 254.0 reserved for denormalized num or zero
255 reserved for infinity or NaN
8/14/2019 Floating-Point Format
5/25
Comparison
00000000
11111111
10000000
01111111
255
0
127
128
127
+0 0
127
-0
-127
-128
-1
-127
128
0
1
Unsigned Ones complement Twos complement Biased
8/14/2019 Floating-Point Format
6/25
Precision 32 bits, Single-Precision (1,8,23)
(1.18x10-38 to 3.40x1038)
64 bits, Double-Precision (1,11,52)
(2.23x10-308
to 1.79x10308
) 80 bits, Double-Extended-Precision
(1,15,64)
Intel format, not IEEE standard
(3.37x10-4932 to 1.18x104932)
8/14/2019 Floating-Point Format
7/25
Single Precision Exponent is Biased by 28-1-1=127
Represents -126 to 127
In the example shown above, the sign is zero,the exponent is -3, and the significand is 1.01
(in binary, which is 1.25 in decimal). Therepresented number is therefore +1.25x2-3,which is +0.15625.
8/14/2019 Floating-Point Format
8/25
Single Precision Number Ranges The smallest non-zero positive and largest non-zero
negative numbers (represented by the denormalized
value with all 0s in the Exp field and the binary value 1in the Fraction field) are
21491.4012985 x 1045
The smallest non-zero positive and largest non-zero
negative normalized numbers (represented with thebinary value 1 in the Exp field and 0 in the Fractionfield) are
21261.175494351 x1038
The largest finite positive and smallest finite negativenumbers (represented by the value with 254 in the Expfield and all 1s in the Fraction field) are
(2128 - 2104)3.4028235 x 1038
8/14/2019 Floating-Point Format
9/25
Example Encode the decimal number -118.625 using the IEEE 754 system
First we need to get the sign, the exponent and the fraction. Because
it is a negative number, the sign is "1". Now, we write the number (without the sign; i.e. unsigned, no two's
complement) using binary notation. The result is 1110110.101. Next, let's move the radix point left, leaving only a 1 at its left:
1110110.101 = 1.110110101 x 26. This is a normalized floating pointnumber. The mantissa is the part at the right of the radix point, filled
with 0 on the right until we get all 23 bits. That is11011010100000000000000.
The exponent is 6, but we need to convert it to binary and bias it (sothe most negative exponent is 0, and all exponents are non-negativebinary numbers). For the 32-bit IEEE 754 format, the bias is 127 andso 6 + 127 = 133. In binary, this is written as 10000101.
8/14/2019 Floating-Point Format
10/25
Double Precision Exponent is Biased by 211-1-1=1023
Represents -1022 to 1023
8/14/2019 Floating-Point Format
11/25
Double Precision Number Ranges The smallest non-zero positive and largest non-zero
negative numbers (represented by the denormalized
value with all 0s in the Exp field and the binary value 1in the Fraction field) are
21074510324
The smallest non-zero positive and largest non-zero
negative normalized numbers (represented by thevalue with the binary value 1 in the Exp and 0 in theFraction field) are
210222.225073858507202010308
The largest finite positive and smallest finite negativenumbers (represented by the value with 2046 in theExp field and all 1s in the Fraction field) are
(21024 2971)1.797693134862315710308
8/14/2019 Floating-Point Format
12/25
Special Cases zero is not directly representable in the straight format,
due to the assumption of a leading 1 (we'd need to
specify a true zero mantissa to yield a value of zero).Zero is a special value denoted with an exponent field ofzero and a fraction field of zero.
If the exponent is all 0s, but the fraction is non-zero (else
it would be interpreted as zero), then the value is adenormalizednumber, which does nothave an assumedleading 1 before the binary point. Thus, this represents anumber (-1)sx 0.fx 2-126, where sis the sign bit and fisthe fraction. For double precision, denormalizednumbers are of the form (-1)sx 0.fx 2-1022. From this youcan interpret zero as a special type of denormalizednumber.
8/14/2019 Floating-Point Format
13/25
Special Cases cont The values +infinity and -infinity are denoted
with an exponent of all 1s and a fraction of all 0s.The sign bit distinguishes between negativeinfinity and positive infinity. Being able to denoteinfinity as a specific value is useful because it
allows operations to continue past overflowsituations. The value NaN (Not a Number) is used to
represent a value that does not represent a real
number. NaN's are represented by a bit patternwith an exponent of all 1s and a non-zerofraction.
8/14/2019 Floating-Point Format
14/25
Special Cases SummaryType Exponent Mantissa
Zeroes 0 0(Positive/Negative Zero depends on sign)
Denormalized 0 non zero
Normalized 1 to 2e-2 anyInfinities 2e-1 0
(Positive/Negative Infinity depends on sign)
NaNs 2e-1 non zero
(here e is size of exponent)
8/14/2019 Floating-Point Format
15/25
Special Operations Overflow: exponent too large, producing an
infinity. Underflow: exponent too small, producing a
denorm or zero.
Zerodivide: nonzero number is divided by zero,producing an infinity of the appropriate sign.
Operand Error: such as such as division of zero
by zero, or taking the square root of -1,producing a NaN
8/14/2019 Floating-Point Format
16/25
Special OperationsOperation Result
nInfinity 0InfinityInfinity Infinity
nonzero 0 Infinity
Infinity + Infinity Infinity00 NaN
Infinity - Infinity NaN
InfinityInfinity NaN
Infinity 0 NaN
8/14/2019 Floating-Point Format
17/25
FPU Coprocessor: supplement the functions
of the primary processor. Coprocessor examples: floating point
arithmetic, graphics, signal processing,string processing, or encryption.
FPU registers: eight 80-bit data registers,
three 16-bit registers (control, status, andtag)
8/14/2019 Floating-Point Format
18/25
FPU Register Stack Circular, top is defined in control register
%st(n)
8/14/2019 Floating-Point Format
19/25
8/14/2019 Floating-Point Format
20/25
Preset Values
FLD1 Push 1.0
FLDL2T Push Log210
FLDL2E Push Log2e
FLDPI Push Pi FLDLG2 Push Log102
FLDLN2 Push Ln2 (Loge2) FLDZ Push 0.0
8/14/2019 Floating-Point Format
21/25
8/14/2019 Floating-Point Format
22/25
Status Register Indicates the operating condition of the FPU
Status Bit Description0 Invalid operation exception flag1 Denormalized operand exception flag2 Zero divide exception flag3 Overflow exception flag4 Underflow exception flag5 Precision exception flag6 Stack fault7 Error summary status
8 Condition code bit 0 (C0)9 Condition code bit 1 (C1)10 Condition code bit 2 (C2)11-13 Top of stack pointer14 Condition code bit 3 (C3)
15 FPU busy flag fstsw oldvaluefldsw newvalue
8/14/2019 Floating-Point Format
23/25
Control Register controls the FPU functions, such as
calculation precision, and rounding methodStatus Bit Description0 Invalid operation exception mask1 Denormal operand exception mask
2 Zero divide exception mask3 Overflow exception mask4 Underflow exception mask5 Precision exception mask6-7 Reserved
8-9 Precision control10-11 Rounding control12 Infinity control13-15 Reserved
fstcw oldvaluefldcw newvalue
8/14/2019 Floating-Point Format
24/25
Control Register Precision Control
00 -- single-precision (24-bit significand) 01 -- not used
10 -- double-precision (53-bit significand)
11 -- double-extended-precision (64-bit significand)
Rounding Control 00 -- round to nearest
01 -- round down (toward negative infinity)
10 -- round up (toward positive infinity)
11 -- round toward zero
8/14/2019 Floating-Point Format
25/25
Tag Register Identify the values within the eight 80-bit
FPU data registers. (2 bits per register) A valid double-extended-precision value (code 00)
A zero value (code 01)
A special floating-point value (code 10) Nothing (empty) (code 11)
fsttw oldvaluefldtw newvalue