Computer Arithmetic : Floating-point
CS 3339Lecture 5
Apan QasemTexas State University
Spring 2015
*some slides adopted from P&H
Review
• Signed binary representation• sign-and-magnitude• 2’s complement
• Overflow detection
• Implementation of binary multiplication and division in the microarchitecture
performance issues?
Representing Big (and Small) Numbers
• No matter what our word size is, there will always be a need to represent bigger numbers
MAXINT + 1MININT - 1
• There is also a need to represent really small numbers
nanosecond: 0.000000001• Also, there are numbers, that are not big or small but still
might require a lot of space
PI = 3.141…e = 2.718281…
• Floating-point values allow us to represent large and small and real numbers
Talking about HW,SW issues are different
Scientific Notation
• Idea of floating-point representation is not new• Borrowed from convention used in scientific
journals
• For example,0.000000000000000000000000000000000001
would typically be written as 1.0 x 10-35
and602,214,150,000,000,000,000,000
would typically be written as 6.0221415 x 1023
Scientific Notation
• General format (-1)sign x F x 10E
F = fractional partE = exponent
• Fractional part usually normalized (e.g., 1.xxxx)
• For binary, it becomes
(-1)sign x F x 2E
Binary Representation of Floating-point Values
• Allocate some bits for exponent, some for fraction, and one for sign
• Still have to fit everything in 32 bits (single precision)
• What is the trade-off?
Range vs. Precisionmore bits in the fraction (F) gives more accuracy, more bits in the exponent (E) increases size of the
number being represented
Range vs. Precisionmore bits in the fraction (F) gives more accuracy, more bits in the exponent (E) increases size of the
number being represented
031
Floating-point Standard
• Defined by IEEE Std 754-1985• Revised in 2008
http://en.wikipedia.org/wiki/IEEE_754-2008
• Developed in response to divergence of representations• Portability issues for scientific code
• Now almost universally adopted• Enforced at the language level
• Two representations• Single precision (32-bit)• Double precision (64-bit)
IEEE Floating-point Format
Sign •To store
• 0 for non-negative and 1 for negative value• sign-and-magnitude representation
•To interpret • 0 implies non-negative and 1 is negative
single: 8 bitsdouble: 11 bits
single: 23 bitsdouble: 52 bits
(-1)s x (1 + Fraction x 2(Exponent - Bias))
FP Storage
FP Interpretation
IEEE Floating-point Format
Fraction•To store
• Normalize fraction to a value in the range 1.0 ≤ x < 2.0• If value is 4.56 x 210 then normalized value is 1.14 x 212
• Store only the bits past the decimal point (14) • All normalized values have 1 to the left of the decimal so no need to
represent it explicitly (hidden bit)• Significand is Fraction with the “1.” restored
•To interpret • Extract bits from fraction and add 1
single: 8 bitsdouble: 11 bits
single: 23 bitsdouble: 52 bits
(-1)s x (1 + Fraction x 2(Exponent - Bias))
FP Storage
FP Interpretation
IEEE Floating-point Format
Exponent•To store
• add bias to exponent• If exponent is 3, store 3 + 127 = 130 • excess representation, ensures exponent is unsigned• single: bias = 127; double: bias = 1023
•To interpret • Extract exponent bits and subtract bias• If exponent bits represent 17 then interpret as 17 – 127 = -110
single: 8 bitsdouble: 11 bits
single: 23 bitsdouble: 52 bits
(-1)s x (1 + Fraction x 2(Exponent - Bias))
FP Storage
FP Interpretation
Single-Precision Range
• Exponents 00000000 and 11111111 are reserved
• Smallest value• smallest stored exponent: 00000001
actual exponent = 1 – 127 = –126• Fraction: 000…00 significand = 1.0
±1.0 × 2–126 ≈ ±1.2 × 10–38
• Largest value• largest stored exponent: 11111110
actual exponent = 254 – 127 = +127• Fraction: 111…11 significand ≈ 2.0
±2.0 × 2+127 ≈ ± 3.4 × 10+38
Double-Precision Range
• Exponents 0000…00 and 1111…11 reserved
• Smallest value• Exponent: 00000000001
actual exponent = 1 – 1023 = –1022• Fraction: 000…00 significand = 1.0
±1.0 × 2–1022 ≈ ±2.2 × 10–308
• Largest value• Exponent: 11111111110
actual exponent = 2046 – 1023 = +1023• Fraction: 111…11 significand ≈ 2.0
±2.0 × 2+1023 ≈ ±1.8 × 10+308
IEEE 754 FP Standard : Examples
• Smallest+: 0 00000001 1.00000000000000000000000 = 1.0 x 21-127
• Zero: 0 00000000 00000000000000000000000 = true 0• Largest+: 0 11111110 1.11111111111111111111111 = 2-2-23 x 2254-127
• 1.02 x 2-1 =
• 0.7510 x 24
FP Conversion
0.75 x 24
1. Convert to fraction with a power of 2 denominator
0.75 x 24 = 3/4 x 24 = 3/22 x 24
2. Convert numerator to binary
11two / 22 x 24
3. Determine place for decimal (binary) point
0.11two x 24
4. Normalize
1.1 x 23
5. Adjust for IEEE standard1 + Fraction Fraction is 0.1Bias : 127 + 3 = 130 Exponent is 130
IEEE 754 FP Standard : Examples
Smallest+: 0 00000001 1.00000000000000000000000 = 1 x 21-127
Zero: 0 00000000 00000000000000000000000 = true 0Largest+: 0 11111110 1.11111111111111111111111 = 2-2-23 x 2254-127
1.02 x 2-1 =
0.7510 x 24 =
0 01111110 1.00000000000000000000000 0 10000010 1.10000000000000000000000
Floating-point Addition in Base 10
To add F1 x 10E1 and F2 x 10E2
1. Divide/Multiply one of the numbers to make the exponents the same
2. Factor out the exponent3. Add the fractions
3.2 x 102 + 1.456 x 104
= 0.032 x 104 + 1.456 x 104
= (0.032 + 1.456) x 104 = 1.488 x 104
Floating-point Addition in Binary
Addition (and subtraction)(F1 2E1) + (F2 2E2) = F3 2E3
Step 0: Restore the hidden bit in F1 and in F2
Step 1: Align fractions by right shifting F2 by (E1 - E2) positions
(assuming E1 E2)
Step 2: Add the resulting F2 to F1 to form F3
Step 3: Normalize F3 (so it is in the form 1.XXXXX …)If E1 and E2 have the same sign check for overflow
If E1 and E2 have different signs check for underflow
Step 4: Round F3 and possibly normalize F3 again
if F3 has more bits then we have room for
Step 5: Re-hide the most significant bit of F3 before storing the result
Number too small!
8 x 24 = 8/2 x 23 = 4 x 23
Divide by 2 ≈ right shift
Floating Point Addition Example
Addition (assume 4 bits for fraction, 3 for exponent, bias = 4)
(0.5 = 1.0000 2-1) + (-0.4375 = -1.1100 2-2)
Step 0:
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Hidden bits restored in the representation above
Right-shift significand with the smaller exponent until its exponent matches the larger exponent (just once in this example)
1.0000 2-1 and -0.1110 x 2-1
Add significands1.0000 + (-0.1110) = 1.0000 – 0.1110 = 0.0010
Normalize the sum, checking for exponent over/underflow0.0010 x 2-1 = 0.0100 x 2-2 = .. = 1.0000 x 2-4
No need to round, 4 bits in fractional field
Re-hide the hidden bit before storing
Can’t do 2’s complement math here
Floating Point Multiplication
• Multiplication
(F1 2E1) x (F2 2E2) = F3 2E3
Step 0: Restore the hidden bit in F1 and in F2
Step 1: Add the two exponents and adjust for bias
E1 + E2 – bias = E3
determine the sign of the product
Step 2: Multiply F1 by F2 to form a double precision F3
Step 3: Normalize F3 (so it is in the form 1.XXXXX …) 1 bit right shift F3 and increment E3
Check for overflow/underflow
Step 4: Round F3 and possibly normalize F3 again
Step 5: Re-hide the most significant bit of F3 before storing the result
Could we use a basic logic operation to determine sign?
E1 = 5, E2 = 4, bias = 3E3 = ((5 – 3) + (4 – 3)) + 3
Subtract bias twice, add once
Floating Point Multiplication Example
Multiply (assume 32-bit single-precision values)
(0.5 = 1.0000 2-1) x (-0.4375 = -1.1100 2-2)
Step 0:
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Hidden bits restored in the representation above
Add the exponents E1 = -1+ 127 = 126, E2 = -2 + 127 = 125 = (126 + 125) – 127 = 124
Multiply the significands1.0000 x 1.1100= 1.110000
Normalize the product, checking for exponent over/underflow
1.110000 x 2-3 is already normalizedNo need to round
Re-hide the hidden bit before storing
MIPS Floating Point Instructions
• MIPS has a separate Floating Point Register File ($f0, $f1, …, $f31) with special instructions to load to and store from them
lwc1 $f1,16($s2) #$f1 = Memory[$s2+16]
swc1 $f1,24($s4) #Memory[$s4+24] = $f1
• the registers are used in pairs for double precision values• supports IEEE 754 single and double-precision operations
add.s $f2,$f4,$f6 #$f2 = $f4 + $f6
add.d $f2,$f4,$f6 #$f2||$f3 = $f4||$f5 + $f6||$f7
• Similarly, instructions for sub.s, sub.d, mul.s, mul.d, div.s, div.d
MIPS Floating Point Instructions
• Floating point single precision comparison operations c.x.s $f2,$f4 #if($f2 < $f4) cond=1; else cond=0
where x may be eq, neq, lt, le, gt, ge
• Double-precision comparison operations
c.x.d $f2, $f4 #$f2||$f3 < $f4||$f5 cond=1; else cond=0
• Floating point branch operations
bclt 25 #if(cond==1) go to PC+4+25
bclf 25 #if(cond==0)
Frequency of Common MIPS Instructions
only included those with >3% and >1%
SPECint SPECfp
addu 5.2% 3.5%
addiu 9.0% 7.2%
or 4.0% 1.2%
sll 4.4% 1.9%
lui 3.3% 0.5%
lw 18.6% 5.8%
sw 7.6% 2.0%
lbu 3.7% 0.1%
beq 8.6% 2.2%
bne 8.4% 1.4%
slt 9.9% 2.3%
slti 3.1% 0.3%
sltu 3.4% 0.8%
SPECint SPECfp
add.d 0.0% 10.6%
sub.d 0.0% 4.9%
mul.d 0.0% 15.0%
add.s 0.0% 1.5%
sub.s 0.0% 1.8%
mul.s 0.0% 2.4%
l.d 0.0% 17.5%
s.d 0.0% 4.9%
l.s 0.0% 4.2%
s.s 0.0% 1.1%
lhu 1.3% 0.0%
Associativity
Associativity matters in floating-point operations
What are the implications?
Need to be especially careful in parallelizing FP applicationsIncorrect results, even if no dependence!
x86 FP Architecture
• Originally based on 8087 FP coprocessor• 8 × 80-bit extended-precision registers• Used as a push-down stack• Registers indexed from TOS: ST(0), ST(1), …
• FP values are 32-bit or 64 in memory• Converted on load/store of memory operand• Integer operands can also be converted
on load/store
• Very difficult to generate and optimize code• Result: poor FP performance
x86 FP Instructions
Optional variations• I: integer operand• P: pop operand from stack• R: reverse operand order• But not all combinations allowed
Data transfer Arithmetic Compare TranscendentalFILD mem/ST(i)
FISTP mem/ST(i)
FLDPI
FLD1
FLDZ
FIADDP mem/ST(i)
FISUBRP mem/ST(i) FIMULP mem/ST(i) FIDIVRP mem/ST(i)
FSQRT
FABS
FRNDINT
FICOMP
FIUCOMP
FSTSW AX/mem
FPATAN
F2XMI
FCOS
FPTAN
FPREM
FPSIN
FYL2X
Streaming SIMD Extension 2 (SSE2)
• Adds 4 × 128-bit registers• Extended to 8 registers in AMD64/EM64T
• Can be used for multiple FP operands• 2 × 64-bit double precision• 4 × 32-bit double precision• Instructions operate on them simultaneously
• Single-Instruction Multiple-Data
Accurate Arithmetic
• IEEE Standard 754 specifies additional rounding control• Extra bits of precision (guard, round, sticky)• Choice of rounding modes• Allows programmer to fine-tune numerical behavior of a
computation
• Not all FP units implement all options• Most programming languages and FP libraries just
use defaults
Trade-off between hardware complexity, performance, and market requirements
Trade-off between hardware complexity, performance, and market requirements
Who Cares About FP Accuracy?
• Important for scientific code• But for everyday consumer use?
• “My bank balance is off by 0.0002¢!”
• The Intel Pentium FDIV bug• The market expects accuracy
• whether they need it or not
Q: Why didn't Intel call the Pentium the 586?
A: Because they added 486 and 100 on the Pentium and got 585.999983605.