3
Rounding: the process to obtain the best possible floating-point representation for a given real value.
ANSI/IEEE standard: round to floating number whose significand has an LSB of 0 (of two adjacent floating-point number, the significand of one must end in 0, and the other one in 1). This is called round-to-near-even.
For example, 3.5 and 4.5 are both rounded to 4, the closet even number, based on round-to-near-even.
4
• Other rounding methods– Round inward (toward 0):choose the nearest value
in the same direction as 0.– Round upward (toward +∞): choose the larger of
the two possible values.– Round downward (toward -∞): choose the smaller
of the two possible vavlues.
•
5
Example 12.1 Rounding to the nearest integer
a. Consider the rounded even integer corresponding to a real signed-magnitude number x a rtnei(x). Plot this round-to-nearest-even-integer for x in the range [-4,4].
b. Repeat part a for the function rtni(x), that is, round-to-nearest-integer function, where the midway values are always rounded up
7
Example 12.2 Directed rounding
a. Consider the inward-directed round corresponding to a real signed-magnitude number x as a function ritni(x). Plot this round-inward-to-nearest-integer function for x in the range [-4,4].
b. Repeat part a for the round-upward-to-nearest-integer rutni(x).
10
12.2 Special Values and Execeptions
• Five special values in ANSI/IEEE floating-point standard– ±0 Biased exponent=0, significand=0 (no
hidden 1)– ± ∞ Biased exponent=255 (short), or 2047
(long), significand=0– NaN Biased exponent=255 (short), or 2047
(long), significand≠0
11
Consider the addition of ±2e1s1 and ±2e2s2, where e1 > e2
(±2e1s1) +(±2e2s2)=±2e1(s1±s2/2e1-e2)
12.3 Floating-Point Addition
14
12.4 Other Floating-point Operations
Multiplication of ±2e1s1 and ±2e2s2
(±2e1s1)×(±2e2s2)=±2e1+e2(s1×s2/2e1-e2)
Division of ±2e1s1 and ±2e2s2
(±2e1s1)/(±2e2s2)=±2e1-e2(s1/s2)
16
Figure 12.7 The common floating-point instruction format for MiniMIPS and components for arithmetic instructions. The extension (ex) field distinguishes single (* = s) from double (* = d) operands.
12.5 Floating-Point Instructions
10 floating-point arithmetic instructions (5 different operations: add, sub, multiply, divide, negate)
add.s $f0,$f8,$f10 # set $f0 to ($f8)+($f10)
add.d $f0,$f8,$f10 # set $f0 $f1 to ($f8$f9)+($f10$f11)
Single operands can be in any of the floating registers. Double operands must be in specified to be in even numbered registers
17
Figure 12.8 Floating-point instructions for format conversion in MiniMIPS.
6 format conversion instructions: integer to single/double, single to double, double to single, and single/double to integercvt.s.w $f0,$f8 # set $f0 to single (integer $f8)cvt.d.w $f0,$f8 # set $f0 to double (integer $f8)cvt.d.s $f0,$f8 # set $f0 to double ($f8)cvt.s.d $f0,$f8 # set $f0 to single ( $f8, $f9,)cvt.w.s $f0,$f8 # set $f0 to integer ($f8)cvt.w.d $f0,$f8 # set $f0 to integer ($f8, $f9)
18
Figure 12.9 Instructions for floating-point data movement in MiniMIPS.
6 data transfer instructions: load/store word to/from coprocessor1, move single/double from one FP register to another, move (copy) between FP registers and CPU general registers.
lwcl $f8, 40($3) # load mem[40+($s3)] into $f8swc1 $f8, A($3) # store mem[A+($s3)] into $f8mv.s $f0,$f8 # load $f0 with ($f8)mv.d $f0,$f8 # load $f0,$f1 with ( $f8, $f9,)mfc1 $t0,$f12 # load $t0 with ($f12)mtc1 $f8,$t4 # load $f8 with ($t4)
19
Figure 12.10 Floating-point branch and comparison instructions in MiniMIPS.
2 branch and 6 comparison instructions. The FP unit has a flag that is set to T or F based on 6 comparisons (equal, less than, or less or equal for single/double data type)
bc1t L # branch on FP flag truebc1f L # branch on FP flag falsec.eq.* $f0, $f8 # if ($f0)=($f8), set flag to truec.lt.* $f0, $f8 # if ($f0)<($f8), set flag to truec.lw.* $f0, $f8 # if ($f0)≤($f8), set flag to true
20
Table 12.1 The 30 MiniMIPS floating-point instructions:because the op field contains 17 for all but two of the instructions (49 for lwc1 and 50 for swc1), it is not shown.
21
12.6 Result Precision and Errors• FP arithmetic can be quite dangerous and must be used with
proper care, because results of FP computations are inexact.
• Why? – Many real numbers do not have exact binary representation within a
finite word format. This is referred as representation error.
– Even for values that are exactly representable, FP arithmetic produces inexact results. For example, product of 2 short FP numbers will have a 48 bits significant that must be rounded to 23 bits (plus hidden 1) This is called computation error.
22
Example 12. 4
Associate law of addition does not hold in general in FP arithmetic. For example
a= -25×(1.10101011)
b=25 × (1.10101110)
c=-2-2 × (1.01100101)
(a+b)+c = a+(b+c) ?
23
Figure 12.11 Algebraically equivalent computations may yield different results with floating-point arithmetic.
24
• Using guard digits to avoid excessive error.For example, in a 10-digit calculator, 1/3 is represented as 0.333 333 333 3, multiplying 3 results in 0.999 999 999 9, but not 1.
However, in a calculator with 2 guard bits, 1/3 is represented as 0.333 333 333 333, but still displayed as 0.333 333 333 3, multiplying 3 results in 1.