CDA 3101 Fall 2013 Introduction to Computer Organization Floating Point Theory, Notation, MIPS 30...

CDA 3101 Fall 2013

Introduction to Computer Organization

Floating Point

Theory, Notation, MIPS

30 September, 2 October 2013

Overview• Floating point numbers• Scientific notation

– Decimal scientific notation

– Binary scientific notation

• IEEE 754 FP Standard• Floating point representation inside a computer

– Greater range vs. precision

• Decimal to Floating Point conversion• Type is not associated with data• MIPS floating point instructions, registers

Computer Numbers• Computers are made to deal with numbers• What can we represent in n bits?

– Unsigned integers: 0 to 2n - 1– Signed integers: -2(n-1) to 2(n-1) - 1

• What about other numbers?– Very large numbers? (seconds/century)

3,155,760,00010 (3.1557610 x 109)– Very small numbers? (atomic diameter)

0.0000000110 (1.010 x 10-8) – Rationals (repeating pattern) 2/3

(0.666666666. . .)– Irrationals: 21/2 (1.414213562373. . .)– Transcendentals : e (2.718...), (3.141...)

Scientific Notation

6.02 x 1023

radix (base)decimal point

• Normalized form: no leadings 0s

(exactly one digit to left of decimal point)

• Alternatives to representing 1/1,000,000,000–Normalized: 1.0 x 10-9

–Not normalized: 0.1 x 10-8, 10.0 x 10-10

mantissa exponent

Binary Scientific Notation

1.0two x 2-1

radix (base)“binary point”

Mantissa Exponent

• Floating point arithmetic

–Binary point is not fixed (as it is for integers)

–Declare such variable in C as float

Floating Point Representation

• Normal format: +1.xxxxxxxxxxtwo*2yyyytwo

• Multiple of Word Size (32 bits)

031S Exponent30 23 22

Significand

1 bit 8 bits 23 bits

• S represents SignExponent represents y’sSignificand represents x’s

• Represent numbers as small as 2.0 x 10-38 to as large as 2.0 x 1038

Overflow and Underflow

• Overflow– Result is too large (> 2.0x1038 )

– Exponent larger than represented in 8-bit Exponent field

• Underflow– Result is too small

• >0, < 2.0x10-38

– Negative exponent larger than represented in 8-bit Exponent field

• How to reduce chances of overflow or underflow?

Double Precision FP• Use two words (64 bits)

• C variable declared as double• Represent numbers almost as small as

2.0 x 10-308 to almost as large as 2.0 x 10308

• Primary advantage is greater accuracy (52 bits)

031

S Exponent30 20 19

Significand1 bit 11 bits 20 bits

Significand (cont’d)32 bits

Floating Point Representation

Normalized scientific notation: +1.xxxxtwo*2yyyytwo

Significand (cont’d)

0

S Exponent20 19


32 bits

31 30

DoublePrecision

Exponent: biased notationSignificand: sign – magnitude notation

Bias 127 (SP)1023 (DP)

031

S Exponent30 23 22


SinglePrecision

IEEE 754 FP Standard• Used in almost all computers (since 1980)

– Porting of FP programs– Quality of FP computer arithmetic

• Sign bit:

• Significand:– Leading 1 implicit for normalized numbers– 1 + 23 bits single, 1 + 52 bits double– always true: 0 < Significand < 1

• 0 has no leading 1– Reserve exponent value 0 just for number 0

1 means negative0 means positive

(-1)S * (1 + Significand) * 2Exp

IEEE 754 Exponent• Use FP numbers even without FP hardware

– Sort records with FP numbers using integer compares

• Break FP number into 3 parts: compare signs, then compare exponents, then compare significands

• Faster (single comparison, ideally)– Highest order bit is sign ( negative < positive)– Exponent next, so big exponent => bigger #– Significand last: exponents same => bigger #

Biased Notation for Exponents

• Two’s complement does not work for exponent

• Most negative exponent: 00000001two

• Most positive exponent: 11111110two

• Bias: number added to real exponent– 127 for single precision– 1023 for double precision

• 1.0 * 2-1

(-1)S * (1 + Significand) * 2(Exponent - Bias)

0 0111 1110 0000 0000 0000 0000 0000 000

Significand• Method 1 (Fractions):

– In decimal: 0.34010 => 34010/100010 => 3410/10010

– In binary: 0.1102 => 1102/10002 (610/810) => 112/1002 (310/410)

– Helps understand the meaning of the significand

• Method 2 (Place Values):– Convert from scientific notation– In decimal: 1.6732 = (1x100) + (6x10-1) + (7x10-2) + (3x10-3) + (2x10-4)

– In binary: 1.1001 = (1x20) + (1x2-1) + (0x2-2) + (0x2-3) + (1x2-4)

– Interpretation of value in each position extends beyond the decimal/binary point

– Good for quickly calculating significand value– Use this method for translating FP numbers

Binary to Decimal FP

• Sign: 0 => positive

• Exponent: – 0110 1000two = 104ten

– Bias adjustment: 104 - 127 = -23

• Significand:– 1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...

=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22

= 1.0 + 0.666115

0 0110 1000 101 0101 0100 0011 0100 0010

• Represents: 1.666115*2-23 ~ 1.986*10-7

Decimal to Binary FP (1/2)

• Simple Case: If denominator is a power of 2 (2, 4, 8, 16, etc.), then it’s easy.

• Example: Binary FP representation of -0.75– -0.75 = -3/4

– -11two/100two = -0.11two

– Normalized to -1.1two x 2-1

– (-1)S x (1 + Significand) x 2(Exponent-127)

– (-1)1 x (1 + .100 0000 ... 0000) x 2(126-127)

1 0111 1110 100 0000 0000 0000 0000 0000

Decimal to Binary FP (2/2)• Denominator is not an exponent of 2

– Number can not be represented precisely– Lots of significand bits for precision– Difficult part: get the significand

• Rational numbers have a repeating pattern• Conversion

– Write out binary number with repeating pattern.– Cut it off after correct number of bits (different

for single vs. double precision).– Derive sign, exponent and significand fields.

0.33333332x 2

0 .66666664

0.66666666x 2

1 .33333332

0.33333333x 2

0 .66666666

Decimal to Binary

1. Significand: 101 0101 0101 0101 0101 0101

2. Sign: negative => 1

3. Exponent: 1+ 127 = 128ten = 1000 0000two

1 1000 0000 101 0101 0101 0101 0101 0101

- 3 . 3 3 3 3 3 3…

=> - 1.1010101.. x 21 1 1 . 0 1 0 1 0 1 0 . . .-

Types and Data

–1.986 *10-7

–878,003,010–“4UCB” ori $s5, $v0, 17218

• Data can be anything; operation of instruction that accesses operand determines its type!

• Power/danger of unrestricted addresses/pointers:–Use ASCII as FP, instructions as data, integers as

instructions, ...–Security holes in programs

0011 0100 0101 0101 0100 0011 0100 0010

IEEE 754 Special Values

-(1-2-24)*2128 (1-2-24)*2128-.5*2-127 .5*2-127

PositiveOverflow

NegativeOverflow

Expressible Positive Numbers

Expressible Negative Numbers

PositiveUnderflow

NegativeUnderflow

0

Special Value Exponent Significand+/- 0 0000 0000 0

Denormalized number 0000 0000 Nonzero

NaN 1111 1111 Nonzero

+/- infinity 1111 1111 0

Value: Not a Number

• What is the result of: sqrt(-4.0)or 0/0?– If infinity is not an error, these shouldn’t be either.

– Called Not a Number (NaN)

– Exponent = 255, Significand nonzero

• Applications

– NaNs help with debugging

– They contaminate: op(NaN, X) = NaN

–Don’t use NaN– Ask math majors

Value: Denorms• Problem: There’s a gap among representable FP numbers around 0

– Smallest pos num: a = 1.0… 2 * 2-126 = 2-126

– 2nd smallest pos num: b = 1.001 2 * 2-126 = 2-126 + 2-150

– a - 0 = 2-126

– b - a = 2-150

• Solution:– Denormalized numbers: no leading 1– Smallest pos num: a = 2-150 – 2nd smallest num: b = 2-149

b

a0+-

Gap!

0+-

Rounding• Math on real numbers => rounding• Rounding also occurs when converting types

– Double single precision integer

• Round towards +infinity– ALWAYS round “up”: 2.001 => 3; -2.001 => -2

• Round towards -infinity– ALWAYS round “down”: 1.999 => 1; -1.999 => -2

• Truncate– Just drop the last bits (round towards 0)

• Round to (nearest) even (default)– 2.5 => 2; 3.5 => 4

FP Fallacy

• FP Add, subtract associative: FALSE!– x = – 1.5 x 1038, y = 1.5 x 1038, and z = 1.0

– x + (y + z) = –1.5x1038 + (1.5x1038 + 1.0)= –1.5x1038 + (1.5x1038) = 0.0

– (x + y) + z = (–1.5x1038 + 1.5x1038) + 1.0= (0.0) + 1.0 = 1.0

• Floating Point add, subtract are not associative!

– Why? FP result approximates real result

– 1.5 x 1038 is so much larger than 1.0 that 1.5 x 1038 + 1.0 in floating point representation is still 1.5 x 1038

Computational Errors with FP

FP Addition / Subtraction

• Much more difficult than with integers

• Can’t just add significands

• Algorithm– De-normalize to match exponents– Add (subtract) significands to get resulting one– Keep the same exponent– Normalize (possibly changing exponent)

• Note: If signs differ, just perform a subtract instead.

MIPS FP Architecture (1/2)• Separate floating point instructions:

– Single Precision: add.s, sub.s, mul.s, div.s– Double Precision: add.d, sub.d, mul.d, div.d

• These instructions are far more complicated than their integer counterparts

• Problems:– It’s inefficient to have different instructions take vastly

differing amounts of time.– Generally, a particular piece of data will not change from

FP to int, or vice versa, within a program. – Some programs do not do floating point calculations– It takes lots of hardware relative to integers to do FP fast

MIPS FP Architecture (2/2)• 1990 Solution: separate chip that handles only FP.• Coprocessor 1: FP chip

– Contains 32 32-bit registers: $f0, $f1, …– Most registers specified in .s and .d instructions ($f)– Separate load and store: lwc1 and swc1

(“load word coprocessor 1”, “store …”)– Double Precision: by convention, even/odd pair contain one

DP FP number: $f0/$f1, … , $f30/$f31

• 1990 Computer contains multiple separate chips:– Processor: handles all the normal stuff– Coprocessor 1: handles FP and only FP;

• Move data between main processor and coprocessors:– mfc0, mtc0, mfc1, mtc1, etc.

Floating Point Hardware (FP Add)

C => MIPS

float f2c (float fahr) { return ((5.0 / 9.0) * (fahr – 32.0));}

F2c: lwc1 $f16, const5($gp) # $f16 = 5.0

lwc1 $f18, const9($gp) # $f18 = 9.0

div.s $f16, $f16, $f18 # $f16 = 5.0/9.0

lwc1 $f20, const32($gp) # $f20 = 32.0

sub.s $f20, $f12, $f20 # $f20 = fahr – 32.0

mul.s $f0, $f16, $f20 # $f0 = (5/9)*(fahr-32)

jr $ra # return

Conclusion• Floating Point numbers approximate values that we

want to use.• IEEE 754 Floating Point Standard is most widely

accepted attempt to standardize FP arithmetic• New MIPS architectural elements

– Registers ($f0-$f31)

– Single Precision (32 bits, 2x10-38… 2x1038)• add.s, sub.s, mul.s, div.s

– Double Precision (64 bits , 2x10-308…2x10308)• add.d, sub.d, mul.d, div.d

• Type is not associated with data, bits have no meaning unless given in context (e.g., int vs. float)

Date post:	17-Jan-2016
Category:	Documents
Upload:	eleanore-roberts
View:	214 times
Download:	0 times

CDA 3101 Fall 2013 Introduction to Computer Organization Floating Point Theory, Notation, MIPS 30...

Documents