Floating Point Arithmetic

FLOATING POINT ARITHMETIC

TABLE OF CONTENTSo History of Floating Pointo Defining Floating Point Arithmetico Floating Point Representationo Floating Point Formato Floating Point Precisionso Floating Point Operationo Special valueso Error Analysiso Exception Handlingo FPU Data Register Stack

IA32 FLOATING POINT History

8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit)

486: merged FPU and Integer Unit onto one chip Summary

Hardware to add, multiply, and divide Floating point data registers Various control & status registers

Floating Point Formats single precision (C float): 32 bits double precision (C double): 64 bits extended precision (C long double): 80 bits

DEFINING FLOATING POINT ARITHMETICo Representable numbers

o Scientific notation: +/- d.d…d x rexp

o sign bit +/-o radix r (usually 2 or 10, sometimes 16)o significand d.d…d (how many base-r digits d?)o exponent exp (range?)o others?

o Operations:o arithmetic: +,-,x,/,...

o how to round result to fit in formato comparison (<, =, >)o conversion between different formats

o short to long FP numbers, FP to integero exception handling

o what to do for 0/0, 2*largest_number, etc.o binary/decimal conversion

o for I/O, when radix not 10o Language/library support for these operations

FLOATING POINT REPRESENTATION

o It describes a system for representing real numbers which supports a wide range of values.

o A number in which the decimal point can be in any position.

o Example:A memory location set aside for a floating-pointnumber can store 0.735, 62.3, or 1200.

IN COMPARISON WITH:o Radix point – or radix character is the symbol used in numerical representations

to separate the integer part of a number (to the left of the radix point) from its fractional part (to the right of the radix point). Radix point is a general term that applies to all number bases.

Ex: In base 10 (decimal): 13.625 (decimal point) In base 2 (binary): 1101.101 (binary point)

o Fixed point - a number in which the position of the decimal point is fixed. A fixed-point memory location can only accommodate a specific number of decimal places, usually 2 (for currency) or none (for integers). For example, amounts of money in U.S. currency can always be represented as numbers with exactly two digits to the right of the point (1.00, 123.45, 0.76, etc.).


o


o Binary Cases

Sign

bit

BiasedExponent

Significand or Mantissa

where:S is the fraction mantissa or significand.E is the exponent.B is the base, in Binary case

IEEE 754: FLOATING POINT IN MODERN COMPUTER

o The IEEE has standardized the computer representation for binary floating-point numbers in IEEE754. This standard is followed by almost all modern machines.

IEEE 754: FLOATING POINT FORMATIEEE 754 format– Defines single and double precision formats(32 and 64 bits)– Standardizes formats across many differentplatforms– Radix 2– Single» Range 10-38 to 10+38» 8-bit exponent with 127 bias» 23-bit mantissa– Double» Range 10-308 to 10+308» 11-bit exponent with 1023 bias» 52-bit mantissa

IEEE 754 FORMAT PARAMETERS

FLOATING POINT PRECISIONSIEEE 754:

16-bit: Half (binary16)32-bit: Single (binary32), decimal3264-bit: Double (binary64), decimal64128-bit: Quadruple (binary128), decimal128

Other:o Minifloat o Extended precisiono Arbitrary-precision

FLOATING POINT PRECISIONSo Single Precision, called "float" in the C language

family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).

o Double precision, called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).

o The other basic formats are quadruple precision (128-bit) binary, as well as decimal floating point (64-bit) and "double " (128-bit) decimal floating point.

PRECISION CONSIDERATIONSo Guard Bits – prior to a floating-point

operation, the exponent and signicand of each are loaded into ALU registers. The register contains additional bits, called guard bits, which are used to pad out the right end of the significand with 0s.

o Rounding – the precision of the result is the rounding policy. The result of any operation on the significands is generally stored in a longer registers.

THE USE OF GUARD BITS

THE STANDARD LIST FOUR ALTERNATIVE APPROACHES : Round to nearest : The result is rounded to

the nearest representable number. Round toward + ∞ : The rounded up

toward plus infinity. Round toward - ∞ : The result is rounded

down toward negative infinity. Rounded toward 0 : The result is rounded

toward zero.

INTERNAL REPRESENTATION Floating-point numbers are typically packed into a computer

datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the IEEE 754 binary formats they are apportioned as follows:

IEEE STANDARD FOR BINARY FLOATING-POINT ARITHMETIC

IEEE 754 goes beyond the simple definition of a format to lay down specific practices and procedures so that floating-point arithmetic produces uniform , predictable results independent of the hardware platform.

FLOATING POINT OPERATIONEXPONENT OVERFLOWo A positive exponent exceeds the maximum possible exponent

value. In some system, this may be designated as + ∞ or -∞.EXPONENT UNDERFLOWo A negative exponent is less than the minimum possible

exponent value (e.g., -200 is less than -127). This means that the numbers is too small to be represented, and it may be reported as 0.

SIGNIFICANT UNDERFLOWo In the process of aligning significant, digits may flow off the

right end of the significant. As we shall discuss, some form of rounding is required

SIGNIFICANT OVERFLOWo The addition of two significant of the same sign may result in a

carry out of the most significant bit. This can be fixed by realignment, as we shall explain.

FLOATING-POINT: ADDITION AND SUBTRACTION (Z X ± Y)PHASE 1 : ZERO CHECKo Addition and subtraction are identical except for a sign change, the process by

changing the sign of the subtracted if it is a subtract operation. Next, if either operand is 0, the other is reported as the result.

PHASE 2: SIGNIFICAND ALIGMENT.o The next phase is to manipulate the numbers so that the two exponents are

equal.PHASE 3: ADDITIIONo The two significands are added together, taking into account their sign. Because

the signs may differ, the result may be 0. There is also the possibility of significand overflow by 1 digit. If so, the result is shifted right and the exponent is incremented. An exponent overflow could not occur as a result; this would be reported and the operation halted.

PHASE 4: NORMALIZATIONo The final phase normalizes the result. Normalization consists of shifting

significand digits left until the most significand digit (bit, or 4 bits or base- 16 exponent) is nonzero.

FLOATING-POINT ADDITION AND SUBTRACTION (Z X ± Y)

FLOATING-POINT: MULTIPLICATION (Z X * Y)

FLOATING-POINT: DIVISION (Z X/Y)

SPECIAL VALUESo Signed zero

In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0).

o Subnormal numbersSubnormal values fill the underflow gap with values where the absolute

distance between them are the same as for adjacent values just outside of the underflow gap.

o InfinitiesThe infinities of the extended real number line can be represented in IEEE

floating point data types, just like ordinary floating point values like 1, 1.5 etc. They are not error values in any way, though they are often (but not always, as it depends on the rounding) used as replacement values when there is an overflow. Upon a divide by zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "∞" if the programming language allows that syntax).

o NaNsIEEE 754 specifies a special value called "Not a Number" (NaN) to be

returned as the result of certain "invalid" operations, such as 0/0, ∞×0, or sqrt(−1). The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to extend the floating-point numbers with other special values, without slowing down the computations with ordinary values. Such extensions do not seem to be common, though.

IEEE FLOATING POINT ARITHMETIC STANDARD 754 - NAN (NOT A NUMBER)o NAN: Sign bit, nonzero significand, maximum exponento Invalid Exception

o occurs when exact result not a well-defined real numbero 0/0o sqrt(-1)o infinity-infinity, infinity/infinity, 0*infinityo NAN + 3o NAN > 3?o Return a NAN in all these cases

o Two kinds of NANso Quiet - propagates without raising an exception

o good for indicating missing datao Ex: max(3,NAN) = 3

o Signaling - generate an exception when touchedo good for detecting uninitialized data

OPERATIONS THAT PRODUCE A QUIET NAN

IEEE FLOATING POINT ARITHMETIC STANDARD 754 - NORMALIZED NUMBERSo Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp

o Macheps = Machine epsilon = 2-#significand bits = relative error in each operationo OV = overflow threshold = largest numbero UN = underflow threshold = smallest number

o +- Zero: +-, significand and exponent all zeroWhy bother with -0 later

Format # bits #significand bits macheps #exponent bits exponent range---------- -------- ----------------------- ------------ -------------------- ----------------------Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38)Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308)Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 - 216383 (~10+-4932) Extended (80 bits on Intel machines)

IEEE FLOATING POINT ARITHMETIC STANDARD 754 - “DENORMS”o Denormalized Numbers: +-0.d…d x 2min_exp

o sign bit, nonzero significand, minimum exponento Fills in gap between UN and 0

o Underflow Exceptiono occurs when exact nonzero result is less than underflow threshold UNo Ex: UN/3o return a denorm, or zero

o Why bother? o Necessary so that following code never divides by zero o if (a != b) then x = a/(a-b)

IEEE FLOATING POINT ARITHMETIC STANDARD 754 - +- INFINITY

o +- Infinity: Sign bit, zero significand, maximum exponent

o Overflow Exceptiono occurs when exact finite result too large to represent accuratelyo Ex: 2*OVo return +- infinity

o Divide by zero Exceptiono return +- infinity = 1/+-0 o sign of zero important! Example later…

o Also return +- infinity foro 3+infinity, 2*infinity, infinity*infinityo Result is exact, not an exception!

ERROR ANALYSISo Basic error formula

o fl(a op b) = (a op b)*(1 + d) whereo op one of +,-,*,/o |d| <= = machine epsilon = machepso assuming no overflow, underflow, or divide by zero

o Example: adding 4 numbersfl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3) = x1*(1+d1)*(1+d2)*(1+d3) +

x2*(1+d1)*(1+d2)*(1+d3) + x3*(1+d2)*(1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4) where each |ei| <~ 3*machepsget exact sum of slightly changed summands xi*(1+ei)Backward Error Analysis - algorithm called numerically stable if it

gives the exact result for slightly changed inputsNumerical Stability is an algorithm design goal

EXCEPTION HANDLINGo What happens when the “exact value” is not a real

number, or too small or too large to represent accurately?

o 5 Exceptions:o Overflow - exact result > OV, too large to represento Underflow - exact result nonzero and < UN, too small to

represento Divide-by-zero - nonzero/0o Invalid - 0/0, sqrt(-1), …o Inexact - you made a rounding error (very common!)

o Possible responseso Stop with error message (unfriendly, not default)o Keep computing (default, but how?)

EXCEPTION HANDLING USER INTERFACE

o Each of the 5 exceptions has the following featureso A sticky flag, which is set as soon as an

exception occurso The sticky flag can be reset and read by the user

o reset overflow_flag and invalid_flago perform a computationo test overflow_flag and invalid_flag to see if any exception occurred

o An exception flag, which indicate whether a trap should occur

o Not trapping is the defaulto Instead, continue computing returning a NAN, infinity or denormo On a trap, there should be a user-writable exception handler with access to the

parameters of the exceptional operationo Trapping or “precise interrupts” like this are rarely implemented for performance

reasons.

FPU DATA REGISTER STACKo FPU register format (extended precision)

s exp frac063647879

R7R6R5R4R3R2R1R0

st(0)st(1)st(2)st(3)st(4)st(5)

st(6)st(7)

Top

FPU register stackostack grows down

wraps around from R0 -> R7oFPU registers aretypically referencedrelative to top of stack

st(0) is top of stack (Top)followed by st(1), st(2),…

opush: increment Top, loadopop: store, decrement Top

absolute view stack view

FPU INSTRUCTIONSo Large number of floating point instructions

and formats ~50 basic instruction types load, store, add, multiply sin, cos, tan, arctan, and log!

o Sampling of instructions:Instruction Effect Description

fldz push 0.0 Load zeroflds S push S Load single precision realfmuls S st(0) <- st(0)*S Multiplyfaddp st(1) <- st(0)+st(1); pop Add and pop

☺END

Date post:	24-Feb-2016
Category:	Documents
Upload:	netis
View:	101 times
Download:	0 times

Floating Point Arithmetic

Documents