+ All Categories
Home > Documents > A Proposed Standard for Binary Floating.-Point Arithmetic

A Proposed Standard for Binary Floating.-Point Arithmetic

Date post: 31-Dec-2016
Category:
Upload: nguyendieu
View: 225 times
Download: 1 times
Share this document with a friend
12
Offered here for publc comment, this proposed standard facilitates transportation of numerically oriented programs and encourages development of high-q iality numerical software. A Proposed Standard for Binary Floating.-Point Arithmetic I Draft 8.0 of IEEE,Jask P754 Introductory Comments by David Stevenson, Chairman, Floating-Point Working Group Microprocessor Standards Committee, IEEE Computer Society Few programmers care how their computer performs floating-point arithmetic. If they do, it is usually because they've had a divide-by-zero fault (even after inserting a test to ensure thatx . y before dividingbyx - y) or some equally mysterious incident. Specifying a programming environment that minimizes such anomalies is one of the goals of this standardization effort. Overall, it attempts to facilitate the transportation of numerically oriented programs and to encourage the development of high- quality numerical software. These two goals are especially important in the microprocessor environment since com- ponent vendors are not likely to devote extensive re- sources to developing numerical software for the general community. A number of rationales underlying the development of this proposal should be brought to the reader's attention. First, the working group responsible for this document was not restricted to the format or other conventions of an existing floating-point system; instead, the interests of the user community were placed above the goal of in- dustrial continuity. * In fact, the segment of the computer industry that has shown the greatest interest in this work has been the semiconductor industry that is currently in- troducing the second generation of floating-point units on IC chips. The second major rationale, based on the realization that most implementations would rely on soft- ware to supply the full functionality of the proposal, was that the document describe a programming environ- ment-meaning both hardware and software. Indeed, one goal was to encourage hardware implementations that do not preclude an efficient implementation of the total desired functionality. These rationales should be kept in mind as we review the proposal's major features and indicate why they were 'A new working group, IEEE Task 854, has been formed recently to per- mit development of a floating-point standard parameterized to accom- modate different computer word formats and radices. Dr. W. J. Cody of Argonne National Laboratory will be chairman of this working group. included. There are three major aspects of the proposal: the format of the data types, the arithmetic, and the ex- ception handling. Formats. The basic format sizes for floating-point numbers-32 bits and 64 bits-were selected for efficient calculation of array elements in byte-addressable memo- ries. For the 32-bit format, precision was deemed the most important criterion, hence the choice of radix 2 instead of octal or hexadecimal. Other characteristics include not representing the leading significand bit in normalized numbers, a minimally acceptable exponent range which uses eight bits, and an exponent bias which allows the reciprocal of all normalized numbers to be represented without overflow. For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format-a possible boon to users of optimizing compilers which reorder the sequence otIrithmetic operations from that specified by the careful programmer. The proposal also recommends the minimum requite- ments for extended-precision temporaries (qtuantities whose range and precision are greater than a basic format but do not require twice as many bits for representation). With their greater precision, extended-precision tem- poraries lessen the chance of a final result that has been contaminated by. excessive roundoff error; with their greater range, they also lessen the chance of an in- termediate overflow aborting a computation whose result would have been l-tiresent4ble in a basic format. The motivation for supplyinAO&tended precision is to afford some benefits of a higher basic precision without incur- ring the time penalty usually assotiated with higher preci- sion; however, the proppsed standard requires only single precision (32-bit format) for conforming implementa- tions. 0018-9162/81/0300-0051S00.75 C 1981 IEEE March 1981 51
Transcript
Page 1: A Proposed Standard for Binary Floating.-Point Arithmetic

Offered here for publc comment, this proposed standard facilitatestransportation of numerically oriented programs and encourages

development of high-q iality numerical software.

A Proposed Standard for BinaryFloating.-Point Arithmetic

I Draft 8.0 of IEEE,Jask P754

Introductory Comments by David Stevenson, Chairman, Floating-Point Working GroupMicroprocessor Standards Committee, IEEE Computer Society

Few programmers care how their computer performsfloating-point arithmetic. If they do, it is usually becausethey've had a divide-by-zero fault (even after inserting atest to ensure thatx . y before dividingbyx - y) or someequally mysterious incident. Specifying a programmingenvironment that minimizes such anomalies is one of thegoals of this standardization effort. Overall, it attemptsto facilitate the transportation of numerically orientedprograms and to encourage the development of high-quality numerical software. These two goals are especiallyimportant in the microprocessor environment since com-ponent vendors are not likely to devote extensive re-sources to developing numerical software for the generalcommunity.A number of rationales underlying the development of

this proposal should be brought to the reader's attention.First, the working group responsible for this documentwas not restricted to the format or other conventions ofan existing floating-point system; instead, the interests ofthe user community were placed above the goal of in-dustrial continuity. * In fact, the segment ofthe computerindustry that has shown the greatest interest in this workhas been the semiconductor industry that is currently in-troducing the second generation of floating-point unitson IC chips. The second major rationale, based on therealization that most implementations would rely on soft-ware to supply the full functionality of the proposal, wasthat the document describe a programming environ-ment-meaning both hardware and software. Indeed,one goal was to encourage hardware implementationsthat do not preclude an efficient implementation of thetotal desired functionality.

These rationales should be kept in mind as we reviewthe proposal's major features and indicate why they were

'A new working group, IEEE Task 854, has been formed recently to per-mit development of a floating-point standard parameterized to accom-modate different computer word formats and radices. Dr. W. J. Cody ofArgonne National Laboratory will be chairman of this working group.

included. There are three major aspects of the proposal:the format of the data types, the arithmetic, and the ex-ception handling.

Formats. The basic format sizes for floating-pointnumbers-32 bits and 64 bits-were selected for efficientcalculation of array elements in byte-addressable memo-ries. For the 32-bit format, precision was deemed the mostimportant criterion, hence the choice of radix 2 instead ofoctal or hexadecimal. Other characteristics include notrepresenting the leading significand bit in normalizednumbers, a minimally acceptable exponent range whichuses eight bits, and an exponent bias which allows thereciprocal of all normalized numbers to be representedwithout overflow.

For the 64-bit format, the main consideration wasrange; as a minimum, the desire was that the product ofany two 32-bit numbers should not overflow the 64-bitformat. The final choice of exponent range provides thata product of eight 32-bit terms cannot overflow the 64-bitformat-a possible boon to users of optimizing compilerswhich reorder the sequence otIrithmetic operations fromthat specified by the careful programmer.The proposal also recommends the minimum requite-

ments for extended-precision temporaries (qtuantitieswhose range and precision are greater than a basic formatbut do not require twice as many bits for representation).With their greater precision, extended-precision tem-poraries lessen the chance of a final result that has beencontaminated by. excessive roundoff error; with theirgreater range, they also lessen the chance of an in-termediate overflow aborting a computation whose resultwould have been l-tiresent4ble in a basic format. Themotivation for supplyinAO&tended precision is to affordsome benefits of a higher basic precision without incur-ring the time penalty usually assotiated with higher preci-sion; however, the proppsed standard requires only singleprecision (32-bit format) for conforming implementa-tions.

0018-9162/81/0300-0051S00.75 C 1981 IEEEMarch 1981 51

Page 2: A Proposed Standard for Binary Floating.-Point Arithmetic

Arithmetic. The proposed standard requires accuratecomputation of all arithmetic results to within half a unitin the last place of the destination format. Once the hard-ware is in place to achieve this goal (guard, round, andsticky bits, for example), it requires little more to achievedirected roundings that are useful in interval arithmetic,so the proposal requires these additional rounding capa-bilities (see Figure 1).

In addition to the four basic arithmetic operations, theproposed standard also requires remainder, square root,and conversions between binary and decimal representa-tions. Remainder, square root, and conversions within aspecified range must be as accurate as the basic arith-metic. Remainder is included because of its usefulness inargument reduction in computing elementary transcen-dental functions, square root because of its frequency inmatrix algorithms; both are included because they can besupported-in a well-designed divide unit-to the re-quired accuracy with little additional hardware. Conver-sion was included to ensure that accurate, reproducible

-4----- SIGNIFICAND BITSLEAST-SIGNIFICANT BIT

L i I I I I I I I 1 1

GUARD BIT

ROUND BIT

STICKY BIT

Figure 1. Guard, round, and sticky bits ensure accurateunbiased rounding of computed results to within half aunit in the least-significant bit. Two bits are required forperfect rounding; the guard bit is the first bit beyondrounding precision, and the sticky bit is the logical OR ofall bits thereafter. To accommodate post-normalization insome operations, the round bit is kept, beyond the guardbit, and the stickly bit is a logical OR of all bits beyondround.

results on different implementations would not be lost atthe I/O interface.

It should be remembered that this proposal specifies aprogramming environment; the supporting hardwareneed not directly implement these operations as single in-structions. Finally, the insistence on accuracy is not onlyan end in itself (providing sharper error bounds), it alsoensures a host of pleasant derivative features, such as thecommutativity of addition and multiplication.

Exceptions. Operations that produce results beyondthe range of normalized floating-point numbers are alsotreated in this proposal. In situations where a trap is notallowed, overflow and divide-by-zero generate infinities,and subsequent arithmetic involving these infinities pro-duces results obeying traditional mathematical conven-tions regarding infinity. Operations that have nomathematical interpretation, such as zero divided byzero, will produce a not-a-number called a NaN. SuchNaNs can be used to convey diagnostic informationregarding their creation, or can be used as 'escape-mechanism pointers to nonstandard representations.Underflow is handled by introducing "denormalized"numbers-nonzero numbers that lie between the largestnegative normalized number and the smallest positivenormalized number, with constant spacing (see Figure 2).In many instances, denormalized numbers reduce poten-tial underflow damage to no more than roundoff error.One of the consequences of a floating-point system

with NaNs is that comparisons do not obey the trichoto-my rule: two items may compare not only as less than,equal, or greater than; they may also be unordered. Thiscomplicates the handling of branching conditions, butthose responsible for this proposal felt that the additionalcomplexity was unavoidable and that the additional func-tionality which gives rise to unordered comparisions wasworth the logical expense.

Unordered comparisons are one of several exceptionaloperations that can arise, and a default action for a non-

TINY BINARY FLOATING-POINT NUMBERS

GRADUAL c4UNDERFLOW

DENORMALIZED -

NUMBERS O m 2m 4mn 8m

Figure 2. Each vertical tick stands for a 4-bit significand troduces a gap between m and 0 much widbinary floating-point number. The underflow threshold m tween m and the next larger number. Graduais a power of 1/2 depending upon the allowed range of ex- fills that gap with denormalized numbersponents; every floating-point number bigger than m, but packed between m and 0 as are normalized nnone smaller, is representable as a normalized floating- tween m and 2m. Doing so relegates underflpointnumber.Onsome machines(IBM 7094, DEC PDP-10, computations to a status comparable witDEC PDP-11, etc.) m is a normalized number, too; on among the normalized numbers.others (HP-3000) it is not. Flushing underflows to zero in-

ier than be-Il underflowas denselylumbers be-low in mostth roundoff

COMPUTER52

Page 3: A Proposed Standard for Binary Floating.-Point Arithmetic

trapping environment is specified for each occasion; fortrapping contexts, the result to be delivered to a user-specifiable trap handler is indicated.

In this brief introduction, it is impossible to give anadequate account of several years' work by the many peo-ple involved in the effort of the working group (theminutes of the meetings and supporting documents run tohundreds of pages). An early collection of most of theideas was made by Prof. William Kahan and JeromeCoonen, both of the University of California at Berkeley,and Prof. Harold Stone of the University of Massachu-setts at Amherst; a much-revised version of that work ap-peared in the October 1979ACMSIGNUMNewsletter, aspecial issue providing an extensive account of the pro-posal's features and the alternatives considered by thegroup.The Floating-Point Working Group, IEEE Task P754,

of the Microprocessor Standards Subcommittee, underthe initial chairmanship of Richard H. Delp of Four-Phase Systems, recast that work-many times. Coonen,Kahan, John F. Palmer of Intel, Tom Pittman of IttyBitty Computers, and David Stevenson of Zilog were re-sponsible for drafting this proposal. Other members ofthe working group who played major roles in its delibera-tions by presenting alternative proposals include BobFraley and Steve Walther of Hewlett-Packard Laborato-ries and Mary Payne, Dileep Bhandarkar, and WilliamStrecker of Digital Equipment Corporation. Severalworking-group members will present a one-day tutorialon the proposed standard May 20, 1981, in conjunctionwith the Fifth Symposium on Computer Arithmetic inAnn Arbor, Michigan.The IEEE Computer Society is publishing the draft

standard-along with related material-to invite publiccomment prior to its submission to the IEEE StandardsBoard for adoption as an IEEE Standard. Commentsshould be sent to Stevenson by May 15, 1981, with copiesto Mike Smolin and Steve Diamond, co-chairmen of theMicroprocessor Standards Committee. Stevenson's ad-dress is Zilog, Inc., 10460 Bubb Road, Cupertino, CA95014; (408) 446-4666, ext. 5476. Smolin and Diamond'saddress is Synertek, PO Box 552 MS-34, Santa Clara, CA95052. Ifyou would like to participate in this and other ef-forts of the Microprocessor Standards Committee, pleasecontact one of the co-chairmen. *

The Proposed Standard

Foreword

This standard is a product of the Floating-Point Work-ing Group of the Microprocessor Standards Subcommit-tee of the IEEE Computer Society Computer StandardsCommittee. It is intended that the standard embody theessence of "An Implementation Guide to a Proposed

1. Computer, Vol. 13, No. 1, January 1980, pp. 68-79. See also errataon p. 61 of this issue.

Standard for Floating-Point Arithmetic" by Jerome T.Coonen.lThis standard defines a family of commercially feasible

ways for new systems to perform binary floating-pointarithmetic. The issues of retrofitting were not considered.The desiderata which guided the formulation of this stan-dard included:

(a) Facilitate movement of existing programs fromdiverse computers to those which adhere to thisstandard.

(b) Enhance the capabilities and safety available toprogrammers who, though not expert in numericalmethods, may well be attempting to producenumerically sophisticated programs. However, werecognize that utility and safety are sometimes an-tagonists.

(c) Encourage experts to develop and distribute robustand efficient numerical programs portable, viaminor editing and recompilation, onto any com-puter which conforms to this standard and pos-sesses adequate capacity. When restricted to adeclared subset of the standard, these programsshould produce identical results on all conformingsystems.

(d) Provide direct support for* execution-time diagnosis of anomalies,* smoother handling of exceptions, and* interval arithmetic at a reasonable cost.

(e) Provide for development of* elementary functions like exp, cos, .* very high precision (multiword) arithmetic, and* coupling of numerical and symbolic algebraic

computation.(f) Enable rather than preclude 'further refinements

and extensions.

1. Scope

1.1. Implementation objectives. It is intended that an im-plementation of a floating-point system conforming tothis standard can be realized entirely in software, entirelyin hardware, or in any combination of hardware and soft-ware. It is the actual environment which the programmeror user of the system sees that conforms or fails to con-form to this standard. Hardware components that requiresoftware support to conform shall not be said to conformapart from such software.

1.2. Inclusions. This standard specifies:

(a) floating-point number formats;(b) the results for add, subtract, multiply, divide,

square root, remainder, and compare;(c) conversions between integers and floating-point

numbers;(d) conversions between different floating-point for-

mats;(e) conversion between basic format (see §3.1) float-

ing-point numbers and decimal strings; and(f) floating-point exceptions and their handling, in-

cluding non-numbers (NaNs).

Preliminary-Subject to RevisionMVarch 1981 53

Page 4: A Proposed Standard for Binary Floating.-Point Arithmetic

1.3. Exclusions. This standard does not specify:

(a) integer representation;(b) interpretation of signs and fraction fields of NaNs;(c) binary - decimal conversions to and from extended

formats; or(d) formats of decimal strings.

2. Definitions

2.1. User. The user of a floating-point system is con-sidered to be any person, hardware, or program, not itselfspecified by this standard, having access to and control-ling those operations of the programming environmentspecified in this standard.

2.2. Binary floating-point number. A bit-string char-acterized by three components, a sign, a signed exponent,and a significand. Its numerical value, if any, is the signedproduct of its significand and two raised to the power ofits exponent. In this document a bit-string is not alwaysdistinguished from a number it may represent.

2.3. Exponent. That component of a binary floating-point number which normally signifies the power towhich two is raised in determining the value of therepresented number. Occasionally the exponent is calledthe signed or unbiased exponent.

2.4. Biased exponent. The sum of the exponent and a con-stant (bias) chosen to make the biased exponent's rangenon-negative.

2.5. Significand. That component of a binary floating-point number which consists of an explicit or implicitleading bit to the left of its binary point and a fractionfield to the right of the binary point.

2.6. Fraction. That field of the significand that lies to theright of its implied binary point.

2.7. Normal zero. The exponent is the format'sminimum, and the significand is zero. Normal zero mayhave either a positive or a negative sign. Only the extendedformats have any unnormalized zeros (see §2.9).

2.8. Denormalized. The exponent is the format's mini-mum, the explicit or implicit leading bit is zero, and thenumber is not normal zero. To denormalize a binaryfloating-point number means to shift its significand rightwhile incrementing its exponent, until it becomes a denor-malized number.

2.9. Unnormalized. Occurs only in the extended format.The number's exponent is greater than the format'sminimum, and the explicit leading bit is zero. If thesignificand is zero, this is an unnormalized zero.

2.10. Normalize. If the number is nonzero, shift itssignificand left while decrementing its exponent until theleading significand bit becomes one; the exponent is

regarded as if its range were unlimited. Ifthe significand iszero, the number becomes normal zero. Normalizing anumber does not change its sign.

2.11. NaN. Not a number; a symbolic entity encoded infloating-point format. See §3 and §6.2.

2.12. Status flag. A variable which may take two states,set and clear. A program may clear or copy a flag. Whenset, a status flag may contain additional system-depen-dent information, possibly inaccessible to the program.The operations of this standard may, as a side effect, setsome of the following flags: inexact result, underflow,overflow, divide by zero, and invalid operation.

2.13. Destination. Every unary or binary operation deliv-ers its result to a destination, either explicitly designatedby the user or implicitly supplied by the system (e.g., in-termediate results in subexpressions or arguments forprocedures). Some languages place the results of in-termediate calculations in destinations beyond the pro-grammer's control. Nonetheless, this standard definesthe result of an operation in terms of that destination for-mat as well as the operands' values.

2.14. Mode. A mode is a variable which a program mayset, sense, save and restore, to control tL. execution ofsubsequent arithmetic operations. The default mode isthat mode which a program can assume to be in effectunless an explicitly contrary statement is included eitherin the program or its specification. The standard entailsthe modes

(a) projective/affine, which concerns the interpreta-tion of oo,

(b) rounding direction, which concerns the directionof rounding errors,

and, in certain implementations,

(c) rounding precision, to shorten the precision of in-termediate results.

Optionally, an implementator may provide the modes

(d) warning/normalizing, for handling underflowedvalues, and

(e) traps disabled/enabled, for handling exceptions.

2.15. Shall and should. In this standard, the use of theword "shall" signifies that which is obligatory in any con-forming implementation; the use of the word "should"signifies that which is strongly recommended as being inkeeping with the intent of the standard, despite architec-tural or other constraints beyond the scope of this stan-dard that may on occasion render the recommendationsimpractical.

3. Formats

This standard defines four floating-point formats intwo groups, basic and extended, each having two widths,single and double. The standard levels of implementation

Preliminary-Subject to Revision54 COM PUTER~

Page 5: A Proposed Standard for Binary Floating.-Point Arithmetic

s e f

0 8 31Figure 1. Single-precision format.

s e

0 11 63

Figure 2. Double-precision format.

are distinguished by the combinations of formats sup-ported.

3.1. Basic formats

3.1.1. Single. A 32-bit format for a binary floating-pointnumber Xis divided as shown in Figure 1. The componentfields ofX are the I -bit sign s, the 8-bit biased exponent e,and the 23-bit fractionf. The value v of Xis as follows:

(a) If e=255 andf.0, then v=NaN.(b) Ife=255 andf=0, then v= (- I)soo.(c) If O<e<255, then v=(- 1)s2e-127(l.f).(d) If e= 0 andf.0, then v = ( - 1)52 -126(0.f).(e) Ife=Oandf=0, then v=(- l)sO, (zero).

3.1.2. Double. A 64-bit format for a binary floating-pointnumber Xis divided as shown in Figure 2. The componentfields ofX are the I -bit sign s, the 11-bit biased exponent

e, and the 52-bit fractionf. The value v ofXis as follows:

(a) Ife=2047andfX0, then v=NaN.(b) Ife=2047andf=0,thenv=(- I)soo.(c) IfO<e<2047, then v=(- 1)s2e-1023(1.f).(d) Ife=O andf.0, then v= (- 1)52 - 1022(o.f).(e) If e = 0 andf= 0, then v = (- 1)sO, (zero).

3.2. Extended formats

3.2.1. Single-extended. Extended is an implementation-dependent format. An extended binary floating-pointnumber X has four components: a 1-bit sign s, an expo-nent e of specified range combined with an implemen-tation-dependent bias, a 1-bit integer part j, and a frac-tion f with at least 31 bits. The exponent shall range be-tween a minimum value m s - 1023 and a maximumvalueM- + 1024. The value v of Xis as follows:

(a) If e=Mandf.X0, then v=NaN.(b) Ife=Mandf=0, then v=(- 1)soo.(c) Ifm<e<M, then v=(- I)s2e(.ff).(d) If e=m and j=f=0, then v=(-)sO, (normal

zero).(e) If e =m and j orf is nonzero, then

v = (- )s2e' (j.f), where e' = m or m + I at the im-plementor's option.

3.2.2. Double-extended. The double-extended format isthe same as single-extended described in §3.2.1, except

Preliminary-Subject to Revision

Page 6: A Proposed Standard for Binary Floating.-Point Arithmetic

that the exponent shall range between m c - 16383 and 4. RoundingM2 + 16384, and the fraction shall have at least 63 bits.

3.2.3. Exponent range. An implementation of this stan-dard is not required to provide (and the user should notassume) that single-extended have greater range thandouble.

3.3. Combinations of formats. All implementations con-forming to this standard shall support single. Implemen-tations should support the extended format correspond-ing to the widest basic format supported, and need notsupport any other extended format.2

2. Only if upward compatibility and speed are important issues should asystem supporting the double-extended format also support single-extended.

Except for binary -decimal conversion, all operationsspecified in §5 and §7 shall be performed as if correct toinfinite precision, then rounded according to the specifi-cations in this section. Rounding takes a number regardedas infinitely precise and, if necessary, modifies it to fit inthe destination's format while signalling that it is inexact(see §8.5).

4.1. Default rounding mode. An implementation of thisstandard shall provide round to nearest, with rounding toeven in case of a tie, as the default rounding mode. Whenrounding to nearest, the result shall differ from the in-finite precision exact result by at most one half in the least-significant-digit position; rounding to even means that ifthe difference is exactly half then the rounded result shallhave an even last digit.

Preliminary-Subject to Revision

Page 7: A Proposed Standard for Binary Floating.-Point Arithmetic

4.2. Directed rounding modes. An implementation shallprovide user-selectable positive- and negative-directedrounding (round toward + Xo and round toward - oo)and truncation (round toward 0) for all operations.When rounding toward + oo, the result shall be the for-

mat's value (possibly + co) closest to and no less than theinfinitely precise result, except as specified in §8.3;analogously, when rounding toward - oo, the result shallbe the format's value (possibly - co) closest to and nogreater than the infinitely precise result, except asspecified in §8.3. When rounding toward 0, the resultshall be the format's value closest to and no greater inmagnitude than the infinitely precise result.The rounding modes may affect the signs of zero sums

(see §6.3).

4.3. Rounding precision. Normally a result is rounded tothe precision of its destination. However, some hardware

will always deliver results from single format operands todouble or extended destinations. On such a system theuser, which may be a high-level language compiler, shallbe able to specify that a result be rounded instead to singleprecision, though it is stored in the double or extendedformat with its wider exponent range.3 Similarly, a systemthat delivers all results from double format operands toextended destinations shall permit the user to specifyrounding to double precision. Note that to meet thespecifications in §4.1, the result cannot suffer more thanone rounding error.

3. Rounding precision control is intended to allow systems whosedestinations are always double or extended to mimic systems with singleand double destinations. However, use of precision control to combinedouble (or extended) operands to produce a single format result with justone rounding is considered nonstandard.

Preliminary-Subject to Revision

Page 8: A Proposed Standard for Binary Floating.-Point Arithmetic

5. Operations

All conforming implementations of this standard shallprovide add, subtract, multiply, divide, square root, re-mainder, floating-point format conversions, conversionsbetween floating-point and integers, binary decimalconversions, and comparisons.When all operands are normalized, the operations shall

be performed as if to infinite precision before rounding asspecified in §4. §7 specifies the results when at least one ofthe operands is not normalized. §6 augments the specifi-cations to cover signed zero and oc and NaN; §8 enumer-ates exceptions.

5.1. Arithmetic. An implementation shall provide add,subtract, multiply, divide, and remainder for any twooperands of the same format, for each supported format;it should also provide the operations for operands of dif-fering formats. The destination format (regardless of therounding precision control of §4.3) shall be at least aswide as the operands' format. All results shall be roundedas specified in §4.The remainder r=x REMy is defined regardless of the

rounding mode by the following relation when y .0:r=x-yxn

where n is the integer nearest x/y; n is even whenevern-xly 2= ½. Note that with this definition the re-

mainder is exact. The result shall be normalized unless itunderflows.

5.2. Square root. The square root operation shall be pro-vided in all supported formats and is defined for all nor-malized operands .0; 0= -0. The destination for-mat shall be at least as wide as the operand's. The resultshall be rounded as specified in §4.

5.3. Floating-point format conversions. It shall be possi-ble to convert floating-point numbers between all sup-ported formats. If the conversion is to a less wide preci-sion, the result shall be rounded as specified in §4. If theconversion is to a wider precision, it shall be exact, al-though an invalid result exception may be raised as speci-fied in §8.1.2.

Table 1.Decimal conversion ranges.

DECIMAL TO BINARY BINARY TO DECIMALFORMAT MAX M MAX N MAX M MAX N

SINGLE i10-i 99 io1-i 54DOUBLE 1o19-1 999 1ol7-1 341

Table 2.Correctly rounded decimal conversion range.

DECIMAL TO BINARY BINARY TO DECIMALFORMAT MAX M MAX N MAX M MAX N

SINGLE io1-i 13 i10-i 13DOUBLE io'l-i 27 1O17 l1 27

5.4. Conversion between floating-point and integer. Itshall be possible to round a floating-point number to an

integer value in the same floating-point format, for allsupported formats. The rounding shall be as specified in§4, with the understanding that in round to nearest mode,if the difference between the unrounded operand and therounded result is exactly one half, the rounded result iseven.

It shall be possible to convert between all supportedfloating-point formats and all supported integer formats.Conversion to integer shall be effected by rounding asspecified in §4. Conversions between floating-point in-tegers and integer formiats shall be exact unless an excep-tion arises as specified in §8.1. 1.

5.5. Binary -decimal conversion. Conversion betweendecimal strings in at least one format and binary floating-point numbers in all supported basic formats shall be pro-vided for numbers throughout the ranges specified inTable 1. The integers Mand N-in Tables 1 and 2 below aresuch that the decimal strings have values Mx 10±N. Oninput, trailtig zeros shall be appended to or stripped fromM (up to the limits specified in Table 1) in order tominimize N. When the destination is a decimal string, itsleast-significant digit should be located by formatspecifications for purposes of rounding.

Conversions shall be correctly rounded as specified in§4 for operands lying within the ranges specified in Table2. Otherwise the error in the converted result shall not ex-ceed by more that 0.47 units in the destination's least-significant digit the error that would be incurred by therounding specifications of §4, provided that exponentover/underflow does not occur.

Conversions shall be monotonic. That is, increasing thevalue of a binary floating-point number shall not decreaseits value when converted to a decimal string, and increas-ing the value ofa decimal string shall not decrease its valuewhen converted to a binary floating-point number.When rounding to nearest, conversion from binary to

decimal and back to binary shall be the identity as long asthe decimal string is carried to the maximum precisionspecified in Table 1, namely nine digits for single and 17for double.4

If decimal to binary conversion over/underflows, theresponse is as specified in §8. Over/underflow and NaNsand infinities encountered during binary to decimal con-version should be indicated to the user by appropriatestrings.

5.6. Comparison. It shall be possible to compare floating-point numbers in all supported formats, including com-parisons between operands of differing formats. Com-parisons are exact and never overflow or underflow. Fourmutually exclusive relations are possible: '.'less than,""equal," "greater than," and "unordered." The lastcase arises when at least one operand is NaN, or when oo in

4. The properties specified for conversions are implied by error boundsthat depend on the format (single or double) and the number of decimaldigits involved; the 0.47 mentioned is a worst-case bound only. For a de-tailed discussion of these error bounds and economical conversionalgorithms that exploit the extended format, see "Binary Decimal Con-version in KCS Arithmetic," by Jerome T. Coonen (to appear).

Preliminary-Subject to Revision58 COMPUTER

Page 9: A Proposed Standard for Binary Floating.-Point Arithmetic

the projective mode is compared to anything other thanco; every NaN shall compare "unordered" with every-thing, including itself. Comparisons shall ignore the signof infinity in the projective mode (where + X = - oo), andshall ignore the sign of zero (so + 0 = - 0).

5.6.1. Condition codes. When the result of a comparisonis reported via condition codes, the result shall be an en-coding of one of the four relations listed in §5.6.

5.6.2. Predicates. When the result of a comparison isreported as an affirmation or negation of a predicate, thefollowing implications shall determine that response:

(a) The relation "less than" affirms the predicates <,c, ., and denies the predicates =, -, >,unordered.

(b) The relation "equal" affirms the predicates =, c,., and denies the predicates <, >, .X, unordered.

(c) The relation "greater than" affirms the predicates>, ., ., and denies the predicates =, c, <,unordered.

(d) The relation "unordered" affirms the predicates., unordered, and denies the predicates <, c, =,

In addition to the response specified above, an invalidoperation exception (see §8.1) shall arise in a comparisonjust when two values whose relation is "unordered" arecompared via a predicate involving <, c, 2, >, or theirnegations, as specified in §8.1 .1 .h.

6. Infinity, NaNs, and signed zero

6.1. Infinity arithmetic. Infinity arithmetic shall be con-strued as the limiting case of real arithmetic with operandsof arbitrarily large magnitude, when such a limit exists.Infinity arithmetic shall be supported under two user-selectable modes, affine and projective, with projectivebeing the default. In affine mode - oo< (every finitenumber) < + oo, but in projective mode infinities com-pare "equal" regardless of sign and compare "unor-dered" with everything else. Consequently, the twomodes are distinguished by exceptions in add, subtract,square root, and compare, as specified in §8.

Except for the invalid operations specified for oo,arithmetic upon oo is always exact and therefore shall raiseno exceptions. The three exceptions that do pertain to oo

are raised only when

(a) oo is created from finite operands by overflow(§8.3) or division by zero (§8.2), with the cor-responding trap disabled, or

(b) oo is an invalid operand (§8.1).

6.2. Operations with NaNs. Two different kinds of NaN,trapping and nontrapping, shall be supported in all opera-tions.

Trapping NaNs shall be reserved operands which pre-cipitate an invalid operation exception (§8.1.1) or someother implementation-dependent exception for everyoperation listed in §5 that is performed upon them.5

Nontrapping NaNs shall obey the following rules; theseNaNs should, by means left to the implementor's discre-tion, afford retrospective diagnostic information in-herited from invalid or unavailable data and results.

For those operations specified to deliver floating-pointresults,

(a) every operation involving a trapping NaN or in-valid operation (§8.1), if no trap occurs, shall setthe invalid operation flag and deliver in place of itsinvalid result a nontrapping NaN;

(b) every operation involving one or two input NaNs,none of them trapping, shall raise no exception butdeliver as a result either the same NaN (if operatingupon just one) or one or the other of the inputNaNs, according to an implementation-dependentprecedence rule.

The operations not covered in this paragraph, namelythose which do not deliver a floating-point result, arecomparison (§5.6) and conversion to a format that has noNaNs (§5.4 and §5.5).

6.3. The sign bit. This standard says nothing about thesign of a NaN. Otherwise the sign of a product or quotientis the exclusive OR of the operands' signs; and the sign ofa sum, or of a difference x-y regarded as a sumx+ (-y), differs from at most one of the addends' -signs.These rules shall apply even when operands or results arezero or infinite.When the sum of two operands with opposite signs (or

the difference of two operands with like signs) is exactlyzero, either normal or unnormalized (see §7), the sign ofthat sum (or difference) shall be " + " in all roundingmodes except round toward - oo, in which mode that signshall be" - ." However, x+x=x-(-x) retains the samesign as x even when x is zero.A valid square root can have a negative sign only when

the operand is - 0.

7. Unnormalized and denormalized arithmetic

The default6 mode of arithmetic, when at least oneoperand is not normalized, shall obey the following rules.Rounding and over/underflow handling are performedafter the operations specified here and may modify theresults. In the following specifications expon(x) refers tothe unbiased exponent of x.

(a) Add or subtract (z: = x±y): If at least one of theoperands having exponent m, where m = max (ex-

5. These NaNs afford arithmetic-like enhancements (such as complex-affine infinities, extremely wide range, etc.) that are not the subject of thestandard. However, if there is no special trap designated and enabled forthese NaNs, then the invalid operation exception is raised as specified in§8.1 .1.

6. These default rules are analogous to those for normalized numbers,though they tend more toward excessive caution than optimal utility, andoffer pipelined processors a faster but second-best alternative to providingan optional normalizing mode described later. More useful than theserules, but probably harder to implement, are the rules for significancearithmetic.

Preliminary-Subject to Revision 59March 1981

Page 10: A Proposed Standard for Binary Floating.-Point Arithmetic

pon(x), expon(y)), is normalized, then z shall benormalized before rounding. Otherwise ex-pon(z) = m.

(b) Multiply (z: = x x y): expon(z) = expon(x) + ex-pon(y), with the same exceptions as noted in §7(c).

(c) Divide (z : = x/y): expon(z) = expon(x) - ex-pon(y) - 1 when y is normalized and nonzero, ex-cept that when only one ofx andy is unnormalizedand the other is normal 0 or oo the result z is thesame (normal 0 or 00 or invalid) as if the unnor-malized operand were replaced by its normalizedequivalent. Otherwise an exception shall be sig-nalled as specified in §8.1.

(d) Remainder (z : = x REM y): z shall be calculatedas if x were first normalized.

(e) Square root is an invalid operation if its operand isnot normalized.

(f) Conversion (z : = x): expon(z) = expon(x).(g) Integer conversion (z : = IntegerPart(x)): If ex-

pon(x) > the number of fraction bits, then z shallbe identically x. Otherwise z shall be normalized.

(h) Conversion of denormalized binary floating-pointnumbers to decimal forms representing valuesMx 10N, where M and N are integers, should useleading zeros in the representation of Mto indicatethe degree to which the number is denormalized.

(i) Compare: comparisons shall be made as if bothoperands had first been normalized.

7.1. Normalizing mode. Another mode of arithmeticshould be provided which normalizes all denormalizedvalues before performing arithmetic with them, andhence precludes the creation of new unnormalizedoperands.7 This applies to all operations listed in §7. Un-normalized operands shall not be affected by this mode.

8. Exceptions

There are five types of exceptions that shall be detected.A trap under user control should be associated with eachexception as specified in §9. The default response to an ex-ception shall be to proceed without a trap. This standardspecifies results to be delivered in both trapping and non-trapping situations. In some cases the result is different ifa trap is enabled.For each type of exception the implementation shall

provide a status flag which shall be set on any occurrenceof the corresponding exception when no trap occurs. Itshall be reset only at the user's request. The user shall beable to test and to alter the status flags individually, andshould further be able to save and restore all five at onetime.

8.1. Invalid- operation. There are two kinds of invalidoperation exception. One, called invalid operand, arisesif an operand is invalid for the operation to be performed.The other, called invalid result, arises if the result is in-

valid for the destination. The result to be delivered, when

either kind of invalid operation exception occurs withouta trap, shall be a nontrapping NaN (see §6.2).

8.1.1. Invalid operand. Invalid operation shall be sig-nalled in the following cases:

(a) if any operand is a trapping NaN (see §6.2) and noother (implementation-dependent) trap is desig-nated and enabled;

(b) addition or subtraction oX i oo in projective tnodeand magnitude subtraction of infinities like( + oo) + ( - oo) in affine mode;

(c) multiplication 0 x oo;(d) division 0/0, 0/0oo, or the divisor is not normalized

and the dividend is finite and not normal zero;(e) remainder x REM y, where y is zero or not nor-

malized, or x is infinite;(f) square root if the operand is less than zero, oo in the

projective mode, or not normalized;(g) conversion of a binary floating-point number to an

integer or decimal format when overflow, infinity,or NaN precludes a faithful representation in thatformat and this cannot otherwise be signalled; and

(h) comparison via the predicates <, c, >, > or theirnegations, when the relation is "unordered."

A binary floating-point result to be delivered, when aninvalid operation exception arises without a trap, shall bea nontrapping NaN.

8.1.2. Invalid result. In any operation, when the resultdestined for a single or double format would be unnor-malized but not denormalized, invalid operation shall besignalled.8 When an invalid result exception coincideswith an overflow or inexact or trapped underflow excep-tion, invalid result shall take precedence; but untrappedunderflows cannot be invalid results because they aredenormalized.

8.2 Division by zero. If the divisor is normal zero and thedividend is a finite nonzero number, then division by zeroexception shall be signalled. The default result shall be acorrectly signed oo (see §6.3).

8.3 Overflow. If a rounded result is finite and not an in-valid result but its exponent is too large to represent in thetarget floating-point format, then overflow shall besignalled, unless the rounding mode is round toward + oo

7. In many computations the loss of significance due to denormalizationis not consequential, and the invalid results (see §8.1.2) that arise in thewarning mode are too pessimistic. Thus implementors are strongly en-couraged to support the normalizing mode, except perhaps in pipelined ar-ray processors with an extended format for accumulation of intermediatesums. Use of the extended format, with its explicit leading significant bit,permits the calculation of intermediate products and quotients involvingdenormalized numbers without invalid result exceptions.

8. This can happen only in certain cases of the following operations:(a) When an unnormalized extended is converted to a basic format.(b) Except in the normalizing mode, when operations upon denor-

malized single format operands have double destinations.(c) Except in the normalizing mode, when a denormalized number is

magnified by multiplication or division and the destination's for-mat is single or double.

Preliminary-Subject to Revision COM PUTER60

Page 11: A Proposed Standard for Binary Floating.-Point Arithmetic

or round toward - oo and there is no trap on overflow. Inthe latter case overflows shall be rounded thus:

(a) Round toward - co carries normalized positiveoverflows to the format's largest number, and un-normalized positive overflows to the format'slargest number's exponent without changing thesignificand.

(b) Round toward + oo carries normalized negativeoverflows to the format's most negative number,and unnormalized negative overflows to the for-mat's largest number's exponent without changingthe significand.

In these two cases invalid result may arise, but only in con-versions from extended to basic formats. All other casesof overflow without a trap shall yield oo with the ap-propriate sign.

If the overflow trap is enabled, overflow upon conver-sion shall deliver to the trap handler a result in the widestformat supported but rounded to the destination's preci-sion, except that, when the result of decimal to binaryconversion lies outside that range, NaN shall be delivered.All other trapped overflows shall deliver to the traphandler a result with correctly rounded significand andmodified exponent. The modified exponent is the correctexponent minus a bias adjust of 192 in the single format,1536 in double, and 3 x 2"-2 in extended, where n is thenumber of bits in the exponent field.9

8.4 Underflow. Underflow occurs whenever

(a) a result which is not normal zero, when examinedeither before or after rounding at the implementor'soption,10 is found to have too small an exponent tobe represented in the destination's format withoutfurther denormalizing, or

(b) an extended format product or quotient withneither operand a normal zero, when examinedeither before or after rounding at the implemen-tor's option, turns out to be indistinguishable froma normal zero. (Note that this cannot happen withnormalized operands.)

When underflow occurs with no trap, the unroundedresult shall be first denormalized, then rounded, thendelivered to its destination; moreover the underflow flagshall be set to signal the event unless the rounding mode isround toward + oX or - oo.

If the underflow trap is enabled, the result delivered tothe trap handler shall be as specified for overflow in §8.3except that the bias adjust is added rather than sub-tracted.

$.5 Inexact. In the absence of an invalid operation excep-tion, if the rounded result of an operation is not exact or if

9. The bias adjust is chosen to translate over/underflowed exponents asnearly as possible to the middle ofthe exponent range so that a trap handlercan provide appropriate information for later reconstruction of the cor-rect result.

10. To examine a number before rounding means to examine it asthough it were first rounded toward zero.

it overflows without a trap, then the inexact exceptionshall be signalled. The rounded or overflowed result shallbe delivered to the destination.

9. Traps

A user should be able to request a trap on any ofthe fiveexceptions and to request that trap to be disabled. If thetrap is disabled, then the corresponding exceptions shallbe handled in the default manner specified in §8. If an ex-ception is signalled for which the trap is enabled, then theexecution of the program in which the exception occurredshall be suspended, a handling routine specified by theuser shall be activated, and a result, if specified in §8, shallbe delivered to the trap handler.

9.1. Trap handler. For each trap supported, the user shallbe able to specify a trap handler having the capabilities ofa subroutine that can return a value to be used in lieu ofthe exceptional operation's result; this result is undefinedunless delivered by the trap handler. Similarly, thattrapped exception's flag(s) may be undefined unless set orreset by the trap handler. When a system traps, the traphandler should be able to determine

(a) the type(s) of exception(s) that occurred on thisoperation,

(b) the kind of operation that was being performed,(c) the destination's format,(d) in overflow, underflow, inexact, and invalid result,

the correctly rounded result including informationthat might not fit in the destination's format, and

(e) in invalid operand and divide by zero, the operandvalues.

Appendix: Recommended functionsand predicates

The following functions and predicates are recom-mended as aids to program portability across differentsystems, perhaps performing arithmetic very differently.They are described generically; that is, the types of theoperands and results are inherent in the operands.Languages that require explicit typing will have cor-responding families of functions and predicates.

(a) copysign(x,y) returns x with the sign of y. Hence,abs(x) = copysign(x, 1.0).

(b) - x is x with its sign reversed.(c) scalb(x,N) returns the product of x and 2N, for in-

tegral values N; this is accomplished by addingNtothe exponent ofxand then checking for exceptionalconditions.

(d) logb(x) returns the unbiased exponent of x, asigned integer in the format of x, except thatlogb(O) is - oo, logb(oo) is + oo, and logb(NaN) isthat NaN. When x is positive and finite,1 scalb(x, -logb(x)) <2 except when x is denor-malized in the warning mode or unnormalized.

Preliminary-Subject to RevisionMarch 1981 61

Page 12: A Proposed Standard for Binary Floating.-Point Arithmetic

(e) nextafter(x,y) returns the next representableneighborofxinthedirectiontowardy. Ifx=y, oreither x or y is oo in the projective mode or a NaN,then x is returned.

(f) finite(x) returns the value TRUE if- oo <X< + co and returns FALSE otherwise.

(g) isnan(x), or equivalently xX.x, returns the valueTRUE ifx is a NaN and returns FALSE otherwise.

(h) x< >y is TRUE only when x<y or x>y, and isdistinct from x y which means NOT (x =y) and isnever an invalid opefation.

(i) unordered(x,y) returns the value TRUE if x isunordered with y and returns FALSE otherwise;this is never an invalid operation. O

Preliminary-Subject to Revision

IEEE P754 voting committee membersat time of adoption of the proposed draft

Andrew Allison, Los Altos Hills, CaliforniaWilliam Ames, Hewlett-Packard Data SystemsMike Arya, Cupertino, Calif.Janis Baron, IntelDileep Bhandarkar, Digital Equipment CorporationJoel Boney, MotorolaJim Bunch, University of California, La JollaEd Burdick, National SemiconductorPaul Clemente, Prime ComputerW. J. Cody, Argonne National LaboratoryJerome T. Coonen, University of California, BerkeleyJim Crapuchettes, Menlo Computer AssociatesRichard H. Delp, Four-Phase SystemsAlvin Despain, University of California, BerkeleyTom Eggers, Digital Equipment CorporationDick Fateman, University of California, BerkeleyDon Feinberg, Digital Equipment CorporationStuart Feldman, Bell LaboratoriesEugene Fisher, Lawrence Livermore National LaboratoryPaul F. Flanagan, Analytical MechanicsGordon Force, KylexLloyd Fosdick, University of ColoradoRobert Fraley, Hewlett-Packard LaboratoriesHoward Fullmer, Parasitic EngineeringDaniel D. Gajski, University of Illinois, UrbanaDavid Gay, Massachusetts Institute of TechnologyC. W. Gear, University of Illinois, UrbanaMartin Graham, University of California, BerkeleyDavid Gustavson, Stanford Linear Accelerator CenterGuy Haas, DatapointChuck Hastings, Data GeneralDavid Hough, Apple ComputerJohn E. Howe, IntelThomas E. Hull, University of TorontoSuren Irukulla, Prime ComputerRichard James III, Santa Clara, CaliforniaPaul S. Jensen, Lockheed Research LaboratoryWilliam Kahan, University of California, BerkeleyHoward Kaikow, Nashua, New HampshireDick Karpinski, University of California, San FranciscoVirginia Klema, Massachusetts Institute of TechnologyLes Kohn, National SemiconductorDan Kuyper, Sperry UnivacM. Dundee Maples, M & E AssociatesJohn Markiel, Westmont, New JerseyRoy Martin, Apple ComputerDean Miller, MotorolaWebb Miller, University of California, Santa BarbaraJohn C. Nash, Vanier, Ontario, CanadaDan O'Dowd, National SemiconductorCash Olsen, SigneticsJohn F. Palmer, IntelBeresford Parlett, University of California, BerkeleyDave Patterson, University of California, BerkeleyMary Payne, Digital Equipment CorporationTom Pittman, Itty Bitty ComputersLewis Randall, Apple ComputerRobert Reid, Dunstable, MassachusettsChristian Reinsch, Leibniz-Rech/Bay. Akad. Wiss.Roger Stafford, Beckman InstrumentsDavid Stevenson, ZilogG. W. Stewart, University of MarylandRobert G. Stewart, Stewart Research EnterprisesHarold Stone, University of MassachusettsWilliam D. Strecker, Digital Equipment CorporationRobert Swarz, Digital Equipment CorporationGeorge Taylor, University of California, BerkeleyDar-Sun Tsien, IntelGreg Walker, MotorolaJohn Stephen Walther, Hewlett Packard LaboratoriesP. C. Waterman, Burlington, Massachusetts

COMPUTER62


Recommended