+ All Categories
Home > Documents > Antilogarithmic Converter

Antilogarithmic Converter

Date post: 04-Jun-2018
Category:
Upload: john-leons
View: 215 times
Download: 0 times
Share this document with a friend

of 13

Transcript
  • 8/13/2019 Antilogarithmic Converter

    1/13

    Published in IET Computers & Digital Techniques

    Received on 25th May 2011

    Revised on 15th January 2012

    doi: 10.1049/iet-cdt.2011.0089

    Special Issue: High-Performance Computing System

    Architectures: Design and Performance

    ISSN 1751-8601

    Decimal floating-point antilogarithmic converter basedon selection by rounding: algorithm and architectureD. Chen L. Han S.-B. Ko

    Department of Electrical and Computer Engineering, University of Saskatchewan, Campus Drive 57, Saskatoon,

    SK S7N 5A9, Canada

    E-mail: [email protected]

    Abstract:This study presents the algorithm and architecture of the decimal floating-point (DFP) antilogarithmic converter, basedon the digit-recurrence algorithm with selection by rounding. The proposed approach can compute faithful DFP antilogarithmic

    results for any one of the three DFP formats specified in the IEEE 754-2008 standard. The proposed architecture is synthesisedwith an STM 90-nm standard cell library and the results show that the critical path delay and the number of clock cycles of theproposed Decimal64 antilogarithmic converter are 1.26 ns (28.0 FO4) and 19, respectively, and the total hardware complexity is29325 NAND2 gates. The delay estimation results of the proposed architecture show that it has a significant decrease in terms oflatency in contrast with recently published high performance decimal CORDIC implementations.

    1 Introduction

    Nowadays, there are many commercial demands for decimal

    floating-point (DFP) arithmetic operations such as financialanalysis, tax calculation, currency conversion, Internet-based applications and e-commerce [1]. This trend givesrise to further development on DFP arithmetic units thatperform accurate computations with exact decimaloperands. Owing to its significance, DFP arithmetic hasbeen included in specifications of the IEEE 754-2008standard[2]. As the main part of a decimal microprocessor,the basic decimal arithmetic units, such as decimal adder/subtracter, multiplier and divider, are attracting more andmore researchers attention. A complete survey of hardwaredesigns for the basic decimal arithmetic units is summarisedin[3]. Recently, the hardware components of the basic DFParithmetic units have been implemented in IBMs system z9[4], POWER6[5] and z10[6] microprocessors.

    Transcendental functions including logarithm, antilogarithm,exponential, reciprocal, sine, cosine, tangent, arctangent and soon are useful arithmetic concepts in many areas of science andengineering, such as computer 3D graphics, scientificcomputations, artificial neural networks, digital signalprocessing and logarithmic number system. The decimaltranscendental function computation is also very useful forsome specific applications, such as some computations usedin financial applications in banks [7], the scientistic decimalcalculator [8] and some pocket computers [9]. The decimaltranscendental functions, as recommended decimal arithmeticoperations, have been specified in IEEE 754-2008. Recently,

    Intel Corporation has provided the first software solutionto compute the DFP transcendental functions using anexisting and well-established binary floating-point (BFP)transcendental function mathematical library [10]. However,

    with the strict requirement on computational speed andaccuracy in the future, the hardware components may beincluded in the high-end microprocessor to support the

    decimal transcendental computation.Muller[11] presents both software and hardware-oriented

    algorithms to compute transcendental functions, anddiscusses issues related to accurate BFP implementations ofthese functions. The hardware-oriented algorithms based ondigit-recurrence with selection by rounding are introducedfor high-radix binary division and square-root [1214],CORDIC [15], logarithm [16] and exponential [17], [33]operations, respectively. This method can efficientlydecrease the cost of implementation, in particular, thecomplexities of the selection function for redundant digits.We have presented in [17] a DFP logarithmic converterbased on a radix-10 digit-recurrence algorithm by selectionby rounding. In this paper, the same approach is analysedto implement the DFP antilogarithmic converter in order toachieve faithful antilogarithmic results of DFP operandsspecified in IEEE 754-2008. The design described in thispaper is an improved design based on our previous researchpresented in [18], and includes the following novelties: (i)using the redundant carry-save representation of the data-path;(ii) selecting redundant digits by rounding estimated residuals;(iii) retiming and balancing the delay of the proposedarchitecture; (iv) implementing the novel subcomponents inthe carry-save data-path; and (v) processing the normalisationand the parallel final addition and the rounding operation todisplay DFP antilogarithmic results.

    This paper is organised as follows: Section 2 gives an

    overview of the DFP antilogarithm operation. In Section 3,the proposed algorithm and the error analysis for the DFPantilogarithm computation are presented. Section 4 describesthe architecture of the proposed DFP antilogarithmic converter

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277 289 277

    doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    2/13

    with details of its hardware implementation. In Section 5, weanalyse the area-delay evaluation results of the proposedarchitecture, and then compare the performance of theproposed design with our previous design [18], the recentdecimal CORDIC designs [19, 20] and the softwareimplementation[10]. Section 6 gives conclusions.

    2 DFP antilogarithm operation

    IEEE 754-2008 specified three interchange DFP formats, a32-bit storage format called Decimal32; 64-bit and 128-bitcomputational formats called Decimal64 and Decimal128,respectively. The value of a DFP operandv compliant withthe DFP format, is represented as follows

    v = (1)s 10e significand (1)

    In (1), s is the 1-bit sign of the DFP operand. The realexponent, (w + 2)-bit e, is calculated by subtracting anexponent bias from the value of the encoded exponent,

    where the values of w are equal to 6, 8 and 12,respectively, in the three DFP formats. In IEEE 754-2008,emax and emin represent the maximum and minimumvalues of the real exponent, where emax is equal to+96,+384 and +6144, respectively, and emin is equal to2(emax2 1) for the three DFP formats. The significandis a q-digit non-normalised unsigned DXP number in theform ofd0, d1, d2, . . . , dq21, 0 di , 10, where q is equalto 7, 16 and 34, respectively, for the three DFP formats. InIEEE 754-2008, the decimal significand can be encoded bydensely packed decimal (DPD) encoding [21] or binaryinteger decimal encoding[22]. In this paper, we choose theDFP format in DPD encoding so that the decimalsignificand of a DFP operand can be decoded to binary-coded decimal (BCD) representation in hardware.

    2.1 Exception handling

    A valid DFP antilogarithm operation is defined as

    R = Anti log10(v) = 10v (2)

    There are some exceptional cases that need to be dealt withduring a DFP antilogarithm operation:

    Ifvis a NaN, the DFP antilogarithm operation returns NaNand signals the invalid operation exception. If v is a positive infinite operand, the antilogarithmoperation simply returns+1, and ifv is a negative infiniteoperand, the antilogarithm operation simply returns +0with no exception. If v is in the range of [(log10(|vmax|), +1], theantilogarithm operation satisfies the condition of overflowand returns the maximum representable DFP operand or+1 based on the different rounding modes. Ifv is in the range [21, log10(|vmin|)], the antilogarithmoperation satisfies the condition of underflow that roundsthe intermediate result down to zero or to the minimumrepresentable DFP number based on the different rounding

    modes. Ifvis in the range of [log10(|vmin|), log10(|vmax|)], a normalDFP antilogarithm operation takes place. The rest of thispaper details the computation on this interval in particular.

    The DFP antilogarithm operation can be transformed to adecimal fixed-point (DXP) antilogarithm computation

    R = 10(1)s10ed0.d1d2 ...dq1 = 10+vint 10+vfrac (3)

    In (3),vintis ann-digit decimal integer number which is in therange of [emin2 q + 1,emax], wherenis equal to 3, 3 and 4for Decimal32, Decimal64 and Decimal128 formats,respectively. The vintplus the value ofk represents the realexponent of the DFP antilogarithmic result, where k isachieved by normalising the DXP result of 10vfrac to thedecimal significand of the DFP antilogarithmic result. Sincethe valid value ofvcould be very close to zero, the fractionnumber vfrac can be represented by several leading zerosplus the q-digit decimal significand, vfrac +0.00. . .00d0,d1, . . . , dq21. Therefore vfrac is a decimal fraction numberin the range of (21, 1), which can be completelyrepresented by at most (emin2 q + 1)-digit or at least(q2 n)-digit. Since the results of 10vfrac are in the range of(0.1, 10), the q-digit faithful DXP antilogarithmic result isenough to represent the q-digit decimal significand of the

    DFP antilogarithmic result. When v is very close to zero,vfrac cannot be processed in hardware implementation withthe limit width of the data-path. Therefore vfrac is truncatedto at least (q + 1)-digit vfrac so that the approach can stillguarantee a faithful rounded DFP antilogarithmic result. In thefollowing, we focus on the algorithm and the architectureof the (q + 1)-digit DXP decimal antilogarithmic converterwhich can produce the q-digit faithful decimal significand ofthe DFP antilogarithmic result.

    2.2 Rounding

    IEEE 754-2008 specifies five types of rounding modes [2]. Acommon requirement for the DFP antilogarithmic operation in

    IEEE 754-2008 is capable of computing exactly roundedresults (within 0.5 ulp of precision). In order to achieve exactlyrounded results by any one of the rounding modes, it is neededto determine whether the value of the exact result (infiniteprecision) is less or higher than the midpoint between the twonearest DFP numbers. However, if the exact result is so closeto the midpoint that the exact rounding is difficult to perform,unless we can determine the maximum length of chain ofnines or zeros after the rounding digit for every possible DFPresults (Table Makers Dilemma [23]). On the other hand,providing additional guard digits before rounding cannot onlyguarantee results much closer to a half-ulp, but also greatlyreduce the probability of incorrect rounding to near zero. In

    this paper, we mainly focus on delay optimisation of theproposed DFP antilogarithmic converter, so we design adigit-recurrence algorithm to achieve faithfully roundedresults (within 1 ulp of precision) for the DFP antilogarithmicoperation by using the roundTiesToEven mode.

    3 Algorithm

    A digit-recurrence algorithm to compute 10vfrac is summarised

    as follows

    limj1

    vfrac

    log10(fj)

    0 (4)

    If (4) is satisfied

    limj1

    log10(fj)

    vfrac (5)

    278 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289

    & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    3/13

    Thus

    10vfrac =

    1j=1

    fj (6)

    fjis defined as fj 1 + ej102jby whichvfracis transformedto 0 through a successive subtraction of log10(fj). This form

    of fj allows the use of a decimal shift-and-addimplementation.

    According to (5) and (6), the corresponding recurrences fortransforming v frac and computing the antilogarithm arepresented in (7) and (8), where

    j 1, L[1] = vfrac and E[1] = 1

    L[j+ 1] = L[j] log10(1 + ej10j) (7)

    E[j+ 1] = E[j] (1 + ej10j) (8)

    The digitsejare selected so thatL(j+ 1) converges to 0. A 1-digit accuracy is, therefore obtained in each iteration. Afterperforming the last iteration of recurrence, the results are

    L[j+ 1] 0 (9)

    E[j+ 1] 10vfrac (10)

    To have a selection function for ej, a scaled remainder isdefined in (11), where gis defined as a scaled constant.

    W[j] = 10j L[j] g (11)

    Thus

    L[j] = W[j] 10j g1 (12)

    To substitute (12) into (7)

    W[j+ 1] = 10W[j] 10j+1 g log10(1 + ej10j) (13)

    3.1 Selection by rounding

    The selection of the digitejis achieved by rounding the scaledresiduals to its integer part. In order to reduce the delay of

    selection function, the rounding is performed on an estimateW[j], which is obtained by truncating W[j] to t fractionaldigits (truncating W[j] at the position 102t). The selectionfunction is indicated as

    ej= round(W[j]) (14)In (14), round indicates that if the digit ofW[j] at theposition 1021 is larger than or equal to 5, the digit ej isobtained by adding the integer part ofW[j] and 1; otherwiseit is directly obtained by the integer part of

    W[j]. In this

    work, the selection by rounding is performed with themaximum redundant set ej[

    {29, 28, . . . , 0, . . . , 8, 9

    }.

    Since|ej| 9

    9.5 ,W[j] , 9.5 (15)

    Since we must have (15) satisfied, the range ofW[j] is

    9.5 + dt, W[j] , 9.5 + dt (16)

    In (16), d is the truncation error. It should be noted that0 dt, (10/9)102t, regarding the sign-magnitude carry-save representation of W[j]. Therefore the bounds ofW[j] 2 ejare

    0.5 , W[j] ej, 0.5 +10

    9 10t (17)

    Since (13) can be represented as

    W[j+ 1] = 10(W[j] ej) 10j+1 g log10(1 + ej10j) + 10ej (18)

    If we want to keep9.5 , W[j+ 1] , 9.5, we must keep

    9.5+

    10

    9 10t , W[j

    +1] , 9.5 (19)

    According to (17), (18) and (19), the numerical analysis isprocessed as follows

    10j+1 g log10(1 + ej10j) 10ej. 4.5 + 10

    9 10t+1 (20)

    10j+1 g log10(1 + ej10j) 10ej, 4.5 10

    9 10t (21)

    The results in the numerical analysis show that wheng 2.3,and only if j 3, t 1 the conditions (20) and (21) aresatisfied. In doing so, the selection by rounding is onlyvalid for iterations j 3, and e1 and e2 can be onlyachieved by look-up tables. However, using two look-uptables for j 1, 2 significantly increase the overallhardware implementations. Therefore the restriction for e1is defined so that e2 can be achieved by selection byrounding and one look-up table is saved. SinceW[1] 10 2.3 vfrac, W[2]can be achieved as

    W[2] = 230 vfrac 102 2.3 log10(1 + e1101) (22)

    When the value ofjequals to 2 andtequals to 1, the value ofe2 is in the range of28 e2 8 so that (20) and (21) aresatisfied. Substituting 28 e2 8 andt 1 in (17) yields

    8.5 , W[2] , 8.5 + 19

    (23)

    According to (22) and (23), we obtain

    230 vfrac 102 2.3 log10(1 + e1101) , 8.5 +1

    9

    (24)

    230 vfrac 102

    2.3 log10(1 + e1101

    ) . 8.5 (25)The results in the numerical analysis of (24) and (25) showthat the decimal input operandv fracis restricted in the range

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277 289 279

    doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    4/13

    of 21.03 vfrac 0.31 so that e2 can be achieved withselection by rounding. Since the value of vfrac is in therange of21 , v

    frac , 1, in order to tune the positive v

    frac

    to negative, the fraction part of the positive vfrac should befirstly adjusted to negative by vfrac2 1 and then itscorresponding integer part vint is adjusted by vint+ 1.Table 1 shows the selection of e1. Since 1-digit e1 fails tocreate Table 1 for achieving continuous ranges to cover allnegativev frac, e1 is extended to a 2-digit so that all negativev

    fraccan be achieved.

    3.2 Error analysis and evaluation

    The errors in the proposed antilogarithmic digit-recurrencealgorithm can be produced in four ways. The first type oferror is the inherent error of algorithm, 1i, resulting fromthe difference between the antilogarithm results obtainedfrom finite iterations and the exact results obtained frominfinite iterations. The second one is the inexact input error,1v, produced by the difference between antilogarithmic

    results of the inexact inputvfrac and the real inputvfrac. Thethird one is the quantisation error, 1q, resulting fromthe finite precision of the intermediate values in hardwareimplementation. The fourth one is the final rounding error1r, whose maximum value is 0.5 ulp (|1r| 0.5 102q).In order to achieve a q-digit decimal significand of thefaithful DFP antilogarithmic result, the following conditionmust be satisfied

    1t= 1i + 1v + 1q 10q (26)

    3.2.1 Inherent error of algorithm: Since each DXPantilogarithmic result is achieved after (q

    +1)th iterations,

    1i can be defined as

    1i=1j=1

    (1 + ej10j) q+1j=1

    (1 + ej10j) (27)

    Thus, (27) can be written as

    1i=1j=1

    (1 + ej10j) 1 1

    1

    j=q+2(1 + ej10j)

    (28)

    In (28), since the proposed DXP antilogarithmic algorithm can

    compute the input values, which fall in the range of (21, 0), theexact antilogarithmic results, obtained after the infiniteiterations, are in the range of (0.1, 1). In order to use the staticerror analysis method, we substitute the caseej 9 or29 and

    the maximum value of the exact antilogarithmic results to(28), then the maximum 1iis obtained

    1i 1 1

    1

    j=q+2(1 + 9 10j) (29)

    In (29), it is obvious that

    1j=q+2

    (1 + 9 10j) = eS1j=q+2ln(1+910j) (30)

    Since (30) is satisfied

    1j=q+2

    ln(1 + 9 10j) , 9 (10q2 + 10q3 + )

    (31)

    We obtain

    1j=q+2

    (1 + 9 10j) , e9(10q2+10q3+) (32)

    Thus, the maximum absolute 1iis

    |1i| , 1 1

    e9(10q2+10q3+) 1 10q1 (33)

    3.2.2 Inexact input error: If a DFP operand, v, is veryclose to zero, the whole digit-width of vfrac +0.00. . .00d0, d1, . . . , dq21 can be too long to be implemented. vfrachas to be truncated to at least (q + 1)-digitvfrac in the DXPantilogarithmic operation. Therefore the inexact input errorcan be defined as

    1v= 10vfrac 10vfrac (34)

    It is evident that the maximum 1vis obtained when (i) thevfracconsists of (q + 1)-digit leading zeros and q-digit decimalsignificand; (ii) each of decimal significand digit, d0, d1,. . . , d

    q2

    1 9

    1v 10+0.00...00

    q+1

    99...99q 10

    +0.00...00q+1 (35)

    Table 1 Selection ofe1

    Range ofvfrac e1(BCD) Range of vfrac e1(BCD)

    [20.00, 20.02] 20.0(00000000) (20.49, 20.55] 27.0(00110000)

    (20.02, 20.07] 21.0(10010000) (20.55, 20.61] 27.4(00100110)

    (20.07, 20.12] 22.0(10000000) (20.61, 20.67] 27.7(00100011)

    (20.12, 20.19] 23.0(01110000) (20.67, 20.72] 28.0(00100000)

    (20.19, 20.24] 24.0(01100000) (20.72, 20.77] 28.2(00011000)

    (20.24, 20.28] 24.5(01010101) (20.77, 20.82] 28.4(00010110)(20.28, 20.32] 25.0(01010000) (20.82, 20.89] 28.6(00010100)

    (20.32, 20.37] 25.5(01000101) (20.89, 20.94] 28.8(00010010)

    (20.37, 20.42] 26.0(01000000) (20.94, 20.98] 28.9(00010001)

    (20.42, 20.49] 26.5(00110101) (20.98, 21.00) 29.0(00010000)

    280 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289

    & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    5/13

    Equation (35) can be written as

    1v (10+0. 99...99

    q )10q1 1 (36)

    Thus

    log10(1 + 1v) +0. 99 . . . 99q

    10q

    1

    (37)

    According to Taylor series expansion of the logarithmfunction log10(1 +x), we obtain

    1v 12v2+

    / ln(10) , 1v/ ln(10)

    +0. 99 . . . 99q

    10q1 (38)

    Therefore the maximum absolute 1vis

    |1v| 2.303 10q1 (39)

    3.2.3 Quantisation error:Since only the finite precisionof the intermediate values is processed in hardwareimplementation, the quantisation error is produced. In thispaper, we define FD-digit as the minimal data-width offractional digits for each intermediate value. The DXPantilogarithmic results can be achieved by (q + 1) timessuccessive multiplication

    10vfrac

    = q+1

    j=1(1

    +ej10

    j) (40)

    Since the fractional digit-width of the intermediatemultiplication results are represented in the carry-saverepresentation in which the carry may occur in the FD-digitthat is shifted out of the data-path in the first interaction.Therefore the truncated error 102FD is produced from thefirst iteration. After (q + 1) iterations, the maximumquantisation error, 1q, can be represented as

    1q= 10FD

    q+1

    j=1 (1 + ej10j

    ) + + q+1

    j=q+1 (1 + ej10j

    ) + 1 (41)

    According to the same mathematical method as (30), (31) and(32), each successive multiplication in (41) satisfies

    10FD q+1j=1

    (1 + ej10j) , 10FD eSq+1

    j=1ej10j

    (42)

    Thus, the maximal quantisation error, 1q, satisfies

    1q , 10FD (eSq+1j=1ej10j + + eSq+1j=q+1ej10j + 1) (43)

    Considering the case ej 9 or 29 in (43), we obtain the

    maximum absolute1q

    |1q| , (q + 2) 10FD (44)

    3.2.4 Error evaluation:Having obtained1i, 1v, 1qin (33),(39) and (44), respectively, we achieve the maximum absoluteerror1 tas

    |1t| = |1i| + |1v||1q| 0.331 10q + (q + 2) 10FD (45)

    We substitute the digit-width of the decimal significand of thethree DFP formats,q 7, 16 and 34, into (45), respectively.The results indicate that the maximum absolute errors|1t|obtained in the three DFP formats are smaller than 0.5 ulp,which can satisfy the condition (26). Thus, the finalrounded results are smaller than the accuracy requirementwithin 1 ulp after considering the final rounding error.Table 2 shows the error analysis for three different DFPinterchange formats. The error analysis in Table 2 proves

    that only when the minimal data-width of the fractionaldigits for each intermediate value (FD-digit) is larger thanor equal to (q + 2)-digit or (q + 3)-digit, the proposedalgorithm can guarantee q-digit accuracy for the DXPantilogarithm operation, and therefore a q-digit decimalsignificand of the faithful DFP antilogarithmic result canbe achieved.

    3.3 Guard digit of scaled residual

    Since the scaled residual W[j] with only finite precision isoperated in hardware implementation, we need to analysehow many guard digits g are enough to prevent therounding error of W[j], 1w, from affecting the correct

    selection of digits ej. Since W[j] is converged in the rangeof (29.5, 9.5), we define the digit-width of W[j] as(q +g+ 3)-digit, consisting of three-digit integer part and(q +g)-digit fraction part.

    The values of logarithm 22.3 log10(1 + ej102j) in (13)can be achieved by storing these values in the look-uptable. With the increasing number of iterations, however,the size of the table will become prohibitively large.Therefore there is a need for a method that can reduce thetable size and achieve a significant reduction in the overallhardware requirement. A Taylor series expansion of thelogarithm function log10(1 +x) is demonstrated in (46)

    log10(1 +x) = x x2

    2+ / ln(10) (46)

    Afterh iterations, the values of22.3 log10(1 + ej102j) do

    Table 2 Error analysis of DFP antilogarithm for DFP interchange

    formats

    Format names Decimal32 Decimal64 Decimal128

    significand (q-digit) 7 16 34

    no. of iteration (q+ 1) 8 17 35accuracy (q-digit) 7 16 34

    FD-digit 9a 19b 37b

    max. error (|1t| 102q) 0.421 0.349 0.367a(q+ 2)-digitb(q+ 3)-digit

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 281

    doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    6/13

    not need to be stored in the look-up table, whereas22.3 ej102j/ln(10), instead, are used for approximation.

    In iterations (j 1) to (j q + 1), because (q +g+ 3)-digit rounded values of 22.3 log10(1 + ej102j) and 22.3 ej102j/ln(10) are obtained from the look-up tables,the rounding error, +0.5 102q2g, is produced in eachiteration. The maximum quantisation error, 11wq, is

    |1wq| q+1j=1

    0.5 10qg (47)

    Since the value of22.3 log10(1 + ej102j) is approximatedby the value of22.3 ej102j/ln(10) in iterations (j h + 1)to (j q + 1). However, according to the series expansion ofthe logarithmic function in (46), an approximation error, 1wa,is produced in each iteration

    1wa= 2.3

    q+1

    j=h+1 (ej10

    j)2

    2 +(ej10

    j)3

    3 / ln(10)(48)

    we keep (ej102j)2/2 ln(10) to analyse 1wa

    1wa 2.3 q+1

    j=h+1

    (ej10j)2

    2

    / ln(10) (49)

    Considering the worst case (ej 9 or 29), we obtain themaximum1wa

    |1wa| 4.01 102h1 (50)

    Therefore according to (13), after the (q + 1)th iteration, thetruncation error ofW[j],1w, is obtained as

    |1w| 10q+1 (|1wq| + |1wa|)= (0.5q + 0.5) 101g + 4.01 10q2h (51)

    Since the digitejis selected by rounding the scaled residualW[j] to its integer part in each iteration, 1w needs to satisfythe conditions, 1w , 1 in order to guarantee the correctselection of digits ej. To satisfy this condition for threedifferent DFP interchange formats, we obtain that when thevalues ofh are equal to 4, 9 and 18, andgare equal to 2, 2and 3 for three formats, respectively.

    4 Architecture

    Fig. 1 shows the architecture of the proposed DFPantilogarithmic converter in the top level. Since such issuesof the DFP antilogarithmic converter as the exception

    handling, the packing and the unpacking from the IEEE754-2008 DFP format are straightforward, we only detailthe architecture for the computation of the sign bit (Rsign),the real exponent (Rexp) and the decimal significand(Rsignificand) of the DFP antilogarithmic results in this paper.To represent the signed decimal intermediate value, allvariables in the architecture are represented with 10scomplement number system in BCD encoding. To speed upthe execution of recurrences, all intermediate values in thedata-path are represented using the redundant decimal carry-save representation, where ssss and c represents a 1-digitsum and a 1-bit carry, respectively. As a consequence ofthis representation, the delay of the addition and the

    multiple operation in the recurrence are independent of thecomputational precision.

    Fig. 1 Architecture of the proposed DFP antilogarithmic converter

    282 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289

    & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    7/13

    4.1 Data-path

    The data-path of the proposed architecture is pipelined andretimed into four stages in order to minimise and balancethe critical path delay. The initial processing stage (stage1) is to obtain the initial digit e1. The digit recurrencestage (stage 2) is to achieve the remaining digits ej. Theantilogarithm computation stage (stage 3) is to compute

    the (q + 3)-digit intermediate decimal significand of theDFP antilogarithmic result. Finally, 1-bit Rsign, (w + 2)-bitRexp and q-digit Rsignificand of the DFP antilogarithmicresults are achieved in the final processing stage (stage 4).The cycle-based sequence of operations is summarisedas follows:

    Stage 1, in the firstclock cycle (in iteration (j 1)): The 1-bitsign, (w + 2)-bit real exponent, and theq-digit non-normaliseddecimal significand, as input operands, are obtained from inputregisters. The q-digit decimal significand and (w + 2)-bit realexponent are processed in the range reduction logic to achievethe (q + 1)-digit DXP operandv frac. Meanwhile, the value ofvint is obtained in the vintgenerator and sent to stage 4 by aregister. If the DFP operand is positive, the fraction part of the

    DFP operand vfrac is adjusted to a negative fraction numberby vfrac2 1 in a 10s complement converter, and then it isthe input of the DXP decimal antilogarithmic converter.Meanwhile, its corresponding integer part, vint, is adjusted byvint+ 1 and sent to stage 4. The digit e1 is obtained fromlook-up table I based on the value of 2-digit MSDs ofvfrac. The v

    frac is multiplied by a 2-digit constant 2.3 in a

    multiple logic (Mult1) to achieve the (q + 3)-digit value ofm 2.3 vfrac with the carry-save representation (ms, mc).Them out from Mult1 is shifted 2-digit to the left to achieve10W[1] (W[1] 10 2.3 vfrac); and the correspondingvalue of 2230 log10(1 + e11021) is achieved from look-uptable II. Then, the values of 10W[1] and 2230 log10(1 +e110

    21

    ) are sent to stage 2 by registers.Stage 2, from the second to the (q + 1)th clock cycles(in iterations j 2 to j q + 1): In the second clockcycle, the residual W[j] is achieved by adding 10W[1]and 2230 log10(1 + e11021) together in a decimal 3:2CSA compressor. Then, the digit ej can be obtained byrounding 3-digitW[j] in a rounding ej logic. This can beexpressed by

    (Ws[j],Wc[j]) = 10W[1] 230 log10(1 + e1101)

    ej= round(

    W[j])

    The W[j] in the carry-save representation is shifted 1-digitto the left to achieve 10 W[j] that is sent back toMux2 for the next iteration. From the number of j 2 to

    j hth iteration, the value of22.3 10j+1 log10(1 + ej102j)10j+1 is obtained from look-up table II and sentback to Mux1 for the next iteration. This can be expressedby

    (Ws[j+ 1],Wc[j+ 1]) = 10W[j] 2.3 10j+1 log10(1 + ej10j)

    ej= round(W[j+ 1])

    From the number ofj (h + 1)th to j (q + 1)th iteration,the value of 22.3 10 ej/ln(10) is obtained fromlook-up table III and sent back to Mux1. This can be

    expressed by

    (Ws[j+ 1],Wc[j+ 1]) = 10W[j] 2.3 10 ej/ ln(10)

    ej+1= round(W[j+ 1])After the (q + 1)th clock cycle, all the digitsejare achievedby the selection by rounding.

    Stage 3, from the second to the (q + 2)th clock cycles (initerations j 1 to j q + 1): In the second clock cycle,2-digite1 is concatenated with 9 and zeros, and it is shifted1-digit to the right to achieve e110

    21 in a barrel shifter andthen selected by Mux4. Meanwhile, E[1] 1 is selected byMux5. The decimal significand result E[2] of the firstiteration is obtained in the (q + 3)-digit decimal 4:2 CSAcompressor. This can be expressed by

    (Es[2],Ec[2]) = 1 + e1101

    From the third to the (q + 2)th clock cycles, the intermediatevalue of ejE[j] out from a multiple logic (Mult2) is shifted

    j-digit to the right to obtain ejE[j]102j in a barrel shifter.The value ofE[j] is selected by Mux5 for the computationofE[j+ 1] in the next iteration. This can be expressed by

    ((ejE[j]10j)s, (ejE[j]10

    j)c) = ej E[j]10j

    (Es[j+ 1],Ec[j+ 1]) = ejE[j]10j +E[j]

    After the (q + 2)th clock cycle, (q + 3)-digit decimalsignificand of the DFP antilogarithm result is obtained.

    Stage 4, in the (q + 3)th clock cycle: The sum and carry of(q + 1)-digit MSDs of the fractional part ofEs[j] andEc[j],

    Es

    and Ec

    , are added together to achieve E in a q-digitdecimal compound adder. At the same time, the E isrounded to the faithful decimal significand Rsignificand basedon the value inc of the rounding position in a roundinglogic. Since we consider the roundTiesToEven mode inthis design, the rounding logic generates an increment incbased on

    inc = 1 if (rd. 5 or (rd= 5 and LSB(L) = 1))0 if (rd, 5 or (rd= 5 and LSB(L) = 0))

    whererdrepresent the LSD ofE[j]

    . The (w + 2)-bit exponentRexp and 1-bit sign Rsign are obtained in a sign & exponentgenerator.

    Table 3 shows some iterations of a 64-bit DFPantilogarithm operation executed in the proposed architecture.

    4.2 Hardware implementation

    The details of the hardware implementation for each stage ofthe proposed DFP antilogarithmic converter are presented inthis section. In the rest of this paper, the symbols of , ^,_ and & represent the logical-XOR, logical-AND, logical-OR and logical-concatenation, respectively. The symbol of(A)yx refers to the yth bit in the xth digit position in adecimal number, A, where the least significant bit (LSB)

    and the least significant digit (LSD) have the index of0. For example, (W[j])32 is the third bit of the second digitin W[j]. The symbol of A refers to the logic-NOT of anumberA.

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277 289 283

    doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    8/13

    4.2.1 Initial processing stage:Fig. 2shows the details ofthe hardware implementation of the initial processing stage(stage 1).

    In stage 1: The range reduction logic consists of a decimal

    bi-directional barrel shifter, a 10s complement converter, a9s complement converter, a trailing zeros detector and a3-to-1 multiplexer. The vint generator consists of a leading-zero-counter (LZC), a decimal barrel shifter, an encoder and

    decimal add-one logic and 3-to-1 multiplexer. The LZC isimplemented based on [24], whereas the decimal barrelshifter is implemented by a log2(q) levels of multiplexer. Thetrailing zeros detector is implemented by a prefix tree based

    on [24] to generate control signals of sel for the 3-to-1multiplexer. Note that the overflow and underflow of the DFPantilogarithm operation can be detected by the vintgenerator,and the implementation of their detection is straightforward.

    Table 3 Example of a 64-bit DFP antilogarithm operation

    v (21)1 8576308882936892 10216,R= 10vintvfarc = 100 108576308882936892 ,vint 0,vfrac -0.85763088829868920 e1 28.6,m 2.3 vfrac (ms 7926448956923414740, mc 0101000000001100100), h 9,g 2

    W[j] W[j] E[j]Ws[1] 979.264489569234147400 In first clock cycle (first iteration)

    Wc[1] 001.010000000011001000

    10Ws[1] 792.64489569234147400010Wc[1] 010.100000000110010000 Es[1] 1.000000000000000000

    22.3 log10(1 + e11021) 102 +196.390551794005254100 Ec[1] 0.000000000000000000

    Ws[2] 980 898.034346386456638100 (E[1]e11021)s 9.140000000000000000

    Wc[2] 011 101.101101100000100000 (E[1]e11021)c +0.000000000000000000W[2] 208 e2 2 1 Es[2] 0.140000000000000000

    10Ws[2] 980.343463864566381000 Ec[2] 0.000000000000000000

    10Wc[2] 011.011011000001000000 In second clock cycle (second iteration)

    22.3 log10(1 + e21022) 103 +010.039052425635195000

    Ws[3] 013 001.383426289192476000 (E[2]e21022)s 9.887488888888888888

    Wc[3] 000 000.010101001010100000 (E[2]e21022)c +0.111111111111111111

    W[3] +13 e3 +1 Es[3] 9.03859999999999999910Ws[3] 013.834262891924760000 Ec[3] 1.10000000000000000010Wc[3] 000.101010010101000000 In third clock cycle (third iteration)

    22.3 log10(1 + e31023) 104 +990.026217975671270000

    Ws[10] 021 002.199071104000000000 (E[9]e910

    29)s 9.999999984725084930

    Wc[10] 010 001.001011000000000000 (E[9]e91029)c +0.000000011111110111W[10] +31 e10 +3 Es[10] 0.028782483451741196

    10Ws[10] 021.990711040000000000 Es[10] 0.110011011010010000

    10Wc[10] 010.010110000000000000 In 10th clock cycle (10th iteration)

    22.3 10 e10 /ln(10) +970.033680748675623892

    Ws[11] 019 001.933401788675623892

    Wc[11] 001 000.101100000000000000W[11] +20 e11 +210Ws[11] 019.334017886756238920

    10Wc[11] 001.011000000000000000 In 11th clock cycle (11th iteration)

    22.3 10 e11 /ln(10) +980.022453832450415928

    Ws[17] 036 993.644904860602385560 Es[17] 9.027693494806399868

    Wc[17] 011 111.110100110111100000 Ec[17] 1.111110000100001100W[17] +47 e17 +5 In 17th clock cycle (17th iteration)In 18th clock cycle (E[17]e1710

    217s )s 9.999999999999999906

    (E[17]e1710217)c +0.000000000000000100

    Es[18]

    0.138682384895290964Ec[18] 0.000111110011110010

    In 19th clock cycle Es

    = .13868238489529096 rdvint= 0 Rexp= vint 16 = 16 E

    = .1387934949064010 Ec

    = .00011111001111001

    Rsign= 0 Rexp= 16("111110000") Rsignificand= 1387934949064010 +compound 1addition

    inc

    284 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289

    & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    9/13

    The e1 generator is implemented straightforwardlyaccording to Table 1. The corresponding (q +g+ 3)-digit2230 log10(1 + e11021) is obtained from look-up tableI. Since gis equal to 2, 2 and 3 for three formats, the sizeof look-up table I is (25 48)-bit, (25 84)-bit and(25 160)-bit for Decimal32, Decimal64 and Decimal128,respectively.

    The multiple logic (Mult1) is applied to computem 2.3 vfrac, where v fracis a (q + 1)-digit negative valuein the range of 21 , m 0. The Mult1 is implementedbased on the partial product generation logic presented in[25]. The multiples are formed by adding two of an initialmultiple set23m(achieved by adding 25mand 2min a 3:2

    CSA counter) and220m (achieved by shifting 1-digit to theright of22m). Both 2m and 5m can be generated with onlya few logic delays. The Boolean equations for generatingdouble and quintuple of the BCD number are presented in

    [26]. To decrease the delay of the addition, two levels ofdecimal CSA adders are implemented to develop multiplesms and mc. The Boolean equation for computing 1-digitdecimal addition of the BCD number is presented in [26].The signals cin1 and cin2 are generated to supplement theLSD owing to the 9s complement conversion (22m and25m). The signal cin1 and cin2 are added in the LSD andthe second LSD of the first level of CSA adders, respectively.

    4.2.2 Digit recurrence stage:Fig. 3shows the details ofthe hardware implementation of the digit recurrence stage(stage 2).

    In stage 2: The 3:2 decimal CSA compressor, applied toachieve the residual (Ws[j], Wc[j]), is implemented by onelevel of (q +g+ 3)-digit 3:2 CSA counter. Then, 1-digitsign (Ss, Sc), 1-digit integer part(Is, Ic) and 1-digit fractionpart (fsMSD ,fcMSB

    ) of the residual are sent to the rounding ej

    logic for selecting digits ejby rounding the residual (Ws[j],Wc[j]). The sign of the digit ej is obtained by the signdetector block which is implemented based on the equation

    sign = (S0s Sc) (I3s ^I0s ^Ic)The 1-digit fraction (fsMSD,fcMSB

    ) and the value of 5 are added

    together in the 1-digit decimal full adder to generate the signalcarry to determine the rounding operation. The signalscarry and sign are sent to the selection generator to achievea control signal sel of a 4-to-1 multiplexer. The value of|ej|is achieved in four parallel full adders by adding the value of0, 1, 6 and 5 with the signals offsMSD and fcMSB

    , respectively

    |ej| =Is +Ic + 0 if sign = 0 ^ carry = 0Is +Ic + 1 if sign = 0 ^ carry = 1Is

    +Ic

    +6 if sign

    =1 ^ carry

    =0

    Is +Ic + 5 if sign = 1 ^ carry = 1

    Thus, the digitejis obtained by concatenating 1-bit sign with1-digit |e|.

    Fig. 2 Details of hardware implementation of Stage 1 inFig. 1

    Fig. 3 Details of hardware implementation of Stage 2 inFig. 1

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277 289 285

    doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    10/13

    The look-up table II stores all the (q +g+ 3)-digit22.3 10j+1 log10(1 + ej102j), where j is in the rangeof 1 j h. Since ej is in the range of29 ej 9, thereare 18 different values (except for the value when ej 0)that need to be stored in the look-up table for each iteration.Sinceh are equal to 4, 9 and 18 andgare equal to 2, 2 and3 for three formats, the size of look-up table II is(26

    48)-bit, (28

    84)-bit and (29

    160)-bit for

    Decimal32, Decimal64 and Decimal128, respectively. Inorder to reduce the size and delay of look-up table II,(q +g+ 3)-digit 22.3 10j+1 log10(1 + ej102j) can beefficiently reallocated in the multiple tables. For Decimal64shown in Fig. 3, the single look-up table II is relocatedinto two parts in which the first part (TabII 1) stores all thevalues of 22.3 10j+1 log10(1 + ej102j), when2 j 9 andej+1; the second part (TabII 2) stores thevalues when 2 j 9, and 2 ej 9 and29 ej 22.The sizes of TabII 1 and TabII 2 are (24 84) and(27 84), respectively. Thus, the optimised size of look-uptable II is reduced from 2.64 to 1.48 kB. Look-uptableIII stores 19 values of (q +g+ 3)-digit 22.3 ej/ln(10)

    10, thus it is implemented by a size of 25

    84-bit

    look-up table. Thus, the total optimised size of thelook-up tables is about 2.14 kB for Decimal64. Theimplementations of address generators to address look-uptable II and look-uptable III based on the values of jandejare straightforward.

    4.2.3 Antilogarithm computation stage:Fig. 4 showsthe details of the hardware implementation of theantilogarithm computation stage (stage 3).

    In stage 3: The multiple logic (Mult2) is applied tocompute Es[j]ej+Ec[j]ej, where ej is in the range of

    29 ej 9. The multiple of Es[j]ej is formed by addingtwo of an initial multiple set m, 2m, 2m, 22m, 5m,25m, 10m, 210m selected by sel1 and sel2 generatedby a recorder. The implementation of 4m logic can begenerated by connecting two 2m logics in series. The cin1and cin2 are generated by a recorder to supplement theLSD owing to the 10s complement conversion. The cin1and cin2 are added in the LSD of the two levels of CSA

    adders, respectively. Since each bit of Ec[j] is only zero orone, the signal of Ec[j]|ej| can be achieved in a carryextend block which can be implemented by a series oflogical-AND gates. If the digitej, 0 (sign(ej) 1),Ec[j]ejis obtained by the 9s complement conversion of Ec[j]|ej|,and then the signal sign(ej is supplemented in the LSD ofthe signal (ejE[j])c, otherwise, Ec[j]ej is directly obtainedfrom the signal ofEc[j]|ej|.

    (Ec[j]ej)3:0i =

    Ec[j]i ^ e3:0

    j if sign(ej) = 09s com(Ec[j]i ^ e

    3:0j ) if sign(ej) = 1

    Thus, theE[j]ej((ejE[j])s, (ejE[j])c) are achieved by addingthe Es[j]ejandEc[j]ejin a decimal CSA adder. Finally, the(E[j]ej10

    2j)s and (E[j]ej102j)c are obtained in a decimal

    barrel shifter. The 4:2 decimal CSA compressor, applied toadd the E[j] (Es[j], Ec[j]) and E[j]ej10

    2j ((E[j]ej102j)s,

    (E[j]ej102j)c) together is implemented by two levels of

    (q + 3)-digit 3:2 CSA counters.

    4.2.4 Final processing stage:Fig. 5shows the details ofthe hardware implementation for the final processing stage(stage 4).

    Fig. 4 Details of hardware implementation of Stage 3 inFig. 1

    286 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289

    & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    11/13

    In stage 4: The decimal compound adder is implementedbased on the conditional speculative method [27]. A prefixtree is implemented based on the binary Kogge Stone

    network [28]. The additions of Esi

    +Eci

    and Esi

    +Eci

    +1

    can be implemented using three binary half adders and a

    binary full adder connected as a ripple carry chain. The

    logic for adding the value of 6 is used to compensate the Ej

    to the correct representation of the BCD encoding, whichcan be implemented using two binary half adders and twobinary full adders. Since the value of Rsignificand may berounded to the value of 1 (incq+1 1), when it happens,theRsignificandis directly set as one.

    A BCD to binary converter, which is used to convertthe decimal value of vint to the w + 2-bit binary format,is implemented based on[29]. The (w + 2) exponentRexp isselected based on the value of incq+1 in a 2-to-1 multiplexer

    Rexp= vint

    16 if incq

    +1

    =0

    vint 15 if incq+1 = 0Since the DFP antilogarithm result should be positive, thesign bit (Rsign) is zero.

    5 Implementation and comparisons

    The proposed improved DFP antilogarithmic converter thatcan compute operands in Decimal32, Decimal64 andDecimal128 formats was modelled with VHDL and thensimulated using ModelSim, respectively. A comprehensivetestbench, which includes special test cases (NaN, Infinite,Subnormal or zero operands), corner test cases and validrandom DFP operands was performed to verify thecorrectness of the design. The proposed architectures weresynthesised using Synopsys Design Compiler with the STM90-nm CMOS standard cells library [30] under the typicalcondition (1.2 VDD core voltage and 258C operatingtemperature). The clock, input signals and output signalsare assumed to be ideal. Inputs and outputs of the proposeddesign are registered and the design is optimised for delay.

    The delay model is based on a logical effort method[31],which estimates the proposed architecture delay values in atechnology independent parameter, FO4 unit (the delay ofan inverter of the minimum drive strength (1) with afanout of four 1

    inverters). To measure the total hardware

    cost in terms of number of gates, the area of the proposedarchitecture is estimated as the number of equivalent1 two input NAND gates (NAND2). Note that 1FO4 45 ps, and 1 NAND2 4.4 mm2 in the STM 90-nm

    CMOS standard cells library under the typical condition.Table 4 summarises the delay and the area (without thelook-up tables) estimated using the area and delayevaluation model for the Decimal64 antilogarithmicconverter. The synthesis report shows that the powerconsumption of the proposed DFP and DXP architecture are26.1 and 16.9 mW. The worst path delay of each stage ishighlighted in the corresponding figure by a dashed thickline. The evaluation results show that the critical path of theproposed architecture is located in stage 3 (highlighted inFig. 4), and the details of the critical path in the Decimal64implementation are reported inTable 5.

    The results of the proposed design are also compared withthose of our previous design implemented and reported in[18]. The area and critical path delay of the previous 16-digit DXP antilogarithmic converter is about 140 732 mm2

    and 8.25 ns, respectively (synthesised with the TSMC0.18 mm standard cell library in the typical condition,thus, 1 FO4

    75 ps, and 1 NAND2

    10.0 mm2). The

    comparison results reported in Table 6 show that theimproved DXP antilogarithmic converter based onthe redundant data-path is about 3.91 times faster than theprevious design based on the non-redundant data-path interms of latency, with the expense of 1.85 times more area.

    Fig. 5 Details of hardware implementation of Stage 4 inFig. 1

    Table 5 Details of critical path of the Decimal64 antilogarithmic

    converter

    Blocks in the critical path Total

    Reg Mult2 Mux Shift CSA 4:2 setup (ns)

    0.07 0.53 0.06 0.23 0.29 0.08 1.26

    Table 4 Delay and area of Decimal64 antilogarithmic converter

    Stage Worst delay

    (FO4)

    Areas

    (NAND2)

    initial processing stage (Fig. 2) 25.3 7323

    digit recurrence stage (Fig. 3) 23.6 4667

    antilog computation stage (Fig. 4) 28.0 15 485

    final processing stage (Fig. 5) 18.6 1178

    top-level control logic (FSMa) 3.5 672

    total 28.0b 29 325

    aFSM, finite-state machinebCritical path delay

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 287

    doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    12/13

    With respect to the existing works implemented based onthe CORDIC algorithm in [19, 20], it is quite difficult tocompare the hardware performance between the twodifferent algorithms. In [19], only the number of clockcycles and critical path delay are presented, which are 200and 13 FO4 (taking the Power 6 processor as a reference),respectively. The comparison results reported in Table 6show that the digit-recurrence approach proposed in thiswork is 4.89 times faster than the unit [19] based on theCORDIC approach in terms of latency. Compared with thedesign[20]which is an improved CORDIC fixed-point unitbased on [19], the proposed architecture in this work is2.40 times faster than this design, with the expense of 0.48times less memory and 1.39 times more area.

    For further analysis, we compare the performance of theproposed architecture with the software approach reportedin [10]. The software DFP transcendental functioncomputation library is compiled by the Intel C++Compiler (IA32 version 11.1) [32]. It takes about 1060clock cycles to compute a Decimal64 exponential result,running with Intel Core(TM) 2 Quad @ 2.66 GHz

    microprocessor. The comparison results reported in Table 6show that the proposed hardware implementation in thiswork is about 45.8 times faster than software implementation.

    6 Conclusions

    In this work, we presented a DFP antilogarithmic converterthat is based on the digit-recurrence algorithm withselection by rounding. We developed the radix-10algorithm, improved the architecture and implemented itwith the STM 90-nm CMOS standard cells library. Theimplementation results show that the improved architectureis 3.91 times faster than our previous design [18] in termsof latency. To provide a reference for floating-point-unitdesigners when they consider a fast implementation forthe radix-10 implementation, we compared the proposedarchitecture with a recent high performance implementationbased on the decimal CORDIC algorithm [19, 20].Although a comparison between two different algorithmsdepends on many parameters, the design presented in thispaper shows a latency 4.89 and 2.40 times faster than thatof the units based on the CORDIC algorithm. In addition,compared with the software DFP transcendental functioncomputation library [10], the proposed hardwareimplementation in this work is about 45.8 times faster thanthe software implementation.

    7 Acknowledgments

    This work was supported by the Natural Sciences andEngineering Research Council of Canada (NSERC). The

    authors would appreciate the anonymous reviewers for theirvaluable comments.

    8 References

    1 Cowlishaw, M.F.: Decimal floating-point: algorism for computers.16th IEEE Symp. on Computer Arithmetic (ARITH16), 2003,pp. 104 111

    2 IEEE Working Group of the Microprocessor Standards Subcommittee:IEEE 754-2008 Standard for Floating-Point Arithmetic (August 2008)

    3 Wang, L.-K., Erle, M.A., Tsen, C., Schwarz, E.M., Schulte, M.J.: Asurvey of hardware designs for decimal arithmetic, J. IBM Res. Dev.,2010, 54, (3), pp. 8:18:15

    4 Duale, A.Y., Decker, M.H., Zipperer, H.-G., Aharoni, M., Bohizic, T.J.:Decimal floating-point in z9: an implementation and testingperspective, J. IBM Res. Dev., 2007, 51, (1/2), pp. 217227

    5 Eisen, L., J.W.W. III, Tast, H.-W., et al .: IBM POWER6accelerators: VMX and DFU, J. IBM Res. Dev., 2007, 51, (6),pp. 663 683

    6 Schwarz, E.M., Kapernick, J.S., Cowlishaw, M.F.: Decimal floating-point support on the IBM system z10 processor, J. IBM Res. Dev.,2009, 53, (1), pp. 4:14:10

    7 Harrison J.: Presentation: decimal transcendentals via binary (June2009). http://www.ac.usc.es/arith19/sites/default/files/S7P2Decimal

    TranscendentalsViaBinary.pdf8 Kropa, J.C.: Calculator algorithms, Math. Mag., 1978, 51, (2),pp. 106 109

    9 Imbert, L., Muller, J.M., Rico, F.: A radix-10 BKM algorithm forcomputing transcendentals on pocket computers, J. VLSI SignalProcess. Syst., 2000, 25, (2), pp. 179186

    10 Harrison, J.: Decimal transcendentals via binary. 19th IEEE Symp. onComputer Arithmetic (ARITH19), 2009, pp. 187194

    11 Muller, J.M.: Elementary functions, algorithms and implementation(Birkhauser Verlag, Boston, USA, 2005, 2nd edn.)

    12 Ercegovac, M.D., Lang, T., Montuschi, P.: Very high-radix divisionwith prescaling and selection by rounding, IEEE Trans. Comput.,1994, 43, (8), pp. 909918

    13 Lang, T., Montuschi, P.: Very-high radix square root with prescalingand rounding and a combined division/square root unit, IEEE Trans.Comput., 1999, 48, (8), pp. 827841

    14 Antelo, E., Lang, T., Bruguera, J.D.: Computation of x/d

    in a very-

    high radix combined division/square-root unit with scalingand selection by rounding, IEEE Trans. Comput., 1998, 47, (2),pp. 152 161

    15 Antelo, E., Lang, T., Bruguera, J.: High-radix CORDIC rotation basedon selection by rounding, J. VLSI Signal Process. Syst., 2000, 25, (2),pp. 141 153

    16 Pineiro, A., Ercegovac, M.D., Bruguera, J.D.: High-radix logarithmwith selection by rounding: algorithm and implementation, J. VLSISignal Process. Syst., 2005, 40, (1), pp. 109123

    17 Chen, D., Han, L., Choi, Y., Ko, S.: Improved decimal floating-pointlogarithmic converter based on selection by rounding, IEEE Trans.Comput., 2012, 61, (5), pp. 607621

    18 Chen, D., Zhang, Y., Teng, D., Wahid, K., Lee, M.H., Ko, S.-B.: A newdecimal antilogarithmic converter. IEEE Symp. on Circuit and System(ISCAS09), 2009, pp. 445448

    19 Vazquez A., Villalba J., Antelo E., Zapata E.L.: Redundant floating-

    point decimal CORDIC algorithm, IEEE Trans. Comput., PrePrint,201220 Kaivani A., Jaberipur G.: Decimal cordic rotation based on selection

    by rounding: algorithm and architecture, Comput. J., 2011, 54, (11),pp. 1798 1809

    Table 6 Comparison results Decimal64 antilogarithmic converter with other designs

    Works Cycle time (FO4) Cycles (No.) Latency (FO4) Ratio Area (NAND) Ratio ROM (kB)

    proposed 28.0 19 532.0 1.00 29 325 1.00 2.14

    proposeda 28.0 18 504.0 0.95 26 197 0.89 2.14

    previous[18]a 110.0 18 1980.0 3.72 14 073 0.48 2.14

    CORDIC[20]b 34.62 35 1211.7 2.40 18 826 0.64 4.50

    CORDIC[19] 13.0 200 2600.0 4.89 N/A N/A N/A

    software[10] 23.0 1060 24 380 45.8 N/A N/A N/ASoftware library running at Intel Core(TM) 2 Quad @ 2.66 GHz

    a16-digit DXP antilogarithmic converterb16-digit DXP CORDIC unit

    288 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289

    & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089

    www.ietdl.org

  • 8/13/2019 Antilogarithmic Converter

    13/13

    21 Cowlishaw, M.F.: Densely packed decimal encoding,J. IEE Comput.Digit. Tech., 2002, 149, (3), pp. 102104

    22 Cornea, M., Harrison, J., Anderson, C., Tang, P.T.P., Schneider, E.,Gvozdev, E.: A software implementation of the IEEE 754R decimalfloating-point arithmetic using the binary encoding format, IEEETrans. Comput., 2009, 58, (2), pp. 148162

    23 Lefevre, V., Muller, J.M., Tisserand, A.: Toward correctlyrounded transcendentals, IEEE Trans. Comput., 1998, 47, (11),pp. 1235 1243

    24 Oklobdzija, V.G.: An algorithmic and novel design of a leading zero

    detector circuit: comparison with logic synthesis, IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., 1994, 2, (1), pp. 124128

    25 Lang, T., Nannarelli, A.: A radix-10 combinational multiplier. IEEEAsilomar Conf. on Signals, Systems and Computers (ACSSC06),2006, pp. 313317

    26 Erle, M.A., Schulte, M.J.: Decimal multiplication via carry-saveaddition. 14th IEEE Int. Conf. on Application-Specific Systems,Architectures, and Processors (ASAP03), 2003, pp. 348358

    27 Vazquez, A., Antelo, E.: Conditional speculative decimal addition.Seventh Conf. on Real Numbers and Computers (RNC 7), 2006,pp. 47 57

    28 Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solutionof a general class of recurrence equations,IEEE Trans. Comput., 1973,C-22, (8), pp. 786793

    29 Deschamps, J.-P., Bioul, G.J.A., Sutter, G.D.: Synthesis ofarithmetic circuits: FPGA, ASIC and embedded systems (Wiley,2006, 1st edn.)

    30 STMicroelectronics, 90 nm CMOS090 Design Platform, 2007

    31 Sutherland, I., Sproull, R., Harris, D.: Logical effort: designing fastCMOS circuits (Morgan Kaufmann, 1999, 1st edn.)

    32 Intel Corporation, Using decimal floating-point with Intel C++compiler, http://software.intel.com/en-us/articles/using-decimal-floating-point-with-intel-c-compiler, 2010

    33 Pineiro, A., Ercegovac, M.D., Bruguera, J.D.: Algorithm andarchitecture for logarithm, exponential, and powering computation,IEEE Trans. Comput., 2004, 53, (9), pp. 10851096

    IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277 289 289

    doi: 10 1049/iet cdt 2011 0089 & The Institution of Engineering and Technology 2012

    www.ietdl.org


Recommended