Applying speculation techniques to implement functional units

Abstract1— This paper justifies the use of estimation and

prediction of carries to increase the performance of functional units built with the replication of full adders while keeping a low area penalization. Adders and multipliers are the most representative modules in this group of functional units. The use of these design techniques allows the implementation of modules with performance improvements ranging from 20% to 50% with only an area overheads around 5%. These functional units are suitable for asynchronous circuits but they could also be introduced in synchronous circuits with speculative techniques. The basic idea consists in estimating the carry out from some parts of the functional units, allowing every part to operate independently and in parallel. These modules are connected to build bigger ones. Results from simulations show that for some applications it is possible to make predictions even more accurate that the bit-based estimation. Predictions have also the advantage they can be introduced in the multipliers design, whether estimators cannot. These predictions are similar to the ones used in the branch prediction in a processor.

Index Terms— functional units design, prediction techniques, high-performance.

I. INTRODUCTION Addition is the key arithmetic operation in most of the

digital circuits and processors. Therefore, their performance and other parameters, such as area and consumption, are highly dependent on the adders features. Multipliers and other complex modules usually include a great amount of adders. Although the memory accesses are the main bottleneck in actual processors [1], the increase in adders performance becomes critical in the design of application specific integrated circuits (ASICs).

Historically there have been several proposals to implement modules capable to execute additions. Firstly ripple carry adders (RCA), which are composed by the replication of a basic cell called full adder, see figure 1 a). This is the simplest implementation of an adder. The main problem is the length of the carry path, which diminishes its performance.

In order to reduce the carry path, the carry select adder [2-3] (CSA) was proposed, see figure 1 b). The basic idea is to divide the adder in several parts and replicate every

1 This work has been suported by the Spanish Goverment Research

Grant TIN2005-5619

most significant part, executing the most significant parts both with carry in ‘0’ and ‘1’ simultaneously, and select the correct result once calculated the real carry. The final critical path will be roughly the size of the original adder divided by the number of fragments. Note the replication of the most significant modules.

The carry lookahead adder [4-5] (CLA) also reduces the carry path. It consists in dividing the adder into several modules, anticipating the carry in of every module according to a calculus of the carry out from the previous (less significant) module. With this technique the critical path becomes logarithmic respect to the number of inputs, see figure 1 c).

The use of these functional units in asynchronous designs present some benefits in term of delay-area product, lower power, improved electromagnetic emissions, superior adaptability, and elimination of a global clock signal [6-7]. The delay is reduced by including early completion logic. In spite of the area overhead of this logic, the delay-area product is also reduced. Besides, asynchronous designs lead to a more modular design style [7].

The paper is organized as follows: firstly we will discuss the state of the art, secondly we will show an example to motivate our techniques. Afterwards we will introduce our proposal and a possible improvement. Finally we will present our experimental results and some conclussions.

II. PREVIOUS WORK Previous adder implementations present different

features. RCA is the simplest, the smallest, and the one which consumes less power and energy. However it is also

Applying speculation techniques to implement functional units

Alberto A. Del Barrio, Maria C. Molina, Jose M. Mendias, Esther Andres, Roman Hermida, Francisco Tirado

Dpto. Arquitectura de Computadores y Automatica Universidad Complutense de Madrid

{albertodbg, eandresp}@fdi.ucm.es, {cmolinap,mendias,rhermida,ptirado}@dacya.ucm.es

a) b)

c)

Figure 1.a) 4 bits RCA, b) CSA with two fragments, c) 16 bits CLA with two levels of carry lookahead

a) b)

c)

Figure 1.a) 4 bits RCA, b) CSA with two fragments, c) 16 bits CLA with two levels of carry lookahead

978-1-4244-2658-4/08/$25.00 ©2008 IEEE 74

the slowest. On the other hand, CSA and CLA are very fast but they have a great overhead in terms of area, power and energy.

The estimated carry structure [8-9] (ESTC) uses both carry select and lookahead techniques. On the one hand divides the adder in several modules as the CSA, but without replicating the most significant modules because it will only make the calculations with one carry. On the other hand, this carry is estimated with the most significant bits from the previous module. If the estimation is correct the execution time will be the half time of an adder with the same width. If not, the execution time will be roughly the same as if the modules would have been connected with the ripple carry technique. See figure 2a). This estimation seems to the lookahead of CLA, due to in most of the cases we are only anticipating the carry. Note this technique allow to use different kinds of adders for the internal modules that compose the adder. For example, we can use two CLA modules and connect them with the ESTC technique.

The most interesting feature of this design is produced when the carry estimation fails. In this case the correction of the result is made over the same most significant module, so there is no need to replicate it. The only hardware overhead is due to the carry estimation and the control and delay logic for generating “Done” signal. See figure 2 b).

In this way, ESTC adders have an area and consumption a little bit higher than the original associated adder, with an average performance a little bit lower than if they had been built with a pure carry select technique, due to the saving

of the replicated most significant modules and associated multiplexers.

The behavior of ESTC adders is asynchronous. One of the advantages of the asynchronous modules is that the execution time can be considered as the average delay, instead of the worst case delay for the synchronous modules. Therefore, the objective is to be as close as possible to the best case so that the average delay is similar. So the percentage of hits in the carry estimation becomes critical. However most circuits are synchronous today.

Estimating carries [8] consists in using the most significant bits of the previous module to decide the carry in for the following module. If we only consider the most significant bit of every operand, we will have a 75% of probability to guess the real carry. That is, we will have a carry ‘1’ if both bits are ‘1’ or ‘0’ if both bits are ‘0’. If they are different we will guess the half of the cases. See figure 2 c). Wallace et al. applied the same reasoning [9] to increase the number of bits for estimating the carry, reaching more than a 95% with 4 bits per operand. See figure 2 d). The counterpart is the increase in the area, estimation delay and energy consumption caused by the additional hardware. Similar forwarding techniques have been also used in [17] for reducing pipeline delay.

The goal of our work is to study another ways of reaching high hit rates reducing the area and delay overhead. Our idea is to overcome the dataflow limits [10] of the carry path but in a synchronous context.

a) b)

c) d)

Figure 2. a) ESTC scheme design, b) delay control design, c) probability of estimated carry (CE) to be equal to the truecarry (CT), and carry estimation implementation d) pred rate versus used bits, and fail rate formula usign n bits

%fail = (1/2n)*100

a) b)

c) d)

Figure 2. a) ESTC scheme design, b) delay control design, c) probability of estimated carry (CE) to be equal to the truecarry (CT), and carry estimation implementation d) pred rate versus used bits, and fail rate formula usign n bits

%fail = (1/2n)*100

75

III. MOTIVATIONAL EXAMPLE It is well known the fact that the data is prone to be

repeated along the programs (data locality). In [12-13] Lipasti et al. introduce this concept and study it in the load/store instructions of a processor.

As it has been explained before, the concept of predictive adders is not suitable for processors, due to the bottleneck in the memory accesses, so the cycle time will be determined by them, but the idea is the same. Nevertheless, note that if the content of the memory is repeated, the additions which calculate the effective address will be also repeated or at least the addresses range will usually be the same, so the carry which interconnects the adder modules will certainly be the same.

There are some applications such as the DSP ones [14] whose data is more likely to be repeated in the near future, because it always ranges between two well defined bounds. For example, ADPCM [15] is an algorithm of voice compression and decompression. In a common telephone conversation there will be many silences, which are codified with zero values. So every involved addition will surely produce a zero carry out.

In order to illustrate our idea, let us consider the examples presented in table I, for a Wallace estimation with 1 and 2 bits, versus a 1-bit predictor [16] like those used in the BTB for branch instructions. Based on the profiling technique [11] we are going to calculate how many times we guess the carry out.

1-bit Wallace estimation uses the most significant bit of every operand, 2-bits Wallace estimation uses the two most significant bits, and 1-bit predictor keeps the last carry out. Note that for the first addition we suppose the initial prediction is ‘0’.

In table I a) we have a random sequence of addition fragments. The objective is to predict the carry out of every fragment. Note that the 1-bit predictor fails the first

time the carry changes to 1 and afterwards it always guess the carry out, while the Wallace estimators remain failing until the most significant bits present a combination that allow them to deduce the correct carry. Obviously, the 1-bit Wallace estimator is less accurate than the 2 bits one, and fails more times.

In table I b) we have a sequence of 16 bit additions from the mix module of the ADPCM, obtained after simulating the decoder algorithm for the reference input example clinton.adpcm. If we implement the required adder with two 8 bit similar modules, the Wallace estimators will use the bits 7th an 6th from the least significant part (columns 2 and 3) to decide the estimated carry. In this case, 1-bit predictor only fails the first addition and guess the carry for the following ones, while Wallace estimators produce more errors when we are adding a range of values that always produces the value ‘1’ for the carry out signal.

Note the similitude between the operands in the case shown in table I b). This value locality justifies the prediction techniques. Note also Wallace techniques are not 100% predictive. They only make true prediction when they cannot guess the correct carry with the most significant bits, and in these cases they always predict a ‘0’ value, like a zero static prediction. In the other cases they really know the carry because of the operand values. So on the one hand we can consider Wallace technique as a special kind of forwarding or lookahead whose delay, area and sure (not predictive) carry hit percentage augments while we increase the bits used for anticipating the carry, and on the other hand as an improved version of a zero static prediction.

IV. PROPOSED TECHNIQUE

The proposed prediction technique consists in the use of branch predictors, instead of the Wallace estimators, inside functional units to increase their performance without compromising their area. Large prediction structures

CE CT CE CT CE CT1110 + 0001 0 0 0 0 0 01110 + 0010 0 1 0 1 0 11110 + 0011 0 1 0 1 1 11110 + 0100 0 1 1 1 1 11110 + 0101 0 1 1 1 1 1

Wallace 1 bit Wallace 2 bits 1 bit predictorExample

a)

b)

Table I. a) Carry estimated (CE) and real carry out (CT) for a random sequence of addition fragments, b) CE and CT for an addition sequence in the mix ADPCM module, supposing the adder is divided in two modules

1st op 2nd op CE CT CE CT CE CT0000001010101010 + 0011111101110110 10 11 1 1 1 1 0 10000001010101111 + 0000000101000001 10 01 0 1 0 1 1 10000001010101101 + 0011111101110011 10 11 1 1 1 1 1 10000001010110010 + 0000000100010010 10 01 0 1 0 1 1 10000001010101111 + 0011111101110001 10 11 1 1 1 1 1 1

ExampleWallace 1 bit Wallace 2 bits 1 bit predictor7th-6th bits

CE CT CE CT CE CT1110 + 0001 0 0 0 0 0 01110 + 0010 0 1 0 1 0 11110 + 0011 0 1 0 1 1 11110 + 0100 0 1 1 1 1 11110 + 0101 0 1 1 1 1 1

Wallace 1 bit Wallace 2 bits 1 bit predictorExample

a)

b)

Table I. a) Carry estimated (CE) and real carry out (CT) for a random sequence of addition fragments, b) CE and CT for an addition sequence in the mix ADPCM module, supposing the adder is divided in two modules

1st op 2nd op CE CT CE CT CE CT0000001010101010 + 0011111101110110 10 11 1 1 1 1 0 10000001010101111 + 0000000101000001 10 01 0 1 0 1 1 10000001010101101 + 0011111101110011 10 11 1 1 1 1 1 10000001010110010 + 0000000100010010 10 01 0 1 0 1 1 10000001010101111 + 0011111101110001 10 11 1 1 1 1 1 1

ExampleWallace 1 bit Wallace 2 bits 1 bit predictor7th-6th bits

76

obviously produce great hardware overheads. This disadvantage has motivated our decision of using simple predictors.

Predicting a carry is to decide whether a bit is ‘0’ or ‘1’. This prediction is quite similar to the decision of taken or not taken a branch. We have considered the following predictors [16]:

1) One bit predictor. This is the simplest one. It can be implemented only with a D-latch. The predicted carry will be the last real carry.

2) Two bit predictor or bimodal predictor. It consists of a finite state machine with four states. In a branch context, “00” and “01” states mean strongly not taken, and not taken, while “10” and “11” mean taken and strongly taken. If we change our point of view to the carry out prediction, we will predict a ‘0’ value for states “00” and “01”, and a ‘1’ value for the others states.

3) History predictor. This predictor decides the following carry out depending on a fixed number of last carries, after applying some function over these bits. In our case we have chosen 3 bits of carry history and the decision function will be the majority function. For example, if the last three true carries (CT) are ‘0’, ‘1’, ‘0’ we will predict a ‘0’.

4) Contextual predictor. This predictor decides the next carry out according to carry patterns and some history bits. In our case we have used two bits of carry history. For example, if the last two carries are ‘0’ and ‘0’, the pattern will be “00”. Therefore, the next carry out will be the last real carry that happened for the pattern “00”.

Figure 3 shows the truth tables of every predictor, and figure 4 illustrates the controller schemes for the prediction techniques.

The problem of estimating/predicting carries arises when the estimation/prediction fails. So in order to incorporate these modules to a synchronous context we must consider the worst case delay. To overcome this limitation we have to change the operation delay concept, that is, the same kind of operation can take a different

number of cycles, which can be smaller than the ones needed in the common case. For example, we can divide an addition into two halves that operate independently. So the cycle time will be roughly the 50% of the original. If we guess the carry, the addition will last one cycle, if not, we change the prediction for the next cycle in which we will calculate the correct result with the correct carry. The same idea can be applied to different types of operations and designs.

Therefore, the only area overhead consists in deciding when we hit or we fail predicting the carry, and in this way we do not need delay elements for generating the “Done” signal as in figure 2 a).

V. IMPROVEMENT OF THE HIT RATE

The technique of predicting carries is very accurate, but we can improve the hit rate applying hybrid techniques, as it happens with common branch predictors in the processors.

The key question of hybrid predictors is why we are going to predict a carry when we can calculate it easily. In the aforementioned Wallace techniques we have seen that one bit per operand is enough for guessing the carry in the 50% of the cases. Therefore we are going to combine this Wallace estimation with the prediction techniques. See figure 5 a).

See figure 5 b). If the most significant operands bits are equal, it is completely sure the carry will be ‘0’ or ‘1’. When these bits are different, the carry out needs more hardware to be calculated, so we will use the predictor

Prediction

CTCin

Figure 4. Controller scheme for prediction techniques

Prediction

CTCin

Figure 4. Controller scheme for prediction techniques

Estimation Prediction

Control

Figure 5. a) General hybrid prediction scheme, b) Examplefor a 16 adder made of two 8-bits modules, with hybrid

prediction with a 1-bit Wallace estimator and a 1-bitpredictor implemented with a D-latch

DX7

0 1X7Y7

a) b)


Control


Control

Figure 5. a) General hybrid prediction scheme, b) Examplefor a 16 adder made of two 8-bits modules, with hybrid

prediction with a 1-bit Wallace estimator and a 1-bitpredictor implemented with a D-latch

DX7

0 1X7Y7

DX7

0 1X7Y7

a) b)

Last carry Pred0 01 1

State Pred00 001 010 111 1

Carry history Pred Carry history Pred000 0 100 0001 0 101 1010 0 110 1011 1 111 1

Pattern Last carry00 101 010 111 1

a) b)

c)

d)

History bits “00”

Prediction ‘1’

Figure 3. Truth tables a) 1 bit predictor, b) 2 bits predictor, c) 3 bits history predictor + majority

logic decision function, d) 2 bits contextual predictor

Last carry Pred0 01 1

State Pred00 001 010 111 1

Carry history Pred Carry history Pred000 0 100 0001 0 101 1010 0 110 1011 1 111 1

Pattern Last carry00 101 010 111 1

a) b)

c)

d)

History bits “00”

Prediction ‘1’

Figure 3. Truth tables a) 1 bit predictor, b) 2 bits predictor, c) 3 bits history predictor + majority

logic decision function, d) 2 bits contextual predictor

n/2 bits

MS part

n/2 bits

LS partD

E

QPred

Ctrue

Cout

Figure 6. n bits ripple carry adder with predictiontechnique

predn/2 n/2

n/2 n/2 n/2 n/2

n/2 bits

MS part

n/2 bits

LS partD

E

Q D

E

Q D

E

QPred

Ctrue

Cout

Figure 6. n bits ripple carry adder with predictiontechnique

predn/2n/2 n/2n/2

n/2n/2 n/2n/2 n/2n/2 n/2n/2

77

value. Note that the Wallace estimator is only used when both bits are equal, so we can reduce the estimator to a forwarding of the bit used for estimation.

VI. DESIGNING MODULES

Using the aforementioned ideas, in this section the designs of both a predictive adder and a predictive multiplier are presented.

A. Predictive Adder The design of this adder is simple. We have two smaller

adders and the most significant module takes the carry in from the predictor. In our case we have chosen a 1-bit predictor implemented with a gated D-latch. See figure 6.

Therefore the predictive adder is twice as fast as a non predictive one if we guess the carry, because for a n bits adder we are reducing the critical path from n to n/2.

The prediction is updated when it is different from the true carry. In that case the failure in the prediction is indicated. Note this scheme is general. It is possible to use whatever kind of adder and whatever kind of predictor. For example, consider one N bits pure CLA adder, whose delay will be O(log(N)). Now consider one N bits predictive adder whose basic modules are N/2 pure CLA adders. With the predictive scheme one forwarding level is saved and thus, the corresponding area overhead.

B. Predictive Multiplier The basic scheme of a multiplier is built with several

full adders which propagate internal carries. In this case there are many carries to predict, and the use of many predictors reduces the probability of guessing all the

carries. That is, the probability of executing a product in one short cycle is very low.

For example, consider a ripple carry multiplier such as the one of the figure 7 a). Now, suppose we have a probability of 0.95 to guess a carry. Then, the probability of guessing all the carries is (0.95)3, that is, less than 0.86. This is not bad at all, but now consider a multiplier bigger than a 4x4 one. The probability of an absolute hit becomes lower very fast while we increase multipliers size. However in real cases there will probably be many zeros, so the hit rate will surely be higher, but the area overhead will not dissapear.

Supposing the probability of guessing every carry is the same, p, for a mxn ripple carry multiplier the probability of guessing all the carries is p(n-1). With (n-1) predictors the number of full adders in the critical path will decrease from (m+n-1) to (m/2+n-1).

Therefore we must consider another structure to use less predictors. We have chosen the carry save structure or Braun multiplier [19], which reduces the number of critical paths and only two predictors, allocated in the middle of the last stage of full adders, as in figure 7 b), are enough for obtaining the same performance. Note this scheme is not valid with estimation technique [8-9], because the combinational logic for deciding the carries would join the broken paths another time.

Therefore, with a Braun scheme for a mxn multiplier we have a probability of p2 of guessing all the carries and furthermore, the area overhead is quite smaller than in the case of the ripple carry scheme. Nevertheless the probability is slightly higher, because when both predictions fail and when both carries in are different the effect is the same as a hit: we are adding ‘1’. So this case is also a hit.

a) b)

Figure 7. a), b) Ripple Carry and Carry Save Multipliers with predictors, c) Predictorsdesign for a Carry Save Multiplier

c)

D

E

Q

D

E

Q

Ctrue1

Ctrue2

Ctrue1

Ctrue2

Ctrue1Ctrue2

Pred1

Pred2

hit

a) b)

Figure 7. a), b) Ripple Carry and Carry Save Multipliers with predictors, c) Predictorsdesign for a Carry Save Multiplier

c)

D

E

Q D

E

Q

D

E

Q D

E

Q

Ctrue1

Ctrue2

Ctrue1

Ctrue2

Ctrue1Ctrue2

Pred1

Pred2

hit

78

See figure 7 c) for the predictors implementation. We have 2 gated D-latch, one for every prediction, and some combinational logic that controls the enables and computes the hit signal.

Note that the Braun multiplier only works for positive numbers. However, the Baugh-Wooley [20] scheme is valid for both positive and negative numbers and its structure is almost the same than the Braun multiplier (it is also built following a carry-save structure).

VII. EXPERIMENTAL RESULTS

In order to compare the efficiency of the different estimation/prediction schemes and our designs we have made several experiments.

Firstly we have collected the data obtained from a set of additions from the simulation of the ADPCM decoder [15] and the Discrete Cosine Transform (DCT) in the JPEG2000 [18], and we have calculated the hit percentage for the aforementioned estimators/predictors. For the ADPCM we have used 16-bit adders composed by two 8-bit modules, and for the JPEG2000 32-bit adders built of two 16-bit modules.

The results for the different estimators/predictors are shown in table II a). With these percentages we can assert predictors are as accurate as a 2- bit Wallace estimator, and hybrid predictors as a 3-bit Wallace estimator. Note that hybrid predictors have been built with a 1 bit Wallace estimator and the pure associated predictor.

Figure 8 shows the comparison between the hit percentages for four different input photos in the JPEG2000 algorithm. The obtained results demonstrate the efficiency of the estimation/prediction techniques indepentdently from the inputs. The hit percentage is over

97%, reaching in some cases the 99.5 %. Table II b) shows the hit percentages reached for the

ADPCM products, supposing that fmult1 and mix1 multipliers are implemented with a 6x6 and a 12x8 predictive multipliers, respectively.Table II c) shows the hit percentages for the DCT products in the JPEG2000, supposing that 16x16 predictive multipliers are used. We have made this study for 4 different inputs, reaching in every case almost a 99% or higher hit percentages.

Considering jointly table II b) and c) it can be observed that hit percentages are higher as we increase the multiplier size. This fact is due to big multipliers only really use their most significant part (that is, the most significant bits of the operands are ‘1’) for few cases. In most cases the most significant bits are ‘0’ for ADPCM and JPEG2000 and therefore the carries are easily predictable. Therefore in some cases maybe desirable to increase a little bit multipliers size for increasing hit percentage according to simulations.

For measuring delay reduction and area overhead using predictive techniques, we have synthesized a RCA and a Braun multiplier with the commercial tool Synopsys Design Compiler. The target library used for both cases is VTVTLIB25 by Virginia Tech. based on a 0,25 μm TSMC technology. We have used a 1-bit predictor for the adder

Table III. RCA and Braun Multiplier delay and area with and without predictive techniques

Example Delay Delay Pred Overh Area Area Pred OverhRCA (8) 3486 2050 -41,19 6564 7332 11,70

RCA (16) 6931 3773 -45,56 12966 13619 5,04RCA (32) 13822 7218 -47,78 25376 26421 4,12

Braun Mul (8x8) 8385 6890 -17,83 43371 45004 3,77Braun Mul (16x16) 17790 14026 -21,16 175511 179446 2,24Braun Mul (32x32) 36833 28523 -22,56 683898 703250 2,83

Table III. RCA and Braun Multiplier delay and area with and without predictive techniques

Example Delay Delay Pred Overh Area Area Pred OverhRCA (8) 3486 2050 -41,19 6564 7332 11,70

RCA (16) 6931 3773 -45,56 12966 13619 5,04RCA (32) 13822 7218 -47,78 25376 26421 4,12

Braun Mul (8x8) 8385 6890 -17,83 43371 45004 3,77Braun Mul (16x16) 17790 14026 -21,16 175511 179446 2,24Braun Mul (32x32) 36833 28523 -22,56 683898 703250 2,83

Table II. a) pred percentage from ADPCM decoder andJPEG2000 additions set, b) pred percentage for the

ADPCM products, c) pred percentage for JPEG2000 products with 4 different inputs

a)ADPCM JPEG

1 bit Wallace 91,8 98,92 bits Wallace 94,2 98,93 bits Wallace 96,4 99,01 bit predictor 94,3 97,82 bits predictor 94,8 98,8

2 bits Contextual 94,2 97,83 bits History 94,6 98,9Hybrid 1 BP 96,2 98,1Hybrid 2 BP 96,1 98,6

Hybrid 2 Bcont 96,0 98,1Hybrid 3BHist 96,0 98,9

Avg 95,0 98,5

Input %HitPhoto 1 99,24Photo 2 99,83Photo 3 98,85Photo 4 99,56Average 99,37

Example %Hitfmult1 88,81mix1 99,25

Average 94,03

b)

c)

Table II. a) pred percentage from ADPCM decoder andJPEG2000 additions set, b) pred percentage for the

ADPCM products, c) pred percentage for JPEG2000 products with 4 different inputs

a)ADPCM JPEG




Avg 95,0 98,5



Average 94,03

b)

c)

a)ADPCM JPEG




Avg 95,0 98,5



Average 94,03

b)

c)95,5

96,0

96,5

97,0

97,5

98,0

98,5

99,0

99,5

100,0

Photo 1 Photo 2 Photo 3 Photo 4 Avg

Hit percentage in JPEG2000

1 bit Wallace2 bits Wallace3 bits Wallace1 bit predictor2 bits predictor2 bits Contextual3 bits HistoryHybrid 1 BPHybrid 2 BPHybrid 2 BcontHybrid 3BHistAvg

Figure 8. pred percentage for 4 different inputs tothe JPEG2000 algorithm

95,5

96,0

96,5

97,0

97,5

98,0

98,5

99,0

99,5

100,0

Photo 1 Photo 2 Photo 3 Photo 4 Avg

Hit percentage in JPEG2000

1 bit Wallace2 bits Wallace3 bits Wallace1 bit predictor2 bits predictor2 bits Contextual3 bits HistoryHybrid 1 BPHybrid 2 BPHybrid 2 BcontHybrid 3BHistAvg

Figure 8. pred percentage for 4 different inputs tothe JPEG2000 algorithm

Tpred = 9 cycles x (CTpred + (1-0.9) x CTpred) = 9.9CTpredTconv = 9 cycles x CTconv = 9CTconvCTpred = 0.8CTconvGain = 1 – Tpred/Tconv = 0.12

Figure 9. Performance gain with predictive techniques, where Tpred and Tconv are the execution time with

and without prediction, and CTpred and CTconv thecycle time with and without prediction

Tpred = 9 cycles x (CTpred + (1-0.9) x CTpred) = 9.9CTpredTconv = 9 cycles x CTconv = 9CTconvCTpred = 0.8CTconvGain = 1 – Tpred/Tconv = 0.12

Figure 9. Performance gain with predictive techniques, where Tpred and Tconv are the execution time with

and without prediction, and CTpred and CTconv thecycle time with and without prediction

79

and two 1-bit predictors in the Braun multiplier, following figures 6 and 7 c). The results are shown in table III. Columns 4 and 7 show the delay and area overhead, respectively. Delay is measured in ps and area in μm2. Both designs have been synthesized for several sizes, shown in brackets. In the RCA implementation the delay gain with prediction is greater than a 40% and increases as we augment the size of the adder, approaching to the theoretical limit which is the 50% of gain. The same fact happens with the Braun multiplier, but in this case the initial gain is a 18% approaching to the theoretical limit of 25% delay gain. Note that for a nxn Braun multiplier we reduce the critical path from (2n-1) to (n/2+n-1).

In terms of area the decrease in the overhead percentage with the inputs size is quite logical, because we use always one predictor for adders and two for multipliers whatever the operator dimensions.

Finally, the floating point product algorithm is considered, according to the IEEE 754 standard [21]. Its critical module, due to its delay and area, is a 24x24 multiplier. In a synchronous context, the cycle time will be determined by the multiplier delay. Using the commercial tool Synopsys Behavioral Compiler we get a latency of 9 cycles. Using a predictive multiplier and supposing a hit probability of 0.95 for each carry, we will have a probability of absolute hit slightly higher than 0.9. Suppose CTpred and CTconv are the cycle time for the predictive FP-product and the non predictive one, respectively. Therefore in each cycle we will take CTpred plus a penalization of one cycle in the case of failing the prediction. If CTpred is the 80% of CTconv, then the performance gain will be around 12%. See figure 9. For greater hit probabilities we would have greater improvements, as shown in the previous examples.

VIII. CONCLUSIONS

This work justifies the use of estimation and prediction techniques in the functional units implementation when designing applications with value locality, such as the DSPs ones. Via simulations it is possible to determine whether these implementations are going to achieve great improvements.

We have examined the Wallace proposals and introduced the prediction in the process of calculating the carries for diminishing the carry path, obtaining great performance improvements with low area overheads. Experiments show the accuracy of both techniques, which round or overtake hit percentages of 95% in most of the cases. However only pure prediction techniques are valid for implementing multipliers with low area overheads and remarkable delay reductions. Besides we have proposed a way of introducing prediction techniques in the synchronous context.

The prediction widens the possibilities for adders and multipliers implementations, due to hybrid modules can be built with predictive carries only for submodules whose carry out predictability reaches high values or combine several kinds of estimators (only in the case of adders) and/or predictors according to the required area or the highest hit rates.

Finally, note prediction can be applied in whatever design. The only point is to know what should be predicted. For example, in the case of CLA adders maybe it would be better to predict some internal signal than the carry itself.

REFERENCES [1] Wm. A. Wulf and Sally A McKee. “Hitting the Memory wall:

Implications of the Obvious”. Computer Architecture News, 23(1), pp. 20-24, March 1995.

[2] R.B.Freeman, “Checked carry select adder”, IBM Tech. Disclosure Bull., 1970, 13, (6), pp. 1504-1505

[3] M. Uya, K. Kaneko, J. Yasui, “A CMOS floating point multiplier”, IEEE J. Solid-State Circuits, 1984, SC-19(5), pp. 697-702.

[4] M.M. Mano and C.R. Kime, “Logic and computer design fundamentals”, Prentice-Hall, 2001.

[5] J.Lim, D.G. Kim and S.I. Chae, “A 16-bit carry lookahead adder using reversible energy recovery logic”. IEEE Journal Solid-State Circuits, vol 34., no 6. June 1999.

[6] D.J. Kinniment. “An evaluation of asynchronous addition”, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 1996, 4, (1), pp. 137-140.

[7] Jens Sparso. “Asynchronous circuit design”. Technical University of Denmark. 2006.

[8] W.F. Wallace, S.S. Dlay and O.R. Hinton. “Probabilistic carry state estimate for improved asynchronous adder performance”, IEE Proc. Comput. Digit. Tech., 2001, 148, (6), pp. 221-226.

[9] E.M. Ashmila, S. Dlay, O. Hinton. “Adder methodology and design using probabilistic multiple carry estimates”, IEE Proc. Comput. Digit. Tech., 2005, 152, (6), pp. 697-703.

[10] M.H. Lipasti and J.P. Shen. “Exceeding the Dataflow Limit via Value Prediction”. Proc. of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. 1996.

[11] F. Gabbay and A. Mendelson. “Can Program Profiling Support Value Prediction?”. Proc. of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. 1997.

[12] M.H. Lipasti, C.B. Wilkerson, J.P. Shen. “Value Locality and Load Value Prediction”. Proc. of the 17th International Conference on Architectural Support for Programming Languages and operating Systems”, pp. 138-147, 1996.

[13] K.M. Lepak, M.H. Lipasti. “On the Value Locality of Store Instructions”. Proc. of the 27th Annual International Symposium on Computer Architecture. 2000.

[14] B. Bishop, T.P. Kelliher, M.J. Irwin. “A Detailed Analysis of Mediabench”. IEEE Workshop on Signal Processing Systems, pp. 448-455. 1999.

[15] 40, 32, 24, 16 kbits/s Adaptative Differential Pulse Code Modulation (ADPCM). Recommendation G.726. ITU.

[16] J. L. Hennessy & D. A. Patterson, “Computer Architecture: A Quantitative Approach”. 2007.

[17] T. Liu, S.Lu, “Performance Improvement with Circuit –Level Speculation”, Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 10-13 December 2000, Monterey, California, USA.

[18] M. Charrier. “JPEG2000: A New Standard for Still Image Compression”. IEEE International Conference on Multimedia Computing and Systems. 1999.

[19] M. Davio,“Digital Systems with Algorithm Implementation”, John Wiley & Sons, 1983.

[20] W. Stallings, “Computer Organization and Architecture”. Prentice Hall. 1999.

[21] IEEE Standard for Binary Floating-Point Arithmetic. New York: ANSI/IEEE, Std. 754-1985, Aug. 1985.

80

Date post:	09-Apr-2023
Category:	Documents
Upload:	tesch
View:	0 times
Download:	0 times

Applying speculation techniques to implement functional units

Documents