Hemani GALS

7/27/2019 Hemani GALS

1/6

40.4

Lowering power consumption in clock by using Globally Async hronousLocally Synchronous design style.A.Hemani ,T.Meincke ,S.Kum ar4, A.Postula5, T.01sson2, P.Nilsson2, J.Ob erg , P.Ellervee,

D.Lundqvist3. ESD Lab, Departmentof Electronics, KTH, Sweden 2. Lund University, Sweden3. Ericsson Radio Systems AB, Stockholm, Sweden 4. Indian Institute of Techno logy, New Delhi, IndiaDepartmentof CSEE, University of Queensland, B risbane, Australia

ABSTRACTPower consumption in clock of large high performanceVLSIs can be reduced by adopting Globally Asynchro-nous, Locally Synchronous design style (GALS). GALShas small overheads or the global asynchronous com-munication and local clock generation. We proposemethods to a) evaluate the benefits of GALS and ac-count o r its overheads, which can be used as the basisfo r partitioning the system into optimal number/size ofsynchronous blocks, and b) automate the synthesis ofthe global asynchronous communication. Three realis-tic ASICs, ranging in complexity from I to 3 milliongates, were used to evaluate GALS beneJits and over-heads. The results show an average power saving ofabout 70% in clock with negligible overheads.1 INTRODUCTIONVLSI systems have reached a point in their evolutionwhere they are big enough and are being clocked at highenough frequency that the overh ead of the clock in theform of pow er consumption has become unacceptable.

Fig. 1.Power breakdown in a high-performance CP U [9].

Permission to make digitalor hard copies of all or part of this work fo r personal orclassroom use is granted without fee provided that copies are not made or distrib-uted for profit or commercial advantage and that copies bear this notice and the fullcitation on the first page. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee.DAC99, New Orleans, Louisiana01999 ACM 1-58113-092-9/99/0006..$5.00

This is confirmed by [12] for high performance m icro-processors, see Fig. 1W e have performed similar experiments with 3 ASICsdesigned in .25 micron CMOS, ranging in complexityfrom approximately 1 million gates to 3 million gatesand clocked at 250 MHZ. Our results substantiate thefindings of [12] as show n in section 4.The VLSI systems should be divided into smaller au-tonomous sub-systems to contain the clock overhead.The extreme case being a fully asynchronous system.W hile this has been tried successfu lly in isolated casese.g. [3], the design methodology for asynchronous sys-tems is far from m ature for widespread acceptance [11.As a comprom ise, we propose a design style called Glo-bally Asynchronous and Locally Synchronous or GALSfor short, see Fig. 2.GALS architectxe is com posed oflarge synchronous blocks (SBs) which communicatewith each other on an asynchrono us basis. By eliminat-ing the global clock, we eliminate a major source ofpower consumption and a design bottleneck. Addition-aly, as SBs operate asynchrono usly with respect to eachother, the frequency at which each SB is clocked can betailored to the local needs, thus reducing the averagefrequency and the overall power consumption.

d nSvnchronousBlock3 \

Datac3 andshake protocol

Fig. 2. The GALS architecture.Power consumption in VLSI systems is reduced by re-ducing voltage, switched capacitan ce and sw itching ac-tivity. W hile the first two have been mostly technology

87 3


2/6

and circuit oriented approaches, the third has a greatpotential from system to logic design phase. Some re-cent work [131 reports using variable voltage as a pa-rameter at system level to reduce power consumption.Gated clocks in recent times have become a popularway to red uce the switching activity of logic in redun-dant cycles [14], even com mercial logic syn thesis toolsare offering limited support towards this. Tiwari et. a1[9 ] applies this idea to turn of th e activity on the clocknet as one way of reducing power consumption inclock. They divide the design into clock regions andtransform simple clock buffers to qualyfing clock buff-ers to implemen t the clock gating. But, they also reporttiming and electrical issues that limits the practicalityof this approach beyond 1 00 MH Z. Xi et. a1 [6] reportimpressive results with a balanced buffer appro ach, andan optimization algorithm for buffer and device sizingunder process variations. Though skew can be mini-mised with this ap proach, we believe this approach willnot scale well as the clock frequencies go beyond GH Z.In contrast, the GA LS appro ach is skew tolerant at glo-bal level because it does not depend on a global clockreference for com munication. At the local level of SB s,designing clocks with tolerable skew should be easierbecause the problem size is smaller, typically 1Ok to100K gates. The smaller skew in GALS compared toGlobally Synchronous (GS) case, would also ease theskew constraints under which logic is synthesised.The GALS approach has its own overheads and de-mands on design methodology. The first overhead con-cerns the handshake protocol which introduces globalcontrol signals, this is discussed in m ore detail in sec-tion 3.3. The second overhead is for the cost of localclock generation for which two schemes exist. One,that we have ado pted in our w ork uses independent lo-cal clock generators of the type described in [4]. Theother approach is to have a l o w p o w e r global clock ref-erence signal. Both th ese scheme s are discussed in sec-tion 3.4. Further, to be adopted by the engineeringcommunity, a design methodology is needed that sup-ports the designer in 1) partitioning the system into SBssuch that the G AL S benefits are m aximised comparedto the GALS overheads and 2) synthesis of the asyn-chronous interface among SBs. Section 2 discussessuch a methodology.This paper focuses on establishing a case for GAL S de-sign style by taking three realistic designs and calculat-ing the power savin gs that can be achieved with such a

style while accurately accounting for the overh eads.2 The GALS MethodologyThe G ALS m ethodology extends the established syn-chronous design methodology in two respects as dis-cussed in the previous section. One is to partition thesystem into optimal num berkize of S Bs and the otheris to refine the communication among SBs for asyn-chrnous communication. The G ALS design m ethodol-ogy is shown in Fig. 3.1. The starting point is a hierarchical description of asynchronous system.2. The first step is to pre-part i t ion such a descriptionby cutting the hierarchy at the first level from theroot. This strategy, though simple, serves the pur-pose of seeding the inner optimization loop.3 . In the communication refinement step, for everypair of communicating SBs, communicationsrequirement analysis is made to select the cheapest

of the four possible communication modes: sendand forget, strobe, handshake or FIFO. For such ananalysis, each SB is characterised by parameterslike clock period, m id m ax transitions for the VOs .Such characterisation is done by static analysis.This refinement is based on our earlier work onhardware synthesis from SD L, detailed in [SI. Thisis a mature methodology that has been applied toindustrial size problem s, see [2].

CommunicationRefinement

[incremental] Synthesis

Fig. 3. The GALS design methodology,4. SBs are synthesised using behavioural an d o r logicsynthesis tools, followed by preliminary floorplan-ning of the SB s. Synthesis results fro m the first runare saved in a hierarchical way. As repartitioning

874


3/6

will not change drastically the synthesis results, thesynthesis results from the first run are used to calcu-late the gate and register count of SB s according tothe new partitioning.The next step is to evaluate the power savings anduse this as an objective function in a partitioningstrategy. At present, our work is focused on defin-ing the objective function that can be used in one ofthe many well established partitioning strategiesbased on sim ulated annealing, tabu search etc.Method to Evaluate Power Savings

In other words f i is then the upperbound on powersavings that can be achieved using G ALS . Though Eq2 sugg ests that powe r will monotonically increase withN , the fact is that the GALS overheads at some pointwill sufficiently erode the GALS benefits. This willhappen only for very large number of blocks since theoverheads of GA LS are very small.3.2. Power consum ption in the clockPower consumption in the clock, as mentioned earlier,comes from two so urces, one is due to the load seen atthe clock terminals - the registers, and the second is dueto the clock w ire itself. Assum ing an H -tree tvDe clock, I3.1. Analysis of Power savings and upperbound

The essential idea in the GA LS pow er savings strategyis the elimination of the global cloc k. Thus by partition-ing the system into more of blocks w e eliminate moreof the global clock. But this benefit does not increasemonotono usly, because by incre asing the number of

distribution, power consumption in clock is given byE71

h 3 h 2P clk = ( a d q + i ( ? l ) D c w + c N w,,eg r eg (E O 3)partitions, we also increase the two overhead s of GA LSdiscussed earlier. To analyse the power savings inGALS, let us assume, without loss of generality, thatthe system has a total area A and is partitioned into Nequal sized blocks. Further, let Pclock N ) represent thepower consumption in the clock of the GALS systemwith N blocks.

P c l o c k ( N )= k , . N . . ,& (EQ 1)Where n = A / N is the average area of an SB.Power consumption in clock is due to two factors: theload of sequential elements and capacitance of theclock w ire itself. Th e first facto r is proportional to areaand is accounted by N. a in Eq 1. The second factor isproportional to the length of clock wire and is account-ed by ,& in Eq 1. The tw o factors are multiplied be-cause for each sequential element driven, we need toovercome the capacitive load on wire of length & .The factor k l absorbs the constants of proportionality,technology and methodology factors.As Globally Synchronous (GS) is a special case ofGALS, the power consumption in clock of GS isPc lock ( l ) . he power savings in GA LS compared to GScan then be written as:

\ . I

The first two factors account for local and global clockwiring and the third factor acco unts for the load due toregisters. In E q 3, Nres is the number of registers, h isthe depth of the H-tree, D is the dimension of the chip,c is the capacitance per unit length and creg is the ca-pacitance for a single register,f s the clock frequencyan d Vdd is the power supply. a is an estimation factorto account for the algorithm used for local clock rout-ing. Eq 3 is applicable to the GS case. For the GA LScase, we use the sam e equation but assume each SB tobe a chip and then sum up the power consumption forall SBs.3.3, Communication overheadThe communication overhead in the GALS architec-ture comes from the activity on the control signals toimplement the communication protocol among theSBs . Four ph ase handshak e protocol is a representativeexample. Four major factors that contribute to the com-munication o verhead are: 1)The mode of commu nica-tion resulting from the com mun ication refinement step3 in section 2. In this paper w e are uniformly assuminghandshake based communication. 2) The frequencywith which SB s communicate with other SBs, theworst case being that they communicate in every clockcycle, 3) Number of SBs that participate in communi-cation, the worst case being every SB communicateswith every other and 4) The length of wires for controlsignals. The objective of the partitioning strategy and

875


4/6

the floorplanning step in the GALS methodologyshown i n Fig. 3. would be to minimise the externalcommunication and place SB s that do communicateclose to each other. As the data communication over-head would be the same for GALS and SBs, we areonly interested i n the communication overhead P,,,,,due to asynchronous protocol which is given by Eq 4.

(EQ 4)where, B is the set of comm unicating SB pairs, X b is theset of commu nication instances for a particular pair - b- of comm unicating SBs, nsjg s the number of signalsin the comm unication protocol, the handshak e proto-col will have two signals. request and acknowledge; C ,is the wire capacitance per unit length, Z b , ; is the wirelength for communication signal, nreg s the number ofregisters driven by the control signals and Cr , is theregister capacitance,fb,i is the frequency at with whichSB s communicate.3.4. Local clock generation and its overheadIn the GA LS architecture local clocks are required forthe SBs. There are two possibilities, either to have anindependen t local clock generator o r to use a global ref-erence clock sig nal with frequency m ultiplier in SBs toachieve the desired higher frequen cy.When using the global reference clock signal, the pow-er consump tion in it can be kept low by adop ting threemeasures: 1) the signal swing can be a fraction of Vdd,for instance a few hundred millivolts, 2) the signal isdistributed at a much lower frequency co mpared to thehighest frequency required by any SB and then using amultiplier within SB to achieve the desired higher fre-quency and lastly 3) no effort is made to carefully de-sign the geometry of the signal to minimise skew,because the GALS architecture is not affected by theglobal clock skew.Traditionally, an analog PLL is used for frequencymultiplication. However integrating a PLL in a noisydigital environment is difficult. In addition to noise is-sues, the PLL is also sensitive to process variations.Fully digital m ethods are more desirable for both low-voltage and low-power clock generation. For frequen-cy multiplication, a ring oscillator controlled by a glo-bal clock signal can be used instead of a PLL or a DLL

as suggested in [4].In this desig n, a burst of a predeter-mined num ber of oscillations is produced for each pe-riod of the global reference clock fo llowed by idle timefor safety margin.Local clock generators based on ring oscillators havemany advantages such as robustness, small size andlow power con sump tion. The basic ring oscillator con-sists of an odd number of inverters connected in a cir-cula rch ain. This circuit will not have a stable operationpoint and will therefore oscillate. The frequ ency of thering oscillator will be determined by the propagationtime through the chain of inverters. There are manyways to manip ulate the frequency of the ring oscillator.The m ost straightforward m ethod is to change the prop-agation delay by changing the number of inverters.However, using only this method, the oscillation willbe set to a fixed frequency . Oth er ways are to use cur-rent starved inverters [ 5 ] , r a delay line of controllablecapacitors [1 ]. There are advantages of reducing thelocal oscillator frequency, using one of the suggestedmethods, instead of letting each SB operate fast andthen be turned off when ready. One advantage is thepossibility of making a low power design, since thetiming demands on the hardware in the SB will be lesssevere. Ring oscillators are low cost solutions and thepower consumption in them for the G ALS architecturecan be estimated by Eq 5.

8

(EQ5)Where B is the set of SBs, Ci,lv s the capacitive loaddue to one inverter, Nit lvb is the number of invertersused by the ring oscillator in SB number b andfb is itsfrequency of operation.In [ l ] , dynam ic voltage scaling of both the oscillatorsand the SBs, leading to an energy efficient design isproposed. Using a DCDC-converter, dynamic voltagescaling of the power supp ly voltage at each SB contain-ing a ring oscillator is made.In [121,a local ring oscillator is used for clock genera-tion by frequency multiplication in a digital 95 kHz In-termediate Frequency (IF) filter. This IF filter can beregarded as a typical SB in the GALS architecture. Inthis design, the clock generator including local clockbuffer occupies only 3% of the total chip area. Theclock power consumption in this example is about 12%of the total power consumption in the SB.

876


5/6

4 EXPERIMENTS & RESULTSIn the previous sections, we have analytically reasonedabout the benefits and ov erheads of GA LS. In this sec-tion, we present experiment to quantify those reason-ings. W e took three designs, with 1.12 5,2.0 9 and 3.04million gates. T hese designs a re in the initial stage ofdevelopment. The focus of the experiment was to eval-uate the benefits of GALS and account for its over-heads. For this reason, we have followed themethodology outlined in section 2 in spirit, by manu al-ly doing some of the steps outlined there. The experi-mental steps were as follows:1 . The experiment started with a hierarchical RTLdescript iodmo dels of the design in VH DL.2. The description was cut at the first level of hierar-chy, resulting in 24, 46 and 52 partitions for the

three designs respectively. These partitions weretreated as th e initial set of SB s for the threedesigns..

ui;

Table 1:Design DataMillions Initial SB s

7.5 X 1.5

12.8 X 12.8Table 2.Data used for the GA LS experiment

3. The communication refinement step was skipped,because the existing tool works with SDL and weare in the process of adapting it for VH DL. A ll glo-bal comm unication is assumed to take place using afour phase handshake protocol. All SBs are clockedat the same frequency, internally s ome of them u segated clocks. This sets the value of nSigan d nregparameters in Eq 4 to 24. The communication refinement step was skipped,because the existing tool works with SDL and weare in the process of adapting it for VH DL. A ll glo-bal comm unication is assumed to take place using afour phase h andshake protocol. A ll SBs are clockedat the same frequency, internally some of them use

gated clocks. This sets the value of nSig an d nregparameters in Eq 4 to 2Each of these SBs, whichwere synthesisable and not models correspondingto processor cores and memories, were syn thesisedusing logic synthesis tools.5. Each desigr. was floorplanned.

I10 30 40 50 602o partitions /designFig. 4. Power consumption in Clock as a function of par-6. Keeping the floorplan fixed, we manually createdmultiple partitioning scenarios for each design.Starting with the initial set of SBs, we gradually

increased the partion size by grouping neighbour-in g SB s into new larger SBs. These scenarios corre-spond to the design points shown in Fig . 4.

titioning.

7 . For each partitioning scenario, the design datatogether with the synthesis result was u sed to find:the number of registers, Nreg and the dimensionD of each SB. These were then used in Eq 3 tocalculate the power consum ption in clock , plot-ted in Fig. 4.the set - B - of participating SB s and the set - x b- of commu nication instances. For the length ofwires, we used manhattan distance as an esti-mate. These corresponds to the B , x b an d lb,iparameters respectively in Eq 4.

8 . For calculating the overhead due to ring oscillators,we est imated the number of inverters required to b e9 using equations in Ch. 3 of [5] . This was used inEq 5. and added to Eq 4 o calculate the GALSoverhead.Fig. 4. hows how the power consumption reduces asthe number of partitions increase. The gains are steep

877


6/6

in the beginning and then as expected, they start to flat-ten as the law of diminishing return bites in.One canalso see that as the design size increases the GALSbenefi t is more prominent . The GA LS overhead is neg-ligible compared to the clock pow er consumption.From this one could draw the conclusions that theGALS benefit will not start eroding the benefits untilthe number of partitions becomes quite large. For thedesigns used, the number of partitions will have to in-crease by an order of magnitude for the GALS over-head to be significant.5 CONCLUSION & FUTURE WORKClock being the major source of power consumptionand a design bottleneck has motivated this research intothe GALS design style, which retains the benefits ofsynchronous designs and avoids the problems due toglobal c lock. W e have proposed a methodology thatautomates the task of communication refinement forimplementing the asynchronous communication be-tween the S Bs of the G ALS architecture . GALS bene-fits and overheads were reasoned analytically and anupperbound on power savings has been derived, equal-ling@,where N is the number of SBs. Three large re-alistic designs were used to quantify the GALSbenefits, overhead s and pow er savings compared to theGS case. Results show 70% power saving s compared tothe GS case in the clock. GA LS overh eads were negli-gible and in fact the data at hand shows that theywouldnt becom e significant unless the number of par-titions for these designs increases by an order of mag-nitude.This paper was concerned mainly with power con-sumption and savings, but there are further conse-quences of the GALS architecture that deserv e closeranalysis. The GALS architecture allows SB s to run atdifferent clock speeds. We are in the process of exper-imenting with o ther large designs wh ere partitions havea natural need to run at grossly different speeds, andthus have the potential for exploiting the GAL S charac-teristic. Clock skew constraints are limited to the SBboundaries and thu s smaller than in the GS case whichresults in more effective computation time allowingcheaper and cooler logic to be used. O n the other hand,the introduction of global protocol sign als will increasethe total area required causing the average w ire lengthto grow and thus increasing the performance and pow erpenalty. Th e protocol exchange degrades the commu-nication speed between blocks adding more perform-

ance penalty and may cause deadlock to occur. Theseissues will be addressed in our future research.61.

2.

3 .

4.

5.6.

7.

8.

9.

REFERENCESS. Hauck, A synchronous Design Methodologies: AnOverview , Proceedings of IEEE, Vol. 83, No. 1, pp 69-93, January 1995.W. Horn, M odelling of an ATM M ultiplexer in a Net-work Terminal for a Mixed Ha rdw areF innw are Imple-mentation, Master thesis, TRIT A-ES D-1998-06,Department of Elec tronics, Royal Institute of Technol-ogy, Stockholm, Sweden, May 1998.G . M. Jacobs, R. W Broderson, A Fully AsynchronousDigital Signal Processor Using Self-Timed C ircuits,IEEE Journal of Solid-state Circuits, Vol2 5, No. 6,Dec.1996.P. Nilsson, M . Torkelson, A M onolithic D igital Clock-Generator for On-Chip Clo chn g of Custom DSPs,IEEE Journal of Solid-state Circuits, pp. 700-706, May1996J.M.Rabaey , Digital Integrated Circuits, Prentice Hall,1997J. M. Rabaey, M. Pedram, Low Power Design M ethod-ologies Ch 1, Kluwer Academic Publishers, 1996,J. M . Rabaey, M. Pedram, Low Power Design Method-ologies, Ch 5, Kluwer Academic Publishers, 1996,B. Svantesson, S . Kumar, A. Hemani, A Methodologyand A lgorithms for Efficient Interprocess C ommunica-tion Synthesis from System Description in SDL, inProc. of VLSI Design98, pp 78- 84 ,7-8 Jan 1998, Chen-nai, IndiaV. Tiwari et. al., Reducing Power in High-perform anceMicroprocessors, 35th DAC, June 98 .

ISBNO-7923-9630-8

ISBNO-7923-9630-8

10.T. Hotta K. Kurita and N. Kitamura. PLL-based BiC-M OS on-chip clock generator fo r ver y high-speed micro-processors. IEEE Journal of Solid-state Circuits, 26:pp.485-589, April 1991.11. T. D. Burd and R. W. Brodersen, Processor Designfo rPortable Systems, Journal of VLSI Signal Processing,Kluw er Academic Publishe rs, Volume 13, Num bers 213,AugusVSeptember 1996, pp. 203-222.12. P. Nilsson and M. Torkelson. A Custom Digital Interme-diate Frequency Filter fo r the American M obile T ele-phone System. IEEE Journal of Solid-state Circuits,32:pp. 806-815, June 1997.13. Inki Hong et. al. Power Optimisation of V ariable VoltageCore-Based Systems. 35th DAC, June 98, pp. 176-18 1.14.L.Benini and G. De Micheli, Transformations and Syn-thesis of FSMs for low power gated clock implementa-tion, IEEE T rans. on CA D, Vol. 15 , No. 6, Jun e 1996.

878

Date post:	14-Apr-2018
Category:	Documents
Upload:	sambhav-verman
View:	233 times
Download:	0 times

Hemani GALS

Documents