+ All Categories
Home > Documents > GALS at ETH Zurich: Success or Failure

GALS at ETH Zurich: Success or Failure

Date post: 22-Nov-2023
Category:
Upload: 204
View: 0 times
Download: 0 times
Share this document with a friend
10
GALS at ETH Zurich: Success or Failure ? Frank K. Gürkaynak, Stephan Oetiker, Hubert Kaeslin, Norbert Felber, Wolfgang Fichtner Integrated Systems Laboratory, CH-8092 ETH Zurich Abstract The Integrated Systems Laboratory (IIS) of ETH Zurich (Swiss Federal Institute of Technology) has been active in Globally-Asynchronous Locally-Synchronous (GALS) re- search since 1998. During this time, a number of GALS circuits have been fabricated and tested successfully on sil- icon. From a hardware designers point of view, this arti- cle summarizes the evolution from proof of concept designs over multi-point interconnects to applications that specifi- cally take advantage of GALS operation to improve crypto- graphic security. In spite of the fact that they fail to address numerous idiosyncrasies of GALS (such as good partition- ing into synchronous islands, port controller design, paus- able clock generators, design for test, etc.), hierarchical de- sign flows have been found to form a workable basis. What prevents GALS from gaining a wider acceptance mainly is the initial effort required to come up with a design flow that is efficient and dependable. 1. Introduction The technological improvement in microelectronic man- ufacturing is well documented. Developing circuits with an ever increasing complexity poses serious challenges to designers. Particularly as the number of clocked elements, and the clock rates continue to increase, clock distribution has become a considerable challenge. This has led to a re- newed interest in alternative design methodologies such as Globally-Asynchronous Locally-Synchronous (GALS) de- sign. First introduced by D. Chapiro in his Ph.D thesis as early as 1984 [2], the name GALS basically suggests that the system consists of multiple functional blocks that com- municate asynchronously. Neither the specific self-timed communication between the blocks, nor the synchroniza- tion method used at block boundaries is strictly determined. Therefore, many different flavors of GALS have been pre- sented in the literature. Several implementations [12, 1, 14] have used a local clock generator that can be paused to syn- chronize GALS modules during data transfers. Some recent implementations use synchronizers [16] or asynchronous FIFOs [3] for the same purpose. It is also possible to gener- ate the local clock pulses of a GALS module directly from handshake signals [10, 9]. In the following, the term GALS will refer to the specific flavor developed by the Integrated Systems Laboratory (IIS) of ETH Zurich. The initial work on GALS systems at IIS started with a project geared towards reducing the power consumption of wireless systems. The research later focussed more on practical realization of GALS systems and, as a result, sev- eral chips were fabricated and measured. Despite all suc- cessfully implemented designs, and an apparent increase in interest for GALS systems, at least in the research commu- nity, continued industrial support for GALS projects could not be maintained. In this paper, a summary of the activities of the IIS on the GALS field is presented. Section 2 describes all chips developed as part of GALS research and highlights the most important results obtained from each implementation. In section 3, issues on practical GALS implementations are discussed. The advantages and present problems of GALS systems are briefly explained in Section 4, and we allow ourselves a final word in section 5. 2. The GALS Chips A total of five different chips using three different man- ufacturing technologies have been designed as part of the GALS research by the IIS (table 1). Implementing a given design in silicon has its own set of challenges. Most of the additional effort required is directed towards practical problems, and is not necessarily scientific in nature. However, the experience obtained from such de- signs helps improve the methodology. As an example, the problems experienced during our first chip (Fango) have led to using local clock generators that have programmable pe- riods for all following GALS systems. Later during the design of another chip (Acacia), this property of the local 1
Transcript

GALS at ETH Zurich: Success or Failure ?

Frank K. Gürkaynak, Stephan Oetiker, Hubert Kaeslin, Norbert Felber, Wolfgang Fichtner

Integrated Systems Laboratory, CH-8092 ETH Zurich

Abstract

The Integrated Systems Laboratory (IIS) of ETH Zurich(Swiss Federal Institute of Technology) has been activein Globally-Asynchronous Locally-Synchronous (GALS) re-search since 1998. During this time, a number of GALScircuits have been fabricated and tested successfully on sil-icon. From a hardware designers point of view, this arti-cle summarizes the evolution from proof of concept designsover multi-point interconnects to applications that specifi-cally take advantage of GALS operation to improve crypto-graphic security. In spite of the fact that they fail to addressnumerous idiosyncrasies of GALS (such as good partition-ing into synchronous islands, port controller design, paus-able clock generators, design for test, etc.), hierarchical de-sign flows have been found to form a workable basis. Whatprevents GALS from gaining a wider acceptance mainly isthe initial effort required to come up with a design flow thatis efficient and dependable.

1. Introduction

The technological improvement in microelectronic man-ufacturing is well documented. Developing circuits withan ever increasing complexity poses serious challenges todesigners. Particularly as the number of clocked elements,and the clock rates continue to increase, clock distributionhas become a considerable challenge. This has led to a re-newed interest in alternative design methodologies such asGlobally-Asynchronous Locally-Synchronous (GALS) de-sign.

First introduced by D. Chapiro in his Ph.D thesis asearly as 1984 [2], the name GALS basically suggests thatthe system consists of multiple functional blocks that com-municate asynchronously. Neither the specific self-timedcommunication between the blocks, nor the synchroniza-tion method used at block boundaries is strictly determined.Therefore, many different flavors of GALS have been pre-sented in the literature. Several implementations [12, 1, 14]

have used a local clock generator that can be paused to syn-chronize GALS modules during data transfers. Some recentimplementations use synchronizers [16] or asynchronousFIFOs [3] for the same purpose. It is also possible to gener-ate the local clock pulses of a GALS module directly fromhandshake signals [10, 9]. In the following, the term GALSwill refer to the specific flavor developed by the IntegratedSystems Laboratory (IIS) of ETH Zurich.

The initial work on GALS systems at IIS started witha project geared towards reducing the power consumptionof wireless systems. The research later focussed more onpractical realization of GALS systems and, as a result, sev-eral chips were fabricated and measured. Despite all suc-cessfully implemented designs, and an apparent increase ininterest for GALS systems, at least in the research commu-nity, continued industrial support for GALS projects couldnot be maintained.

In this paper, a summary of the activities of the IIS onthe GALS field is presented. Section 2 describes all chipsdeveloped as part of GALS research and highlights the mostimportant results obtained from each implementation. Insection 3, issues on practical GALS implementations arediscussed. The advantages and present problems of GALSsystems are briefly explained in Section 4, and we allowourselves a final word in section 5.

2. The GALS Chips

A total of five different chips using three different man-ufacturing technologies have been designed as part of theGALS research by the IIS (table 1).

Implementing a given design in silicon has its own set ofchallenges. Most of the additional effort required is directedtowards practical problems, and is not necessarily scientificin nature. However, the experience obtained from such de-signs helps improve the methodology. As an example, theproblems experienced during our first chip (Fango) have ledto using local clock generators that have programmable pe-riods for all following GALS systems. Later during thedesign of another chip (Acacia), this property of the local

1

Table 1. Timeline for GALS chips designed atthe IIS.

Name Year Key Goal

Fango 1999 initial test design

Marilyn 2000 proof of concept

Oscar 2002 local clock generators

Shir-Khan 2003 multi-point interconnect

Acacia 2005 GALS-based application

clock generator was exploited to improve the efficiency ofcryptographic countermeasures.

Once a suitable self-timed library containing the portcontrollers and the local clock generator was developed, theremaining part of the design flow was based on standarddesign automation tools. Apart from the mutual exclusionelement that was developed separately, all designs were im-plemented by using standard cell libraries.

�������������� ������������

The first GALS implementation realized at the IIS isFango [14]. This chip implemented the Safer-64 crypto-graphic algorithm and contained 8 separate GALS modules.The chip also contains an extensive set of test structures,whereby individual GALS ports and data transfer channelscould be measured separately. The design was implementedusing a 6-Metal 0.25 µm CMOS process and occupied acore area of 0.820 mm2 (figure 1).

The majority of the test structures were shown to befunctional. However, several practical problems were de-tected in the design. Some local clock generators werefound to be operating faster than expected, causing failures.Similarly, a timing problem between the local clock gener-ator and the port controller caused the locally synchronousisland to receive additional clock pulses.

����������������� ��!�"��#�#$#%�

With the lessons learned from Fango, a second chip im-plementing the Safer-SK128, a more advanced version ofthe previous algorithm, with support for more operatingmodes, has been designed. This chip called Marilyn, con-tained 9 separate GALS modules and was implemented us-ing a 5-Metal 0.25 µm CMOS technology.

Figure 1. Micrograph of Fango (0.25 µm 6-Metal technology)

Figure 2. Micrograph of Marilyn (0.25 µm 5-Metal technology).

2

In order to avoid the problems with Fango, the localclock generators used in Marilyn were designed to be pro-grammable. In this way, the clock period of individualGALS modules can be determined during testing. This alsoallows the same local clock generator to be used for GALSmodules with different timing requirements. This practicehas been adopted in all of the following GALS designs.

The SAFER SK128 algorithm was implemented usingstandard synchronous design as well. This second designcalled Merlin was realized using the same 5-Metal 0.25µm CMOS technology and enabled a direct comparison be-tween GALS and synchronous design methodologies [13].

Table 2 compares the key parameters of both designs. Ascan be seen, the area overhead of GALS was quite high. Tobe able to observe different data transfer methods, the num-ber of GALS modules was artificially kept high. This re-sulted in a rather fine-grained partitioning where the locallysynchronous islands were much less complex than whatwould be expected to be found in a practical GALS system.Marilyn further contains an additional test solution that es-sentially replicates the port controllers synchronously. Thiswas designed as a fall-back solution and enabled the en-tire system to be clocked synchronously for debugging pur-poses. Both effects together inflated the overhead by a fac-tor of nearly four.

A new problem was encountered with the programmablelocal clock generator used in Marilyn. The delay stepsproved to be too coarse for fast modules. As a result, someGALS modules that were critical for system performancehad to be clocked at a rate below their optimal frequency.The performance penalty seen in table 2 is a direct result ofthis.

On the other hand, the per-data-item energy consumptionof the GALS solution Marilyn (555 mJ/Mb), when com-pared to a clock-gated synchronous version (737 mJ/Mb),is decidedly smaller which further confirmed the potentialof GALS for low-power applications. Marilyn is also sig-nificant as the first fully functional GALS-based system im-plementation in silicon.

���'&��)(+*-,.���/�"��#�#����

The minimum clock period of the local clock generatorused in GALS must be designed to match the critical pathof the locally synchronous island that it is connected to. Intheory, it is possible to design local clock generators thatinclude delay lines which closely match the critical path ofindividual locally synchronous islands. In practice, how-ever, the exact value of the critical path of a given locallysynchronous island can only be determined towards the end

Table 2. Comparison between SAFER SK128implementations Marilyn (GALS) and Merlin(Synchronous)

Merlin Marilyn ∆(%)

Area [mm2] 1.232 1.560 +21

Throughput [Mb/s] 303 232 -30

Energy [mJ/Mb] 737 555 -32

Max. Clock [MHz] 300 240 -25

Arbitration Block

MU

TEX

MU

TEX

Programmable Delay Line

Ri Ai LClk DelayControlRi Ai

C

MU

TEX

Ri Ai

Figure 3. Simplified block diagram of the lo-cal clock generator.

of the design cycle. Moreover, in a GALS system that con-sists of several different GALS modules, each local clockgenerator must be tuned individually.

A solution to this problem is to use local clock gener-ators that can be programmed to operate at different fre-quencies. In this way, a generic local clock oscillator thatis able to generate the clock freqencies for all locally syn-chronous islands is designed once. With this approach, it isnot necessary to know the exact value of the critical path ofthe locally synchronous islands before tape-out. The localclock generators can be adjusted after production to matchthe critical delay of the locally synchronous island.

Local clock generators with adjustable clock periodsconsist of a programmable delay line as shown in fig-ure 3. The resolution with which the delay line can beprogrammed is crucial for systems that have locally syn-chronous islands with high clock speeds. As an exageratednumerical example consider a local clock generator with aperiod resolution of 0.5 ns per step, starting with a min-imum period of 1.5 ns. If the critical path of the locallysynchronous island is 2.05 ns (487 MHz), the system canbe clocked at 2.5 ns (400 MHz) only. This would result inthe block being clocked at a rate that is nearly 20% belowits maximum. This was essentially the problem that plaguedMarilyn.

There are many alternatives in the literature [8, 6, 11]

3

Figure 4. Micrograph of Oscar (0.6 µm 3-Metaltechnology).

that describe how sub-gate delay resolution programmabledelay lines can be realized. The Oscar test chip (figure 4)was designed using a 3-Metal 0.6 µm CMOS technologyin an attempt to evaluate different delay line architectures[15]. The chip contains 24 different local clock generatorarchitectures.

If a standard digital placement and routing flow is em-ployed for the local clock generator, standard cells used toconstruct the programmable delay line may be placed ir-regularly. This creates different interconnection parasiticsbetween delay elements and degrades the linearity of thedelay line. To improve the linearity of the programmabledelay line, all delay elements were placed manually. Mea-surement results for selected local clock generators can beseen in table 3. The local clock generator used in Marilynhas also been implemented in Oscar for comparison pur-poses. Note that Oscar was implemented using an older 0.6µm technology. This explains the comparatively low oper-ating frequency and high current consumption.

As can be seen from table 3, the minimum period resolu-tion of the local clock generator used in Marilyn is around1 ns. This rather coarse delay step equals to the propagationdelay of four NAND gates. All other local clock generatorspresented in table 3 include an additional method to intro-duce ’fine delays’. For best results, the fine delay tuningrange must span at least one coarse delay step. To obtainthe sub-gate delay resolution of the local clock generatorarchitectures named “negative skew” and “capacitive load”,additional standard cells had to be designed. The delay lineused in the “negative skew” clock generator requires a spe-cialized inverter which has separate inputs for the NMOSand PMOS transistors as described in [11]. The fine-delay

Table 3. Measurement results for selected lo-cal clock generators from Oscar (0.6 µm 3-Metal technology).

Range Res. Current

[MHz] [ps] [mA]

Inverter Matrix 1x8 34-140 ~530 13.285

Phase Blender 40-140 ~150 16.900

Negative Skew 35-130 ~630 11.036

Capacitive Load 35-171 ~110 14.106

Marilyn Clock 30-140 ~1000 10.251

of the “capacitive load” delay-line is obtained by a customdesigned standard cell that can increase its internal capaci-tive load in eight steps.

���10��32%45� �7698:4��;�9�$#�#$&%�

The modular approach of GALS-based design makes itespecially attractive to implement SoC systems. However,large-scale industrial SoC designs are often placed aroundbus architectures. Earlier GALS systems designed at theIIS were realized using only point-to-point interconnectionsbetween GALS modules. The Shir-Khan chip was designedto test several GALS compatible multi-point interconnectarchitectures [17]. A total of five different bus architectureswere implemented in Shir-Khan:

Mogli (Modular GALS interconnect) Single-channelshared bus architecture.

Dual Mogli Similar to the Mogli architecture, but uses sep-arate data channels for command and response. Thisincreases the overhead, but improves the throughputand the bus avilability.

AMBA Based on Mogli, but contains an AMBA/AHBcompliant interface.

String (Self-timed ring for GALS) A ring-based architec-ture where data is transmitted from node to node.

Swing (Switching network for GALS) Switch-based inter-connect that includes a crossbar switch to interconnectall modules.

Several GALS modules connected to each multi-pointinterconnect architecture are required to evaluate the per-formance. Instead of implementing a dedicated algorithm,

4

Figure 5. Micrograph of Shir-Khan

Shir-Khan was designed as a specialized test bed. A four-bit micro-controller, capable of controlling four pairs of in-put and output ports, was designed to serve as the locallysynchronous island for all GALS modules. Shir-Khan isa fairly large GALS design, containing 25 GALS modulesdistributed over a 7 x 4 grid which is clearly visible in thechip micrograph in figure 5.

By using a synchronous configuration interface, eachmicro-controller could be programmed to generate repre-sentative data traffic patterns between GALS modules. Af-ter the configuration phase, the system was allowed to run inGALS mode. Each micro-controller contained a dedicateddata memory for each of its ports. At the end of the op-eration, these data memories were read out using the sameconfiguration interface. This process was fully automated.

To be able to support all bus architectures, additional portcontrollers had to be designed. While Marilyn used onlyfour different port controllers, 57 different port controllerconfigurations were required for Shir-Khan. In total, therewere 181 port controller instantiations. A specialized con-figuration tool was developed to automatically generate theGALS modules from simple configuration scripts.

Table 4 compares the performance figures obtained fromthe bus architectures implemented in Shir-Khan to severalselected synchronous architectures. While the throughputand latency figures are not spectacular, they are comparableto synchronous realizations. The important result obtainedfrom Shir-Khan is that GALS-based system can also beused to implement multi-point interconnects reliably. More-over, it was shown that several stages of GALS design can

Transfer Rate Latency

(Mtransfers/sec) (ns)

Mogli/AMBA 71 15/31

String 107 per segment 8 per node

Swing 147 per initator 8 per stage

PI-bus 50 40

CoreConnect 100 20

Table 4. Main parameters of implementedGALS multi-point architectures, compared toselected synchronous implementations (allin 0.25 µm technology).

be automated.

����<���=>,.��,.�?�@�9�$#�#�<��

Acacia is the first chip where GALS is not only used asa design methodology, but also to solve a problem from anapplication domain: security of cryptographic acceleratorsagainst side-channel attacks[7].

Designed using a 5-Metal 0.25 µm CMOS technology,Acacia seen in figure 6, implements the popular AdvancedEncryption Standard (AES) algorithm using three GALSmodules. Combined with well-known countermeasuressuch as adding noise sources and inserting dummy oper-ations, the independently clocked GALS modules make itvery difficult for an attacker to determine when a specificoperation is taking place. Attacks are made even more dif-ficult by a specialized local clock generator that is able tochange the clock period randomly for each clock cycle. Toprovide a fair comparison, the chip also contains a syn-chronous implementation of the AES algorithm without anycountermeasures.

Introducing countermeasures against side-channel at-tacks invariably adds penalties to system parameters suchas circuit area, operation speed, and power consumption.Acacia can be programmed to trade off throughput againstsecurity at run-time. Table 5 compares two different opera-tion modes of Acacia against the synchronous reference im-plementation. The ’fast’ GALS mode tries to compute theresult as fast as possible, while the ’secure’ mode activatesa series of countermeasures to those parts of the operationthat are more vulnerable. Even with countermeasures, theGALS-based Acacia achieves throughput and power figureswithin 20% of those from a fully synchronous version with-out any countermeasures. The large difference in area can

5

This part of the chip occupied

by an unrelated design

Synchronous Interface &

Reference Design

Goliath

David David

Clockgen Clockgen

Clockgen

d2g d2g

g2s

g2d g2d

Figure 6. Micrograph of Acacia. The left sideof the chip is occupied by an unrelated de-sign (0.25 µm 5-Metal technology).

be largely attributed to the implemented countermeasures,the GALS overhead is around 5%. This shows that it isindeed possible to design GALS systems that have similaror even better performance metrics than their synchronouscounterparts.

3. Practical GALS design

A synchronous circuit can be used as a starting point fora GALS system. In a process called GALSification, sucha circuit gets converted into a GALS system. For an effi-cient GALS realization, the initial architecture must be de-signed with GALS in mind. This will be more apparent forlarger SoC that are expected to benefit significantly fromGALS-based design. Present SoC designs require tens ofclock domains. Several of these domains are introducedto enable system-wide communication protocols betweenblocks with different operating speeds. If such SoC circuitswere designed with GALS in mind, a number of such clockdomains would not be required. Moreover, especially forinter-module communication, inherently self-timed com-munication protocols would be favored over synchronousversions that are more difficult to implement in a GALSsystem.

The following is a set of problems that need to be ad-dressed in a GALS design flow:

Table 5. Measurement results for different op-erational modes. The numbers for GALSmodes represent peak performance.

GALS GALS Sync

fast secure

I/O Clock [MHz] 50 50 150

I/O Cycles 36 46 117

Enc. Time [ns] 720.0 920.0 779.2

Throughput [Mb/s] 177.7 139.1 164.2

Energy [mJ/Mb] 1.232 1.261 0.976

Area [mm2] 1.129 1.129 0.584

• Standard cell libraries need to be enhanced to includeat least the MUTEX element. Additional standardcells for high-resolution delay-lines in clock genera-tors, muller-C gates, or even transistor level implemen-tations of port controllers would also help to improvethe design flow considerably.

• A method for designing the asynchronous port con-trollers is required. There are several tools developedby the asynchronous community like Petrify [4], Min-imalist [5] and 3D [18] which can be used for this pur-pose. The designer must be aware of the specific tim-ing requirements of the asynchronous port controllersrealized using these tools, and must ensure that all tim-ing constraints are met throughout the design flow.

• An automated tool to find the optimal partitioning of asystem into multiple GALS modules.

• A test methodology that is compatible with existingautomatic test equipment and that yields comparabletest coverage results.

• A tool suite that supports hierarchical placement androuting. This allows locally synchronous islands tobe placed and routed first. The self-timed wrappersaround the locally synchronous island can then be op-timized without affecting the locally synchronous is-land.

Apart from automated partitioning, it has been our experi-ence that all of these problems can be addressed reliably.Some of these problems like timing verification and testa-bility are solved using brute-force methods that are only fea-sible because they are limited to very small portions of theoverall design.

6

&%�����)2�AB�?CD6FEG��HIABJ�K)�?L5�G���G�

The GALS design methodology relies on the availablityof a suitable pausable local clock generator and several portcontrollers which are essentially asynchronous finite statemachines. Currently, there is no standard definition of howport controllers should function. Depending on the asyn-chronous protocol used (4-phase, 2-phase), communicationdirection (push-channel, pull-channel), how the clock ispaused (stop until transfer is made, stop only during trans-fer) many different port controllers can be designed. How-ever, only a small set of port controllers (no more than 4)are sufficient for the majority of applications.

Designing a self-timed library is essential for GALSimplementations, yet most designers are not familiar withmethods for constructing asynchronous finite state ma-chines (AFSM). However, designing the self-timed libraryis a one-time investment. Typical AFSM are process-independent and can be even re-used for different manu-facturing technologies. The AFSM description needs to bemapped to standard cells of the manufacturing technology.Fortunately, most port controllers are relatively small andrarely exceed the complexity of 10 to 20 gate equivalents.Therefore the mapping process can be completed manually.

Depending on the AFSM description used, a number oftiming constraints need to be fullfilled. Contrary to syn-chronous circuits, where timing failures are typically due toslow connections, self-timed circuits suffer from fast feed-back signals. While some effort is required to verify that alltiming coinstraints are met, violations can be easily resolvedby inserting delay elements. It is therefore also possible todesign robust, albeit slower, port controllers that will not vi-olate timing coinstraints once they are instantiated as part ofa GALS system.

&%������MN���GEG� EG�? �5����

Partitioning the design into GALS modules remains themost critical aspect of GALS. The partitioning has more in-fluence on the performance of the system than all other fac-tors combined. A well-defined methodology to determinethe partitioning for GALS designs has yet to be developed.Present GALS systems are partitioned manually followingseveral guidelines:

• A GALS module should consist of a single clock do-main.

• GALS modules should be of considerable complexity.They should be large enough to justify the overheadof the self-timed wrapper, however they should not be

overly large to avoid difficulties involved in distribut-ing the clock across a large locally synchronous island.

• All communications with other modules have a poten-tial to slow down the operation of a GALS module.The system should be partitioned in a way to minimizethe inter-block communication as much as possible.In particular, blocks that communicate every clock cy-cle (even unidirectionally) should be placed within thesame GALS module.

Partitioning a synchronous design in a way that is suitablefor GALS may increase the latency of individual modules.There are two factors that contribute to this additional la-tency. Firstly, registers are used at the input and output ofindividual modules to alleviate timing constraints for back-end design. Secondly, in a GALS system, several statussignals that would normally be globally accessible in a stan-dard synchronous system, are confined to the module theyare generated in. Since all GALS modules may run at dif-ferent speeds (from each other), a second GALS modulewill be unable to access, or predict the present value of thestatus signals confined within the first GALS module andcan not be ’prepared’ in advance. The second GALS mod-ule will therefore have to react after the communication be-tween both GALS modules have taken place, which mayindirectly increase the latency.

&%�'&%��OPA.*7�?��@�Q� �R

Modern SoC designs, are very complex systems that canonly be reliably engineered by using extensive design au-tomation software. A new design methodology, such asGALS, has no chance for application unless it can be sup-ported by existing design tools.

A GALS system consists mainly of locally synchronousislands, which are designed using a standard synchronousdesign flow. There are only very few additional require-ments on the design of a locally synchronous island, thatwill become part of a GALS module:

• The locally synchronous islands need to support thespecific handshake protocol imposed by the port con-trollers.

• All inputs and outputs of the locally synchronous is-land should be registered to alleviate the timing con-straints as much as possible.

A GALS module is created by surrounding the locally syn-chronous island by appropriate port controllers and a local

7

clock generator. Certain port controllers may require addi-tional glue logic (like latches) at this level as well. Gen-erating a GALS module from a locally synchronous islandis not an involved task, however for designs like Shir-Khanwhere a large number of GALS modules are used, it needsto be automated.

A hierarchical back-end design flow is used for GALS.State of the art back-end design programs provide excel-lent support for such hierarchical design flows and can beeasily adapted for GALS systems. All locally synchronousislands are placed and routed at the lowest level of the hier-archy. The self-timed wrapper is placed around the locally-synchronous island at a second hierarchical level. At thislevel, the exact timing of all input and output ports of thelocally synchronous island is known. If necessary, timingviolations between port controller signals and locally syn-chronous island are resolved at this level. The top level con-sists of interconnected GALS modules. This level is notvery demanding since there are hardly any timing criticalglobal signals (such as a clock) that need to be distributed.

&%�10%��S)AB*TE7��LU���?� EV�

Circuits that employ self-timed circuits are often criti-cized to have poor testability. Similar concerns have beenraised over GALS systems over the years. The early chipsdesigned at IIS were mainly test vehicles that containedproject-specific test solutions that were not applicable forgeneral use. For the first time, a test solution that can beeasily adapted to all GALS-based systems has been usedfor Acacia.

In a typical GALS system, only a very small percent-age of the chip is comprised of self-timed circuits. As anexample, the stuck-at fault dictionary of Acacia contains154,604 faults. Of these merely 182 (0.118%) are foundwithin the self-timed port controllers. The majority of thefaults are located within the locally synchronous islands andcan be detected using standard stuck-at fault testing meth-ods. However, it is necessary to provide access to standardtest interfaces.

The local clock generator used in Acacia can be config-ured to run in a test mode, where an external synchronousclock is used instead of the locally generated clock. Whilesystem functionality can not be maintained in this mode,scan-chain based test methods can be applied to test all lo-cally synchronous islands. Such a test is sufficient to pro-vide test coverage for most of the chip.

Faults within:

• Local clock generators• Self-timed port controllers• Glue logic within the self-timed wrappers

• Input and output registers of the locally synchronousislands

can not be detected using this method. The local clock gen-erators are designed with a special configuration port thatallows them to be configured and monitored during opera-tion. This interface can be used to provide a test solution,and all faults within the local clock generator can be de-tected reliably.

The remaining faults are detected using a functional ap-proach. First, a list of all remaining faults is obtained. Then,for each remaining fault, the netlist of the design is modifiedand the system is fed with functional vectors that stimulatethe chip to execute data transfers between all GALS mod-ules. Faults that manifest themselves at the circuit outputsare marked as detected. A total fault coverage exceeding99.8% has been achieved by using this method for Acacia.

4. The Future of GALS Implementations

Despite all the success of implementations on silicon,GALS remains a niche technology at best. There are dif-ferent reasons why GALS has not been adapted by the in-dustry. The main problem is that the advantages of usingGALS do not outweigh the additional effort required to im-plement GALS systems by a significant margin.

One of the interesting opportunities for GALS isnetwork-on-chip systems that were designed in an effort tomake the design of large scale SoC designs easier. Similarto the implementation of the STRING interconnection sys-tem from Shir-Khan, the self timed wrapper around a GALSmodule could be extended to support a network switch. Thelocally synchronous island would only be paused if dataneed to be transferred to/from that island. Otherwise, theswitch would be able to route data packets without slowingthe locally synchronous island. This would also allow partsof the system to be clocked at different rates.

Due to its specialized nature, GALS may find applica-tions in selected areas, like secure system design as well.

0����$��=>JWX���E7����A.*> �CZYP=[K\2]O^A.*G� ��

The most visible advantage of GALS systems remainsto its modularity. As long as locally synchronous islandsare designed such that they can interface to port controllers,they can safely be interconnected to form a larger GALSsystem. The self-timed handshaking used by GALS forglobal communication is a natural interface protocol forsystem design. Absolute minimum requirements for a stan-dard interface include:

8

1. agreed-on data (or message) formats,2. agreed-on data transfer protocols,3. flawless timing

Note that GALS system operation addresses 2. and 3. forfree.

Furthermore, all modules can be designed to run at theiroptimum operating frequency. This simplifies the designwhen compared to synchronous methodologies, where allmodules have to either agree on a global timing or re-quire costly synchronizing elements at their interfaces. Theback-end design is simplified significantly, and most of theglobal-timing related problems are eliminated.

GALS presents a new way to construct modular systems.As shown in Acacia, designers can utilize this to developnew solutions as well. A GALS-compatible system needsto be designed in a modular way from the onset, and has tosupport a handshake protocol. In return, all modules in thesystem will be able to run independently. Selected parts ofthe circuit will not need to be slowed down to allow tim-ing constraints on other parts of the system to be met. Aslong as the throughput constraints of the overall system canbe met, the latency of individual GALS modules becomesirrelevant. Unlike classic synchronous systems that are de-signed with worst-case performance in mind, GALS sys-tems are designed for average case performance.

GALS has certain qualities that make it attractive forlow-power applications. Individual GALS modules couldnot only be clocked at a slower rate, but their supply couldalso be regulated on demand. Control circuitry for thesedynamic voltage and frequency scaling techniques coulduse the existing handshake signals to determine the load ofGALS modules locally. However, none of these ideas hasbeen applied to a working GALS system yet.

0�������_>A�HI����5����`Ma�b $LU�?ABHc*

The main problem facing GALS is, that immediate 10ximprovement in key figures of merit can not be demon-strated. Especially the industry is under immense pressureto deliver working solutions within a very short time frame,and is very reluctant to invest time and energy into new tech-nologies as long as these do not result in large benefits.

On the practical side, partitioning remains one of themain challenges of GALS. There is no automated methodfor partitioning, yet the quality of the partitioning influencesthe performance of the GALS system significantly. Currentpartitioning methods rely on the skill and experience of thesystem designer.

As mentioned earlier, port controllers used in GALS im-plementations are not standardized. Anyone interested in

designing a GALS circuit has to first select a specific im-plementation of the port controllers. For designers who areunfamiliar with self-timed design, the respective advantagesand disadvantages of different alternatives may not be im-mediately clear. If a standard set of port controllers weredefined, optimized self-timed libraries could be made read-ily available. In some cases it would even be possible toprovide port controllers as standard cells for certain tech-nologies along with the MUTEX. This would give a broaderrange of design engineers access to GALS.

5. A Final Word

Our experience shows that the GALS approach is indeeda relatively mature design methodology that can be safelyapplied to developing digital systems. As long as a systemhas been designed with GALS in mind, the GALS designflow has been proven to produce results comparable to whatcan be expected from established industrial design flows.Once everything is in place, the design effort is similar tothat of a synchronous design of the same complexity. Allmajor steps of the design flow can be performed using in-dustry standard EDA tools augmented by a couple of extrasynthesis scripts and command files.

The major challenge is more of intellectual nature. Un-derstanding asynchronous state machines, designing ad-justable oscillators, testing self-timed circuits, etc., all askfor expertise and for a mindset that go well beyond thoserequired for regular HDL synthesis with standard cells. Aslong as there exist conceptually simpler ways to piece to-gether large chips from multiple synchronous islands, in-dustry shys away from the complications and imponder-abilities brought about by an unfamiliar design style andunproven flows. To be sure, this attitude is well foundedon a background of current market pressures. GALS, andother asynchronous design methodologies as well, for thatmatter, are thus bound to remain confined to the researchcommunity until something changes in this picture.

References

[1] D. S. Bormann and P. Y. Cheung. Asynchronous Wrapperfor Heterogeneous Systems. In Proc. International Conf.Computer Design (ICCD), Oct. 1997.

[2] D. M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis, Stanford University, Oct.1984.

[3] A. Chattopadhyay and Z. Zilic. GALDS: A CompleteFramework for Designing Multiclock ASICs and SoCs.IEEE Transactions on VLSI Systems, 13(6):641–654, June2005.

9

[4] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno,and A. Yakovlev. Petrify: a Tool for Manipulating Con-current Specifications and Synthesis of Asynchronous Con-trollers. IEICE Transactions on Information and Systems,E80-D(3):315–325, Mar. 1997.

[5] R. M. Fuhrer, S. M. Nowick, M. Theobald, N. K. Jha,B. Lin, and L. Plana. Minimalist: An Environment for theSynthesis, Verification and Testability of Burst-mode Asyn-chronous Machines. Technical Report TR CUCS-020-99,Columbia University, NY, July 1999.

[6] B. Garlepp, K. Donnelly, J. Kim, P. Chau, J. Zerbe,C. Huang, C. Tran, C. Portmann, D. Stark, C. Yiu-Fai,T. Lee, and M. Horowitz. A Portable Digital DLL for High-Speed CMOS Interface Circuits. IEEE Journal of Solid-State Circuits, 34(5):632–644, May 1999.

[7] F. K. Gürkaynak, S. Oetiker, H. Kaeslin, N. Felber, andW. Fichtner. Improving DPA Security by Using Globally-Asynchronous Locally-Synchronous Systems. In Proc. Eu-ropean Solid-State Circuits Conference (ESSCIRC), pages407–411, Sept. 2005.

[8] T.-Y. Hsu, C.-C. Wang, and C.-Y. Lee. Design and Analysisof a Portable High-Speed Clock Generator. IEEE Transac-tions on Circuits and Systems II: Analog and Digital SignalProcessing, 48(4):367–375, Apr. 2001.

[9] J. Kessels, A. Peeters, P. Wielage, and S.-J. Kim. ClockSynchronization through Handshake Signalling. In Proc.International Symposium on Advanced Research in Asyn-chronous Circuits and Systems, pages 59–68, Apr. 2002.

[10] M. Krstic, E. Grass, and C. Stahl. Request-driven GALSTechnique for Wireless Communication System. In Proc.International Symposium on Advanced Research in Asyn-chronous Circuits and Systems, pages 76–85, Mar. 2005.

[11] S.-J. Lee, B. Kim, and K. Lee. A Novel High-Speed RingOscillator for Multiphase Clock Generation Using NegativeSkewed Delay Scheme. IEEE Journal of Solid-State Cir-cuits, 32(2):289–291, Feb. 1997.

[12] S. Moore, G. Taylor, R. Mullins, and P. Robinson. Pointto Point GALS Interconnect. In Proc. International Sympo-sium on Advanced Research in Asynchronous Circuits andSystems, pages 69–75, Apr. 2002.

[13] J. Muttersbach, T. Villiger, and W. Fichtner. Practical De-sign of Globally-Asynchronous Locally-Synchronous Sys-tems. In Proc. International Symposium on Advanced Re-search in Asynchronous Circuits and Systems, pages 52–59,Apr. 2000.

[14] J. Muttersbach, T. Villiger, H. Kaeslin, N. Felber, andW. Fichtner. Globally-Asynchronous Locally-SynchronousArchitectures to Simplify the Design of On-CHIP Systems.In Proc. 12th International ASIC/SOC Conference, pages317–321, Sept. 1999.

[15] S. Oetiker, T. Villiger, F. K. Gürkaynak, H. Kaeslin, N. Fel-ber, and W. Fichtner. High Resolution Clock Generators forGlobally-Asynchronous Locally-Synchronous Designs. InHandouts of the Second ACiD-WG Workshop of the Euro-pean Commission’s Fifth Framework Programme, Munich,Germany, Jan. 2002.

[16] S. F. Smith. An Asynchronous GALS Interface with Ap-plications. In In Proc. IEEE Workshop on Microelectronicsand Electron Devices, pages 41–44, 2004.

[17] T. Villiger, H. Kaeslin, F. K. Gürkaynak, S. Oetiker, andW. Fichtner. Self-Timed Ring for Globally-AsynchronousLocally-Synchronous Systems. In Proc. International Sym-posium on Advanced Research in Asynchronous Circuits andSystems, pages 141–150, May 2003.

[18] K. Y. Yun and D. L. Dill. Automatic Synthesis of ExtendedBurst-Mode Circuits: Part II (Automatic Synthesis). IEEETransactions on Computer-Aided Design, 18(2):118–132,Feb. 1999.

10


Recommended