+ All Categories
Home > Documents > D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1...

D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1...

Date post: 27-May-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
35
GALAXY GALS InterfAce for CompleX Digital SYstem Integration Confid. Level: Date : Issue: Public 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 1/35 Deliverable – D33 System Integration based on GALS Design Flow Grant Agreement No: 214364 Project acronym: GALAXY Project title: GALS InterfAce for CompleX Digital System Integration Funding Scheme: STREP Date of latest version of Annex I against which the assessment will be made: 22.03.2010. Contractual Date of Delivery to the EC: 30. Nov 10 Actual Date of Delivery to the EC: 1. Dec 10 Author(s): Milos Krstic, Xin Fan (IHP), Milos Stanisavljevic (EPFL) Participant(s): IHP Work Package: WP8 Security: Public Nature: Report Version: 1 Total number of pages: 35 Abstract: This is a report that evaluates GALS design flow and compares the efforts needed to introduce GALS technique in comparison to the classical synchronous flow. Here we will focus on our experiences during the design of the synchronous/GALS Moonrake chip and evaluate effectiveness of GALS introduction in respect to system integration. Keyword list: GALS, asynchronous, design flow
Transcript
Page 1: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 1/35

Deliverable – D33

System Integration based on GALS Design Flow

Grant Agreement No: 214364

Project acronym: GALAXY

Project title: GALS InterfAce for CompleX Digital System Integration

Funding Scheme: STREP

Date of latest version of Annex I against

which the assessment will be made: 22.03.2010.

Contractual Date of Delivery to the EC: 30. Nov 10

Actual Date of Delivery to the EC: 1. Dec 10

Author(s): Milos Krstic, Xin Fan (IHP), Milos Stanisavljevic (EPFL)

Participant(s): IHP

Work Package: WP8

Security: Public

Nature: Report

Version: 1

Total number of pages: 35

Abstract:

This is a report that evaluates GALS design flow and compares the efforts needed to introduce GALS technique in comparison to the classical synchronous flow. Here we will focus on our experiences during the design of the synchronous/GALS Moonrake chip and evaluate effectiveness of GALS introduction in respect to system integration.

Keyword list: GALS, asynchronous, design flow

Page 2: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 2/35

Function Responsibility Date Signature

Written by:

M. Krstic 07.10.2010

Checked by:

Approved by:

Reserved to EC

Approved by:

Page 3: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 3/35

CHANGE RECORDS

ISSUE DATE § : CHANGE RECORD AUTHOR

1 7-Oct-10 First version of the report M. Krstic

2 12-Nov-10 Advanced draft X. Fan

3 20-Nov-10 EPFL content added M. Stanisavljevic

4 1-Dec-2010 Final checks M. Krstic

Page 4: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 4/35

BIBLIOGRAPHIC RECORD

Project Number: 214364

Project Title: GALAXY

Deliverable Type: Report

Deliverable Number: D33

Contractual Date of Delivery: 30. Nov 2010

Actual Date of Delivery: 1. Dec 2010

Title of Deliverable: System Integration based on GALS Design Flow

Work package contributing to the Deliverable:

WP8

Authors: M. Krstic, X. Fan, M. Stanisavljevic

Abstract This is a report that evaluates GALS design flow and compares the efforts needed to introduce GALS technique in comparison to the classical synchronous flow. Here we will focus on our experiences during the design of the synchronous/GALS Moonrake chip and evaluate effectiveness of GALS introduction in respect to system integration.

Keywords GALS, asynchronous, design flow

Confidentiality Level Public

Name of Client: EC

Distribution List: GALAXY, EC

Authorised by: Milos Krstic

Version: 1

Document ID: D33

Total Number of Pages: 35

Contact Details: [email protected]

Page 5: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 5/35

TABLE OF CONTENTS

1 INTRODUCTION ............................................................................................................7

2 REFERENCES ...............................................................................................................8

2.1 ACRONYMS .............................................................................................................8

2.2 REFERENCE DOCUMENTS ....................................................................................8

3 GALS DESIGN FLOW.................................................................................................. 10

3.1 DESIGN SPECIFICATION...................................................................................... 11

3.2 BEHAVIOUR LEVEL CODING ............................................................................... 11

3.3 ASYNCHRONOUS-SYNCHRONOUS CO-SIMULATION....................................... 11

3.4 SYNTHESIS............................................................................................................ 12

3.5 POST-SYNTHESIS SIMULATION .......................................................................... 12

3.6 BACK-END LAYOUT ............................................................................................. 12

3.7 POST-LAYOUT SIMULATION................................................................................ 12

3.8 SIGN-OFF AND TAPE-OUT ................................................................................... 13

3.9 GALS DESIGN FOR TESTABILITY ....................................................................... 13

4 EVALUATING DESIGN ASPECTS OF MOONRAKE CHIP ......................................... 15

4.1 SYSTEM ARCHITECTURE .................................................................................... 15

4.2 OFDM TRANSMITTER DATAFLOW...................................................................... 15

4.3 GALS OFDM TRANSMITTER SYSTEM PARTITIONING....................................... 16

4.4 GALS SYSTEM AT THE BEHAVIOURAL LEVEL.................................................. 17

4.4.1 RTL Design of Synchronous Modules...........................................................18

4.4.2 Behaviour Design of Asynchronous Wrappers ............................................19

4.5 SYNTHESIS............................................................................................................ 19

4.5.1 Synchronous Functional Modules Synthesis ...............................................19

4.5.2 Asynchronous Wrapper Synthesis................................................................20

4.6 DESIGN FOR TESTABILITY .................................................................................. 21

4.7 BACK-END DESIGN .............................................................................................. 21

4.7.1 Asynchronous Wrappers Layout...................................................................21

4.7.2 Top-Level Layout ............................................................................................22

4.7.3 Physical Parameters of Moonrake Chip ........................................................22

4.8 GALS VS. SYNCHRONOUS DESIGN COMPARISON........................................... 24

4.8.1 Efforts for GALS design .................................................................................24

4.8.2 Benefits from GALS design ...........................................................................25

Page 6: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 6/35

5 COMPARING SYSTEM INTEGRATION WITH GALS AND SYNCHRONOUS APPROACH ........................................................................................................................ 27

5.1 GALS DESIGN FLOW ............................................................................................ 27

5.1.1 Design Specification.......................................................................................27

5.1.2 Behaviour Level Coding.................................................................................28

5.1.3 Asynchronous-Synchronous Co-simulation ................................................28

5.1.4 Synthesis.........................................................................................................28

5.1.5 Back-End Layout ............................................................................................29

5.1.6 Final Considerations ......................................................................................29

5.1.7 Evaluating the Effort for Example Designs...................................................29

5.2 CLOCK SKEW IN SYNCHRONOUS SYSTEMS AND GALS SYSTEMS ADVANTAGES................................................................................................................ 31

5.3 MULTISYNCHRONOUS VS. GALS DESIGN ......................................................... 34

6 CONCLUSIONS ........................................................................................................... 35

LIST OF FIGURES

Figure 1: GALS Design flow............................................................................................. 10

Figure 2: Architecture of Moonrake chip........................................................................... 15

Figure 3: OFDM transmitter dataflow ............................................................................... 16

Figure 4: Topology of GALS OFDM transmitter................................................................ 18

Figure 5: STG of asynchronous I/O port controllers ......................................................... 19

Figure 6: Clock dependency in GALS OFDM transmitter ................................................. 20

Figure 7: Synthesized netlist of I/O port controllers using Petrify...................................... 21

Figure 8: Floorplan of Moonrake chip............................................................................... 23

Figure 9: Final layout of Moonrake chip ........................................................................... 23

Figure 10: Clock skew histogram under supply voltage fluctuation .................................... 31

Figure 11: Clock skew histogram under gate length fluctuation.......................................... 32

LIST OF TABLES

Table 1: Power Estimation and System partitioning of GALS OFDM Transmitter ............... 17

Table 2: Interconnect Parameters ...................................................................................... 32

Table 3: Interconnect Parameters ...................................................................................... 33

Page 7: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 7/35

1 INTRODUCTION

In the following report we will try to summarize the system integration aspects of the GALS technique. GALS technique was introduced in 80's and the first chips have been developed during the last decade. However, until now, there has not been any study of effectiveness of GALS approach to deal with the problems of system integration. In particular this effectiveness must be analyzed from the point of view of the cost investment (time, effort, resources) in GALS methods during the development of the design and the benefits that this investments brings in comparison to the standard synchronous approach.

In this context, in the following text we will analyze GALS design flow and additional/different steps that are needed to be implemented in comparison to the synchronous approach. After that we will try to describe the experience of designing the GALS chips (Moonrake Chip) in GALAXY project. Finally we will describe the system integration aspects and benefits that are introduced with GALS technology.

Page 8: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 8/35

2 REFERENCES

2.1 ACRONYMS

GALS Globally Asynchronous Locally Synchronous

LS Locally Synchronous

MUTEX Mutual Exclusion

NoC Network on Chip

STA Static Timing Analysis

2.2 REFERENCE DOCUMENTS

Ref. Document Title

[BEI06] E. Beigne, P. Vivet, Design of On-chip and Off-chip Interfaces for a GALS NoC Architecture, Proceedings of 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'06), Grenoble, France, pp. 172-181, March 2006.

[BEI08] E. Beigne, et al, An asynchronous power aware and adaptive NoC based circuit, , 2008 IEEE Symposium on VLSI Circuits, Publication Date: 18-20 June 2008 , Page(s): 190 - 191, DOI: 10.1109/VLSIC.2008.4586002

[FAN09] X. Fan, M. Krstić, E. Grass, Analysis and Optimization of Pausible Clocking based GALS Design, In Proc. of XXVII IEEE International Conference on Computer Design (ICCD) 2009, Resort at Squaw Creek, Lake Tahoe, California, pp 358-365, "Best Paper" award

[GU06] F. Gürkaynak, GALS System Design: Side Channel Attack Secure Cryptographic Accelerators, PhD thesis, Hartung-Gorre Verlag, 2006

[GUO06] F. K. Gurkaynak, S. Oetiker, H. Kaeslin, N. Felber and W. Fichtner: "GALS at ETH Zurich: Success or Failure ?", Proceedings of the Twelfth IEEE International Symposium on Asynchronous Circuits and Systems, Grenoble France, pp. 159-168, March 13-15, 2006.

[G07] E. Grass, F. Herzel, M. Piz, K. Schmalz, Y. Sun, S. Glisic, M. Krstić, K. Tittelbach, M. Ehrig, W. Winkler, C. Scheytt, R. Kraemer, 60 GHz SiGe-BiCMOS Radio for OFDM Transmission, In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), New Orleans, USA, 27 - 30 May 2007, pp. 1979-1982.

[HA05] M. Hashimoto, T. Yamamoto, and H.Onodera. Statistical Analysis of Clock Skew Variation in H-Tree Structure. In Proceedings of the 6th International Symposium on Quality of Electronic Design (ISQED), 2005, pp. 402-407.

[Krstic] M. Krstic, “Request Driven GALS Architecture”, PhD Thesis, BTU Cottbus, Germany, 2006.

[LEM10] R. Lemaire, Y. Thonnart, Magali, a Reconfigurable Digital Baseband for 4G Telecom Applications based on an Asynchronous NoC, The 4th ACM/IEEE International Symposium on Networks-on-Chip - Grenoble, France, May 3-6, 2010

[LIN04] Lines, A., Asynchronous interconnect for synchronous SoC design, IEEE Micro, 2004, 24, (1), pp. 32–41

Page 9: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 9/35

[ÌAF] IAF Basic FFP board, datasheet, http://www.iaf-bs.de/products/ffp-basic.de.html

[MEH01] V. Mehrotra and D. Boning, Technology scaling impact of variation on clock skew and interconnect delay,” in Proc. Int. Interconnect Technology Conf., 2001, pp. 122–124.

[OVG02]

Stephan Oetiker, Thomas Villiger, Frank K. Gürkaynak, Hubert Kaeslin, Norbert Felber, and Wolfgang Fichtner, High Resolution Clock Generators for Globally-Asynchronous Locally-Synchronous Designs, Handouts of the Second ACiD-WG Workshop of the European Commission’s Fifth Framework Programme, Munich, Germany, January 2002.

[PIZ07]

Maxim Piz, Eckhard Grass: A synchronization scheme for OFDM-based 60 GHz WPANs, In Proceedings of IEEE 18th International Symposium on Personal, Indoor and Mobile Radio Communications, 2007. PIMRC 2007., pp. 1-5.

[SF01] J. Sparsø, S. Furber, R. van Leuken, R. Nouta, and A. de Graaf, Principles of Asynchronous Circuit Design: A Systems Perspective. Kluwer Academic Publishers, Boston, 2001.

[ZHU03] Q. K. Zhu, High-Speed Clock Network Design, Kluwer Academic Publishers, 2003.

Page 10: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 10/35

3 GALS DESIGN FLOW

The GALS design flow applied in GALAXY project is illustrated in Figure 1. The following part of this section will describe the design flow in some details.

Design specification

Asynchronous RTL

coding

Synchronous RTL

coding

Petri net

STG

BALSA

VHDL

Verilog

SystemC

Async-Sync

Cosimulation GalaxyID

E

ASIP

Logic Synthesis

ASYNC Logic Synthesis

SYNC

Async

SC LIB SC LIB

Petrify

BALSA

3D

DC

compiler

etc.

Post synthesis

simulation

Modelsim

etc.

verilog netlist

Async Macro

Back-end

Standard

back-end tools

Back-end

Standard

back-end tools

Sign-off (STA, DRC, LVS, ERC)

tape-out

Calibre

Assura

etc.

Figure 1: GALS Design flow

Page 11: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 11/35

3.1 DESIGN SPECIFICATION

A basic specification is required in GALS design, which defines the targets of design performance and area/power constraints. System partitioning should be addressed in the design specification, taking both the implementation benefits and overheads introduced by GALS design into consideration.

A critical issue for nowadays digital VLSI design is the robust clock network distribution. In synchronous design, a high fan-out while low-skew global distributed clock tree is required, which are considered as a major challenge for backend design and often introduces significant power and area consumption. By removing the global clock signals in GALS design, designer would make the clock tree distribution more efficient and less consumption. On the other hand, the additional overheads introduced by GALS design methodology have to be account in two aspects. First, in GALS design communication across different clock domains are accomplished by interface circuits, which can be for example asynchronous FIFOs, pausible local clocking schemes, or just simple double-stage flip-flops. No matter what kind of interface circuits are applied, the additional power and area consumptions will be introduced. Second, the interface circuits have to spend additional clock cycles to resolve the meta-stability issues for safe data transfer, and it will introduce communication latency in GALS systems. In principle, the more GALS blocks in the system, the simpler of the local clock tree distribution in each GALS block, while the higher power, area and latency overheads in the whole GALS design. Designer must make a trade-off between the benefits and overheads in GALS system partitioning granularity.

3.2 BEHAVIOUR LEVEL CODING

Two parts of behaviour level coding are included in GALS design. For synchronous functional modules, standard register-transfer-level (RTL) design is used, often programming in VHDL/Verilog or SystemC. However, for asynchronous circuits, which are involved, for example, in the pausible clocking scheme in asynchronous wrappers, particular languages are needed for behaviour level programming. In GALAXY project, signal transition graphs (STGs), which are based on petri net analysis, have been applied to describe the behaviour of asynchronous port controllers.

The RTL design for synchronous digital circuits is mature and well supported by commercial CAD tools. In GALS design, most of the circuits, including all the functional modules, are normally designed in synchronous manner. That means the behaviour level design of GALS systems is rather similar to the traditional synchronous systems in general. Asynchronous design is only used for the interface circuits in GALS design. Considering the simple structures of interface circuits, asynchronous design in GALS systems only occupy a very limit part in the system design efforts. For the simple asynchronous circuits, the behaviour level design methods mentioned above is efficient and easy for application.

3.3 ASYNCHRONOUS-SYNCHRONOUS CO-SIMULATION

The behaviour level descriptions of both synchronous functional modules and asynchronous wrappers have to be integrated in a complete environment for system simulations. An asynchronous-synchronous co-simulation tool GALAXYIDE, using ASIP packaging format, is developed in GALAXY project, which supports behaviour level co-simulation using different languages such as VHDL/Verilog and STG/Petri-Net, and provides an effective approach to cover the behaviour design of synchronous-asynchronous interfacing circuits in very early design stage.

Including the asynchronous circuits in behaviour level simulations is crucial for GALS systems design. Although the structures of asynchronous interface circuits are relatively simple, their behaviour could be very complicated. The output response of asynchronous circuits is influenced both by the transitions on input signals and by the sequence of transitions appearing on input signals. Same combination of input transitions with different timing sequence could lead to different outputs from the asynchronous circuits. Generally speaking, the input signals arrive randomly in GALS interface circuits, and it turns to be of importance to cover all the corner cases in the high-level behaviour simulations. ASIP provides a solution to perform the behaviour-level asynchronous-synchronous co-simulations. Up to now, it is the only tool supporting the system-level behaviour simulations of GALS designs.

Page 12: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 12/35

3.4 SYNTHESIS

Asynchronous and synchronous behaviour coding are synthesized using different CAD tools separately in GALS design. For synchronous RTL coding, Synopsys Design Compiler (DC) is well applied in digital circuits synthesis using standard cells library. The synthesis of asynchronous behaviour coding is much challenging than its synchronous counterpart. Particular CAD tools, such as BALSA, TAST, HASTE for complicated asynchronous circuits or Petrify, 3D, Minimalist for simple asynchronous logic, are used to guarantee the quasi-delay-insensitivity (QDI) of the gate-level circuits.

Muller-C element and Mutual Exclusive element (MUTEX) are widely applied in asynchronous design, however, these cells are not included in most of nowadays synchronous standard cells library. Efforts are therefore required to create an asynchronous standard cells library using the same technology to the synchronous standard cells library. Because only a few types of asynchronous cells are needed in GALS design, the generation of asynchronous library is relatively simple.

Asynchronous parts in GALS design need to be synthesized separately from the synchronous circuits. The netlist of synthesized asynchronous circuits is sensitive to the environment variations, such as the input signal transition time and output signal fan-out loads. To correctly set I/O constraints for asynchronous circuit synthesis, a careful estimation on the environment is required. On the other hand, the synthesis of synchronous modules in GALS design is similar to the synchronous design.

3.5 POST-SYNTHESIS S IMULATION

The netlist of both synchronous and asynchronous circuits are integrated in the top-level for. The cell and interconnect delays are abstracted and back-annotated into the netlist for accurate gate-level simulations. Note that, since asynchronous circuits are synthesized independently, the delay information of asynchronous circuits relays on the I/O constraints set in synthesis. In top-level post-synthesis simulations, the variations on the delays of asynchronous circuits due to the mismatch between the constraints used for synthesis and the real value of the driving/load signals could lead to errors.

3.6 BACK-END LAYOUT

Hierarchical layout is performed in GALS design. Considering the heavy sensitivity on their performance and reliability to the interconnect delays, asynchronous circuits need to be implemented as soft or hard macros. In system-level layout, these asynchronous macros are merged with other cells. Static timing analysis (STA) is done in both macro-level and system-level layout for timing closure.

Depending on the design requirements, the asynchronous circuits can be layout as soft-macros or hard-macros. As a soft macro, its timing information can be abstracted and used for post-layout simulations. As a hard macro, since it is layout in transistor level, it is difficult to abstract the delay information as normal digital cells for simulations. However, hard macro can be optimized with the maximum efforts for high performance and reliability.

3.7 POST-LAYOUT S IMULATION

Post-layout simulations are especially important to verify the functions of asynchronous circuits. For most of the asynchronous circuits, QDI property is preferred, where the circuits are robust to cell and isochronic wire delays. In post-synthesis simulation, however, there is no timing information on the wires. Only with the post-layout netlist, the interconnect delays in the asynchronous circuits could be checked. By implementing asynchronous circuits as macros in GALS systems, the interconnect delays normally should be rather small and could be safely neglected.

Page 13: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 13/35

3.8 S IGN-OFF AND TAPE-OUT

As synchronous design, after passing DRC, LVS, ERC, and final timing closure, the GALS design will be released for tape-out. The tools that should be used are standard commercial CAD tools used also for the synchronous designs.

3.9 GALS DESIGN FOR TESTABILITY

In order to apply the GALS methodology, we need a clear path how to test such structured systems. As soon as a design using self-timed circuit techniques is mentioned, the first question that pops up is: "How are you going to test this circuit?" This section summarizes recommendations on test development for GALS. This text was directly taken from deliverable D16 that was dealing with the test approach with GALS.

• Use scan approach wherever this seems to be feasible and possible

Create one or more scan-path to reach all latches or flip-flops together of the synchronous parts. The scan-path may operate only when the circuit is in the test mode and can be generated by tools for synchronous designing. Automate test pattern generation tools can therefore be applied in order to minimize the test time and maximize the fault coverage. An external clock is needed in order the shift the states of the scan-path. Reasonable approach would be that each LS block has its autonomous scan chain(s). However, it is also possible to combine different LS domains into the single scan chain. In any case we can apply any of the scan methods that are anyway used for multi-clocking systems. Asynchronous wrappers could be included into scan chain, by breaking of internal loops and making all sequential elements scannable. In this case, particular asynchronous wrapper should normally belong to scan chain of respective locally synchronous.

• Use test monitors for handshaking circuits

The control circuit of the asynchronous part can be tested in normal operation/functional mode. If the handshaking between modules get stuck in compare with a golden chip, the chip is faulty and therefore useless. Therefore a test pattern has to be designed which has to activate all the asynchronous channels in the chip. Since it is very difficult to generate test patterns able to catch all possible dynamic faults that may appear in an asynchronous circuit under nanometre process environment, it is very useful to implement test monitors that can online follow the functioning of handshake circuits.

• Combine testing of synchronous and asynchronous components with BIST

For large chips the method of using scan-paths may be use to much time. In those cases building the self-test capabilities provide a suitable solution just as the do for synchronous designs. Each module can have it’s own BIST structures. They can even work concurrently. Additionally, it is possible to have also the global BIST test that will test the complete functionality of the system. It is very useful that BIST test achieve as good as possible test coverage and this can be checked with appropriate tools.

• Always perform functional test.

Although for the classical synchronous designs the main focus is on structural testing, for GALS and asynchronous circuits it is very important to verify operation for dynamic faults. Therefore, some sort of the functional system testing is needed, either over BIST or by testing asynchronous channels. Those functional tests may also, for the limited circuit complexity as in the case of the asynchronous wrappers, achieve very high test coverage for the stuck-at fault model, but it can also cover many dynamic faults.

• Use multi-level approach for GALS testing

Page 14: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 14/35

It is not necessary to base the test of the GALS system on a single test strategy applied for all system components in the same time and with integrated DfT structure. It is highly advisable to utilize GALS modularity and organize also testing at the different hierarchical levels. In particular, local test methods should be implemented on the locally synchronous level (possibly including respective asynchronous wrapper). Applied local test methods should be based on the proven and commercially available tools and methods. For example, the usual strategy would include scan test on the local synchronous level. With this approach we have to grant the access for ATE to each locally synchronous block.

• Use commercial DFT tools as much as possible

There is a large spec of already available DTF CAD tools for DFT insertion and ATPG test generation based on the different fault-models (for example Tetramax etc.). Those tools are today very effective and mature and for state-of-the-art GALS system those tools should be as much as possible utilized. DTF scripts should be optimized for GALS and asynchronous logic, to enable successful testing. However, in principle if asynchronous logic is prepared for scan insertion and the loops are broken, the commercial ATPG tools could be directly used. Utilization of academic DFT tools is also possible but usually difficult due to their immaturity, restriction in system complexity, and lack of support and documentation.

• Enable use of the industrial hardware testers and ATE in general

Today industry standard is application of complex hardware testers for manufacturing test. The market leaders, such as Verigy, Teradyne, Advantest, and LTX Credence have developed large set of different testers optimized for SoC testing, memory testing, or mixed-design testing. However, most of those testers are based around the synchronous paradigm and cycle based. Therefore, the direct introduction of event-based asynchronous logic to the cycle based testers is usually not possible without careful evaluation of test strategy. Therefore, the input and output signals of the CUT have to be synchronized and prepared for cycled sampling and strobing. An additional potential problem is timing nondeterminism that is present in the asynchronous and potentially GALS circuits due to the arbitration process. Hardware testers usually have difficulties to deal with non-deterministic signals. Therefore, non-deterministic behaviour of IO ports has to be avoided. One elegant solution for this problem could be application of BIST test on the system level, however with synchronous and deterministic communication with tester.

One important aspect of ATE support is ability to access the local units of the GALS system. If we focus on multi-level test approach for GALS systems we must grant the access for the tester to each locally synchronous block. One way to do it is implementation of JTAG ports able to access each block separately.

If the asynchronous components are included in scan test structure we could also test the system for dynamic faults by loading the test vectors with a low frequency clock and executing the test achieving real throughput of the system.

• The test strategy for highly complex GALS systems and NoCs is in principle the same as described

Page 15: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 15/35

4 EVALUATING DESIGN ASPECTS OF MOONRAKE CHIP

4.1 SYSTEM ARCHITECTURE

In the framework of the German funded WIGWAM project (http://www.wigwam-project.de/) we have developed a communication system that supports transmission rate up to 1 Gbps. One feasible possibility can be using of the available spectrum is in 60 GHz range. In order to cope with multipath propagation, we decided to use OFDM modulation scheme. Up to our knowledge this is the first implemented OFDM system capable to deal with such datarates.

In some respects, our OFDM based transmission scheme is similar to the 802.11a standard. The same convolutional codes (171,133) are used with fixed transmission modes ranging from BPSK to 64-QAM. As before, the interleaving is performed over one OFDM symbol. The basic OFDM PHY parameters are adapted to a bandwidth of 330 MHz and summarized in [PIZ07]. Starting from the expected channel delay spread in the order of up to 20 ms, a guard time in the order of 150 ms of higher is required to support 64 QAM-modulation. To avoid substantial loss of data rate, the guard time should constitute only a small fraction of the OFDM symbol time (e.g. 20% in 802.11a). To lower the impact of the guard time on efficiency, one may rise the symbol duration using a large FFT. On the other hand, this approach is limited, because the resulting smaller subcarrier spacing comes at the price of larger phase noise sensitivity. The cyclic prefix of 160 ns at a FFT period of 640 ns (512 subcarriers in 400 MHz bandwidth) resulting in a symbol time of 800 ns were chosen as a good compromise between phase noise sensitivity and maximum tolerable channel delay spread.

For the comparison in design efforts and performance of traditional synchronous and GALS design, both a synchronous OFDM transmitter core and a GALS OFDM transmitter core have been implemented in the Moonrake chip. The top-level architecture of the chip is demonstrated in Figure 2.

Input F

IFO

BIST

Synchronous Tx

GALS Tx

LC

1

LC

2

LC

3

LC

4

LC

5

LC

6

Outp

ut S

tage

PLL JTAG

Data fro

m M

AC

Data to

DA

C

Clock

control

Figure 2: Architecture of Moonrake chip

4.2 OFDM TRANSMITTER DATAFLOW

Figure 3 presents the basic dataflow of the OFDM transmitter. In general, the signal processing is performed in a pipelined manner, and few iteration operations among different processing stages exist in the system. In some functions, such as forward error correcting encoding, interleaving, and 64-point FFT, parallel processing is required for achieving target data throughput.

Page 16: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 16/35

Figure 3: OFDM transmitter dataflow

4.3 GALS OFDM TRANSMITTER SYSTEM PARTITIONING

Criterions for efficient GALS system partitioning have been exploited in the Moonrake chip design:

a. Less interconnects across different GALS blocks. The more interface signals passing from one GALS block into another GALS block, the more overheads and less reliability in synchronization. It means that GALS system partitioning should basically follow in the functionality.

b. Average area and power contributions from each GALS block. Considering the overheads result by the synchronization circuits, a reasonable amount of GALS blocks are preferred. For each GALS block, the partitioning should lead to an average area or power consumptions, depending on the applications, to exploit the benefits of GALS design style.

The primary goal of Moonrake chip design is to exploit GALS design in low-EMI digital systems. Hence, the average power consumption among different GALS blocks is preferred. Table 2 shows the power consumption of each functional module in the system estimated using Synopsys PrimeTime (PT) based on the synthesized netlist of the synchronous design.

According to the power estimation of each functional module, system partitioning of the GALS OFDM transmitter is further performed, also shown in Table 2. It can be clearly seen here that: (1) the

Page 17: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 17/35

system partitioning is based on the functional modules, and (2) the power consumption from each GALS block is averaged in the system.

Table 1: Power Estimation and System partitioning of GALS OFDM Transmitter

GALS Block 1

Input controller

Symbol mapping

Universal scramble

r

Middle controller

FEC encoder

[12:1]

Output interface

Pilot insertion

Mapping [4:1]

Total

Power 0.1% 0.5% 0.0% 7.0% 0.09% 0.1% 3.1% 0.08% 10.97%

Area 0.1% 1.0% 0.0% 12.8% 0.06% 0.1% 5.1% 0.14% 19.3%

GALS Block 2 GALS Block 3 GALS Block 4

Interleave 1

Interleave 2

Total Interleav

e3 Interleav

e 4 Total

Interleave 5

Interleave 6

Total

Power 8.7% 8.7% 17.4% 8.7% 8.7% 17.4% 8.7% 8.7% 17.4%

Area 8.9% 8.9% 17.8% 8.9% 8.9% 17.8% 8.9% 8.9% 17.8%

GALS Block 5 GALS Block 6

FFT_64P 1

FFT_64P 2

FFT_64P 3

FFT_64P 4

Total FFT_4P Backend Total

GALS Block

Average

Power 4.9% 4.3% 4.3% 4.3% 17.8% 11.3% 7.2% 18.5% 50mW

Area 2.7% 2.4% 2.4% 2.4% 9.9% 10.3% 6.7% 17% 0.36mm2

Note: Synthesized using the IFX40lpsvt12 library at a working frequency of 200MHz

Total power consumption: 300mW (average) / 250W (peak)

Total area consumption: 2.14mm2 (synthesized)

4.4 GALS SYSTEM AT THE BEHAVIOURAL LEVEL

Figure 4 illustrates the topology of the GALS OFDM transmitter. It can be seen that a star-like topology is applied in the design. GALS_Block_1 works as a central manage block, including most of the global control logic and scheduling the working status of the other five GALS blocks.

In Moonrake chip design, the pausible clocking based GALS design is applied. A local clock generator is deployed in each GALS block, which provides the clock pulses for all the locally synchronous modules inside the GALS block and can be programmed at frequency.

The communication between different GALS blocks is performed by channels controlled each by a pair of input and output asynchronous port controllers. A total of 16 communication channels (32 input and output ports) are utilized among the 6 GALS blocks in the system.

Considering the pipelined dataflow rooted in the OFDM transmitter, all the GASL blocks are configured to be running at the same frequency, with tiny frequency shifting caused by fabrication and environment uncertainty. Therefore, it actually works as a metachronous GALS system.

Page 18: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 18/35

Figure 4: Topology of GALS OFDM transmitter

4.4.1 RTL Design of Synchronous Modules

Basically, the RTL design of synchronous functional modules in GALS systems is the same to that of synchronous digital circuits design. However, for pausible clocking based GALS design, the following design rules need to be taken into consideration in synchronous RTL coding:

a. All the input and output signals of GALS blocks need to be registered by flip-flops. Normally only the output data has to be registered in synchronous RTL design to minimize the uncertainty of output delay with respect to the trigger edge of clock. However, in the pausible clocking based GALS design, the top-level input signals, signals from other GALS blocks or from the external environment, should be also registered in flip-flops before used in the GALS block. This design style is mainly to guarantee sufficient timing margins for the interconnect delays of the top-level input signals, which are asserted by the asynchronous input port macros. A larger acceptable interconnect delay means less constraints in the placement and routing in backend design.

b. An additional cycle needs to be left as synchronization timing margin for control signals across GALS blocks. In pausible clocking GALS design, the input signals could be valid either in the current clock cycle or in the immediately following clock cycle, depending on the arrival time of the input signals and the acknowledge window of pausible clock generator, and maximum one cycle synchronization delay can be introduced [FAN09]. As a consequence, any control signal transferred from other GALS blocks should be asserted one cycle earlier before the status. For example, empty and full flags are usually generated to identify the status of asynchronous FIFO in synchronous RTL design, however, if these flags need to be transferred across GALS blocks in GALS design, then errors could be introduced. Once the flags were received one cycle later than it was asserted. Instead, underflow or overflow could occur in FIFO. Instead, almost_empty and almost_full flags should be used in GALS design, in which the flags are asserted when at least one data or spaces remaining in the FIFO to be read or written.

Above states the design rules specific to the GALS design in Moonrake chip. In comparison with the synchronous design, fewer design efforts or implantation overheads in area and power consumption are introduced by the GALS design.

Page 19: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 19/35

4.4.2 Behaviour Design of Asynchronous Wrappers

In Moonrake chip design, a set of asynchronous input and output (I/O) port controllers are utilized for signal synchronization across different GALS block. Since the I/O port controllers are relatively simple in behaviour, the signal transition graphs (STGs) are applied to describe the I/O ports controllers, shown in Figure 5.

Both two-phase (transition sensitive) and four-phase (level sensitive) handshaking protocols are utilized in the I/O port controllers. To minimize the handshaking latency caused by long interconnect between I/O port macros and synchronous functional modules, two-phase protocol is adopted on the pair of port request/acknowledge (Rp/Ap) and transfer enable/acknowledge (Te/Ta). On the other hand, four-phase protocol is applied on the pair of internal request/acknowledge (Ri/Ai) to indicate the status of local clock.

Figure 5: STG of asynchronous I/O port controllers

4.5 SYNTHESIS

Since the performance and reliability of asynchronous circuits are sensitive to the interconnect delay, in GALS OFDM transmitter design, hierarchical synthesis is performed. The functional modules from all the GALS blocks are synthesized as multi-clock synchronous design, while local clock generators and asynchronous ports are synthesized separately.

4.5.1 Synchronous Functional Modules Synthesis

The synchronous functional modules from all the GALS blocks are synthesized as a synchronized core. Because each GALS block has different local clock signal, it is indeed the synthesis of a multiple clock design. Therefore, the main point is the definition and declaration of true paths and false paths between clock domains.

Figure 6 clearly demonstrated the clock dependency in the GALS OFDM transmitter. Besides the clock signals GALS_BLK[1-6]_CLK created by local clock generators in GALS blocks, two external

Page 20: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 20/35

clock signals, CLK_JTAG and CLK_EXTW, are used for programmable registers configuration and final output data buffering, respectively.

Data latching registers of input ports are integrated in the synchronized core. Clock signals, ACK and GNT, are asserted by the asynchronous wrapper for triggering the input port registers.

Figure 6: Clock dependency in GALS OFDM transmitter

4.5.2 Asynchronous Wrapper Synthesis

Since there is no global clock reference in the asynchronous circuits, the switching of combinational and sequential gates are both triggered by the local signal pulses. As a result, any glitch on the internal wires could lead to mal-function of the whole design. That means the synthesis of asynchronous circuits is much challenging than that of synchronous counterpart.

The ideal design should be insensitive to both gate delays and wire delays, which is referred to as delay insensitive (DI) design. Unfortunately, it is proven that only rather simple logic could satisfy DI property. Instead, quasi-delay-insensitive (QDI) circuits are developed, where the design is robust with respect to any gate and wire delays if only the isochronic forks are used [SF01]. In Moonrake chip design, the STG of asynchronous port controllers are synthesized using Petrify to guarantee the QDI property of the gate-level circuits.

Figure 7 illustrates the netlist of I/O port controllers synthesized using Petrify. It can be seen that both I/O port controllers have rather simple structures, introducing very few overheads in area and

Page 21: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 21/35

power consumption. To satisfy the isochronic property in the forks which is required by QDI design, the layout of those structures turns to be critical.

Figure 7: Synthesized netlist of I/O port controllers using Petrify

4.6 DESIGN FOR TESTABILITY

In Moonrake chip we have used different techniques for DfT. On one hand, scan chains are inserted in the synchronous OFDM transmitter design for testing. Totally 17682 flip-flops are covered, which are separated into 9 scan chains for fast testing. The GALS part was not included in scan due to the lack of the time and resources. However, the application of scan approach to the GALS part is certainly completely feasible and more or less routine issue. Additionally, we have BIST logic that can perform functional BIST test of both synchronous and GALS transmitter. BIST architecture is based on classical concepts of LFSR and MISR registers. The BIST technique is the main test technique used to enable testing of the system apart from the classical functional test. Additionally, we have enabled the adaptability and programmability of the Moonrake design by adding the JTAG interface that could be used for system setting (PLL, pausible clocking, clocking parameters etc) and can be very useful in the process of system testing and debugging.

4.7 BACK-END DESIGN

Hierarchical layout is performed in the design of GALS OFDM transmitter. The layout of pausible clock generators as well as the asynchronous port controllers is done separately from that of the functional modules. By this means, the cells in the asynchronous wrappers are placed as tightly as possible and the interconnect delays are controllable and minimized, contributing to the improvement in performance and reliability of asynchronous wrappers. The layout asynchronous wrappers are further integrated as soft macros in the top-level layout with the standard cells used in the synchronous netlist of functional modules. Commercial tools, such as Cadence SoCEncounter and Synopsys IC Compiler, are used in the design of GALS OFDM transmitter, and no effort is required for the development of CAD tools for GALS design layout.

4.7.1 Asynchronous Wrappers Layout

The layout of asynchronous wrappers is relatively easy due to the simple structures of both pausible local clock generators and asynchronous I/O port controllers as shown in previous sections.

To achieve better performance after layout, maximum fanout and maximum transition time constraints are set for the layout of asynchronous wrappers. Especially, constraints on the maximum delay are also asserted in the asynchronous port controllers. For example, some typical timing constraints used in the layout of asynchronous wrappers are shown below.

set_max_fanout 4 PINTRANS2PHASE

Page 22: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 22/35

set_max_transition 0.08 PINTRANS2PHASE

set_max_delay 0.25 -from [get_ports AI] -to [get_ports TA]

4.7.2 Top-Level Layout

At the top-level layout, the most important issue is the minimization of the interconnect delays between lclock generators and I/O ports, I/O ports and synchronous modules, and the I/O port pairs consisting of communication channels. Therefore, the placement of soft macros is of crucial importance for system performance. First, port marcos have to be located as close as possible to the corresponding clock generator. Second, the I/O ports handshaking with each other should be proximate as well. Third, the input ports have to be placed near to the GALS blocks where the input data is consumed.

To guide the layout tool following above placement criterions, a set of complex timing constraints has to be applied in the layout, listed below for instance. The die microphotograph of the Moonrake chip after layout is demonstrated in Figure 8. Post-layout simulations show that all the performance targets have been achieved by the design.

set_max_delay 0.1 -from [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_*trans?/RI"] -to [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_lclkgen/REQ*"]

set_max_delay 0.1 -from [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_lclkgen/ACK*"] -to [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_*trans?/AI"]

set_max_delay 0.2 -from [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_douttrans?/RP"] -to [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_pintrans?/RP"]

set_max_delay 0.2 -from [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_pintrans?/AP"] -to [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_douttrans?/AP"]

set_max_delay 0.1 -from [get_pins

"tx_gals_sync_inst/GALSTX/u_GalsBlk?/u_gals_blk?_pintrans?/TA"]

4.7.3 Physical Parameters of Moonrake Chip

The basic physical parameters about the layout of Moonrake chip is presented below. Floorplan of the chip is given on the Fig. 8 and final layout at Fig. 9. Moonrake chip is one of the most complex GALS chip implemented ever. As a comparison - ALPIN chip from LETI, was designed in 65 nm, and it was 11.5 mm2 with pads, and new MAGALI chip from LETI, 65 nm, is 32mm2 with pads.

• Bondlib 55um pitch

• Area 4000u*2250u=9mm2

• ~ 136 signal pins (GALS and Synchronous transmitter)

• ~ 29 signal pins (NoC test structure)

• ~ 70 power pins (35 VDD/VSS pairs for core and ring supply)

Page 23: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 23/35

Figure 8: Floorplan of Moonrake chip

Figure 9: Final layout of Moonrake chip

Page 24: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 24/35

4.8 GALS VS. SYNCHRONOUS DESIGN COMPARISON

Take Moonrake chip as an example, the efforts for designing the GALS OFDM transmitter and the benefits from the GALS design in comparison with the synchronous counterpart are highlighted.

4.8.1 Efforts for GALS design

The first design efforts the designers have to make for GALS design is the system partitioning. In synchronous design, the partitioning is basically based on the functions. While in GALS design, some other constraints have to be considered. For example in Moonrake design, to minimize EMI noise, the system partitioning in GALS design take into consideration the power consumption of each functional module. The detail information on the data processing flow and communication scheme among different functional modules has to be known at very early design stage for an accurate estimation of the system power consumption. In the Moonrake chip design, approximately 2 months were needed for the data flow analysis, dynamic power estimation based on synthesized netlist, and system partitioning of the GALS OFDM transmitter design.

Based on the GALS system partitioning specification, the designers then need to handle the design of interface circuits. Interface circuits are always one of the most key parts of the GALS systems. Low power, area and latency overheads and high reliability are the primary goals for interface circuits design. In Moonrake chip, the asynchronous wrappers are used in pausible clocking scheme for GALS design. Both of the pausible clock generators and asynchronous I/O ports are rather simple in structure and lead to less overheads in power and area. However, since at most single clock cycle latency in data transfer could be introduced by the pausible clocking scheme, particular design efforts are required to guarantee that this communication uncertainty can be tolerant in the functional modules. For example, if there are some FIFO state signals transferred across the GALS blocks, then at least one cycle timing margin should be leaved for the transfer latency uncertainty. Normally, these can be easily achieved by slightly modifying the FIFO control logic. Here we spent around one month to make the GALS system robust to any data transfer uncertainty.

The synthesis of GALS system is one of the most challenging task in GALS design. Depending on the partitioning granularity of the GALS design, it is quite often to have a large number of local clock signals in the whole system. In especial, some handshaking signals are also used as clocks in pausible clocking scheme to synchronize data in the asynchronous I/O ports. Therefore, clear declarations on the relationship of the local clock signals and handshaking clock signals are crucial for correct synthesis. To make the handshaking signals running as clock signals, careful definitions on the synthesis constraints are necessary, including setting maximum fan-out load, maximum transition time, maximum/minimum interconnect delays for example. In Moonrake chip, there are in total six local clock signals and more than sixty handshaking clock signals used, and our estimation is that 2 person-months are needed for the iteration of synthesis and post-synthesis simulations.

Layout of the GALS design is another major challenge. There is no complex global clock signal in the GALS design, and the clock tree distribution is relatively simple. Timing issues on clock trees, such as clock propagation delays and global clock skews are no longer main problems here. However, GALS design is by no means easy for physical design. Because most of the gates switching activities in asynchronous circuits are triggered by the handshaking signals, additional design efforts are needed to implement the asynchronous modules as soft or hard macros to control the interconnect delay strictly. The placement of all the soft/hard macros on the system level also has to be carefully addressed. To achieve better performance, the interconnect delays between different asynchronous macros as well as the interconnect delays between asynchronous macros and synchronous functional modules should be minimized. That means the placement of soft/hard macros have to be as close as possible to the corresponding communication counterparts, which contributes an additional constraint for the layout of GALS design. As to the routing, a critical issue is to make sure that all the handshaking clock signals can trigger the flip-flops correctly. For pausible clocking scheme, another issue is to avoid any timing violations caused by the mismatch on the local clock tree delays and data synchronization delays. To perform this (depending on the experience) one would require 10-12 person months investment.

Page 25: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 25/35

However, in general many of those tasks have to performed anyway regardless of the design style (synchronous or GALS). The important is to estimate how much is the additional effort needed just for GALS introduction in comparison to the total approach. The basis for the Moonrake design was FPGA implemented VHDL code that was already ready before starting the ASIC chip. However, the process of migration of this code to the style applicable to the 40 nm library and performing of logical synthesis, and DfT caused the intensive implementation efforts. On the other hand, design of GALS components, since their architecture was already known from the previous projects, was comparatively small. As stated in D24, we estimate those additional GALSifying efforts to approximately 15% of the time needed to perform full front-end design. Regarding the back-end, in the learning phase this overhead is 40-50% (Moonrake case), but after gaining some experience with GALS design it can be reduced to 20-25%. On the other hand, GALS releases many system integration constraints and for highly complex design this additional time can be transferred actually into benefit since the standard synchronous approach cannot be applied straightforward.

4.8.2 Benefits from GALS design

The most well known advantage of GALS design is the simplification of clock tree distribution. In today’s digital VLSI design, the global clock signals could contribute more than one-third on the power consumption of the whole chip, and introduce large design efforts to implement the clock signal within very strict restriction on latency and skews. In GALS design, all the blocks have their local clock signals, and the communications between different GALS blocks are accomplished by asynchronous interface circuits. Then the clock tree distribution is within each GALS block and is relatively easier than that in synchronous design. A simplified clock tree also contributes to the power and area reduction of system.

GALS design is suitable for dynamic voltage and frequency scaling (DVFS) technique in SoCs and NoCs design. For static CMOS design, the power is mainly consumed by the switching activities of the internal logic, and the switching power of a system is linearly proportional to the working frequency and square proportional to the supply voltage. Therefore, frequency and voltage scaling is promising for significantly reducing the power consumption of CMOS design. GALS design methodology is suitable in nature for DVFS. Since the communications between different GALS blocks are via asynchronous interface circuits, each GALS block can be configured to work at the optimal frequency and voltage for its own performance. GALS DVFS design has gained quite a lot of research attentions in recent years in low-power SoCs and NoCs design.

GALS design also provides good properties for low-noise digital design. In synchronous circuits, all the switching activities are triggered by the rising and falling edges of clock signals, and there are large, steep and periodic supply current surges on the power supply networks. Due to the parasitic resistance on the on-chip power supply lines and inductance on the bonding wires and lead frames, the current surges would introduce large fluctuations on the on-chip supply voltage, known as simultaneous switching noise (SSN) in digital circuits. The voltage drop would introduce delay uncertainty in the internal logic, leading to timing failures. Normally, certain supply voltage margins, no more than 10% for instance, are leaved in power supply distribution and are taken into consideration in static timing analysis (STA). However, with the shrinking in the feature size of technology, the integration density and transition speed of CMOS gates are constantly increased, while the voltage margin is substantially reduced. All these technique trends exacerbate the problem of SSN. In GALS design, each block is triggered by its local clock, and as a result, the switching activities in the whole system are spread to evenly distribute over the time. Therefore, the switching current would be smoothed and lead to less voltage fluctuations.

The switching current caused by the digital circuits contributes the most to the electromagnetic interference (EMI) from integrated circuits. In the last decade, strict international standards are setup on the EMI radiation from ICs, and more and more attentions are paid in this area. In GALAXY project, lots of efforts have been made to prove the efficiency of GALS design for EMI reduction. Clock frequency modulation is a promising method to attenuate the power density at higher harmonic frequencies of clock signals, which has been applied in the GALS OFDM transmitter design.

Page 26: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 26/35

According to the synthesized results, the GALS OFDM transmitter occupies 41.4% in the area of the Moonrake chip, and the synchronous OFDM transmitter consumes 41.1%. They are also similar in terms of power consumption. The performance of Moonrake chip on the EMI noise is currently under measurements in IHP.

Page 27: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 27/35

5 COMPARING SYSTEM INTEGRATION WITH GALS AND SYNCHRONOUS APPROACH

One of the main motivators for including the GALS concept into design of high complexity nanoscale digital systems is of course the need for effective system integration and accelerated design process. It is clear that the advantage of GALS can be visible only for high complexity designs. For other, simpler designs, introducing GALS will only make the design process more complicated. The main indications for introducing GALS are the following:

• The design is of very high complexity, and standard design flow has the problems in timing closure and CTS (clock tree synthesis).

• The design contains the sub-blocks running at different frequencies/phases. Interconnections are needed to integrate mutually asynchronous blocks.

• Implemented design must have low-EMI properties.

If any of those three indicators are valid one should consider involving the GALS design.

The most important advantages of GALS design compared to synchronous approach – better EMI characteristics, power reduction and process variability mitigation – have already been discussed in D24. Here, we will provide the comparison of these two approaches from the point of system integration. Design effort will be specifically considered from the point of design flow implementation and important differences between GALS and synchronous approach emphasized. In addition, we will show some further benefits related to clock skew reduction and potential processing throughput of GALS approach. Finally, a small analysis related to potential gains and drawbacks related to number of partitioned clock domains will be provided.

5.1 GALS DESIGN FLOW

Defining a design flow for Globally Asynchronous Locally Synchronous systems is a difficult task. In general, designs that have asynchronous parts are relatively difficult to implement. A major problem is the lack of support for asynchronous logic designs in commercial EDA tools. This design flow should be based on existing tools, combined with additional scripts. The proposed design flow for developing our GALS system is a combination of the standard synchronous design-flow with addition of specific asynchronous synthesis tools. However, most of the tools are taken from the pure synchronous world. The reason for that is simple: in general, the asynchronous part of the GALS circuitry is very small in comparison with the synchronous part. Accordingly, simple asynchronous circuits can be generated with the use of asynchronous synthesis tools and then embedded in complex synchronous blocks.

We compare the GALS and synchronous integration aspects following through phases of GALS design flow already presented in Section 3 and shown in Figure 1. Comparison is possible directly on GALS design flow since all the phases of synchronous design flow are contained in GALS design flow.

5.1.1 Design Specification

Three main aspects of design specification are area, power and delay (throughput). There are additional aspects related to these specifications compared to synchronous design that have to be analyzed in GALS system. System partitioning should also be addressed in the design specification, taking both the implementation benefits and overheads introduced by GALS design into consideration.

Power/area overhead will certainly be introduced by communication interface circuitry despite the used interface model (asynchronous FIFOs, pausible local clocking schemes, or just simple double-stage flip-flops). However, this circuitry is rather small and the introduced overhead is usually of a very small proportion. In addition, in case of pausible local clocking interface implementation it is possible to

Page 28: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 28/35

effectively use DVFS schemes to reduce power, even below the level of synchronous system. For details of possible implementation see D24.

Regarding delay overhead, communication interface circuits have to spend additional clock cycles to resolve the meta-stability issues for safe data transfer, and it will introduce communication latency in GALS systems. However, as demonstrated with Moonrake chip it is possible to achieve the same or even larger operating frequency of GALS modules compared to synchronous implementation offering the same or even better throughput.

5.1.2 Behaviour Level Coding

In GALS design, most of the circuits, including all the functional modules, are normally designed in synchronous manner. That means the behaviour level design of GALS systems is rather similar to the traditional synchronous systems in general. Asynchronous design is only used for the interface circuits in GALS design. Considering the simple structures of interface circuits, asynchronous design in GALS systems only occupy a very limit part in the system design efforts. For the simple asynchronous circuits, the behaviour level design methods mentioned above is efficient and easy for application.

5.1.3 Asynchronous-Synchronous Co-simulation

The behaviour level descriptions of both synchronous functional modules and asynchronous wrappers have to be integrated in a complete environment for system simulations. An asynchronous-synchronous co-simulation tool GALAXYIDE, using ASIP packaging format, is developed in GALAXY project, which supports behaviour level co-simulation using different languages such as VHDL/Verilog and STG/Petri-Net, and provides an effective approach to cover the behaviour design of synchronous-asynchronous interfacing circuits in very early design stage.

5.1.4 Synthesis

The synthesis tools available on the market (apart from Haste offered by Handshake solutions) do not offer any support for asynchronous design. The analysis of critical paths is not possible. Furthermore, synchronous layout tools have special optimization procedures (as, for example, in-place optimization or timing-driven placement) that are not well suited to asynchronous components.

However, several asynchronous EDA tools have been developed and are available to the users. There are two main categories of those tools. First, there are tools for synthesis of hazard-free asynchronous controllers. Some examples are Petrify, MINIMALIST, and 3D. In general, these tools are a good basis for the synthesis of relatively simple asynchronous controllers. They do not offer support for designing large systems, and they do not integrate any simulation tool. On the other hand, those tools are very useful help when some relatively simple asynchronous component has to be generated. The other category of EDA tools offers a complete design framework for system description, simulation and synthesis. Examples are TANGRAM (recently offered commercially to the market with the name HASTE), BALSA and TAST. The main obstacle for their application in the GALS area is that they are directed towards asynchronous designs and not to mixed synchronous-asynchronous designs. However, in the context of GALAXY project we have considered the extension of BALSA also to mixed synchronous-asynchronous system that we actually need for GALS design. Therefore, GALAXY BALSA is one important tool in our design flow. In addition, Petrify, 3D, Minimalist for simple asynchronous logic, are used to guarantee the quasi-delay-insensitivity (QDI) of the gate-level circuits.

Muller-C element and Mutual Exclusive element (MUTEX) are widely applied in asynchronous design, however, these cells are not included in most of nowadays synchronous standard cells library. Efforts are therefore required to create an asynchronous standard cells library using the same technology to the synchronous standard cells library. Because only a few types of asynchronous cells are needed in GALS design and relative simplicity of these asynchronous cells, the generation of asynchronous library is relatively simple.

Page 29: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 29/35

The synthesis of synchronous modules in GALS design is similar to the synchronous design. However, for asynchronous circuit synthesis a careful estimation on the environment is required in order to correctly set I/O constraint.

5.1.5 Back-End Layout

Hierarchical layout is performed in GALS design. Depending on the design requirements, the asynchronous circuits can be layout as soft-macros or hard-macros. As a soft macro, its timing information can be abstracted and used for post-layout simulations. As a hard macro, since it is layout in transistor level, it is difficult to abstract the delay information as normal digital cells for simulations. However, hard macro can be optimized with the maximum efforts for high performance and reliability. Moreover, it is possible to characterize hard macros in a similar way standard cells are characterized specifying delay tables for input/output pin pairs. This process can also be automated if all macro blocks use standard and relatively small number of pins.

5.1.6 Final Considerations

Behavioural, post-synthesis simulation, and back-annotation are performed using a standard VHDL-Verilog simulator. The verification (LVS, ERC, DRC) is performed using the conventional tools.

Finally, we observe the part of the design flow that deals with the asynchronous components on the left side of Figure 1. In addition to the standard approach, we should additionally design, verify, map, layout and characterize the asynchronous components. Since they are relatively simple (couple of gates), this is not very much complex and has to be done only once. When we perform this, the component design can be replicated for each GALS design in the same technology.

5.1.7 Evaluating the Effort for Example Designs

So far integration effort has been theoretically evaluated. Since the members of the project consortium have the broad experience in designing the GALS systems, based on this experience, we could evaluate the complexity of introducing the GALS technology into the system design. The following analysis is based on the experience from WLAN baseband chip designed in IHP in 2005 using IHP 0.25 um CMOS process, GALS FFT chip (GALAXY project) designed in 2009 using IHP 0.13 CMOS process, and Moonrake 60 GHz transmitter GALS chip (GALAXY Project) designed in 2010 using 40 nm Infineon library.

WLAN GALS Processor (2005)

The GALS WLAN baseband chip was designed in IHP in 2005 using IHP 0.25 um CMOS process. It includes the full baseband functionality. This is a relatively complex design for this technology with the area of 45 mm2 running at 80 MHz. This chip was a feasibility study of a request-driven GALS technique. Since a synchronous version of the same system has been designed, the GALS version could be used to compare the design methodology.

In principle, the GALS design process was faster than the usual synchronous one. Dealing with smaller design blocks was not as difficult as with the synchronous chip. The clocking of the synchronous chip was relatively complex with three different clock domains (80, 40, 20 MHz), with clock divider and clock gates on-chip. With GALS approach, the challenges like global clock tree generation with an enormous number of leaves, clock divider and handling of clock gating has simply disappeared. Clock skew within smaller clock domains was significantly reduced (660 ps -> 486 ps). However, with more stringent constraints; even better results can be achieved. Since there is no global clock tree, timing closure of the complete design was achieved much more easily.

However, during the design of the GALS, several new design issues appeared. The main difficulties were lack of tool support for asynchronous components, or immaturity of asynchronous tools.

Page 30: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 30/35

For example, due to the limitations of the CAD tool, a direct gate-mapping of the generated logic equations was not possible. Therefore, many operations had to be performed manually. This degrades the performance of the final design and introduces additional delay in the design process. Furthermore, the wrapper evaluation and improvement was performed in parallel to the GALS chip design. These issues caused some iterations of the GALS design process.

Testing of the GALS chip was a problem when using a standard synchronous hardware tester. Special BIST logic was embedded in order to allow the use of a classical hardware tester. We had to perform a special calibration of the ring oscillators during testing in order to match results of the testing and simulation.

The implementation of the GALS system resulted in a hardware overhead of around 3% for the asynchronous wrappers. Power measurements of both chips showed just a marginal improvement of 1% for the GALS case. On the other hand, supply variation noise measurements showed a clear advantage of the GALS solution. The absolute maximum of the power spectrum of the GALS circuit is about 5 dB lower than the absolute maximum for the synchronous circuit. This was achieved without any special setup for asynchronous modules such as adjusting clock phases for different blocks However, the general conclusion is that GALS led to a certain simplification of the system integration process.

GALS FFT Processor (2009)

GALS FFT chip was, in the framework of the GALAXY project, designed in 2009 using IHP 0.13 CMOS process. The chip complexity was around 3 mm2, but less then 30% was actually used by the core. The main intent of this chip was to evaluate the possibilities for EMI reduction with GALS, and this was actually achieved, having 13 dB reduction of best GALS mode in comparison to the synchronous mode. However, due to the low complexity of this design, we have to spend comparatively high effort in developing the asynchronous wrappers. Therefore, from the system integration point of view approximately 50% of the time is spent to GALS introduction.

Moonrake OFDM Processor (2009)

Moonrake 60 GHz transmitter GALS chip was designed, in the framework of GALAXY Project, in 2010 using 40 nm Infineon library. The design is highly complex, with the total area ending with 9 mm2 in this advanced scaled process clocked effectively with 160 MHz.

Since this is a part of still ongoing design process at this point we could only evaluate the design effort for the front-end part. The basis for this baseband design was FPGA implemented VHDL code that was already ready before starting the ASIC chip. However, the process of migration of this code to the style applicable to the 40 nm library and performing of logical synthesis, and DfT caused the intensive implementation efforts. On the other hand, design of GALS components, since their architecture was already known from the previous projects, was comparatively small. We estimate those additional GALSifying efforts to approximately 15% of the time needed to perform full front-end design.

As a conclusion, we can say that for high complexity designs the additional effort pays it off over the relaxation of the global timing, and ease of the timing closure. Of course, the GALS will bring also certain level of EMI reduction, and potentially power reduction and resistivity to process variations.

Page 31: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 31/35

5.2 CLOCK SKEW IN SYNCHRONOUS SYSTEMS AND GALS SYSTEMS

ADVANTAGES

A critical issue for nowadays digital VLSI design is the robust clock network distribution. Clock skew minimization is an important design task to ensure correct circuit behaviour of sequential circuits, and many works have been done to design zero-skew clock tree [Zhu03]. Factors that cause clock skew are classified into three categories; design uncertainty, manufacturing variability and environmental change. The main problem of design uncertainty is error of capacitance estimation. Even if we reduce design uncertainty by using detailed analysis and optimization, clock skew occurs due to manufacturing and environmental variabilities. As clock frequency increases, clock skew specification becomes severer, and hence we must design a clock tree that is robust against those variabilities.

Recently, high-performance microprocessor design adopts techniques that adjust clock skew after fabrication [Zhu03]. These techniques are effective for microprocessor design. However, ASIC can not utilize it due to cost problem. In clock design, limiting signal transition time is an important factor that controls clock skew and power dissipation.

We will demonstrate here the level (severity) of clock skew increase with respect to voltage and gate length variation as well as the effect of technology scaling on it. Results in Figures 10 and 11 from [HA05] are obtained with the following parameters: the analyzed clock tree is a standard H clock tree with dimensions 10x10um. The clock skew is measured between central and most distant point. The used technology is 0.18um. The nominal transistor length is 200nm. The transistor length variation consists of random and spatial variations. In random variations case σ=11nm. In the case of spatial variation, the maximum and minimum values are 220 and 180nm respectively. The nominal supply voltage is 1.8V. The power noise is assumed to have random and spatial variations. The standard deviation is set to be 0.1V. The maximum and minimum voltages in spatial variation are 1.98 and 1.62V. λ is the ratio between output and input capacitance in every clock tree point.

Figure 10: Clock skew histogram under supply voltage fluctuation

Page 32: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 32/35

Figure 11: Clock skew histogram under gate length fluctuation

The impact of various factors to clock skew is also analyzed with respect to technology scaling [MEH01]. The used parameters in the analysis are given in Table 2. The clock skew values are given in Table 3.

Table 2: Interconnect Parameters

Page 33: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 33/35

Table 3: Interconnect Parameters

As it can be seen from presented results the impact of clock skew presents a serious treat to future synchronous design. With increased chip sizes, increased supply noise and process variations the increase of clock skew would be so large that conventional methods would not be able to provide successful design of synchronous systems. Here is the huge advantage of GALS systems since the size of synchronous island can be arbitrarily reduce. The reduction of clock skew is linear compared to synchronous island size.

In addition to this advantage, in synchronous design, a high fan-out while low-skew global distributed clock tree is required, which are considered as a major challenge for backend design and often introduces significant power and area consumption. By removing the global clock signals in GALS design, designer would make the clock tree distribution more efficient and less consuming.

Page 34: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 34/35

5.3 MULTISYNCHRONOUS VS. GALS DESIGN

In synchronous large systems where clock skew is large, in synchronous systems with multiple clock domains and in systems where an additional effort is introduced to reduce EMI there are certain drawbacks that increase design complexity, increase area, delay and power penalty and increase designing hazards. These systems as a rule contain multiple paths and synchronizers. As the number of different clock domains/blocks increases the amount of overhead also drastically increases. Moreover, design effort is also increased. In the same time efficiency of clock skew minimization or EMI reduction is much smaller compared to GALS design (for some comparisons see D24).

On the other hand, while providing efficient way to minimize the effect of clock skew, simplify clock tree distribution, reduce significantly EMI and providing natural environment for multisynchronous systems, GALS designs also have very small drawback in terms of area, power and latency when optimized for the proper partition size. Partitioning of GALS design is an important step and an optimization effort is necessary to be performed in design specification phase of system integration.

Here we provide some guidelines for partitioning of GALS designs.

• Reduction of size of partitions is beneficial for minimization of variability (as shown in D24). However, there are drawbacks and after partition size becomes too small impact of variability start to increase. Optimal point should be determined following the procedure from D24.

• Increase in number of partitions also reduces EMI (as shown in D24). However, there is also an optimum point and the analysis using GalsEmilator should be performed.

• Increase in number of partitions increases the number of asynchronous communication block, which despite being small increases the total area and power. An analysis should be performed in design specification stage to confirm the highest number of partitions that satisfy area/power constrains.

• Increase of number of partitions increases communication latency and therefore reduces system throughput. Even though this is the most complex constraint to analyse an effort should be made to fulfil design specification.

Designer must make a trade-off between the benefits and overheads in GALS system partitioning granularity.

Page 35: D33 IHP R001 5 System Integration - galaxy-project.org€¦ · 01/12/2010 1 D33_IHP_R001_5_System_Integration.doc PAGE: 3/35 CHANGE RECORDS ISSUE DATE § : CHANGE RECORD AUTHOR 1

GALAXY GALS InterfAce for CompleX Digital

SYstem Integration

Confid. Level:

Date :

Issue:

Public

01/12/2010

1

D33_IHP_R001_5_System_Integration.doc PAGE: 35/35

6 CONCLUSIONS

In this report we have introduced the possible effects of the system integration when we apply GALS methodology. We have briefly introduced GALS design and test flow, that was used in GALAXY project. In this context we have tried to emphasize the differences between the standard synchronous and GALS flow. In general those differences are small and addressing only the implementation of asynchronous controllers that are normally occupying very small portion of the overall chip size. In principle, the main investment while developing the GALS systems is the initial one including the extension of the standard set libraries, and development of the asynchronous controllers.

Here, we demonstrated from the point of integration that once GALS design flow has been established, the additional effort to perform all the steps is relatively small compared to synchronous design. However, this has been successfully performed during the course of this project and demonstrated on the Moonrake chip example. One section is actually dedicated to Moonrake examples and experience that we have gained during the development of this complex system using the 40 nm CMOS technology. There we have also analyzed the efforts and benefits for the development of this system. It turned out that the time that we spend for making the GALS part correctly running was comparatively small to the overall efforts needed to make the system operational and just to adapt the synchronous IP to the applied ASIC technology. For more complex systems, that are actual target of GALS technology, it is expected that the additional efforts needed for GALS involvement should turn into time-effort win due to the improvement of the system integration.

Additionally, we demonstrated that clock skew presents a significant problem in large designs in modern technologies and that GALS approach offers a viable solution. Partitioning is determined to directly influence performance and that right balance that minimizes latency, power and area is necessary to be determined already in design specification phase. Optimally partitioned GALS design with already established GALS design flow provides not only better performance in terms of area, power, EMI, and delay compared to multisynchronous designs but also requires less designing effort in system integration.


Recommended