Enhanced Synchronous Design Using
Asynchronous Techniques
by
Navid Toosizadeh
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c© 2010 by Navid Toosizadeh
Abstract
Enhanced Synchronous Design Using Asynchronous Techniques
Navid Toosizadeh
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2010
As semiconductor technology scales down, process variations become increasingly difficult to
control. To cope with this, more and more conservative delay and clock frequency estimations
are used during design, which result in overly large and leaky circuits. Also, the system runs at
a speed slower than that possible because a fixed clock determined by the worst-case analysis of
the circuit is used. On top of process variations, voltage and temperature variations also push
the designer towards even more conservative delay estimations.
On the other hand, asynchronous design style has potential advantages over synchronous
design including resilience to process variations, lower power consumption and higher perfor-
mance. Unfortunately, these advantages are usually hindered by the significant design effort
required to implement useful asynchronous circuits and also by the overhead of asynchronous
control logic.
Borrowing from asynchronous techniques, a new methodology is proposed to design syn-
chronous circuits that have some of the advantages of asynchronous circuits. Asynchronous logic
is used to generate the clock of a synchronous system. The resulting system automatically tunes
itself to deliver the best-possible performance under the prevailing process-voltage-temperature
(PVT) conditions. This methodology may be used to reduce the leakage power significantly
in deep nanometer technologies. It also helps in handling process variations. The results from
a 32-bit processor implemented in 90nm technology shows 10X leakage reduction compared to
ii
the traditional synchronous design.
The proposed technique is expanded to adjust the speed of a pipeline according to the
current operations flowing in the pipeline as well as the current PVT conditions. The results
from a 32-bit processor in 90nm technology demonstrate a 2X speed improvement compared
to the conventional synchronous design. The proposed techniques only use synchronous design
tools and are compatible with design flows that are currently in use.
iii
Acknowledgements
I would like to thank my thesis supervisor, Professor Safwat Zaky, for his guidance
and patience over the years. His extensive and broad knowledge has given me a solid foundation
to depend on, while his dedication to the scientific process has improved the quality of my work.
He has gone beyond the call of duty to help me become a better researcher. On a personal
level, I have learned a great deal from him. I consider him to be my mentor.
To Professor Jianwen Zhu, I extend my gratitude for the guidance he offered during the
course of my studies. As a member of my committee he has always provided practical advice
and shared experience with me. I am grateful for his contribution to my research.
I would like to thank Professor Zvonko Vranesic for his contribution as a member of my
committee. His feedback was invaluable in improving the quality of my research.
I also take this opportunity to thank Professor Roman Genov. His valuable feedback cer-
tainly improved the quality of my thesis.
I would like to acknowledge the financial support provided by the Natural Sciences and
Engineering Research Council of Canada, University of Toronto and the Government of Ontario.
To my many friends from SF 2206 and other graduate offices, I extend my gratitude for
their friendship and support during the course of my Ph.D. Many thanks to Kamran and Sogol.
To my brothers and sisters, I extend my heartfelt gratitude for their emotional support
while being away from my home country. Mahnaz, Saeed, Nima and Sharareh, I have always
felt your support. Many thanks to my family members Mohsen, Mojgan, Alireza, Keihan,
Hooman, Sadaf, Kiarash, Mandana, Hamid and my parents-in-law Mehdi and Sareh.
My parents, Hossein and Masoumeh have always encouraged me and provided the best
support for me. I am always grateful to them for their unconditional love.
Above all though I am eternally grateful to my best friend and wife, Reihaneh, who has
so greatly changed the course of my life. She has patiently supported my many busy evenings
and weekends while I completed my academic work. I hope I can provide her with the love and
happiness she has unselfishly given me.
iv
Contents
List of Tables ix
List of Figures xi
1 Introduction 1
1.1 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Challenges of Today’s Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Asynchronous Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 History of asynchronous systems . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Asynchronous Styles and Building Blocks . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Asynchronous handshake protocols . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 C-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Classes of asynchronous circuits . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Asynchronous Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Micropipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 MOUSETRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Globally Asynchronous Locally Synchronous Systems . . . . . . . . . . . . . . . . 18
v
2.7 Potential Asynchronous Design Advantages . . . . . . . . . . . . . . . . . . . . . 19
2.7.1 Avoiding clock skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.2 Lower electromagnetic noise . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.3 Lower power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.4 Resilience to process and environmental variations . . . . . . . . . . . . . 21
2.7.5 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.6 Higher performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Asynchronous Design Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Difficulties in Asynchronous Design . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.10 Attacking Asynchronous Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.10.1 Desynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.10.2 Methods using conventional HDLs . . . . . . . . . . . . . . . . . . . . . . 28
2.10.3 Methods using communicating sequential processes languages . . . . . . . 29
2.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Application of Concurrency in Asynchronous Design 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Handshake circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Sequencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 WAR Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Using Edge-Triggering in Accumulator Circuits . . . . . . . . . . . . . . . . . . . 41
3.5 Introducing Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 System Timing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 System examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Experimental Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . 53
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Enhanced Synchronous Design 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
vi
4.2 PVT-aware Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Solving Real-world Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Leakage Reduction 64
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Review of Power Management Techniques . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 The PVT-aware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5.1 Clock generation issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Case Study: PVT-aware DLX Microprocessor . . . . . . . . . . . . . . . . . . . . 73
5.6.1 Tuning delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6.2 Implementing the fixed-clock counterpart . . . . . . . . . . . . . . . . . . 74
5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7.1 Power and performance analysis . . . . . . . . . . . . . . . . . . . . . . . 76
5.7.2 Resilience to inter-chip PVT variations . . . . . . . . . . . . . . . . . . . . 78
5.7.3 Resilience to intra-chip PVT variations . . . . . . . . . . . . . . . . . . . 78
5.7.4 Suitability for voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8.1 Design space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8.2 Clock error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8.3 Expanding the PVT-aware approach . . . . . . . . . . . . . . . . . . . . . 82
5.9 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 VariPipe: Variable-clock Synchronous Pipelines 88
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 VariPipe: The Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3.1 Creating delay profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vii
6.3.2 Simplifying delay profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3.3 Implementing the clock generation circuit . . . . . . . . . . . . . . . . . . 92
6.3.4 Variable delay implementation . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6 Case Study: VariPipe DLX Microprocessor . . . . . . . . . . . . . . . . . . . . . 96
6.6.1 Implementing the VariPipe DLX processor . . . . . . . . . . . . . . . . . 96
6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.7.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.7.2 Energy consumption analysis . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7.3 Area and energy overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.7.4 Resilience to PVT variations . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.7.5 Reduction in electromagnetic noise . . . . . . . . . . . . . . . . . . . . . . 106
6.7.6 Suitability for voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7 Conclusion and Future Work 111
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.1.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A Previous Publications 115
B Balsa Code for Radix-4 Booth Multiplier 116
C Chip Layout of the PVT-aware Processor 119
Bibliography 121
viii
List of Tables
3.1 Overlapped delays for different insertion points . . . . . . . . . . . . . . . . . . . 50
3.2 Delay values in ps for the accumulators inside the multiplier . . . . . . . . . . . . 51
3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Area of storage and control elements . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 PVT corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Post-synthesis power and area breakdown . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Comparison of the clock period and the critical path . . . . . . . . . . . . . . . . 74
5.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Power and performance results under typical PVT,temp=25◦C . . . . . . . . . . 76
5.7 Post-layout area and leakage breakdown under typical PVT corner,temp=25◦C . 77
5.8 Power and performance results under worst-case PVT,temp=125◦C . . . . . . . . 78
5.9 Clock period changes with intra-chip variations . . . . . . . . . . . . . . . . . . . 79
6.1 Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Operation Selection Table of Execution Unit . . . . . . . . . . . . . . . . . . . . 98
6.3 Operation Selection Table of Decoder . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Post-layout delay profiles of Decoder and Execution unit . . . . . . . . . . . . . . 100
6.5 Simplified delay profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6 PVT corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.7 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ix
6.8 Execution time reduction percentage using VariPipe . . . . . . . . . . . . . . . . 104
6.9 Energy consumption under the typical PVT corner . . . . . . . . . . . . . . . . . 105
x
List of Figures
2.1 Handshake data encoding schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Handshake styles [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 C-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Micropipeline [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 MOUSETRAP [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Epson’s flexible processor [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Desynchronization method [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Balsa code of a single-place buffer [6] . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 Handshake circuit of the single-place buffer [6] . . . . . . . . . . . . . . . . . . . 30
3.1 Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Comparison of S-sequencer and T-sequencer behavior . . . . . . . . . . . . . . . . 36
3.3 T-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 An example of the WAR hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Edge-triggered-based variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Five-stage FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Current synthesis of Dst← f(Dst, Src) in Balsa . . . . . . . . . . . . . . . . . . 41
3.8 Revised accumulator circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Timing diagram for the circuit in Figure 3.8 . . . . . . . . . . . . . . . . . . . . . 43
3.10 Inserting a T-element in the Func channel . . . . . . . . . . . . . . . . . . . . . . 44
3.11 The T-isolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.12 Timing diagram for the circuit in Figure 3.10 . . . . . . . . . . . . . . . . . . . . 45
xi
3.13 Inserting a T-element in the Write channel . . . . . . . . . . . . . . . . . . . . . 47
3.14 Timing diagram for the circuit in Figure 3.13 . . . . . . . . . . . . . . . . . . . . 47
3.15 Inserting a T-element in the Act channel . . . . . . . . . . . . . . . . . . . . . . . 48
3.16 Timing diagram for the circuit in Figure 3.15 . . . . . . . . . . . . . . . . . . . . 49
3.17 Accumulation loops inside the multiplier . . . . . . . . . . . . . . . . . . . . . . . 52
3.18 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1 Clock generation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Clock generation loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Clock generation circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Clock generation circuit schematic . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Proposed low-power PVT-aware design flow . . . . . . . . . . . . . . . . . . . . . 71
5.4 Performance of PVT-aware and fixed-clock DLX processors under all PVT corners 78
5.5 Design Space expansion using PVT-aware design . . . . . . . . . . . . . . . . . . 80
5.6 Dean’s clocking structure [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7 Razor flip-flop for a pipeline stage. (a) A shadow latch controlled by a delayed clock augments each flipflop.
6.1 VariPipe technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Clock generation circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Variable delay and toggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Reducing the switching power of delay element . . . . . . . . . . . . . . . . . . . 94
6.5 A simplified model of the clock generation circuit . . . . . . . . . . . . . . . . . . 94
6.6 Proposed VariPipe design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7 Verilog code of the Execution unit . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.8 Performance of VariPipe and fixed-clock DLX processors under all PVT corners . 103
6.9 Comparison of the clock power spectra . . . . . . . . . . . . . . . . . . . . . . . . 107
C.1 Chip layout of the PVT-aware processor . . . . . . . . . . . . . . . . . . . . . . . 120
xii
Chapter 1
Introduction
Semiconductor technology scaling has introduced new challenges in the design and im-
plementation of synchronous systems such as coping with process variations. Process
variations push designers to use more conservative delay estimations, which result in
overly large, leaky and slow circuits. On top of process variations, voltage and tempera-
ture variations also push the designer towards even more conservative delay estimations.
On the other hand, asynchronous systems exhibit properties that can be of great bene-
fit to today’s implementations, including lower power consumption, better adaptability
to process and environmental variations and better performance. Asynchronous circuits
are capable of adjusting their speed to the present input data and to operating condi-
tions such as temperature and voltage, while a conventional synchronous design always
assumes the longest possible delay.
Despite the advantages of asynchronous design, building useful asynchronous circuits
is burdened by the difficulties in their design and implementation. In this dissertation,
first, a methodology to increase concurrency in the operation of asynchronous circuits
is suggested, which results in faster, smaller and less energy consuming asynchronous
circuits.
However, real-world applications are mainly implemented as synchronous circuits, us-
ing well-established synchronous design tools. The main thesis of this work is that asyn-
1
Chapter 1. Introduction 2
chronous techniques can be applied to synchronous circuits, to considerable advantage. A
methodology is proposed to design synchronous circuits with asynchronous advantages.
The methodology enhances real-world design and applications with asynchronous advan-
tages making possible higher speed, smaller area, less power consumption and better
flexibility to process, environmental and operating conditions.
1.1 Thesis Motivation
As technology scales to smaller feature sizes, process variations make the quality of fab-
ricated chips less predictable. Therefore, more and more conservative delay estimations
are used, which result in overly complex and large circuits with excessive power require-
ments. The variations in other parameters such as voltage and temperature also oblige
more conservative delay estimations. The clock frequency of traditionally-designed syn-
chronous systems is determined using the worst-case process-voltage-temperature (PVT)
analysis of the most critical path. A system is most of the time exposed to typical PVT
conditions and in many cases, its critical path is not triggered. Under these conditions,
the system can run at a higher speed than that determined using worst-case analysis. In
summary, the combination of PVT variations and traditional synchronous design style
results in excessively large, power-consuming and slow circuits.
Today’s applications demand more functionality and performance resulting in a con-
stant increase in power consumption. For example, more and more functions are added
to battery-limited handheld devices. How is it possible to meet all these requirements
with available technologies?
Asynchronous (clock-less) design exhibits features that can be of great use to diminish
today’s design challenges. These features are studied comprehensively in Chapter 2 along
with the definition of asynchrony and asynchronous design styles. Potential advantages
of asynchronous design over synchronous design include:
• Resilience to process and operating conditions
Chapter 1. Introduction 3
• Lower power consumption
• Higher performance
The features of asynchronous design are in line with the design challenges mentioned
earlier. The objective of this work is to find a way to put asynchronous advantages into
real-world use. In other words, to introduce a practical methodology to use asynchronous
advantages in solving the problems introduced earlier such as coping with process varia-
tions, lowering power consumption and improving performance.
Unfortunately, designing asynchronous circuits is difficult and the circuits synthesized
by asynchronous tools are large and slow. This is one of the reasons why asynchronous
design style has not been accepted as a mainstream approach by the industry. The
challenges in the design and implementation of asynchronous circuits are studied in the
next chapter.
1.2 Research Flow
The first step in the work presented in this dissertations was to study available tools and
techniques to design asynchronous systems. The main goal was to understand some of
the limitations of asynchronous circuits and also to study the resulting products. There
are two main groups of asynchronous design tools:
1. Tools such as Balsa [9] and Haste (Tangram) [10–12] have been developed, which
use their own specific hardware description language (HDL) to describe purely asyn-
chronous systems. These tools convert the design into Verilog and then use third-
party tools to synthesize and lay out the circuit.
2. Tools that use conventional HDL (e.g., Verilog or VHDL) to describe a synchronous
design and then convert it into an asynchronous circuit (desynchronization). The
circuit is then synthesized and laid out using dominant Electronic Design Automa-
tion (EDA) tools.
Chapter 1. Introduction 4
First, Balsa was used to implement various test circuits. This experience led to several
observations. The main difference between an asynchronous circuit and a synchronous
one is the way timing is realized. Synchronous design uses a single global timing clock
signal. By contrast, asynchronous design uses local handshakes between modules to time
their communication and to transfer data. When a module has new data to transfer, it
sends a request signal and when the receiver is ready, data are transferred. This allows the
system to work at the highest possible rate. Different data require different processing
time; asynchronous systems are capable of adjusting their speed with the input data.
Also, asynchronous circuits can adjust their speed with the operating conditions such
as voltage and temperature. Hence, asynchronous operation is not bound by worst-case
assumptions, but determined by the average-case processing delay of the input data and
average-case operating conditions. This feature although appealing, comes at a cost. The
handshake control circuits exist all around an asynchronous circuit to control the timing.
This leads to two main problems:
1. Control circuits take area and consume power, especially leakage power.
2. In many cases, the performance is limited by the handshake control circuitry. That
is, the delay of the control circuit is so significant that it may become the bottleneck
of the system.
The desynchronization method generates faster and smaller circuits compared to
Balsa. However, the handshake control circuits are still large and slow. Balsa and
desynchronization are studied in the next chapter.
In this thesis, an optimization is proposed and tested using Balsa to increase the
concurrency in asynchronous circuits involving write-after-read (WAR) operations. It is
shown that handshakes can be overlapped to achieve a higher performance. However,
because the resulting circuit is still purely asynchronous, the control circuit limits the
performance significantly.
Chapter 1. Introduction 5
This experience made it clear that the main problem of asynchronous circuits is that
their advantages are burdened by the control circuit implementation. By contrast, syn-
chronous circuits have a minimal timing control mechanism. The main control signal in
synchronous circuits is the clock signal. Therefore, the synchronous design approach was
adopted as the starting point. Then, asynchronous design techniques were leveraged to
build the clock signal for the synchronous system. The combination is a new approach
to design and implement synchronous circuits with asynchronous advantages. The asyn-
chronous clock generation circuit introduces a very small area and power consumption
overhead. The proposed approach only uses dominant synchronous design synthesis,
timing and layout tools. Therefore, it can be used in many applications.
The resulting system using the proposed methodology is a synchronous circuit that
automatically tunes itself to deliver the best-possible performance with process-voltage-
temperature (PVT) variations. The methodology mitigates the conservative assumptions
involved in the synchronous design process, resulting in significantly smaller and less leaky
circuits.
The results of using the suggested approach on a 32-bit microprocessor are promising,
demonstrating that the methodology may be used as an alternative to purely synchronous
or purely asynchronous approaches. Designers can use the proposed approach to achieve
significant power reductions or performance improvements in many high-speed applica-
tions. The designer does not need to worry as much about delays and variations because
on-chip circuitry auto-correct for these variations. This should reduce the design and
computer aided design (CAD) complexity.
The proposed methodology is then expanded to implement variable-speed pipelines.
A variable clock period is generated that changes cycle-by-cycle according to the current
operations in the pipeline and the current process-voltage-temperature (PVT) conditions.
The resulting system is much faster than its fixed-clock counterpart and produces less
electromagnetic noise.
Chapter 1. Introduction 6
1.3 Thesis Contributions
The main contributions of this thesis are:
1. A methodology to increase concurrency and enhance the asynchronous synthesis of
the circuits that involve write-after-read operations.
2. A low-overhead design methodology to implement synchronous circuits with asyn-
chronous advantages. The resulting circuit adjusts its speed to the prevailing PVT
conditions. The proposed methodology is accompanied by a compete design flow
using standard cells and dominant EDA tools. The methodology reduces the leak-
age power of high-speed applications in the deep nanometer regime and mitigates
process variations.
3. The proposed methodology is expanded to design variable-clock synchronous pipelines
that adjust their speed to the current operations in the pipeline as well as current
PVT conditions. The resulting system has a higher speed and lower electromagnetic
emissions. The overhead of the added clock generation circuit is significantly lower
than in previous work.
1.4 Thesis Organization
The remainder of this dissertation is organized as follows: Chapter 2 provides the re-
quired background. The first contribution of the thesis is explained in Chapter 3, where
several test circuits synthesized using Balsa are examined and optimized by the proposed
technique. Chapter 4 introduces the new design methodology to implement synchronous
circuits with asynchronous advantages, which is then used in Chapter 5 for leakage re-
duction. The proposed technique is expanded further in Chapter 6 to design pipelines
with a variable clock that adjusts to the current operations in the pipeline and current
PVT conditions. Chapter 7 presents concluding remarks and suggestions for future work.
Chapter 2
Background
2.1 Introduction
This chapter presents the background material that forms the basis for the research
presented in later chapters. This dissertation demonstrates that many challenges in the
design of synchronous systems in today’s nanometer regime are mitigated by employing
asynchronous techniques. First, these challenges are discussed and then, asynchronous
design style is reviewed.
Asynchronous design style is introduced followed by handshake styles to implement
asynchronous circuits. Next, C-element, an important building block of asynchronous
circuits is described. Asynchronous pipelines and globally asynchronous locally syn-
chronous (GALS) systems are explained because they are referred to later. Potential
asynchronous design style advantages are discussed, followed by asynchronous applica-
tion examples. These examples demonstrate the usefulness of asynchronous design in
the real world. Challenges of asynchronous design are explained next, followed by earlier
work that attack some of the difficulties in the design and implementation of asynchronous
design.
7
Chapter 2. Background 8
2.2 Challenges of Today’s Technologies
Parameter variations include process variations due to manufacturing phenomena, volt-
age variations due to both manufacturing and runtime phenomena, and temperature
variations due to varying activity and power consumption levels. Process variations
manifest themselves as die-to-die, within-die and wafer-to-wafer variations. Tempera-
ture and voltage variations are dynamic and change with the circuit operation mode
and the activity. These parameters are collectively referred to as PVT (process-voltage-
temperature). Among all these variations, process variations are becoming increasingly
severe as technology scales down to smaller feature sizes. [13].
The variations in process, voltage and temperature makes it difficult to achieve the
desired performance and power consumption. First of all, the maximum clockable fre-
quency of the system is determined by the worst-case analysis of the critical path. This
limitation is addressed in Chapter 6, where a methodology to increase performance with-
out changing the design architecture is presented.
Another design challenge in today’s technologies is the power consumption. With
technology scaling, more transistors are packed on the same chip, increasing the power
consumption. Power supply is decreased to reduce the power consumption and accord-
ingly the threshold voltage of transistors is decreased. A reduction in the threshold
voltage results in a significant increase in the leakage power. The subthreshold current
of CMOS transistors, which is the dominant leakage current component in sub-100nm
technologies, is given by (2.1), where I0 and η are constants, VT is the thermal voltage
and Vth is the threshold voltage. According to this equation, the subthreshold current
increases exponentially with a decrease in the threshold voltage. A comprehensive study
of leakage power physics may be found in [14].
I subth = I0e−VthηVT (2.1)
Average dynamic power mainly depends on the switching activity of the circuit, the
Chapter 2. Background 9
size of the design, and the supply voltage as shown by (2.2) [15], where C is the load
capacitance of a CMOS gate, f is the clock frequency, α is the gate switching activity
and V is the supply voltage.
P Dynamic = CV 2fα (2.2)
Several power reduction techniques are studied in Chapter 5, along with a new
methodology to reduce both leakage and dynamic power.
2.3 Asynchronous Circuits
A circuit is asynchronous when a clock is not used to time the operations of the circuit.
Instead, different modules inside the circuit send completion signals to each other to
indicate the completion of an operation and to request new data. The signals between
different modules are usually referred to as handshake signals. As opposed to a syn-
chronous circuit, the timing between modules is directed by the actual operations of the
system instead of a predetermined timing source.
2.3.1 History of asynchronous systems
The field of asynchronous systems is both old and young. The 1952 ILLIAC at the
University of Illinois had both synchronous and asynchronous parts [16]. The 1960 PDP6
from DEC was asynchronous [17]. Among the pioneers in this field who have developed
much of the theoretical framework are Huffman [18] and Muller [19]. They introduced
Huffman’s circuits and Muller’s circuits, along with basic asynchronous modules. Molnar
at Washington University focused on metastability [20].
Even though asynchronous logic has never disappeared, when clocked techniques of-
fered an easy way for hiding hazards and timing complications, clock-less logic was forgot-
ten until the end of 1970s. The Caltech Conference on VLSI in 1979 contained a complete
session on self-timed circuits. Sutherland’s efforts, especially his award-winning paper
micropipelines [2], kept the field of asynchronous logic alive.
Chapter 2. Background 10
The first synthesis method for asynchronous logic appeared around mid-1980 with the
Caltech program-transformation approach [21]. Since that time, there have been other
synthesis systems such as Philips Tangram, and University of Manchester’s Balsa [22,
23]. New hardware description languages have also been developed. Some research on
asynchronous design has been done in the industry, such as the work by Epson. Epson
added asynchronous concepts to Verilog and called it Verilog+. This language has its
own compiler to deal with asynchronous designs [4].
The first single-chip asynchronous processor was designed at Caltech in 1988 [24].
It was followed by Amulet from the University of Manchester in 1993 [25]. There have
been other processors such as TITAC from Tokyo Institute of Technology [26]. Min-
iMIPS was probably the fastest asynchronous processor at that time. It is a 32-bit MIPS
R3000 microprocessor implemented by Caltech in 1997 [27]. Epson introduced its 8-bit
asynchronous microprocessor in 2005 which can be used in wearable devices [4].
In summary, the asynchronous world has been actively creating low-power, self-timed,
low-radiation, high-performance and adjustable systems in the academic and industrial
environments.
2.4 Asynchronous Styles and Building Blocks
The clock is not used to implement timing in asynchronous systems [28]. Local controllers
replace the global clock to time the operation of modules and their communications.
It is essential to conceive how these communications work in order to implement real
asynchronous systems. In this section, asynchronous handshake protocols such as the
return-to-zero method will be introduced. Also, different classes of asynchronous circuits
are presented. This is followed by studying an important asynchronous element called
C-element.
Chapter 2. Background 11
2.4.1 Asynchronous handshake protocols
In asynchronous systems, there is no clock to specify the rate at which data between
modules should be transferred. Instead, communicating modules use a group of signaling
wires called handshakes. These signals are local to each pair of communicating modules.
Compared to synchronous design, handshakes are activated only when there are new data
to transfer, whereas in synchronous design, clock pulses are generated and distributed
whether there are new data or not.
A handshake channel is formed between two communicating modules by request and
acknowledge signals. The channel may be a push channel in which, the sender sends
data and activates the request signal to indicate data are ready. On the other hand,
the receiver accepts the data and activates the acknowledge signal to indicate data have
been received. By contrast, in a pull channel, the receiver initiates the handshake by
sending a request signal to the sender and the sender replies back by sending data and
activating the acknowledge signal. Handshakes may be implemented by different styles
and the data can be encoded in different ways as explained next.
2.4.1.1 Data encoding
There are two mostly-used types of data encoding for handshaking: 1) Bundled (single-
rail) data 2) 1-of-N data encoding, which are shown in Figure 2.1.
In the bundled data protocol, data and handshakes have separate lines. It is also
called single-rail handshakes to be distinguished from the dual-rail protocol discussed
next.
In the 1-of-N encoding protocol, there are N data lines and one acknowledge signal.
These N data lines are used to transfer log2(N) bits of data. For example, 8 lines are
used to transfer 3 bits of data. Similarly, two data lines are used to transfer one bit
of data. The latter case is the dual-rail data protocol. The value of each bit, which is
either 0 or 1, is realized with two separate wires. Therefore, if the wire showing value 0
is activated, the corresponding data value is 0. Similarly, if the other one is activated,
Chapter 2. Background 12
(a) Bundled (single-rail) data protocol
(b) 1-of-N data encoding
Figure 2.1: Handshake data encoding schemes
the data value is 1. Otherwise, there are no valid data on the channel.
There are also some other data encoding schemes that have been used, such as the
level encoded dual-rail protocol. Similar to the dual-rail encoding, there are two wires
for each bit, namely data and phase. The data value is equal to the value of the bit being
encoded. In cases that two consecutive data tokens are the same, the phase wire changes
value to distinguish between different data.
2.4.1.2 Handshaking styles
Handshake signals can be four-phase (return-to-zero) or two-phase (non-return-to-zero).
In the four-phase style, request and acknowledge signals are activated and deactivated
by signal levels. For example, a high on request or acknowledge signal shows that it is
activated (up phase). To be activated again, the signal must first return to zero (down
phase). This style is shown in Figure 2.2a.
In the two-phase protocol, request and acknowledge are edge activated as shown in
Figure 2.2b. Therefore, both rising and falling edges of handshake signals show a new
activity on the related signal. There is no need for the handshake signal to return to zero
before the next activation and thus, this style is called non-return-to-zero. The falling
Chapter 2. Background 13
and rising activities on control signals are called events. The two-phase style can only be
used in bundled data communication. In 1-of-N data encoding schemes, there are more
than one request signal and thus, it is not possible to use the two-phase style.
(a) Four-phase
(b) Two-phase
Figure 2.2: Handshake styles [1]
2.4.1.3 Comparison of asynchronous handshake protocols
Bundled data protocol is more popular than dual-rail, because it consumes less area and
tends to be faster. However, matching the delay between data and control lines needs
more effort. If this is not done properly, control signals will race with data signals and
hazards occur. By contrast, presence of data in two-phase protocols implies a request to
the receiver. Consequently, there is no delay between the data and control signals. This
protocol is used to implement delay insensitive circuits.
Two-phase systems should be faster because there are no return-to-zero phases. How-
ever, the circuits required to implement this protocol are more complicated and thus,
Chapter 2. Background 14
slower. Simple and fast circuit implementations have been developed for the dual-rail
protocol, using early evaluation to speed up handshakes. More details can be found
in [29] and [30].
2.4.2 C-element
The Muller C-element is a commonly used asynchronous logic component originally de-
signed by David E. Muller. The main feature of the C-element is hysteresis. It has
memory that keeps its output state until all of its inputs change to the same state. At
that point, the output becomes equal to the inputs. The output remains in this state
until all the inputs switch to the opposite state. There are other varieties of C-elements
such as asymmetric C-element, where some inputs only affect the operation in one of the
rising or falling transitions. Figure 2.3 shows the symbol, gate-level and transistor-level
design of the C-element.
(a) Symbol (b) Gate level (c) Transistor level
Figure 2.3: C-element
It should be noted that C-element is one form of rendezvous circuits, which are used
to indicate when the last of two or more signals has arrived at a particular stage [31].
C-element provides the AND function for events in the two-phase protocol. Another form
of rendezvous circuits is GasP [32].
Chapter 2. Background 15
2.4.3 Classes of asynchronous circuits
Asynchronous circuits are classified according to the restrictions on their design. The
more stringent restrictions result in fewer probable hazards. On the other hand, the
more relaxed restrictions result in simpler and faster circuits.
One class of asynchronous circuits is Delay Insensitive (DI) circuits [33]. These circuits
operate correctly regardless of the delays on their gates and wires. Here, unbounded
delays are assumed. This is a hard restriction to satisfy and it has been proven that not
many useful DI circuits can be built. Circuits composed of only inverters and C-elements
can be DI.
Quasi-Delay-Insensitive (QDI) circuits are delay insensitive except that isochronic
forks are permitted [34]. It means that a bounded skew is allowed between different
branches of the circuit. This class of circuits has been used more than other classes in
real implementations.
A Speed-Independent (SI) circuit is a circuit that operates correctly regardless of
gate delays. Wire delays are neglected or assumed to be zero [35]. Self-timed circuits are
made up of elements that have their own local timings but at their interface, they are
all delay insensitive. Scalable-Delay-Insensitive (SDI) circuits are very similar to QDI
circuits. However, SDI assumes that the relative delay ratio between two components is
bounded. This helps in designing simpler and faster practical circuits in comparison to
QDI circuits [26].
2.5 Asynchronous Pipelines
Many large sequential circuits are organized as pipelines. Pipelines are used for dividing
total work among different modules, which are kept busy processing different data in the
input queue. There are several pipelines such as micropipelines [2], MOUSETRAP [29],
QDI pipelines [36, 37], asP* [38], GasP [32], wave pipelines [39, 40], and surfing [41].
Mircopipelines are studied here because they are the basis for the clock generation circuits
Chapter 2. Background 16
proposed in Chapters 5 and 6. MOUSETRAP is also studied because it is a low-overhead
high-speed asynchronous pipeline design approach, comparable to the one proposed in
Chapter 6.
2.5.1 Micropipelines
Micropipelines were the first asynchronous pipelines, invented by Sutherland in 1989 [2].
A basic Micropipeline is shown in Figure 2.4. It is based on the two-phase handshaking
protocol. The C-elements are used to compare the state of each stage with the next
one. The difference in states means that the current stage is full and the next stage is
empty and therefore data can be transferred. Cd (Capture done) and Pd (Pass done)
are simply delayed version of C (Capture) and P (Pass) signals. Storage elements in
micropipelines are event-controlled elements, which are composed of two side by side
latches. These latches are activated alternatively to generate similar responses to rising
and falling events.
Figure 2.4: Micropipeline [2]
A drawback of micropipelines is that their event-driven storage elements are complex
and slow. However, their control circuit is elegant and is a reference for many other works
Chapter 2. Background 17
in the field [32].
2.5.2 MOUSETRAP
In 2001, Singh and Nowick introduced an asynchronous pipeline control mechanism called
MOUSTRAP, which stands for Minimal-Overhead Ultra-high-SpEed Transition-signaling
Asynchronous pipeline [29]. The importance of MOUSETRAP is that 1) unlike mi-
cropipelines, it uses conventional latch storage elements. 2) The control circuit overhead
is minimal and thus, high-performance pipelines may be realized. As shown in Figure 2.5,
the MOUSETRAP design is based on two-phase bundled-data protocol.
Fig. 4. MOUSETRAP pipeline with logic processing.
Figure 2.5: MOUSETRAP [3]
Each pipeline stage has a data latch and a latch controller. The data latch is a simple
transparent latch with a default state of being transparent, allowing new data to pass
through quickly. The latch controller enables and disables the data latch. It consists of
only one XNOR gate with two inputs: done from the current stage N and ack from stage
N + 1. Since the pipeline is designed to use a two-phase protocol, the XNOR gate acts
as a phase converter. It converts the transitions on wires done and ack into level that
controls the corresponding latch.
The operation of MOUSTRAP is as follows. Assume that the latch in Stage N is
transparent. A change in the reqN signal means that a new data item is present at the
Chapter 2. Background 18
stage. The data are latched in the data latch and at the same time, reqN signal is latched
by the control latch. As a result, the doneN signal changes state causing the output of
the XNOR gate to become zero, closing the latch. The doneN signal is delayed by a
fixed delay equal to the worst-case logic processing of stage N . The resulting signal is
reqN+1. If the next stage is ready to accept new data, the new data and control are
latched causing the ackN signal to change state. This changes the output of the XNOR
gate of stage N and opens the data latch of that stage to accept new data.
2.6 Globally Asynchronous Locally Synchronous Systems
A Globally Asynchronous Locally Synchronous (GALS) system is a mixture of syn-
chronous blocks and asynchronous interfaces. It utilizes advantages of synchrony and
asynchrony. GALS systems were first introduced by D. Chapiro in his PhD thesis in
1984 [42]. GALS is studied here because it can be used in combination with the design
approach presented in Chapters 5 and 6.
Each block in a GALS system is designed as synchronous and its interface is asyn-
chronous. Thus, different blocks interact asynchronously without using a global clock,
reducing clock skew problems. Therefore, GALS architectures are suitable for System
on Chip (SoC) and Network on Chip (NoC). They inherit some benefits of asynchronous
systems such as power and emission reduction and modularity. In short, synchronous
conventionality and asynchronous modularity are combined in GALS systems.
The advantages of GALS systems may not outweigh the additional effort required to
implement such systems. First of all, industries cannot directly utilize GALS because
most engineers are not familiar with asynchronous concepts. Also, the design process to
implement a GALS system is difficult. There is no known algorithm that can be used to
partition a system into different clock regions and automatic place and route tools are
not available for GALS [43].
Chapter 2. Background 19
2.7 Potential Asynchronous Design Advantages
In this section the advantages of asynchronous design are summarized and discussed.
There are many challenges in implementing applications as asynchronous systems. There-
fore, it is necessary to understand why asynchronous design should be considered. Asyn-
chronous circuits can be better than synchronous counterparts in many ways, which
include lower power, lower emission level, modularity, and better adaptability to process
variations. Real-world examples that demonstrate these advantages are presented in the
next section.
2.7.1 Avoiding clock skew
System on Chip (SoC) architectures are becoming increasingly complicated. Feature sizes
are decreasing and more functionality is added to the chip. At the same time, higher
frequencies are used. With more clocked modules at high frequencies, the circuits are
more sensitive to clock skew. The problem of clock skew becomes more serious as the
technology advances and the frequency increases.
Many efforts have been made to solve this problem. One of the proposed solutions
is to use clock-less systems. Asynchronous systems do not have a global clock and thus,
they do not suffer from the clock skew.
A branch of asynchronous systems called Globally Asynchronous Locally Synchronous
(GALS) systems, introduced previously, may be used to avoid using a global clock. In-
stead, local clocks are used in the submodules of the system, which are interconnected
using asynchronous interfaces. Since clocks are confined to smaller areas, clock skew is
reduced. However, as mentioned before, compiling digital designs into an architecture
which is neither purely synchronous nor purely asynchronous is hard.
Chapter 2. Background 20
2.7.2 Lower electromagnetic noise
The noise generated by synchronous systems is highly concentrated in one frequency —
the global clock frequency. This can cause interference within the system, as well as with
adjacent systems that operate at a similar frequency.
With no global clock, each part of an asynchronous system operates at its own speed.
Consequently, the system operates in a much wider range of frequencies compared to
a synchronous system. As a result, the electromagnetic noise of an asynchronous sys-
tem is not concentrated in a single frequency. The lower electromagnetic emission of
asynchronous systems has been used in several applications, such as a pager [44].
2.7.3 Lower power consumption
Power comparison of asynchronous systems and their synchronous counterparts depends
on the application and the technology. Three main factors should be considered when
comparisons are made:
• Clock tree versus handshake controls: Clock tree consumes a large portion of the
power because it switches at the fastest rate in the circuit. Local handshakes replac-
ing the clock tree in asynchronous circuits may consume significant power. However,
when the circuit is idle, the power consumption of the handshake control is mini-
mized automatically.
• Area and Leakage power: Generally, asynchronous circuits tend to be larger than
their synchronous rivals. The number of transistors to implement a specific cir-
cuit as asynchronous is often higher than in its synchronous counterpart due to
the handshake controllers. Larger area and more transistors increases the power
consumption, especially leakage power in deep submicron technologies.
The choice between asynchronous and synchronous implementation depends on the
application. In systems that quickly switch between idle and active modes such as burst
Chapter 2. Background 21
receivers and RFID, asynchronous design might be a better choice [45]. There are tech-
niques that reduce the power consumption of synchronous circuits during idle periods
such as gating clock and turning off the clock oscillator. However, these techniques
are costly; for instance, clock gating reduces the maximum operating clock frequency.
Switching an oscillator on and off is also a very costly solution in terms of delay and
energy consumption because it takes time and energy for the system to return to its
normal running mode.
2.7.4 Resilience to process and environmental variations
An important property of asynchronous systems is their adaptability. Since asynchronous
systems can be designed as self-timed circuits, they can adapt to variations in voltage,
temperature, data rate and even process. Several researchers have studied this property
and have successfully shown that asynchronous circuits can adjust to wide variations in
operating parameters [4, 5, 46].
2.7.5 Modularity
Another feature of asynchronous systems is modularity. In synchronous implementations,
interface design is always time consuming. It is a hard part of the design process, partic-
ularly in big projects where different submodules may require different clock frequencies.
Asynchronous systems are clock-less and thus, connecting asynchronous modules is easier
and should require less effort.
2.7.6 Higher performance
Finally, an important area of comparison of asynchronous and synchronous systems is
performance. Several research papers have been published on this matter [47, 48]. It
is not easy to answer the question ”are asynchronous systems faster than synchronous
ones?”
The performance of a synchronous system is limited by its slowest module. In the
Chapter 2. Background 22
design of synchronous systems, the worst-case situation must be taken into account. The
critical path under worst-case process-voltage-temperature (PVT) conditions defines the
clock period. By contrast, the performance of an asynchronous system is based on average
delay. The delay of the longest circuit path currently in operation determines the speed.
Therefore, the performance of an asynchronous system is data dependent. In addition to
data, the speed of the system adjusts to current PVT conditions and not the worst-case
conditions. As a result, if the worst-case delay is longer than the average-case delay and
if the worst-case delay happens rarely, then, an asynchronous implementation is likely to
outperform its synchronous counterpart.
One the other hand, asynchronous circuits are slowed down by delays. Handshakes
are mostly based on return-to-zero protocols which are generically slow [2]. Non-return-
to-zero protocols are potentially faster. However, circuits used to implement them are
bigger, and consequently, their performance is not better than circuits with return-to-zero
protocols.
2.8 Asynchronous Design Applications
Several application examples of asynchronous design are presented in this section, taken
from both industrial and academic research. Applications show where asynchronous
design is strong and useful.
The implementation of a Radio Frequency IDentification (RFID) system demonstrates
that the adaptability of asynchronous design may be used in systems that must work
in different environments [45]. In this work, researchers implemented an active reader
and a passive tag for a contactless RFID system. Energy and data can be transmitted
between the reader and the tag over a range of 6 centimeters. When the distance of the
tag changes, the asynchronous receiver can adapt to the new signal power. Therefore,
different power configurations can be figured out for the required data rate over the
channel. The implemented RFID is a high-speed low-power module that consumes power
Chapter 2. Background 23
only when it is active and can reach a data rate of 1.02 Mbps.
Another work is an ultra-low-power asynchronous processor [5]. This processor is
based on the Atmel’s Advanced Virtual RISC (AVR) instruction architecture. The im-
plemented processor can work in a wide range of voltages, including values very close to
the threshold voltage of a transistor. The adaptability of the processor to this wide range
of voltages allows it to consume different levels of power according to the required per-
formance. The processor is low-power and low-emission making it suitable for Wireless
Sensor Networks (WSN).
In 2005, Epson Company introduced its new flexible 8-bit asynchronous micropro-
cessor [4], which is shown in Figure 2.6. It is built on the Thin Film Transistor (TFT)
technology used for wearable devices. Due to the wide crystal variations in this tech-
nology, satisfying the restrictive constraints of synchronous design, such as a fixed clock
frequency is not possible.
Figure 2.6: Epson’s flexible processor [4]
An ARM-compatible microprocessor for smartcard chip is built using dual-rail proto-
col. Dual-rail has better resistance to attacks based on power analysis and electromag-
netic analysis, compared to single-rail protocol [49].
A study of a Viterbi decoder’s data path shows that the average-case delay of its data
path is only 84% of the worst-case delay. Since asynchronous circuits are based on the
average delay, an asynchronous design is used to achieve better performance over the
Chapter 2. Background 24
synchronous implementation [48].
A group of researchers at Stanford University designed a locally clocked globally
asynchronous cache controller which is twice as fast as its synchronous counterparts.
They used the same feature of average-case operation of asynchronous systems to achieve
this improvement [47].
Finally, the adjustability of asynchronous circuits to voltage and coupling variations
has been used in implementing a robust and high-throughput inter-chip communication
channel in [50].
Despite the advantages of asynchronous systems, building these systems is challenging.
These challenges are described in the next section.
2.9 Difficulties in Asynchronous Design
In this section, a general review of some difficulties in the design of asynchronous circuits
and systems is given, which can be categorized into four groups:
1. The lack of designers’ knowledge and experience in asynchronous design.
2. The lack of strong compilers and Electronic Design Automation (EDA) tools.
3. The lack of pre-designed asynchronous architectures (e.g., FPGAs).
4. Difficulties that are inherent to asynchrony.
First of all, asynchronous design methodologies are different from synchronous design
methodologies. Synchronous systems have been in the market for decades. Therefore,
many engineers are familiar with synchronous design flows and synchronous design tech-
niques have been well developed. Engineers and companies are not willing to use asyn-
chronous design unless there is a proof showing that asynchronous circuits can be useful
in solving real-world problems.
In comparison to synchronous tools, asynchronous design tools are not well devel-
oped. There are a few languages and packages such as Philips Haste and Epson Ver-
Chapter 2. Background 25
ilog+, but these languages are not publicly available yet. There are some academic tools
such as BALSA. The problem is that there is no consensus on a design methodology
for asynchronous implementation. Each company or university uses its own language,
which makes it more difficult for synchronous designers to understand where to enter the
asynchronous world. Many synchronous designers know Verilog/VHDL. However, these
languages in their present form are not suitable for describing asynchronous systems.
The lack of pre-designed asynchronous architectures is another obstacle for utilizing
asynchronous design. Many digital designers are not necessarily experienced in ASIC
design. They describe their desired hardware using an HDL and then simply use com-
mercial tools to target programmable logic devices such as FPGAs. FPGAs are useful
for prototyping and mass production of many applications.
Unfortunately, a similar design flow does not exist for asynchronous design. There is
no commercial asynchronous FPGA on the market, although some experimental asyn-
chronous architectures have been built [51]. FPGAs have been used for proving an idea or
demonstrating the performance of asynchronous systems [52]. Although this practice can
be useful for developing ideas, it is not suitable for industrial use. Present FPGA com-
pilers do not understand asynchronous designs and lead to very poor place and route or
even an unsuccessful fitting. Developing FPGA architectures that support asynchronous
design and providing EDA tools for them are necessary stepping stones to broaden the
use of asynchronous systems.
There are also many difficulties that are inherent to asynchrony, such as hazards and
deadlocks. Hazards can cause asynchronous systems to malfunction. In any computation,
there might be glitches due to different gate and wire delays. In synchronous systems,
these glitches are rarely important because flip-flops only accept inputs at clock edges.
In asynchronous systems, glitches are indistinguishable from real data and hence may
lead to errors. A discussion of hazard categorization and avoidance techniques can be
found in [1].
Chapter 2. Background 26
Deadlocks can occur in systems where two or more competing actions are waiting
for one another to finish, and thus neither ever does. While there is no general solution
for deadlock prevention, knowledge of the system operation along with the designer’s
experience can help in reducing the possibility of deadlocks.
2.10 Attacking Asynchronous Challenges
Despite the difficulties in the design of asynchronous systems, researchers have proposed
several methodologies to simplify the design process, which are reviewed here. An asyn-
chronous design methodology or design flow determines the way a designer describes the
behavior of an asynchronous system. This description is then synthesized for a specific
technology. Asynchronous design methodologies can be grouped in three main categories:
1. Design methodologies that start with a synchronous design and convert it to an
asynchronous system (desynchronization).
2. Design methodologies that use conventional hardware description languages or an
amendment of these languages.
3. Design methodologies that use Communicating Sequential Processes languages.
In this section, these design methodologies are introduced briefly and their advantages
and disadvantages are discussed.
2.10.1 Desynchronization
In this method, the target system is first implemented as a synchronous system, and then
converted into an asynchronous one. This process includes three main steps [5, 46,53]:
1. Conversion of flip-flops to master-slave latches.
2. Generation of matched delays for the combinational logic.
3. Implementation of latch controllers to control handshakes.
Chapter 2. Background 27
FF FF FF
CLK
(a)
CLCL
(a) Synchronous circuit
(b) Desynchronized circuit
Figure 2.7: Desynchronization method [5]
An example of a desynchronized circuit is shown in Figure 2.7.
The main advantage of this methodology is simplicity. Designers can use this method
to generate an asynchronous circuit provided that the desynchronization tool is available.
As reported in [5], the design flow is fast and many of asynchronous benefits are inherited.
In [46], a desynchronized DLX processor is compared against its synchronous counterpart.
The desynchronized processor is faster than its synchronous counterpart under typical
process-voltage-temperature (PVT) conditions.
The main disadvantage of the desynchronization method is the overhead of the com-
munication handshakes required between master and slave latches. As reported in [46],
the performance overhead due to desynchronization is 20% in a DLX microprocessor1.
This timing overhead stems from the delay between the fall of the slave enable signal and
the rise of the master enable signal. Also, the desynchronized processor is 13.44% larger
1The desynchronized processor is 20% slower than its synchronous counterpart under worst-case PVT condi-
tions.
Chapter 2. Background 28
than the synchronous design and consumes more power.
2.10.2 Methods using conventional HDLs
Digital designers use Verilog and VHDL to describe their designs. The advantage of
design methodologies that use these languages is that designers are familiar with the
language.
A drawback of using these languages is that they are not originally designed to describe
asynchrony. Therefore, many basic concepts of asynchronous systems such as handshakes
are not supported.
In an experiment by the author, VHDL packages to model and implement asyn-
chronous handshakes were written. The difficulties encountered include:
• Using proper asynchronous HDL packages does not guarantee the delay insensitivity
of the design.
• Using HDLs without enough experience of asynchronous concepts can easily result
in hazards and deadlocks.
• A synthesizable handshaking package was used with the Altera QuartusII FPGA
compiler. Although the package could be synthesized, post place and route simu-
lations clarified that QuartusII does not understand asynchronous concepts and a
significant amount of manual work is required to fix the handshakes.
Researchers have acknowledged these difficulties and tried to extend Verilog and
VHDL language structures to include asynchronous concepts. Besides language struc-
tures, a compiler is needed to compile the HDL amendment and synthesize the design into
asynchronous circuits. A group at Epson introduced Verilog+ to support asynchronous
data types and handshakes [4]. The Verilog+ code is translated back to Verilog, then
simulated using a conventional HDL simulator. The compiler to convert Verilog+ code
into Verilog is not complete yet and many steps of the synthesis are performed manually.
The design and synthesis tools are not publicly available.
Chapter 2. Background 29
Other researchers have tried to add packages to conventional HDLs in order to support
specific logic required to describe asynchronous circuits. For example, Null Convention
Logic (NCL) have been used for asynchronous design. The motivation behind using NCL
is:
1. It can be added as a package to VHDL.
2. It meets the quasi delay insensitive (QDI) requirements of asynchronous design.
NCL is a logic group with a set phase and a reset phase. In the set phase, data
changes from a space holder called NULL to a proper codeword. In the reset phase, data
line changes to NULL. NCL has been added as a package to VHDL and conventional
synchronous CAD tools were used to synthesize asynchronous designs [54–56].
2.10.3 Methods using communicating sequential processes languages
Communicating Sequential Processes (CSP) languages are used for programming con-
current systems and designing asynchronous systems. Examples are Tangram and Balsa,
which were introduced by Philips and the University of Manchester respectively. They
support asynchronous concepts and thus, low-level coding is not required to implement
handshakes. Tangram and Balsa are good examples of high-level design languages that
support asynchrony. Balsa is publicly available and its latest version at the time of
writing was released in the summer of 2006 [57]. Some further details about Balsa are
provided here because it is used in the next chapter.
Balsa has a user-friendly interface and its syntax is well suited for asynchronous
design. For example, a handshake channel is implemented using a single line of coding.
Special symbols are used to support the concepts of synchronization, sequentiality, and
concurrency. As an example, a Balsa description of a single-place buffer is presented in
Figure 2.8. This Balsa description builds an 8-bit single-place buffer. The circuit requests
a byte from the environment on its input channel i. When new data are available, the
circuit transfers the data to register x. Then, the circuit signals to the environment on its
Chapter 2. Background 30
Figure 2.8: Balsa code of a single-place buffer [6]
output channel o that new data are available and the environment reads the data when
it chooses. This operation is repeated because the handshakes are enclosed in a loop.
The Balsa description of the buffer is then converted to the handshake circuit shown
in Figure 2.9. Balsa’s synthesis is syntax directed. That is, there exists a one-to-one
Figure 2.9: Handshake circuit of the single-place buffer [6]
circuit representation for each line of the Balsa code.
The filled circuits in the figure are active ports that initiate a handshake. Hollow cir-
cles are passive ports that wait for the handshake to start. A handshake channel exists
Chapter 2. Background 31
between two connected ports. Activation is similar to a reset signal that when released,
initiates the operation of the circuit. After activation, the Loop initiates a handshake with
the Sequencer. The Sequencer first issues a handshake to the left-hand Fetch component
(→) causing data to be moved to the Variable element (x). The Sequencer then hand-
shakes with the right-hand Fetch component causing data to be read from the Variable
element. When these operations are complete, the Sequencer completes its handshake
with the Loop element, which starts the cycle again.
The control elements in the circuit such as the sequencer have some delay that limit
the performance of the system. To improve the performance, several optimizations and
techniques have been proposed. For instance, the technique presented in [58] overlaps
the delay of the handshakes in the channels of a sequencer and improve concurrency and
thus, performance.
The handshake circuit is then mapped into the target technology using predefined
implementations of the handshake modules. The user may choose between four-phase
single-rail and dual-rail implementations. The synthesis backend is compatible with both
Xilinx and Cadence physical design tools. The synthesized netlist is a Verilog code in a
target technology. The developers of Balsa claim that adding a new target technology
to Balsa is easy. These features make Balsa an appropriate design tool for asynchronous
research.
Balsa has an integrated development environment (IDE) that includes an editor and a
simulation environment. It includes a graphical simulation environment and an automatic
testbench generator. Balsa has built-in utilities that help in understanding the operation
of the system under design. For example, it shows the handshake circuit after coding the
circuit operation in Balsa (similar to Figure 2.8 and Figure 2.9).
Using the Balsa’s design environment helps in understanding the difficulties involved
in asynchronous design, such as deadlocks and hazards. Balsa helps in avoiding dead-
locks by recognizing obvious deadlocks and warning the designer about their existence.
However, it is the designer’s responsibility to avoid deadlocks. Further explanation of
Chapter 2. Background 32
Balsa and the suggested design flow employing it may be found in Chapter 3.
2.11 Concluding Remarks
In this chapter, several challenges in the design and implementation of synchronous sys-
tems are studied including PVT variations and performance and power requirements.
Asynchronous design is briefly reviewed. Several application examples are provided
demonstrating the potential advantages of asynchronous design over synchronous de-
sign. These advantages include better adaptability to PVT variations, lower power con-
sumption and better performance, which can be used to diminish some of the challenges
in today’s implementations. However, the features of asynchronous design have been
hindered by the design and implementation difficulties and by the delay and power con-
sumption of handshakes. In the next chapters, several design techniques to reduce these
difficulties are suggested.
Chapter 3
Application of Concurrency in
Asynchronous Design
3.1 Introduction
Many approaches have been proposed over the years for the synthesis of asynchronous
circuits. Syntax-directed compilers such as Balsa [9], Tangram [11,12] and other synthe-
sis tools for high-level languages such as SpecC [59] are able to synthesize a high-level
description into a gate-level netlist, without a need for the designer to become involved
in implementation details. The synthesis process guarantees correct sequencing of hand-
shake operations by using standardized control elements known as sequencers. The syn-
thesis process also assists the designer in avoiding timing hazards. This approach has
been used successfully in the implementation of large asynchronous systems, such as
ARM-compatible asynchronous processors [49,60].
This chapter examines the potential benefits and trade-offs involved in using edge-
triggered elements where write-after-read (WAR) hazards exist [61, 62]. A WAR hazard
exists in any digital system where a variable is first read and then new data are written
into it, as in first-in-first-out (FIFO) structures and in the implementation of accumu-
lation statements of the form Dst ← f(Dst, Src). If new data are written into the
33
Chapter 3. Application of Concurrency in Asynchronous Design 34
destination variable before the previous data have been consumed, a WAR error oc-
curs. To avoid WAR errors in asynchronous circuits, variables are implemented as simple
latches and a non-concurrent sequencing mechanism is used to move data. In the case of
accumulation statements, this necessitates the use of master-slave configurations.
Master-slave configurations have also been used in asynchronous pipelines to improve
performance. In [63,64], it is demonstrated that to improve the performance of a pipeline,
the rate of data generation and data consumption should be balanced, which may be
achieved by adding extra buffers to the pipeline.
In this chapter, the focus is on introducing concurrency in the operation of an asyn-
chronous circuit. It is shown that in many cases, using an edge-triggered configuration
instead of a master-slave configuration may lead to increased concurrency in the circuit’s
operation. The proposed ideas are applied to single-rail circuits that use the four-phase
handshake protocol. Similar techniques can be used for dual-rail asynchronous circuits.
In previous studies [58, 65, 66], researchers have shown how to accelerate the down
phase of handshakes by using T-elements in the implementation of sequencers, but em-
phasized that T-elements cannot be used in the presence of WAR hazards [58]. The
WAR hazard should be resolved before using T-elements. The early close scheme and
the interlock scheme are two approaches that have been suggested by Plana and Nowick
to resolve the WAR hazard [67]. They are described in Section 3.3. The approach pre-
sented in this chapter is to use edge-triggering. It is shown that T-elements can be used
in the implementation of sequencers, even in the presence of WAR hazards, by exploiting
edge-triggered storage elements. T-elements serve to break the sequence of handshakes
between different components such as functions, variables and multiplexers, resulting in
a shorter down phase. Timing analysis is presented to quantify the achievable gain in
speed and explain the simulation results. Guidelines are also given to assist designers in
applying the proposed approach to other handshake circuits.
The degree of concurrency enabled by different types of sequencing elements is re-
viewed briefly below as it is a key factor in determining the speed of the synthesized
Chapter 3. Application of Concurrency in Asynchronous Design 35
circuit. Subsequent sections present handshake structures that use edge-triggering to im-
plement WAR operations and describe the trade-offs involved. Special attention is paid
to the synthesis of accumulation statements because of their importance in all forms of
digital processing.
The synthesis techniques proposed are amenable to syntax-directed compilation. The
experimental results given have been derived by simulating circuits synthesized by Balsa
and implemented by Synopsys Design Compiler. Synopsys Power Compiler was used for
simulation-based power analysis.
3.2 Background
3.2.1 Handshake circuits
Handshake circuits are represented by diagrams composed of handshake components such
as variables, functions and controls, with active and passive ports connected via push
or pull channels. Active ports initiate a handshake by sending a request signal, and
passive ports acknowledge the handshake. In a push channel, the sender sends data and
communicates with the receiver by generating a request signal, whereas in a pull channel,
the receiver initiates the handshake by requesting data. Not all channels carry data. Some
channels are purely for control and are referred to as sync or activation channels. Further
details about handshake circuits and conventions can be found in [9, 12,58].
3.2.2 Sequencers
Sequencers are components that control the timing of events in an asynchronous system.
Upon receiving an activation signal, Act, the two-step sequencer in Figure 3.1 triggers
a handshake on channel C1 followed by a handshake on channel C2. Two types of
sequencers have been proposed in the literature, which we refer to as S and T sequencers.
Their behavior is illustrated in Figure 3.2. An S-sequencer waits for the handshake on
channel C1 to be completed before starting the handshake on channel C2 [12]. A T-
Chapter 3. Application of Concurrency in Asynchronous Design 36
Figure 3.1: Sequencer
(a) S-sequencer behavior (b) T-sequencer behavior
Figure 3.2: Comparison of S-sequencer and T-sequencer behavior
Chapter 3. Application of Concurrency in Asynchronous Design 37
sequencer starts the handshake on channel C2 once the up phase of the handshake on
channel C1 is completed [58]. Since the activities on C2 proceed in parallel with the
down phase on C1, a T-sequencer introduces concurrency in the circuit’s operation.
The core components of the S and T-sequencers are the S and T-elements, respectively.
The T-element is shown in Figure 3.3. Following a request on Channel A, it initiates the
up phase of the handshake on Channel B by issuing B Req. When it receives B Ack,
it removes B Req and at the same time issues A Ack, thus allowing the two handshake
operations to proceed to completion concurrently. Implementation details of the S and
T-elements and the corresponding sequencers are available in [12,58].
(a) Symbol (b) Structure level (c) Behavior
Figure 3.3: T-element
3.3 WAR Hazards
Figure 3.4 illustrates a simple FIFO circuit. Each arrival of the activation signal, Act,
transfers data from V ar1 to V ar2 and from the input to V ar1. The circuit exhibits
a WAR hazard. The first command, C1, reads the content of variable V ar1 and saves
it in V ar2, and the second command, C2, writes new data into V ar1. If the data
input to V ar2 is still enabled, the new data would be incorrectly written into V ar2 as
Chapter 3. Application of Concurrency in Asynchronous Design 38
well. Existing asynchronous synthesis methods realize variables with transparent latches
[5,12,22] and use an S-sequencer to guard against the WAR hazard [58]. An S-sequencer
ensures that the down phase on C1 has been completed, thus also ensuring that the input
to V ar2 has been disabled, before starting the C2 command. A T-sequencer cannot be
used in this case, because it would issue command C2 before C1 is completed, thus
leaving the WAR hazard unresolved. Note that the transferrer unit (→) consists solely
of port-to-port wire connections and has no timing implications.
Figure 3.4: An example of the WAR hazard
The sequential execution of handshakes imposed by the S-sequencer places an upper
limit on the speed of operation of a circuit. Provided some means for handling the WAR
hazard is available, the speed of operation can be increased by introducing concurrency.
In what follows, the early close scheme and the interlock scheme by Plana and Nowick [67]
to resolve the WAR hazard are reviewed. Then, the use of edge-triggering to avoid WAR
is studied.
The early close scheme adds an extra mechanism to the control circuit, which closes
the destination latch V ar2 before writing new data into V ar1. With some reasonable tim-
ing assumptions, this method makes it possible to use a T-sequencer. An 85% throughput
improvement has been achieved in a dual-rail implementation of an 8-stage shift regis-
ter [67], compared to the original design with S-sequencers. The power-delay product
was roughly the same.
Chapter 3. Application of Concurrency in Asynchronous Design 39
In the interlock scheme, writing into the source latch V ar1 is stalled until V ar2 is
opaque again. This method guarantees correct operation, but the speed improvement
obtained by employing a T-sequencers is limited.
The approach proposed in this thesis to avoid the WAR hazard is to replace trans-
parent latches with edge-triggered flip-flops. An edge-triggered flip-flop occupies more
area and consumes more power than a latch. Thus, the choice between edge-triggered
flip-flops and normal latches amounts to a trade-off among the three parameters of speed,
area and power. In what follows we show that there are many situations in which the
trade-offs involved are strongly in favor of using edge-triggered circuits.
If the variables in the structure of Figure 3.4 are implemented as edge-triggered flip-
flops, the rising edge of the up phase of command C1 records the data into V ar2 then
immediately disables its inputs, making the down phase redundant. Therefore, the down
phase of C1 can be executed concurrently with the up phase of C2 without giving rise to
a WAR hazard. This means that a T-sequencer may be used instead of an S-sequencer.
Figure 3.5 shows a variable that employs edge-triggering. The write request signal
serves as the clock. The buffer element introduces some delay before generating the write
acknowledge signal to allow sufficient time for the data to be stored in the flip-flops. The
D Q
Write_Data
Write_Req
Write_Ack
Read_Data
Read_Req
Read_Ack
Figure 3.5: Edge-triggered-based variable
read request generates a read acknowledge immediately, because the output of the flip-
flop is always enabled. As with synchronous circuits, the designer must ensure that the
setup and hold times of the flip-flops are met.
To illustrate the trade-offs involved in the proposed approach, a 16-bit, 5-stage FIFO
circuit was synthesized as shown in Figure 3.6. It uses a bundled-data protocol and the
Chapter 3. Application of Concurrency in Asynchronous Design 40
edge-triggered circuit of Figure 3.5 to implement the variables. The four-step sequencer
is implemented using three 2-step T-sequencers, as shown. Post-synthesis simulation
Var 1 Var 2 Var 3 Var 4 Var 5
T-seq
* ;
T-seq
* ;
T-seq
* ;
The rest of the circuit to
provide input to the first
stage and take the output of
the last stage
1
2
3 4
Figure 3.6: Five-stage FIFO
showed that the resulting concurrency led to a circuit 2.3 times faster than the cir-
cuit employing an S-sequencer and latches. This substantial increase is due in part to
the concurrency introduced by the T-sequencer. Also, the T-sequencer is a faster and
smaller circuit that presents lower electrical loads to its environment compared to the
S-sequencer. The increase in speed in this case comes at the expense of an increase in
area and power consumption. The circuit area increased by 52% and the power-delay
product by 3%. Implementation details are described in Section 3.7. Compared to the
results from the early scheme approach, edge-triggering provides better speed improve-
ment. However, the early scheme uses latches, which are smaller than edge-triggered
Chapter 3. Application of Concurrency in Asynchronous Design 41
flip-flops.
This simple example clearly shows that the combination of edge-triggered flip-flops
and T-sequencers has the potential for significant increases in speed, but careful atten-
tion need to be paid to circuit area and power consumption where these are important
considerations. In what follows, we show that gains in both speed and power are possible
in many commonly encountered circuit configurations.
3.4 Using Edge-Triggering in Accumulator Circuits
The case of an accumulation statement, Dst ← f(Dst, Src), offers opportunities for
performance enhancement, where improvements in both speed and area are possible.
An accumulator loop is often synthesized using transparent latches, as shown in Fig-
ure 3.7. This is the implementation generated automatically by the Balsa compiler [6]. To
Auxiliary-Dst
Dst
Src
Act
C1 C2
Func
Figure 3.7: Current synthesis of Dst← f(Dst, Src) in Balsa
avoid the WAR hazard, the Auxiliary-Dst and Dst variables constitute a master-slave
structure, and an S-sequencer must be used to ensure correct data transfers. Follow-
ing the arrival of an activation signal, command C1 causes the result of function f to
be stored in the auxiliary variable. When this step is completed and the input of the
auxiliary variable has been disabled, command C2 transfers the result to the destination
variable Dst.
Several variations on this basic configuration are discussed below using the four-phase
Chapter 3. Application of Concurrency in Asynchronous Design 42
single-rail protocol. To assess the impact of the changes, two test circuits are used. The
first is a minimal system containing a single 16-bit accumulator. The system accepts
successive input data elements, activates the accumulation loop, then sends out the
accumulation results. The second system is an 8-bit by 8-bit radix-4 Booth multiplier.
In the ensuing discussion, the implementations based on the configuration of Figure 3.7
are used as a reference for comparison. Changes in speed, area and power-delay product
are expressed as a percentage of the corresponding parameters for the reference circuit.
Complete simulation details are presented in Section 3.7.
In some cases, it is possible to design handshake circuits that take advantage of the
auxiliary variable to perform useful tasks. For instance, the auxiliary variable may be
shared between different operations and routed to different destination variables [68].
In other cases, the master and slave variables, may be replaced with one edge-triggered
register. In the latter case, the sequencer is no longer needed, and the accumulator circuit
may be simplified to the configuration in Figure 3.8. A request signal on the activation
Figure 3.8: Revised accumulator circuit
channel is forwarded to function f and it propagates from there to the read ports of Src
and Dst. When the values of Src and Dst have been read and processed, the function
unit sends an acknowledge signal to the transferrer module. This signal becomes the
request to the write port of variable Dst, and the signal’s rising edge clocks the data into
it.
Simulation results for the 16-bit accumulator test circuit using the configuration in
Figure 3.8 showed a 44% increase in speed, 9% decrease in area and 22% decrease in the
Chapter 3. Application of Concurrency in Asynchronous Design 43
power-delay product compared to the reference circuit of Figure 3.7. In the case of the
multiplier, the circuit based on Figure 3.8 was 16% faster than that based on Figure 3.7
and had 7% less area and power-delay product.
It should be noted that these improvements are possible because of using edge-
triggering. In comparison, the early close scheme may be used to improve the per-
formance of the original accumulation configuration of Figure 3.7 by allowing the use of
a T-sequencer instead of the S-sequencer. However, it results in a larger circuit because
both the the auxiliary latch and the sequencer are needed.
3.5 Introducing Concurrency
There are three channels in Figure 3.8, Act, Func and Write. The handshakes associated
with these channels occur sequentially, and hence, the delays involved are additive, as
illustrated in Figure 3.9. The total delay for a single accumulation operation is Fup +
Wup + E + Fdown + Wdown, where E is the delay of the down phase of the environment
that issues the activation signal, Fup is the up phase delay of the handshake on the Func
channel and Wup is the up phase delay of the Write channel. Similarly, Fdown and Wdown
are the down phase delays of the handshakes on channels Func and Write. The speed
of operation can be increased if it is possible to break this sequence of handshakes into
two parts that can be overlapped. This may be achieved by inserting a T-element in one
of the channels, provided, of course, that the integrity of the data transfer operation is
not compromised.
Fup E Fdown
Wdown
Act_Req / Func_req
Func_Ack / Write_Req
Wrtie_Ack / Act_Ack
Wup
Figure 3.9: Timing diagram for the circuit in Figure 3.8
Chapter 3. Application of Concurrency in Asynchronous Design 44
Figure 3.10 shows the case where the T-element is inserted in the Func channel.
Because this channel carries data, the T-element is augmented by a wired connection
on the data paths resulting in the element shown in Figure 3.11, which will be referred
to as a T-isolator. The operation of the circuit is illustrated by the timing diagram
in Figure 3.12. An activation request results in a request on the Inter channel then
the Func channel after a delay T through the T-isolator. When the function receives
the request and subsequently generates an acknowledgment, the T-isolator removes the
request and, at the same time, sends a request to the destination variable through the
transferrer. Thus, the down phase of the function proceeds in parallel with both the
Write operation at Dst and the completion of the handshake on the activation channel.
We have assumed for simplicity that the T-element has the same delay, T , for both the
up and down phases and for both of its ports.
Act
T-isol
Dst
Src
Write
Inter Func
Figure 3.10: Inserting a T-element in the Func channel
Figure 3.11: The T-isolator
The diagram of Figure 3.12a shows the case Fdown < Wup +E. The total delay for one
data operation in this case is 3T + Fup + Wup + E + Wdown. When Fdown is greater than
Chapter 3. Application of Concurrency in Asynchronous Design 45
Fup
T
Wup
T
E
Fdown
Wdown
T
Act_Req / Inter_Req
Func_Req
Func_Ack
Inter_Ack / Write_Req
Wrtie_Ack / Act_Ack
(a) when Fdown < (Wup + E)
Fup
T
Wup
T
E
Fdown
Wdown
T
Act_Req / Inter_Req
Func_Req
Func_Ack
Inter_Ack / Write_Req
Wrtie_Ack / Act_Ack
(b) when Fdown > (Wup + E)
Figure 3.12: Timing diagram for the circuit in Figure 3.10
Chapter 3. Application of Concurrency in Asynchronous Design 46
Wup + E, Figure 3.12b, the delay is 3T + Fup + Fdown + Wdown. Compared to Figure 3.9,
the reduction in the total delay for one data operation is given by:
Reduction in delay =
Fdown − 3T if Fdown < (Wup + E)
(Wup + E)− 3T if Fdown > (Wup + E)(3.1)
Equation (3.1) shows that a net reduction in delay will be achieved whenever the delay
resulting from the insertion of the T-element (3T ) is less than both Fdown and Wup + E.
Because of the early removal of the request signal to the function, the configuration in
Figure 3.10 is allowable only in cases in which the function is a combinational circuit that
will not change its output data until the data in either Src or Dst are changed. Variable
Src is controlled by the environment and is assumed not to change until the activation
handshake is completed. Variable Dst will not change until it receives the rising edge of
the write request.
Inserting the T-isolator in the Write channel yields the circuit in Figure 3.13, and
the corresponding timing diagram is given in Figure 3.14. In this case, the down phase of
the Write channel is overlapped with both the environment’s delay and the down phase
of the function. The resulting reduction in delay is given by (3.2).
Reduction in delay =
Wdown − 3T if Wdown < (E + Fdown)
(E + Fdown)− 3T if Wdown > (E + Fdown)(3.2)
The third possibility is illustrated in Figure 3.15 and the timing diagram is in Fig-
ure 3.16. When an activation signal is received from the environment, one data transfer
takes place within the accumulator loop. Then, upon receiving an acknowledge signal
on the Mid channel, the T-element removes its request and simultaneously sends an ac-
knowledge signal to the outside environment. Thus, the completion of the handshake on
the activation channel proceeds in parallel with the down phase of the handshakes inside
Chapter 3. Application of Concurrency in Asynchronous Design 47
Figure 3.13: Inserting a T-element in the Write channel
Fup
T
Wup
E Fdown
Wdown
T
Act_Req / Func_Req
Func_Ack / Inter_Req
Write_Req
Write_Ack
Inter_Ack / Act_Ack
T
(a) when Wdown < (E + Fdown)
Fup
T
Wup
E Fdown
Wdown
T
Act_Req / Func_Req
Func_Ack / Inter_Req
Write_Req
Write_Ack
Inter_Ack / Act_Ack
T
(b) when Wdown > (E + Fdown)
Figure 3.14: Timing diagram for the circuit in Figure 3.13
Chapter 3. Application of Concurrency in Asynchronous Design 48
T-elem
Figure 3.15: Inserting a T-element in the Act channel
the accumulation loop. The reduction in delay in this case is given by (3.3).
Reduction in delay =
E − 3T if E < (Fdown + Wdown)
(Fdown + Wdown)− 3T if E > (Fdown + Wdown)(3.3)
It should be noted that the same delay (T ) has been used for different transactions
through the T-element to simplify the figures and expressions. The term 3T represents
the sum of all the delays introduced by the T-element during a complete data transfer
operation.
3.6 System Timing Optimization
The benefit of introducing concurrency as well as the choice of the optimum location
for inserting a T-element can be assessed using expressions 1–3. The placements of the
T-elements in Figures 3.10, 3.13 and 3.15 differ in the two delays being overlapped, the
smaller of which determines the possible gain in speed. Clearly, the optimum location
for inserting the T-element is the location where that delay is as large as possible.
The three possible insertion points produce three pairs of overlapped delays as shown
Chapter 3. Application of Concurrency in Asynchronous Design 49
Fup
T
Wup
T
E
Fdown
Wdown
T
Act_Req
Mid_Req / Func_Req
Func_Ack / Write_Req
Write_Ack / Mid_Ack
Act_Ack
(a) when E < (Fdown + Wdown)
Fup
T
Wup
T
E
Fdown
T
Act_Req
Mid_Req / Func_Req
Func_Ack / Write_Req
Write_Ack / Mid_Ack
Act_Ack
Wdown
(b) when E > (Fdown + Wdown)
Figure 3.16: Timing diagram for the circuit in Figure 3.15
Chapter 3. Application of Concurrency in Asynchronous Design 50
in Table 3.1. The optimum insertion point can be readily determined when the values of
these delays are known. In most cases, using a T-element to isolate the circuit element
having the longest delay yields the optimum or near optimum result, as the examples
below illustrate.
Table 3.1: Overlapped delays for different insertion points
Location Overlapped delays
Func Wup + E || Fdown
Write E + Fdown || Wdown
Act Fdown + Wdown || E
3.6.1 System examples
The proposed methodology has been examined using the bundled-data 16-bit accumu-
lator and the 8-bit by 8-bit multiplier test circuits. First, the accumulator was tested
inside a simple test circuit that generates repeated data requests. Handshake delays in
Balsa implementations depend on the data in the corresponding data channels, and as
such they vary from one transaction to the next. Variability is on the order of ± 5%.
The delay values obtained from simulations for one data transaction in the accumulator
test circuit are:
Fup = 1795 ps , Fdown = 1362 ps
E = 487 ps
Wup = 539 ps , Wdown = 584 ps
T ≃ 200 ps
The total time for the transaction in Figure 3.9 based on these values is 4767 ps. Exam-
ination of the overlapped delays in Table 3.1 shows that the largest possible reduction
in delay is obtained when the T-element is inserted on the Func channel. Using (3.1),
the delay reduction in this case is (Wup + E − 3T ) = 426 ps. The other scenarios of
T-element insertion increase the circuit delay. If a T-element is inserted on the Write
channel, the delay change is (Wdown − 3T ) = −16 ps according to (3.2). Similarly, the
Chapter 3. Application of Concurrency in Asynchronous Design 51
insertion of a T-element in the Act channel changes the delay by (E − 3T ) = −113 ps
according to (3.3).
Simulation results for the case of a T-element inserted on the Func channel showed a
reduction in delay of 362 ps. The difference between the expected and observed delays is
a result of the changes in port loading when the circuit configuration changes. Also, the
value of T used in the analysis is an approximate average for all the T-element delays.
The delays of different transactions on the ports of the T-element are within ± 10 ps of
the 200 ps given above, with the delay from B Ack+ to B Req− in Figure 3.3c being
the largest.
Improvements in the overall performance are affected by both the accumulator loop
delay and the delays in its test environment. The time of 4767 ps of the handshake
operation in Figure 3.9 represents about 60% of the delay between successive data trans-
actions. The remainder of the delay is introduced by the test circuit that loads new data
in the Src variable and sends a new activation signal. Insertion of the T-element in the
accumulation loop resulted in an overall increase in speed of about 4%. The accumulator
circuit alone was 50% faster than the original reference circuit of Figure 3.7 and had a
23% lower power-delay product. Also, simulations confirmed that the insertion of the
T-element at locations other than the Func channel resulted in lower performance.
The multiplier was used to test the gain in performance in a larger circuit. There are
three accumulator loops, ACC1, ACC2 and ACC3, inside the multiplier, as shown in
Figure 3.17. The delay values for the accumulators are given in Table 3.2.
Table 3.2: Delay values in ps for the accumulators inside the multiplier
ACC1 ACC2 ACC3
Fup 1292 2340 457Fdown 917 1786 686E 1699 2173 406Wup 706 2110 1520Wdown 798 1572 1325T 200 200 200
Chapter 3. Application of Concurrency in Asynchronous Design 52
T-seq*
MUX
f3
Var: product
f2 f1
MUXVar: temp Var: iteration
The rest of
the multiplier
ACC3 ACC2ACC1
Func1 Write1
Act1Act3
1 2
ACT
Func2Write2Write3Func3
Act2
Figure 3.17: Accumulation loops inside the multiplier
The optimum place to insert a T-element for each accumulator and its environment
can be determined using Table 3.1. Because of the long delays in the environment of
ACC1 and ACC2, the optimum insertion points are on the activation channels Act1 and
Act2. The resulting reduction in delay is 1099 ps (20%) for ACC1 and 1573 ps (16%)
for ACC2.
The best place to insert a T-element for ACC3 is on the Write3 channel. However,
this accumulator is connected to port 1 of the T-sequencer, which means that its operation
is already overlapped with its environment. According to Figure 3.2, the T-sequencer
overlaps the down phase of Act3 with both the down phase of Act1 and the delay of
the subsystem issuing ACT . Thus, the down phase delay of Act3 is overlapped with
a larger delay and decreasing it is not useful. Furthermore, inserting a T-element in
channel Write3 increases the up phase delay of Act3 by 2T , thus reducing the overall
performance.
Insertion of T-elements on the activation channels of ACC1 and ACC2 resulted in a
10% speed improvement, at the expense of a slightly higher power-delay product. As in
Chapter 3. Application of Concurrency in Asynchronous Design 53
the case of the simple accumulator, insertion of the T-elements at other locations resulted
in lower performance.
3.7 Experimental Methodology and Results
The synthesis and simulation flow used in testing the circuits presented in this chapter
is shown in Figure 3.18. Balsa was used to design and synthesize the test circuits, and
technology-dependent optimizations were performed using Synopsys. First, circuits were
described using the Balsa language and tested for functionality using Balsa’s simulation
tool. Then, Balsa converted the Balsa description of the design into a technology sup-
ported by Balsa. Balsa’s output is a Verilog netlist, which was converted to a generic
Verilog file using simple scripts. The generic file was fed to Synopsys along with appro-
priate scripts and the Synopsys design constraints (SDC) file to prevent it from omitting
delay elements. Synopsys Design Compiler synthesized the generic Verilog netlist to the
desired technology and optimized the circuit for speed and area.
Synopsys Design Compiler also creates a delay file in the standard delay format (SDF).
The delay file created by Design Compiler was fed to the simulator, ModelSim, along with
the gate-level netlist for post-synthesis simulations. ModelSim was also used to record
the switching activity of the circuit in the switching activity interchange format (SAIF).
Then, the netlist, SAIF file and technology libraries were fed to Synopsys Power Compiler
for average power analysis.
Table 3.3 summarizes the post-synthesis simulation results for various implementa-
tions of the circuits described earlier in the chapter — a 16-bit 5-stage FIFO, a 16-bit
accumulator and an 8-bit by 8-bit radix-4 Booth multiplier. They were all implemented in
180-nm TSMC technology. The Booth multiplier implementation code in Balsa is shown
as an example in Appendix B. The three accumulators in the multiplier implementation
correspond to the three accumulation statements, which are identified by comments in
the code.
Chapter 3. Application of Concurrency in Asynchronous Design 54
Balsa
Description
Balsa
Simulation
Convert to Generic
Verilog
Synopsys
Design Compiler
ModelSim
Simulation
Synopsys
Power Compiler
Balsa
Synthesis
Technology
Dependant Verilog
Netlist
Handshake
Circuits
Generic
Verilog Netlist
Synthesized
Verilog
& SDF
Timing/Functional
Analysis & SAIFPower Analysis
Results
Synopsys Design
Constraints
Figure 3.18: Design flow
Chapter 3. Application of Concurrency in Asynchronous Design 55
For each test circuit, a reference configuration was chosen, as described before. The
average speed and power were obtained by generating 100 random data and the same
data were applied to all implementations. The power-delay product was calculated by
multiplying the average power of the system at the highest performance by the average
delay between two consecutive inputs. Thus, the power-delay product obtained was the
average required energy for processing each input datum.
Table 3.3: Simulation results
Relative RelativeCircuit Name Description Relative average power-delayarea speed product
FIFO1 Using transparent latches 1 1 1
FIFO Using edge-triggered flip-FIFO2 flops 1.524 2.305 1.035
Balsa implementation, usingACC1 master-slave latches (Figure 3.7) 1 1 1
Using edge-triggering as inAccumulator ACC2 Figure 3.8 0.912 1.439 0.782
ACC2 with optimum insertionACC3 of T-element 0.922 1.502 0.774
Using master-slave latchesMUL1 as in Figure 3.7 1 1 1
Using edge-triggering as inRadix-4 MUL2 Figure 3.8 0.935 1.160 0.936
BoothMultiplier MUL2 with optimum insertion
MUL3 of T-elements 0.945 1.274 0.954
MUL2 with non-optimumMUL4 insertion of T-elements 0.945 1.235 0.959
The results in Table 3.3 show the trade-offs possible between latch-based and edge-triggered
implementations of WAR operations. In the case of a FIFO, edge-triggered flip-flops pro-
vide a much faster circuit at the expense of a larger area and more power. For accumulation
statements, edge-triggering wins over latch-based implementation in speed, area and energy
consumption. In handshake circuits with accumulation statements, it is possible to increase
concurrency and speed further by inserting T-elements at the optimum places found using Ta-
ble 3.1, as exemplified by ACC3 and MUL3. Multiplier MUL3 is 27% faster than the reference
circuit and has 5% less area and power-delay product. Multiplier MUL4 demonstrates the case
where T-elements are inserted in non-optimum places, Func1 and Func2, and as a result it is
Chapter 3. Application of Concurrency in Asynchronous Design 56
slower than MUL3.
In order to generalize the results from the specific 8-bit and 16-bit examples, the sizes of S-
and T-elements and different types of latches and edge-triggered flip-flops are given in Table 3.4
for the technology used in the experiments. The storage elements used in the experiments are
highlighted in the table.
Table 3.4: Area of storage and control elements
Circuit Area (µm2)
S-element (Two-step S-sequencer) 81.31T-element (Two-step T-sequencer) 52.80Active-low enable latch 42.28Resettable active-low-enable latch 57.66Resettable negative-edge flip-flop 73.04Positive-edge flip-flop 57.66Resettable positive-edge flip-flop 73.04
Balsa normally uses active-low-enable latches, and generates the control circuit correspond-
ingly. Hence, negative-edge-triggered flip-flops were used to replace the latches in order to keep
the control logic intact. The negative-edge-triggered flip-flop provided in the standard cell li-
brary used in this chapter is resettable. However, in most of our test circuits, the reset of the
flip-flop is tied high as it is not needed.
The increase in area in the edge-triggered implementation of the FIFO can be explained
by the data in Table 3.4. In the FIFO circuit, 16 non-resettable latches were replaced with 16
resettable flip-flops, and three 2-step S-sequencers were replaced with three 2-step T-sequencers
(see Figure 3.6). Thus, the total area was increased. In the case of the accumulator, the storage
elements had to be resettable to clear the accumulator initially. Therefore, 32 resettable latches
were replaced with 16 resettable flip-flops, a two-step S-sequencer was replaced with a T-isolator
(see Figure 3.8 and Figure 3.10), which has the same size as a T-element. Since the area of two
resettable latches is larger than the area of a single resettable flip-flop, the area of the circuit
is reduced. The same type of analysis applies to the multiplier circuit in which, each pair of
non-resettable latches in the accumulation loops were replaced with a single flip-flop.
A resettable flip-flop is a significantly larger circuit compared to a non-resettable flip-flop as
shown in the two last rows of Table 3.4. Smaller and thus lower-power circuits can be obtained
Chapter 3. Application of Concurrency in Asynchronous Design 57
by using a standard cell library featuring non-resettable negative-edge flip-flops or by modifying
Balsa to generate control signals for positive-edge flip-flops. Another possible improvement is
to use faster C-elements. The C-elements in the architecture of T-elements were implemented
by logic gates as the test circuits were synthesized using standard cells available in the library.
Hence, better results would be expected if the T-element or C-element are part of the library
or if a custom layout design is used.
3.8 Conclusion
This chapter demonstrated that the use of edge-triggering in the synthesis of handshake circuits
is an effective means for avoiding the write-after-read hazard and offering a range of trade-offs
among speed, area and power. Significant speed improvements in three different test circuits
have been demonstrated. With edge-triggering it becomes possible to use T-elements to intro-
duce concurrency in the circuit’s operation, thus increasing speed further. The speed of a simple
16-bit accumulator circuit increased by 50%, accompanied by a reduction in the power-delay
product of 23%.
The introduction of concurrency using T-elements leads to a significant reduction in the
penalty associated with the down phase of 4-phase handshake circuits. Criteria for optimized
placement of the T-elements have been proposed and tested in a multiplier circuit implemented
using Balsa and Synopsys. In all cases, the proposed circuits are compatible with the require-
ments of syntax-directed compilation.
Chapter 4
Enhanced Synchronous Design
4.1 Introduction
Asynchronous circuits have unique features that can resolve or diminish many of the urgent
problems of today’s applications in the deep nanometer regime. Compared to synchronous
design, these features include ability to adapt to process and environmental variations, potential
average-case rather than worst-case performance, lower noise and lower power consumption.
However, these advantages are not readily available to designers because of the difficulties
involved in the design of fully asynchronous systems.
Asynchronous circuits depend on fine-grained handshakes for timing and the overhead of
the control circuits to implement handshakes is significant. That is, in many cases, the con-
trol circuits become the bottleneck of the system, resulting in a degraded performance. The
examples in Chapter 3 demonstrate the overhead involved.
Other asynchronous methods such as desynchronization, though easier to design, also suffer
from asynchronous control overhead. In the desynchronization method, a master-slave configu-
ration replaces edge-triggered flip-flops to allow fully asynchronous operation between pipeline
stages. However, the delay between the fall of the slave enable signal and the rise of the master
enable signal results in a 20% performance overhead in a DLX microprocessor [46]. Also, the
desynchronized processor is 13.44% larger than the synchronous design and consumes more
power, especially leakage power.
58
Chapter 4. Enhanced Synchronous Design 59
A synchronous circuit depends on a single control signal, the clock, for the timing of its
operation. This often results in lower control overhead compared to asynchronous circuits.
However, in most synchronous systems, the clock signal has a fixed frequency determined by
the worst-case process-voltage-temperature (PVT) analysis of the most critical path. Hence,
performance is limited by worst-case parameters, even though the system may be operating
under more favorable conditions most of the time.
This thesis proposes a hybrid approach that combines the best features of both synchronous
and asynchronous systems. It is shown that the clock signal can be controlled dynamically using
asynchronous logic. While the main core of the system remains synchronous, the controlling
circuit that generates the clock signal is asynchronous logic. Thus, many beneficial features of
asynchronous design are brought to the synchronous environment. At the same time, the ease of
design and low control overhead of synchronous systems are retained. The resulting system will
be referred to as a PVT-aware self-tuning system. It is able to tune the timing of its operations
to produce the best-possible results under the prevailing PVT conditions. Accordingly, the
design approach is referred to as PVT-aware self-tuning design.
Although the proposed architecture benefits from its asynchronous nature, the use of asyn-
chronous design is limited to the clock generation circuit, with the rest of the system being a
synchronous circuit that can be designed, synthesized, and laid out using a well-established and
understood design flow. As will be shown, the whole system, including its asynchronous clock
generation part, can be implemented using conventional tools and standard cells. It does not
involve any asynchronous design overhead, except for a pre-designed clock generation circuit.
This chapter provides a high-level overview of the proposed PVT-aware design. Detailed
implementations, and the corresponding design flows are presented in Chapter 5 and Chap-
ter 6. Chapter 5 shows how the PVT-aware design approach can be used to reduce the power
consumption of a high-speed circuit. Chapter 6 builds on the proposed PVT-aware design
and adjusts the clock frequency of a pipelined system with the operations taking place in the
pipeline in addition to the PVT conditions. The objective is to increase the speed without
significantly increasing area and energy consumption.
Chapter 4. Enhanced Synchronous Design 60
4.2 PVT-aware Design Approach
The key component of a PVT-aware self-tuning system is its clock generation circuit, which is
shown in Figure 4.1. The chip area is divided into multiple regions and a PVT-aware completion
detection circuit is included in each region. After receiving a clock pulse from the clock pulse
generator at the center, the completion detection circuit introduces a delay that matches the
delay of the critical path of the system, then sends a completion signal to the clock pulse
generator. When all completion signals are received, the clock pulse generator generates a
new clock pulse. The clock pulse generator waits until it receives all completion signals before
issuing a new clock pulse. Thus, the period of each clock cycle is determined by the longest
delay introduced by the completion detection circuits.
CPG
CD1 CD2
CD3 CD4
Clock
pulse
Completion
signal 1
Region 1 Region 2
Region 3 Region 4
Loop 1 Loop 2
Loop 3 Loop 4
CD ≡ Completion Detector, CPG ≡ Clock Pulse Generator
Figure 4.1: Clock generation design
The clock generation circuit is redrawn in Figure 4.2 for better clarity. A completion detec-
tion circuit together with the clock pulse generator forms a loop, which its delay should match
the delay of the critical path of its region. The delay of the loop is tuned using a static timing
Chapter 4. Enhanced Synchronous Design 61
CPG
CD4 of Region 1
CD3 of Region 2
CD2 of Region 3
CD1 of Region 4
Completion
Signal 1
Clock
Pulse
CD ≡ Completion Detector, CPG ≡ Clock Pulse Generator
Figure 4.2: Clock generation loops
analysis (STA) tool.
Multiple completion detection circuits are placed at different regions, which may be subject
to different temperatures and voltages at different times. Also, the fabrication process may
result in parameter variations from one area to another. As PVT conditions change, the delay
of the loop (composed of the completion detection circuit and the clock pulse generator) tracks
the changes in the delay of the corresponding critical path. Since the clock generation circuit
waits for all completion signals, the loop having the longest delay in each clock cycle determines
the clock period.
4.3 Solving Real-world Problems
As technology scales down to smaller feature sizes, it becomes more difficult to predict the
quality of the fabricated chips. In addition to process variations, voltage and temperature
variations should also be accounted for during design. In the traditional multi-corner static
timing analysis approach, many PVT corners including the worst-case PVT corner, are used
to test the timing of the design and to ensure a reasonable chip yield. Due to the nature
of this approach that designs for the worst-case scenario, the resulting chips are large, power
consuming and slow. Another design approach is to use statistical static timing analysis (SSTA)
that unlike traditional static timing analysis, produces timing analysis based on the yield. The
designer can trade off performance for yield [69]. However, SSTA is very complicated and yet
Chapter 4. Enhanced Synchronous Design 62
immature [70].
The main advantage of the proposed technique is its adjustability to PVT variations. It
features an on-chip clock generation which is subject to the same fabrication process as the rest
of the chip. During run-time, the on-chip clock generation circuit synthesizes a clock period
suiting the quality of the fabrication process and the prevailing voltage and temperature. This
mitigates the process variations problem. That is, the resulting chip tunes itself to produce
the best-possible results, given the inter-chip (die-to-die) and intra-chip (within-die) variations.
Because of this self-tuning mechanism, the designer does not need to worry as much about
delays and variations. Hence, complicated analysis such as SSTA to deal with variations are
not as necessary.
When a design is optimized and implemented, different constraints are used to guide op-
timization tools for trade-offs in speed, power and area. These optimizations are usually in
conflict with each other. As shown in Figure 4.3 [71], designing and optimizing for higher clock
frequencies result in larger and more complex circuits, which consume more power. Particularly,
in deep nanometer technologies, high frequency requirements push the optimization tool to use
a large number of high-speed high-leakage cells, resulting in an increase in leakage power.
Most synchronous design flows are limited by the assumption of using a fixed clock frequency,
which is determined using the worst-case PVT conditions. The PVT-aware self-tuning design
exploits the fact that a system is subject to typical rather than worst-case PVT conditions
most of the time. Since the resulting system is PVT-aware, it can be implemented to be as fast
as its traditionally designed counterpart under typical conditions, but with significantly lower
power consumption and lower area as explained in Chapter 5. Alternatively, the PVT-aware
self-tuning mechanism can be used to implement a system, which is faster than its traditional
fixed-clock counterpart under typical conditions as explained in Chapter 6.
According to Figure 4.3, at the extreme end with high-frequencies, power reduction is desired
and at mid-range clock frequencies, achieving higher throughputs without increasing area and
energy consumption is desired. Chapter 5 focuses on power reduction at the extreme end of the
design space with high clock frequencies and Chapter 6 focuses on improving the performance
of the systems with mid-range clock frequencies.
Chapter 4. Enhanced Synchronous Design 63
Figure 4.3: Design Space
Chapter 6 also adds a new feature to the proposed PVT-aware design approach. It demon-
strates a design technique to implement pipelines that automatically adjust their clock period
according to the operations currently performed in the pipeline. Therefore, the speed of the
system is not limited by the delay of the slowest possible operation.
Chapter 5
Leakage Reduction
5.1 Introduction
Power reduction is an important objective in the design of today’s high-performance systems,
particularly in portable devices. Meeting power requirements is getting more difficult in to-
day’s shrinking technologies because of leakage power. As technology scales to smaller feature
sizes, leakage power becomes a more substantial portion of the total power, due to two main
factors. First, the gate length and threshold voltage of transistors are reduced, resulting in a
substantial increase in the leakage power [14]. Secondly, process variations push the designers
to use more conservative delay estimations, which result in overly complex and leaky circuits.
Process variations make the quality of fabricated chips less predictable and hence, more and
more conservative delay and clock frequency estimations are used [72]. This is an undesired
over-engineering to ensure that a large percentage of the fabricated chips meet performance
requirements. For example, using multi-corner static timing analysis, a digital system is de-
signed to deliver the required performance under all PVT corners, including worst-case PVT.
To this end, many high-speed high-leakage cells are used, resulting in significant leakage power
consumption.
This chapter demonstrates that PVT-aware systems can be designed to have significantly
reduced leakage power consumption. The PVT-aware self-tuning mechanism presented in Chap-
ter 4 is used here to introduce a new low-power design approach. According to the proposed
64
Chapter 5. Leakage Reduction 65
approach, a digital system is designed to meet the required clock frequency under typical rather
than worst-case PVT conditions. The system adjusts its clock frequency automatically as PVT
conditions change, either inter-chip or intra-chip. To implement PVT-aware systems, the on-
chip clock generation circuit proposed in Chapter 4 is used. This chapter presents the gate-level
design of the clock generation circuit along with a simple standard-cell ASIC design flow to im-
plement PVT-aware systems. Systems equipped with the proposed self-tuning circuitry are
designed with less pessimistic assumptions and over-engineering. Hence, they are simpler sys-
tems that meet the timing requirements with a smaller number of high-leakage cells and thus,
significantly reduced leakage. In a case study of a DLX microprocessor, leakage power is re-
duced by 10X under typical PVT conditions and by 7X under worst-case PVT conditions using
the proposed approach. Other advantages include a reduction in dynamic power, resilience to
PVT variations, and suitability for voltage scaling.
Most of the previously proposed PVT-aware approaches focus on improving performance [3,
7,53,73] or on reducing dynamic power consumption [8]. The approach presented here focuses
on reducing leakage power consumption while retaining the required performance under typical
conditions. Comparison to previous work is presented in Section 5.9.
In subsequent sections, opportunities to reduce power are demonstrated and the PVT-aware
architecture is described, followed by a methodology to integrate the proposed architecture
into conventional digital design flows. As a case study, the proposed design methodology is
demonstrated using a free-license DLX microprocessor, and complete post-layout results in
90nm technology are presented.
5.2 Review of Power Management Techniques
In today’s technology nodes, leakage power is a significant contributor to the total power, as
the gate length and threshold voltage are scaled down. Several techniques can be applied at the
circuit level to reduce leakage power, including multi-threshold libraries, multiple and dynamic
supply voltages, power gating and variable body biasing [14,74].
Multi-threshold libraries feature different implementations for functions, including high-
Chapter 5. Leakage Reduction 66
voltage-threshold (HVT), standard-voltage-threshold (SVT) and low-voltage-threshold (LVT)
cells, which have different speed and leakage characteristics. Of them, the LVT cells are the
fastest and have the highest leakage. They are used by the synthesis and optimization tools in
critical paths. The SVT and HVT gates are used in less-critical paths to reduce leakage power.
Power gating and body biasing techniques may also be used to control leakage dynamically
in sections of the chip that are idle. An adaptive body biasing approach is presented in [75].
The chip area is divided into small blocks; each block has a replica of the critical path. The
delay of the replica is used as an indicator for body-biasing the transistors in that block. The
use of replicas leads to an excessive increase in area.
Dynamic power depends on the switching activity of the circuit. An effective way to reduce
dynamic power is to gate the clock input to the sections of the circuit that are not performing
useful task [15]. Clock gating introduces extra area overhead, and thus, the granularity of
clock-gated blocks has to be selected carefully to avoid a large increase in the leakage power.
Dynamic voltage scaling is another technique that is used in many systems such as laptops
to deliver high throughput when required by increasing the input voltage to the system. During
idle periods, the input voltage is reduced to extend battery life [76]. The Razor project [8] uses
a PVT-aware mechanism to reduce dynamic power consumption by dynamic voltage scaling.
The approach described in this chapter reduces the leakage power and the dynamic power
during both active and sleep modes of the circuit. It can be combined with other power reduction
techniques to improve power properties further as the examples described later demonstrate.
The resulting PVT-aware system is amenable to dynamic voltage scaling as it automatically
adjusts the operating frequency to the input voltage level.
5.3 Proposed Idea
An IC foundry characterizes its technology under different PVT conditions, known as PVT
corners. The PVT corners for the 90nm technology used in the experiments are given in
Table 5.1 for a 1.0 V supply voltage. Unless otherwise stated, the conditions in Table 5.1 are
those referenced throughout this chapter.
Chapter 5. Leakage Reduction 67
Table 5.1: PVT corners
PVT corner Process Voltage Temperature
Best Fast 1.1 -40◦CTypical Typical 1.0 25◦CWorst Slow 0.9 125◦C
For worst-case design, the best PVT corner is used for hold time check and the worst PVT
corner is used for setup time check. Critical paths are identified under the worst PVT corner.
Then, the synthesis and physical design tools are instructed to optimize the design to meet
the required performance under the worst-case conditions. Meeting timing requirements under
the worst PVT conditions is harder than meeting those requirements under typical conditions,
because circuit paths have longer delays.
The test vehicle used in this chapter is a Hennessey and Patterson’s 32-bit DLX pipeline
microprocessor [77] downloaded from opencores.org [78]. To illustrate the potential for design
optimization, the DLX processor was synthesized by Synopsys Design Compiler for a clock
frequency of 1 GHz under the worst PVT conditions (Design 1) and also under the typical PVT
conditions (Design 2). Both designs were constrained for the best area and power optimizations.
The design methodology used will be described in Sections 5.5 and 5.6.
Table 5.2: Post-synthesis power breakdown and area of the designs under typical PVTconditions,temp=25◦C
DesignAverage leakage power (mW) and Number of cells Average dynamic
Area (µm2)HVT SVT LVT Total power (mW)
Design 10.006 0.026 1.347 1.379
34.672 115215.07(3641 cells) (2377 cells) (7133 cells) (13151 cells)
Design 20.023
0 00.023
33.977 109930.12(12711 cells) (12711 cells)
The resulting number of cells used in the two designs from each of the three categories of
low, standard and high threshold cells is shown in Table 5.2. The optimization tool had to use a
mix of all three cell types in Design 1 to meet the performance constraint. For Design 2, it was
able to achieve the desired performance using HVT cells only. As a result, the leakage power
of Design 1 is substantially larger than that of Design 2. Also, its dynamic power is higher as
it is a larger circuit with more switching capacitances.
Chapter 5. Leakage Reduction 68
It should be noted that the power values in the table were obtained from an initial power
analysis to evaluate the two designs. Post-layout simulation-based power analysis will be pre-
sented in Section 5.6. The power values were obtained using the typical PVT corner, which
uses a temperature of 25◦C. Leakage power increases exponentially with the temperature and
becomes a more substantial portion of the total power. This will be addressed later in the
following sections.
The lower power consumption of Design 2 motivated the idea to design a system that has
the desired performance level under typical PVT conditions, but is equipped with a PVT-
aware mechanism that adjusts the run-time speed to accommodate changes in PVT conditions.
The architecture to support the PVT-aware mechanism and the corresponding design flow are
explained next.
5.4 The PVT-aware Architecture
As explained in Chapter 4, to create a PVT-aware system, an on-chip clock generation circuit is
added as shown in Figure 5.1. The chip area is divided into multiple regions and a PVT-aware
completion detection circuit is included in each region. The number of regions depends on many
parameters such as the quality of the fabrication process and the size of the design.
After receiving a clock pulse from the clock pulse generator at the center, the completion
detection circuit introduces a delay that matches the delay of the critical path of system, then
sends a completion signal to the clock pulse generator. When all completion signals are received,
the clock pulse generator generates a new clock pulse.
Figure 5.2 shows a schematic of the clock generation circuit, which is based on the two-
phase single-rail asynchronous design style [2, 12] and Dean’s dynamic clocking approach [7].
The completion detection circuit for each stage comprises a delay element and a toggle. When
the clock pulse emerges from the delay element, it is converted to a level by the toggle before it
is sent back to the C-element of the clock pulse generator. Initially, all toggle elements are reset
and so is the output of the C-element. After the reset is removed, all toggle elements change
state, causing the C-element to toggle its output, thus creating a clock pulse of width CPW at
Chapter 5. Leakage Reduction 69
CPG
CD1 CD2
CD3 CD4
Clock pulse
Clock pulse
Clock pulse
Clock pulse
Completion signal 1 Completion signal 2
Completion signal 3 Completion signal 4
Region 1 Region 2
Region 3 Region 4
Figure 5.1: Clock generation circuit, CD≡ Completion Detector, CPG≡ Clock Pulse Generator
CPW
Completion signal from Region 1
Completion signal from Region 3
Completion signal from Region 2C
Delay
Toggle
Clock Pulse Generator
Completion Detection
Circuit for Region 1
Completion signal from Region 4
Q
QSET
CLR
D
Reset
To
Clock
Tree
Figure 5.2: Clock generation circuit schematic
Chapter 5. Leakage Reduction 70
the output of the XOR.
Each completion detection circuit delays the clock pulse by an amount matching the critical
path of the system under the prevailing PVT conditions in its region, then the toggle changes
state. When all clock detection circuits have toggled, the C-element toggles creating a new
clock pulse.
The PVT-aware architecture results in a variable clock period. Hence, special attention
should be paid to how it communicates with its environment to ensure correct data transfers.
The problem of transferring data between unsynchronized clock domains already exists in many
high-speed systems. As such, many approaches have been suggested to minimize metastability
and data loss when different clock domains are connected. They include multi-flop synchroniz-
ers, multiplexer recirculation techniques, use of first-in-first-out buffers between different clock
domains and handshake techniques [79,80]. Similar synchronization techniques may be applied
for inter-chip and intra-chip data transfers between a PVT-aware system and its environment.
5.5 Design Flow
Figure 5.3 shows the design flow to integrate the suggested PVT-aware architecture into a
conventional standard-cell ASIC design flow [81]. A few extra steps are needed to add the
clock generation circuit to the top-level hardware description language (HDL), place the clock
generation elements appropriately and tune the delay elements.
The most important change from a conventional design flow is that the main core is syn-
thesized and laid out to meet the required clock period under typical rather than worst-case
PVT conditions. Thus, typical-case timing libraries are used for setup check and critical path
analysis during synthesis and layout, leading to much more favorable results.
5.5.1 Clock generation issues
• The first step to implement the suggested clock generation circuit of Figure 5.2 is to create
a library of delay elements in the target technology. A delay element can be implemented
as a chain of 2n inverters, where n = 1, 2, ..., N . Then, the delay of each delay element is
Chapter 5. Leakage Reduction 71
Synthesize the main synchronous core
(optimize for speed, power and area)
Decide on the number of PVT regions/
implement the circuit of Fig. 2 (HDL
code/synthesize into target tech.)
Connect the clock of the main core to
the output clock of the clock
generation circuit in the top-level HDL
IO placement/power planning/
floorplanning
Placement with in-place DRC/setup/
hold/leakage optimizations
Pre-CTS DRC/setup/hold/leakage
optimizations
Clock tree synthesis
Post-CTS DRC/setup/hold/leakage
optimizations
Last pass leakage optimization
Add fillers and check the design
(geometry/connectivity/antenna)
Check the timing with STA
Is the
resulting clock
period
OK?
Tune delays
and do ECO
HDL design of the main core and
functional verifications
Functional verification of the circuit/
post-layout simulations/tests
Is the
system fully
functional?
Post-route DRC/setup/hold/leakage
optimizations
Done
Start
YesNo
Yes
No
pre-place the clock generation
circuit as in Fig. 1/
Apply set_dont_touch on Clock gen.
Timing-driven (high-effort), SI-driven
(normal-effort) Routing with Fix
Antenna feature
Figure 5.3: Proposed low-power PVT-aware design flow, HDL ≡ Hardware Description Lan-guage, DRC ≡ Design Rule Check, CTS ≡ Clock Tree Synthesis, SI ≡ Signal Integrity, STA ≡Static Timing Analysis, ECO ≡ Engineering Change Order
Chapter 5. Leakage Reduction 72
estimated using a static timing analysis (STA) tool. The result is a table of several delay
elements and their corresponding delay values that can be used in the clock generation
circuit.
• Multiple completion detection circuits are implemented to match the delay of the critical
path. All of the completion detection circuits are designed to match the same critical path
delay. However, they are placed in different regions of the chip as shown in Figure 5.1, to
follow the regional operating PVT conditions.
• The delay element in Figure 5.2 has to be adjusted such that the delay of the loop
composed of the completion detection circuit and the clock pulse generator is equal to the
critical path of the system. The delay of the loop must be tested under different PVT
corners to ensure that it matches the critical path under all conditions. When adjusting
the delay elements, appropriate margins should be used, because different factors such as
crosstalk, inductance, IR drops, noise, etc. may affect the completion detection circuits
and the datapath elements differently.
• During synthesis, it is sufficient to insert delay elements that are approximately 25%
longer than the desired clock period. They are trimmed later, during the layout flow.
• The submodules of the clock generation circuit should be pre-placed during floorplanning
to avoid a random placement.
• After the layout is completed, the post-layout netlist, the standard delay format (SDF)
file and the standard parasitic exchange format (SPEF) file are exported to an STA tool
to test the delays.
• The loop of Figure 5.2 is examined for each completion detection circuit, using an STA
tool to check if the resulting clock period is appropriate.
• If the resulting clock period is not appropriate, the delays inside completion detection
circuits are tuned and a pass of engineering change order (ECO) is performed to fix the
layout.
Chapter 5. Leakage Reduction 73
• The clock pulse width determined by CPW in Figure 5.2 is tested under all PVT conditions
to ensure that the pulse width requirements of sequential elements are not violated.
• The reset signal to the system must be long enough to ensure that the delay elements get
successfully reset and all the gates and flip-flops become stable.
5.6 Case Study: PVT-aware DLX Microprocessor
The DLX processor introduced in Section 5.3 was used as a case study. It was implemented
both as a PVT-aware system and a conventional synchronous circuit. The PVT-aware design
flow of Figure 5.3 was implemented in 90nm technology using the toolset shown in Table 5.3. A
low-power design flow similar to that of Figure 5.3 with the same tools and optimizations but
without the PVT-aware implementation steps was realized for the conventional synchronous
design.
Table 5.3: Toolset
Objective Tool Version
Synthesis Design Compiler Y-2006.06-SP5
Timing and power analysis PrimeTime-PX Y-2006.06-SP3-1
Physical design SoC Encounter 5.2
Simulation ModelSim 6.3c
The shortest possible post-layout clock period of the DLX core was found to be 1.244 ns
under the worst PVT corner. Hence, the design flow of Figure 5.3 was used to implement
a PVT-aware DLX processor with the same clock period of 1.244 ns but under typical PVT
conditions. Also, the chip was divided into four regions similar to Fig 5.1, and a completion
detection circuit was placed in each quadrant.
5.6.1 Tuning delays
To simplify delay tuning, a library of delay elements in the target technology was implemented
as explained in Section 5.5.1. Delays in the clock generation circuit were chosen to be 25%
larger than needed as a starting margin value. The delays were tuned after the place and route
Chapter 5. Leakage Reduction 74
using an engineering change order (ECO) flow, which trimmed the delays gradually until a
desired final margin (10% in this case study) was reached.
To tune the delays after the place and route, each clock generation loop in Figure 5.2 was
analyzed by PrimeTime to find the resulting clock period. If the period was longer than needed,
the delay element was replaced by a smaller delay element from the delay library, and vice versa.
This process was repeated for each completion detection circuit. After tuning the delays, the
new netlist was fed back to Encounter to update the layout (ECO).
As explained earlier, a 10% margin was used for the clock period during delay tuning.
Because of different rise and fall delays in the clock generation circuit, successive clock periods
alternated between 1.367 ns and 1.569 ns, resulting in an effective clock period of 1.468 ns. The
clock period and the critical path of the DLX core change with PVT as shown in Table 5.4.
The chip layout of the PVT-aware processor is shown in Appendix C.
Table 5.4: Comparison of the clock period and the critical path
PVTSuccessive
Effective period Critical pathclock periods
Best 0.929ns , 1.032ns 0.9805ns 0.828nsTypical 1.367ns , 1.569ns 1.468ns 1.244nsWorst 2.312ns , 2.666ns 2.489ns 2.168ns
5.6.2 Implementing the fixed-clock counterpart
After finding the effective clock period of the PVT-aware DLX processor under typical PVT
conditions (1.468 ns), a conventional synchronous counterpart (fixed-clock) was implemented
using the same 10% clock period margin. To do so, the DLX core was constrained to a clock
period of 1.34 ns, which was to be met under all three PVT corners. All the optimizations of
Figure 5.3 were applied to the fixed-clock design.
It should be noted that the same 10% margin was used for both the PVT-aware processor
and the fixed-clock version. The actual required margin depends on many parameters such as
the fabrication quality, the size of the design, and the noise level (i.e. supply noise, clock jitter,
etc).
The important difference between the PVT-aware and the fixed-clock implementations is
Chapter 5. Leakage Reduction 75
that the required clock frequency has to be met even under the worst PVT corner for the
latter. The PVT-aware version meets the required clock frequency and performance under
typical conditions. As shown below, the PVT-aware system achieves a significant reduction in
power at the expense of a lower performance under conditions worse than typical.
5.7 Evaluation
In this section, the PVT-aware and the fixed-clock microprocessors are compared in terms of
power consumption and performance. Using several application programs with different power
consumptions, it is demonstrated that the leakage of the PVT-aware processor is 10X less under
typical conditions and 7X less under worst-case conditions. Other properties of the PVT-aware
microprocessor such as resilience to PVT variations and suitability for voltage scaling are also
examined.
The functionality and power consumption of the PVT-aware DLX processor and its fixed-
clock counterpart were analyzed using the three benchmark suites given in Table 5.5. The
benchmarks were compiled by DLX GCC [82]. Post-layout simulations of the circuits were per-
formed for each benchmark to record switching activities in the switching activity interchange
format (SAIF). These, together with parasitic data (SPEF files), were used by PrimeTime-PX
for simulation-based power analysis.
Table 5.5: Benchmarks
Source Benchmark
MiBench [83]
adpcm coderadpcm decodercrc32dijkstraqsort
PowerStone [84]
bcntblitcompressucbqsort
Applications from [85,86]
Bubble SortJPEG-DCTMP3-DCT32MPEG2-Bdist
Chapter 5. Leakage Reduction 76
5.7.1 Power and performance analysis
Typical PVT: Average power consumption and execution times under typical PVT conditions
are presented in Table 5.6. Power values in the table do not include memory and IO. The core
power is the power of the processor excluding the clock tree.
Table 5.6: Power and performance results for fixed-clock and PVT-aware DLX processors under typicalPVT conditions,temp=25◦C
Fixed-clock Processor
ProgramAverage Dynamic power (mW) Average leakage power (mW) Total average Execution time
Core Clock tree Total Core Clock tree Total power (mW) (µs)
adpcm coder 17.012 18.794 35.806 1.237 0.006 1.243 37.049 286.135adpcm decoder 18.713 18.794 37.507 1.236 0.006 1.242 38.749 1274.348crc32 17.113 18.794 35.907 1.236 0.006 1.242 37.149 286.851dijkstra 16.812 18.794 35.606 1.236 0.006 1.242 36.848 143.806qsort 19.713 18.794 38.507 1.236 0.006 1.242 39.749 729.099bcnt 17.812 18.794 36.606 1.236 0.006 1.242 37.848 44.204blit 17.511 18.794 36.305 1.237 0.006 1.243 37.548 103.950compress 16.513 18.794 35.307 1.236 0.006 1.242 36.549 1508.450ucbqsort 19.414 18.794 38.208 1.235 0.006 1.241 39.449 1399.062Bubble Sort 24.012 18.794 42.806 1.235 0.006 1.241 44.047 12.963JPEG-DCT 19.813 18.794 38.607 1.236 0.006 1.242 39.849 588.808MP3-DCT32 20.315 18.894 39.209 1.233 0.006 1.239 40.448 73.323MPEG2-Bdist 18.613 18.794 37.407 1.235 0.006 1.241 38.648 50.578Average 18.380 18.795 37.175 1.236 0.006 1.242 38.402
PVT-aware Processor
ProgramAverage Dynamic power (mW) Average leakage power (mW) Total average Execution time
Core Clock tree Total Core Clock tree Total power (mW) (µs)
adpcm coder 10.15 18.996 29.146 0.125 0.004 0.129 29.275 286.136adpcm decoder 11.25 18.996 30.246 0.125 0.004 0.129 30.375 1274.349crc32 10.05 18.996 29.046 0.125 0.004 0.129 29.175 286.852dijkstra 9.75 18.996 28.746 0.125 0.004 0.129 28.875 143.807qsort 12.45 18.896 31.346 0.125 0.004 0.129 31.475 729.100bcnt 10.85 18.896 29.746 0.125 0.004 0.129 29.875 44.205blit 10.35 18.996 29.346 0.125 0.004 0.129 29.475 103.951compress 9.65 18.996 28.646 0.125 0.004 0.129 28.775 1508.451ucbqsort 12.15 18.896 31.046 0.125 0.004 0.129 31.175 1399.063Bubble Sort 15.148 18.796 33.944 0.125 0.004 0.129 34.073 12.964JPEG-DCT 11.95 18.996 30.946 0.125 0.004 0.129 31.075 588.809MP3-DCT32 11.85 18.996 30.846 0.125 0.004 0.129 30.975 73.324MPEG2-Bdist 11.15 18.896 30.046 0.125 0.004 0.129 30.175 50.579Average 11.133 18.961 30.094 0.125 0.004 0.129 30.223
Chapter 5. Leakage Reduction 77
The PVT-aware processor executes all benchmark programs with the same execution time
of the fixed-clock processor under typical conditions. However, it consumes less leakage and
dynamic power. To calculate the average power values (highlighted in the table), the energy
consumption of all programs are calculated and summed, then divided by the sum of the
execution times. On average, the leakage power of the PVT-aware processor is 10X less than
that of its fixed-clock counterpart and its dynamic power is 19% less for the same performance.
The total power of the PVT-aware processor is 21% smaller, on average. Since the clock
tree power is almost equal for the two designs, only the core power contributes to the power
differences.
Table 5.7 shows how the two designs differ in some of the key parameters that affect power
consumption. The PVT-aware design is implemented mostly of HVT cells. The fixed-clock
system uses a large number of LVT and SVT cells, which have significantly larger leakage
power. Also, the fixed-clock system is a bigger circuit with about 14% more area and more
switching capacitances. Therefore, the fixed-clock design consumes more dynamic power than
its PVT-aware counterpart.
Table 5.7: Post-layout area and leakage breakdown under typical PVT corner,temp=25◦C
DesignLeakage power (µW) and Number of cells Total area of
HVT SVT LVT Total std. cells (µm2)
PVT-aware14.45 27.12 87.18 128.75
129637.86(10680 cells) (1901 cells) (662 cells) (13243 cells)
Fixed-clock5.45 85.96 1150.00 1241.41
151573.39(3821 cells) (4715 cells) (5652 cells) (14188 cells)
Worst-case PVT: The power and performance values presented above were obtained under
the typical PVT corner provided for the technology, which uses a temperature of 25◦C. Power
requirements, specially leakage power, were next examined under the worst-case PVT corner
at a temperature of 125◦C (this was the only available PVT corner with a high temperature).
Average power consumption of all benchmarks and their total execution time for this case
are presented in Table 5.8. Leakage power is a substantial portion of the total power for the
two designs under these conditions. The PVT-aware processor has a 7X lower leakage power.
The speed of the PVT-aware processor is reduced as a result of being exposed to worse-than-
Chapter 5. Leakage Reduction 78
typical conditions, and thus, its dynamic power is also reduced. The power-delay product of
the PVT-aware system is 2.30X smaller than that of its fixed-clock counterpart.
Table 5.8: Power and performance results under worst-case PVT,temp=125◦C
DesignAv. Leakage Av. Total Total Ex. Power-delay
power (mW) power (mW) Time (ms) product (µJ)
Fixed-clock 59.193 87.216 6.502 567.08
PVT-aware 8.092 22.344 11.023 246.30
5.7.2 Resilience to inter-chip PVT variations
The PVT-aware processor was tested under the three PVT corners. The system executed all
benchmarks correctly and tuned itself to produce the best possible results under the prevailing
PVT conditions. Figure 5.4 shows that the execution time changes as the clock period changes
under different PVT conditions. The results in this figure are consistent with the clock periods
presented previously in Table 5.4.
0
0.2
0.4
0.6
0.8
1
Best Typical Worst
Ex
ecu
tio
n t
ime
rela
tiv
e to
wo
rst-
case
of
PV
T-a
war
e
Fixed-clock PVT-aware
Figure 5.4: Performance of PVT-aware and fixed-clock DLX processors under all PVT corners
5.7.3 Resilience to intra-chip PVT variations
The PVT-aware processor was also tested for its resilience to intra-chip PVT variations. An
area in the right side of the chip equal to about 1/3 of the total area was selected. Starting
Chapter 5. Leakage Reduction 79
from the typical-case SDF file, the delays of all the cells in that area were augmented by 10%
using the Design Compiler’s derating commands. A new SDF file was generated to be used in
simulations, which verified that the system executed all the benchmarks correctly. Table 5.9
shows the change in clock periods with the increased delay.
Table 5.9: Clock period changes with intra-chip variations
PVT Successive clock periods Critical path
Chip under typ. 1.367ns , 1.569ns 1.244ns
Chip with1.473ns , 1.600ns 1.378ns
augmented delay
5.7.4 Suitability for voltage scaling
The 90nm technology used in the experiments is characterized for two supply voltage levels:
1.0 V and 1.2 V. These characterizations were used to apply voltage scaling to the PVT-aware
DLX processor. It was ensured that the pads were compatible with 1.2 V and no hold violation
occurred. Then, the PVT-aware processor was tested under typical PVT conditions for both
supply voltages. The system automatically adjusted its frequency to changes in supply voltage.
The clock frequency of the system with 1.2 V was 1.17 times higher than that for a voltage supply
of 1.0 V. This shows that the PVT-aware design is amenable to voltage scaling techniques.
5.8 Discussion
5.8.1 Design space
The proposed PVT-aware approach expands the design space by providing more flexibility to
power and performance trade-offs. This can be useful in the implementation of many applica-
tions, such as portable systems. The case study presented in this chapter shows that a system
can be designed to deliver the desired performance under typical conditions, which are the con-
ditions that the system is exposed to most of the time. If the conditions get worse, the system
is still functional as it automatically slows down and if the conditions get better, it will speed
up. The advantage of such a system over the fixed-clock design is an overall reduction in power
Chapter 5. Leakage Reduction 80
High frequencies
Low frequencies
Conventional Design
PVT-aware Design
Figure 5.5: Design Space expansion using PVT-aware design
and area, which is result of using a smaller number of high-leakage high-speed cells to reach
the desired performance.
Figure 5.5 shows the design space, which was explained for the conventional design in
Chapter 4. Using the conventional design approach, the designer of a system may trade off
performance for power. He may aim for a high clock frequency, which results in a high dynamic
power dissipation. To reach higher frequencies, a larger number of high-speed high-leakage LVT
cells are required, which increase the leakage power consumption. Similarly, the required area
is increased as the target clock frequency increases.
The proposed PVT-aware design approach adds a new curve to the design space, which
may be used to achieve the required speed with lower area and power consumption, compared
to the conventional design. The distance between the two curves in Figure 5.5 increases with
the clock frequency. At lower frequencies, conventional and PVT-aware designs both achieve
the required speed using a small number of LVT cells. However, as the target clock frequency
increases, the number of LVT cells, and thus the total leakage power of the conventional design
become more substantial than those of the PVT-aware design. Similarly, the difference between
the area of the conventional design and that of the PVT-aware design becomes larger as the
Chapter 5. Leakage Reduction 81
target clock frequency increases and so does the difference between their dynamic power values.
Alternatively, a PVT-aware system may be designed to deliver the performance of its fixed-
clock counterpart under worst-case conditions. This is useful when there is a hard limit on
the required clock frequency. Such a system delivers a performance better than that of its
fixed-clock counterpart under typical conditions, at the expense of higher power consumption
as the clock frequency increases. It should be noted that the clock period of the proposed
clock generation circuit is limited by the loop delays and therefore, it may not reach the clock
frequency of a fixed-clock synchronous design under worst-case conditions.
Suitability of PVT-aware systems for voltage scaling adds another degree of freedom to the
trade-offs available to the designer. PVT-aware systems automatically adjust their frequency
to the input voltage. Hence, the input voltage can be reduced using dynamic voltage scaling
techniques to conserve power when top performance is not required. On the other hand, the
input voltage may be increased to boost performance when the system is exposed to poor PVT
conditions.
5.8.2 Clock error detection
Error detection for the clock generation circuit is addressed here. The clock is started by the
reset signal. Since each clock pulse depends on the previous one and the clock generator is not
self-starting, an error detection circuit should exist to detect that the clock has stopped and to
reset the clock generation circuit.
The C-element in the clock generation circuit can mask transient pulses or glitches of up to
a certain width. However, wider glitches caused by noise at the inputs of the C-element may
mistakenly be interpreted as a completion signal. Also, transient pulses at the delay lines may
cause a completion detection to be generated at a wrong time. These may cause the clock pulse
generator to freeze, which should be detected.
In an experiment, the clock generated by the on-chip clock generation circuit was checked
using a simple test circuit and a reference clock signal. A counter, clocked by the generated clock
signal was sampled periodically in the reference clock domain. If the counter stops changing,
an error signal is generated. The reference clock frequency and the interval between samplings
Chapter 5. Leakage Reduction 82
should be selected appropriately with respect to the frequency range of the generated clock.
5.8.3 Expanding the PVT-aware approach
In the case study presented above, the typical PVT corner at a temperature of 25◦C provided
by the technology supplier was used. In the conventional design, important PVT corners for
performance analysis are worst-case and best-case corners and not the typical one. Therefore,
usually enough typical PVT corners are not provided by IC foundries. However, information
about more typical PVT corners with higher temperatures is required for the PVT-aware design
approach because a typical design usually might be running at a temperature higher than 25◦C.
It should be noted that the case study presented in this chapter, used four PVT regions
for a microprocessor. The actual required number of regions depends on many parameters
including the size of the design, voltage and temperature profiles over the chip, the quality of
the fabrication process and the margins used for the delays.
Different paths might have different sensitivity to PVT changes and thus, under different
PVT conditions, the critical path changes. As a result, the circuit may have multiple critical
paths. To ensure that the delay loops are always longer than the critical path, the proposed
design flow measures the delay of all paths under various PVT conditions, independent of which
path is the critical path.
It should be possible to apply the PVT-aware clocking approach in a modular way to
complex and large systems on chip (SoCs). SoCs are growing in complexity and hence, multiple
clock domains are inevitable. Complex SoCs are composed of several modules with different
clock domains, which communicate using asynchronous interfaces [87]. In the case of the PVT-
aware design, clock generation circuits can be used in different modules and similar approaches
can be used to connect the modules from different clock domains. Therefore, a large SoC should
benefit from the advantages of the proposed PVT-aware design.
Chapter 5. Leakage Reduction 83
5.9 Comparison to Previous Work
The idea of exploiting typical PVT conditions to improve performance has been used in different
designs including clock frequency control systems such as Dean’s STRiP processor [7] and
TEAtime [73] and also asynchronous circuits [3, 46]. However, the approach presented in this
chapter employs PVT-aware properties primarily to reduce leakage power consumption. There
are also several other differences as discussed briefly in this section.
A predecessor to the clocking scheme in this thesis was presented in Dean’s PhD thesis in the
implementation of a self-timed RISC processor. Dean proposed the clock generation structure
shown in Figure 5.6. A C-element and a pulse generator at the center are used to generate the
Figure 5.6: Dean’s clocking structure [7]
clock pulse. Tracking cells are designed to match the delay of different functional units. To do
so, several functions are selected according to the frequency of their use and partially replicated
at the transistor-level. When all the tracking cells signal the completion of their corresponding
operations, a new clock pulse is generated. Dean demonstrated a two-fold speed improvement
under typical PVT conditions using his approach. Dean’s work opened up the area of variable
or dynamic clocking.
The clock generation method presented in this thesis builds on Dean’s work and significantly
improves it. The design of tracking cells in Dean’s work is complicated because they should
accurately replicate the critical path of the corresponding function. Also, they are implemented
Chapter 5. Leakage Reduction 84
at the transistor level. Loads on the transistors of the functional units are imitated using
passive transistors. Replicas of circuit paths may be large and power consuming compared
to the matched delays used in this thesis. However, they more accurately model the delay of
corresponding paths. The approach presented in this thesis is easily incorporated in standard-
cell ASIC design implementation and does not require customized transistor-level design. Also,
it employs static timing analysis in the design of the matching delay elements of completion
detection circuits. As such, the design of the completion detection circuit is independent of the
function being matched. This allows the introduction of a simple design methodology that uses
conventional design tools to implement variable-clock systems with standard cells. As a result,
the proposed approach can be readily used in many applications.
A pioneering work in reducing power consumption by reducing worst-case margins is the
Razor project [8]. The Razor project shows the possibility of reducing the voltage margins
used in worst-case analysis of synchronous circuits. This work reduces power consumption by
reducing the input voltage, but keeps the clock frequency intact. The design does not guarantee
error-free operation. Hence, an error recovery circuit is added to cope with any timing errors
that may result from the reduced voltage.
As a result of operating at a reduced voltage, errors may occur in operations that require
full voltage. Such errors are detected using the Razor flip-flops shown in Figure 5.7 A shadow
logic operating with a delayed clock is used to obtain the correct result, which is compared to
the data in the main flip-flop. If there is a difference, an error signal is generated. The error
signal causes the data in the main flip-flop to be replaced with the data in the latch. Then,
the pipeline is flushed in a counter-flow approach. Simulation results show a 64% power saving
with less than 3% performance penalty in a simplified 64-bit Alpha pipeline design in 180-nm
technology.
Compared to the PVT-aware approach presented in this thesis, Razor has the advantage
of keeping the clock frequency fixed, which might be useful for fixed data rate applications.
However, implementing Razor requires both architectural and circuit changes. Implementing
Razor flip-flops and the pipeline flushing mechanism increases the area and complexity of the
system. Flushing a pipeline with many stages may result in a significant performance loss. In
Chapter 5. Leakage Reduction 85
Razor FF
0
1
Logic stageL2Main
flip-flop
Shadowlatch
Error_L
ErrorComparator
clk
clk_delayed
Q1D1Logic stageL1
clk
clk_delayed
D
Error
Q
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Instr 1 Instr 2
Instr 1 Instr 2
(a)
(b)
Figure 5.7: Razor flip-flop for a pipeline stage. (a) A shadow latch controlled by a delayed clockaugments each flipflop. (b) Razor flipflop operation with a timing error in cycle 2 and recoveryin cycle 4. [8]
Chapter 5. Leakage Reduction 86
addition, Razor-based design methodology inherits the disadvantage of traditional synchronous
design where circuit is optimized for worst-case conditions. Although such over-design for
typical condition can be partially overcome by lowering the supply voltage, area overhead is
still incurred, which in turn leads to increased leakage power. In contrast, the proposed PVT-
aware mechanism targets typical-case conditions during the synthesis and physical design phases
of the implementation. This leads to smaller area and leakage. Meanwhile, clock frequency
requirement is guaranteed not only for typical conditions, but also for worst-case conditions by
raising the supply voltage.
TEAtime [73] adjusts the clock frequency as PVT conditions change. The critical path
of the system is replicated and tested with the clock frequency continuously changing using a
voltage-controlled oscillator (VCO) to find the highest suitable value for the system. A clock
frequency increase of 34% under typical PVT operating conditions compared to worst-case
conditions has been demonstrated in a FPGA implementation of a DLX-style processor. The
design methodology described in this chapter achieves better performance improvements from
worst-case to typical PVT conditions compared to TEAtime. It also adapts to intra-chip PVT
variations.
Asynchronous circuits can also be designed to adjust their speed to PVT conditions. Several
asynchronous design styles exist including desynchronization [46, 53], in which a synchronous
design is converted into an asynchronous one, and Mousetrap [3], which is a methodology to
design high-speed pipelines taking advantage of PVT variability. The methodology proposed
in this chapter is simpler than asynchronous design styles because conventional synchronous
design tools are used and thus, neither asynchronous design methods nor asynchronous design
tools are required. The desynchronization method introduces an area overhead of 13.5% in
a DLX microprocessor [46]. By comparison, area overhead using the presented PVT-aware
architecture is only 0.5% for a similar DLX processor.
Chapter 5. Leakage Reduction 87
5.10 Conclusion
This chapter proposed the use of PVT-aware design to implement systems that are capable
of delivering the same performance as conventional synchronous circuits under typical PVT
conditions, but with much reduced power requirements. This chapter presented a complete
design solution for PVT-aware systems, including a PVT-aware architecture and a design flow
to implement such systems using standard-cell ASICs. The suggested methodology expands the
digital design space for many applications with low power and high performance requirements,
such as portable devices.
The case study of the DLX microprocessor has demonstrated that the PVT-aware system,
implemented in 90nm technology, delivers the same performance as its fixed-clock counterpart
under typical PVT conditions, with 10X less leakage and 19% less dynamic power. The clock
frequency changes automatically to produce the best-possible results under the prevailing PVT
variations. It has also been shown that voltage scaling techniques can be applied to PVT-aware
systems, which automatically adjust their speed to the input voltage.
Chapter 6
VariPipe: Variable-clock
Synchronous Pipelines
6.1 Introduction
In many pipelined systems, such as microprocessors, the time required to complete an op-
eration in any given stage of the pipeline depends on the operation being performed. In a
conventional synchronous system, the delay of the longest path of the pipeline under the worst
process-voltage-temperature (PVT) corner is used to determine the clock frequency. However,
the longest path of the system is not necessarily triggered in every cycle. Also, the system
is normally operating under typical PVT conditions. Hence, there are many times when a
frequency much higher than that derived under worst conditions is possible.
This chapter introduces a variable-clock synchronous pipeline design (VariPipe), in which
the clock period is adjusted in each clock cycle based on the operations taking place in the
pipeline stages [88]. An on-chip clock generation circuit dynamically matches the delay of
the current operations of the pipeline in every cycle. At the same time, the clock period au-
tomatically adjusts to the current PVT conditions. The proposed approach achieves better
performance than isochronous clocking while retaining the simplicity of synchronous system
design. Other advantages of variable-clock synchronous pipelines include a reduction in elec-
tromagnetic noise and suitability for voltage scaling techniques. These features make variable
88
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 89
clocking appealing for many applications including embedded systems and portable devices.
Several studies have been published that address variable-speed pipelines, including Tele-
scopic units [89] and a variable-clock pipeline processor introduced by Dean [7]. Asynchronous
design methodologies such as desynchronization [46] and Mousetrap [3] have also been proposed
to achieve average-case performance. The main advantage of VariPipe over previous work is
that the overhead incurred by the added clock generation circuit is low, thanks to the use of
variable delay elements. According to the case study presented in this chapter, the overhead of
the added clock generation circuit for a VariPipe DLX processor is only 2.6% in area and 3% in
energy consumption. Dean’s approach achieves the same performance improvement of 2X over
conventional isochronous design as the VariPipe processor. However, duplicates of functional
units double the area and energy consumption of the functional units. The desynchronization
method introduces a 13.5% area overhead in a DLX processor [46]; the processor does not adjust
its speed based on the operations in the pipeline and therefore its performance gain is limited.
VariPipe methodology is based on a synchronous circuit implementation, and thus, many
challenges in the design of asynchronous circuits are avoided. Also, it employs a simple design
methodology using standard cells and conventional synchronous design tools, which allows
designers to use the proposed approach in many applications. A comprehensive comparison to
related work is presented in Section 6.8.
Section 6.2 describes the basic idea of the VariPipe approach and Section 6.3 explains the
methodology in more detail. Timing constraints are given in Section 6.4. A design flow to
implement VariPipe application specific integrated circuits (ASICs) is presented in Section 6.5,
and the proposed approach is demonstrated and evaluated through a microprocessor case study
in Sections 6.6 and 6.7.
6.2 VariPipe: The Idea
Consider a pipelined system consisting of several pipeline stages of combinational logic, sepa-
rated by pipeline registers. Each pipeline stage may have different modes of operation exercised
by different instructions as they flow through the pipeline stage. For example, the execution
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 90
stage of a RISC processor may execute different operations such as addition, bitwise logical
operations, etc. The key observation is that these operations activate different paths and thus,
they have different delays. In the isochronous clocking scheme, employed in today’s dominant
Electronic Design Automation (EDA) methodology, the clock period is constant, which means
it must be longer than the delay of all possible operations in the pipeline at all times. VariPipe
employs a clocking scheme in which the clock period continuously tracks the maximum delay
under current PVT conditions for all operations currently being performed. As Figure 6.1
shows, a variable delay is associated with each pipeline stage, and its delay is adjusted to
match the delay of the current operation in the stage, as determined by the data in that stage’s
Stage 1 Stage 3
Variable Delay 1 Variable Delay 3
Clock Pulse Generator
Variable Delay 2
Critical path
Shorter path
Figure 6.1: VariPipe technique
input registers. When the delays of all pipeline stages have elapsed, the clock pulse generator
creates a new clock pulse. As a result, some clock cycles are shortened and the overall speed is
increased. The variable delay unit is placed close to the corresponding datapath to be subject
to the same PVT conditions.
Note that although the proposed architecture benefits from its asynchronous nature, the use
of asynchronous design is limited to the clock generation circuit, leaving the rest of the system
still a synchronous circuit that can be designed, synthesized, and laid out using a traditional
design flow.
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 91
6.3 Design Methodology
In this section, the methodology for the design and implementation of VariPipe systems is
described in detail. The design process starts with a high-level hardware description of the
system and its implementation in the target technology. Adding the VariPipe facilities involves
three steps: creating delay profiles, simplifying the delay profiles and implementing the clock
generation circuit.
6.3.1 Creating delay profiles
Different operations in any stage of the pipeline can be identified from the high-level hardware
description of the system. Each operation takes the values in the input registers and saves its
result in the output registers. The result of an operation may not be needed in every cycle,
as determined by the selection signals for that operation. The different operations that can be
performed in any pipeline stage and the conditions under which the results of those operations
are selected are recorded in an operation selection table. The case study below shows that
operation selection tables are easy to prepare, because they are constructed using a high-level
description of the system without using the low-level implementation details.
The maximum delay of any operation of the pipeline stage can be determined from the
low-level implementation of the system in the target technology. There are two methods to find
the delays: I) Dynamic timing analysis (DTA), which finds the delays using test vectors. II)
Static timing analysis (STA), which is used here. For each pipeline stage, the delays of all the
operations are found, and a delay profile is created by grouping the operations of the pipeline
stage according to their delay values. The case study below presents a simple and automated
approach to construct delay profiles.
6.3.2 Simplifying delay profiles
Each pipeline stage has a minimum delay that can be identified from the delay profile of
that stage. The path having the largest minimum delay of all pipeline stages is the shortest
inevitable path of the pipeline. To reduce the number of delay values needed, delay values less
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 92
CPW
Completion signal from Stage 1
Completion signal from Stage 3Completion signal from Stage 2C
Delay SelectorInput to
the stage 1
Variable Delay Toggle
Clock Pulse Generator
Completion Detection
Circuit for Stage 1
S
Operation
Selection
Table
Figure 6.2: Clock generation circuit
than the delay of the shortest inevitable path in each profile are grouped and rounded up to
the maximum value of the group. This simplifies the delay profile and the implementation of
the clock generation circuit.
6.3.3 Implementing the clock generation circuit
Figure 6.2 shows the clock generation circuit, which is composed of two parts: the completion
detection circuits and the clock pulse generator. The design of the clock generation circuit
is based on the two-phase single-rail asynchronous design style [2, 12] and thus inherits many
properties from asynchronous systems. The completion detection circuit for each stage is com-
posed of a variable delay, a toggle, an operation selection table and a delay selector. The delay
selector reads appropriate signals from the inputs to the pipeline stage. It uses the operation
selection table, which is ordered according to the delay values, to generate a one-hot delay
selection signal (S) to select the appropriate delay value. If the inputs to the pipeline stage
activate more than one operation (e.g., in a complex multi-task stage), the delay corresponding
to the operation with the longest delay is selected.
When the clock pulse emerges from the variable delay element, it is converted to a level
by the toggle before being sent to the C-element. Initially, all toggle elements are reset and
so is the output of the C-element. After the reset is removed, all toggle elements change state
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 93
causing the C-element to toggle its output, thus creating a clock pulse of width CPW at the
output of the XOR. The clock pulse loads new values into the input registers of each pipeline
stage.
Note that the delay through the delay elements must be at least long enough for the corre-
sponding delay-selection signals to become valid. After a delay matching the operation with the
longest delay currently in the stage, the toggle changes state. When all stages have switched
to the new state, the C-element toggles creating a new clock pulse.
6.3.4 Variable delay implementation
Consider a pipeline stage whose delay profile has three values, d1, d2, and d3 (d3 > d2 > d1).
The first two delays are selected by signals S1 and S2. Otherwise, d3 is selected. The design
of the variable delay and the output toggle for that stage are illustrated in Figure 6.3. The
values of the three delay elements k1, k2 and k3 are selected such that the total delay around
the clock loop in Figure 6.2 matches the stage’s delay profile d1, d2 and d3.
k2k1
S1S2
Q
QSET
CLR
D
Reset
Input
Outputk3
Figure 6.3: Variable delay and toggle
Delay k1 in Figure 6.3 consists of a long chain of gates which change state twice with
every input pulse. To reduce power consumption, the delay architecture shown in Figure 6.4
is proposed. The input pulse to the delay chain is converted to a level, then at the end of the
chain, converted back to a pulse. As a result, the gates composing delay L1 switch only once
with each input pulse. Delay L1 should be tuned such that the minimum delay of the path
between the input and the output matches the desired delay, k1, and delay element PW should
be adjusted to generate a suitable pulse width. Simulations showed that the power saving
achieved by this technique is close to 50% for long delay chains.
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 94
Q
QSET
CLR
Reset
Input Output
InputPW
Output
Figure 6.4: Reducing the switching power of delay element
6.4 Timing Constraints
Figure 6.5 shows a simplified model of the clock generation circuit with three completion de-
tection circuits. The timing constraints on the design of the clock generation circuit may be
summarized as follows:
• The reset signal to the system must be long enough to ensure that the delay elements are
successfully reset and all the gates and flip-flops become stable.
• Each loop in Figure 6.5 has different rise and fall times. The minimum delay of the loop
should be used for delay tuning.
• Each completion detection circuit must be placed within the corresponding stage to ensure
that it matches the datapaths’ delays under the prevailing PVT conditions in that stage.
Figure 6.5: A simplified model of the clock generation circuit
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 95
When adjusting the delay elements, appropriate margins should be used because factors
such as crosstalk, IR drops, noise, inductance, etc. may affect the datapath and the
completion detection circuit differently.
• Part of the delay of the loops in Fig 6.5 is the clock pulse generator and the clock tree
delay. The clock pulse generator and the root cells of the clock tree are not necessarily
close to the pipeline stage and their PVT conditions may be different. Therefore, the
clock pulse generator and clock tree delays must be used with appropriate margins when
tuning delays. In the case study presented below, only 90% of the clock tree and the clock
generation circuit delay is taken into account, thus ensuring that the total delay around
each of the clock loops is slightly larger than the required delay.
• The delays of the clock generation loops should be tested under all PVT corners to ensure
that the delay elements inside the loops are sufficiently large.
• Delay selection signals Si in Figure 6.2 and Figure 6.3 must become valid before the
input clock pulse emerges from the first delay element (k1), and hence, delay k1 must be
sufficiently long.
• The clock pulse width determined by CPW in Figure 6.2 and the pulse width determined
by PW in Figure 6.4 should be tested under all PVT conditions to ensure that the pulse
width requirements of sequential elements are not violated.
• Communication of a variable-clock system with its environment needs special attention to
ensure correct data transfers. The problem of transferring data between unsynchronized
clock domains already exists in many high-speed systems, and many approaches are in use
to minimize metastability and data loss when different clock domains are connected. They
include multi-flop synchronizers, multiplexer recirculation techniques, use of first-in-first-
out buffers between different clock domains and handshake techniques [79, 80]. Similar
synchronization techniques may be applied for inter-chip and intra-chip data transfers
between a VariPipe system and its environment.
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 96
6.5 Design Flow
Fig 5.3 shows the proposed design flow to implement VariPipe systems using standard cells.
The design flow is explained in detail in the following case study.
6.6 Case Study: VariPipe DLX Microprocessor
To test the performance of a variable-clock synchronous pipeline, a VariPipe version of Hen-
nessey and Patterson’s 32-bit DLX pipeline microprocessor [77] was implemented in 90nm
technology. The Verilog code of the processor was downloaded from opencores.org [78]. The
DLX core is a RISC microprocessor with five pipeline stages: instruction fetch, instruction
decoding, instruction execution, memory access and write back. To implement the processor,
the design flow of Figure 6.6 was realized using the toolset shown in Table 6.1.
The main synchronous core was constrained to a clock period of 8.73 ns to accommodate the
worst PVT corner. Then, two versions of the processor were generated: one version equipped
with the VariPipe technique and the other a conventional synchronous circuit (fixed-clock).
Both designs were optimized for minimum power and area.
Table 6.1: Toolset
Objective Tool Version
Synthesis Design Compiler Y-2006.06-SP5
Timing and power analysis PrimeTime-PX Y-2006.06-SP3-1
Physical design SoC Encounter 5.2
Simulation ModelSim 6.3c
6.6.1 Implementing the VariPipe DLX processor
According to the design flow of Figure 6.6, the first step after obtaining the behavioral HDL
of the DLX processor is to analyze the design to identify the operations of each pipeline stage
and the conditions under which each operation is selected. The execution unit and the decoder
are given here as examples.
Execution unit: Part of the behavioral Verilog code of the execution unit is shown in Fig-
ure 6.7. The execution unit performs a range of tasks including logical and arithmetic opera-
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 97
HDL design of the synchronous main
core and functional verification
1) Analyze the design to find the
operations of each stage
2) Find conditions under which an
operation is activated
Operation
Selection
Tables
Synthesize the main synchronous core
with timing/power/area constraints
Synthesized HDL code of the
synchronous design
Find the delay of each operation
Simplify delay profiles & implement
the clock generation circuit of Fig. 2
(HDL code/synthesis in the target
technology)
Pre-layout
Delay profiles
Placement (with in-place opt.)
Pre-CTS opt.
Clock tree synthesis
Post-CTS opt.
Routing (with opt.)
Post-routing opt.
Add fillers and check the design
(connectivity/geometry/antenna)
Connect the clock gen. to the main
synch. design in the top-level HDL
IO placement/power planning/
floorplanning (including inserting the
completion detector of each stage
inside that stage)/apply
set_dont_touch on clock gen.
Find the delay of each operation
Simplify delay profiles & modify the
clock generation circuit of Fig. 2
according to new delay profiles
Post-layout
Delay profiles
Check timing with STA
Are the delays
OK?
Tune delays
and do ECO
No
Post-layout simulations and tests/
Design verification
Yes
ECO
Post-layout netlist,
SDF & SPEF
Figure 6.6: Proposed VariPipe design flow, HDL ≡ Hardware Description Language, DRC ≡Design Rule Check, STA ≡ Static Timing Analysis, ECO ≡ Engineering Change Order, SDF≡ Standard Delay Format, SPEF ≡ Standard Parasitic Exchange Format, CTS ≡ Clock TreeSynthesis
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 98
tions on input registers A and B and places the result into the ALU result register. The results
of these operations are available on the intermediate signals ADD result, AND result and
SUB result. One of these signals is then selected as the output on ALU result based on the
‘ d e f i n e ADD 6b‘100000
‘ d e f i n e SUB 6b‘100010
‘ d e f i n e AND 6b‘100100
. . .
a s s i gn ADD resu l t = reg A + reg B ;
a s s i gn SUB re su l t = reg A − reg B ;
a s s i gn AND resu l t = reg A & reg B ;
. . .
i f ( I R o p c o d e f i e l d == 0) / / R−t y p e f o rma t i n s t . or NOP
case ( I R f u n c t i o n f i e l d )
‘ADD: ALU resu l t <= ADD resu l t ;
‘SUB : ALU resu l t <= SUB re su l t ;
‘AND: ALU resu l t <= AND resu l t ;
. . .
Figure 6.7: Verilog code of the Execution unit
instruction opcode field and instruction function field, which are available in the input registers
of the execution unit. Thus, the operation selection table of the execution unit can be derived
as in Table 6.2.
Table 6.2: Operation Selection Table of Execution Unit
OperationSelection signals (Si)
IR opcode field IR function field
ADD 0 6’b100000SUB 0 6’b100010AND 0 6’b100100
... ... ...
Decoder: The decoder is responsible for generating the branch signal, which declares that
a branch has to be taken in the next cycle. The decoder also computes the branch address
and sends it to the fetch unit. The result of this computation is needed only if the branch is
to be taken. Therefore, when the branch signal becomes valid and if it is equal to zero, there
is no need to wait for the computation of the branch address to be completed. The operation
selection table of the decoder is shown in Table 6.3.
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 99
Table 6.3: Operation Selection Table of Decoder
Operation Selection signals (Si)
Branch signal 1 (always computed)Branch address Branch signal
... ...
After synthesizing the main core with the design constraints, pre-layout delay profiles of the
pipeline stages are extracted using the STA tool and operation selection tables. Delay profile
extraction and simplification are explained later for post-layout delay profiles as the process is
the same.
To implement the clock generation of Figure 6.2, the operation selection table of each
pipeline stage is used to create the delay selection logic. Delay elements are not fine-tuned at
this stage as the delays will change during layout. Initially, delay elements are selected to be
around 30% larger than needed.
To simplify the implementation of the clock generation circuit, a library of delay elements
in the target technology is created. A delay element is implemented as a chain of 2n inverters,
where n = 1, 2, ..., N . Then, the delay of each delay element is estimated using the STA tool.
The result is a table of several delay elements and their corresponding delay values. The clock
generation circuit is completed using these delay elements.
The clock generation circuit is then connected to the main synchronous core in the top-level
HDL used for the layout flow. Most layout steps are similar to the conventional synchronous
design flow [81]. The main difference is that the completion detection circuit of each stage is
constrained to be placed inside that stage.
After the place and route steps are completed, post-layout delay profiles are created. The
list of operations for which delay values are needed is readily available from the operation
selection tables. The delays of various operations are found using static timing analysis (STA).
The compiler’s STA facility enables the designer to obtain the longest delay in any pipeline
stage. However, constructing the delay profiles requires information about the longest path
for each of the operations in the operation selection table. The required information can be
obtained using the STA facility as follows. To find the delay of a given operation of a pipeline
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 100
stage, the corresponding selection fields in the input registers of the stage are set to the values
that select that operation, using assign statements in the HDL netlist. These values will in
turn, set the corresponding selection signals for that operation and the delay reported by the
STA tool will be the delay of the desired operation. This process can be automated using an
appropriate script.
As an example, operations ADD, SUB and AND of the execution stage save their re-
sults in the ALU result register. To find the delay of the ADD operation, selection signals
IR opcode field and IR function field are set to 0 and 6’b100000 in the post-layout netlist, as per
the information in Table 6.2. As a result, the delay from the input registers to the ALU result
register reported by the STA tool is the desired delay.
The delay profiles of the execution unit and the decoder under the worst PVT corner are
given in Table 6.4, with a 10% margin. These are the values to be matched by the delay
elements. According to the delays in the table, the critical path corrosponds to the SUBI
operation in the execution unit. It has a delay of 9.6 ns, including a 10% margin.
Table 6.4: Post-layout delay profiles of Decoder and Execution unit
Decoder
Operation Delay + 10% margin (ns)
Branch address 9.06Slot number 6.35Branch signal 6.07
... ...WriteBack index 0.59
Execution Unit
Operation Delay + 10% margin (ns)
SUBI 9.60... ...
SLT 7.60SRLI 6.91
... ...NOP 2.27
The next step is to simplify the delay profiles. The longest delay of the decoder is the branch
address calculation. The branch signal determines if this calculation is needed. Therefore, the
branch signal computation is an unavoidable operation and its corresponding path is the shortest
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 101
inevitable path of the decoder, which is 6.07 ns (with the 10% margin). The simplified view
of the clock generation circuit previously given in Figure 6.5 may be used for the clock period
calculation. To calculate the clock period corresponding to the shortest inevitable path of the
decoder, the delay of the path is augmented by the delay of the toggle (0.35 ns), the clock pulse
generator (0.45 ns) and the clock tree (0.31 ns). As a result, the shortest possible clock period
is 7.18 ns. Since this is more than the delay of the other operations of the decoder (except
Branch address computation), the decoders delay profile can be simplified to two delays, as
shown in Table 6.5.
Table 6.5: Simplified delay profiles
Decoder unit
Operation Delay (ns)
Branch address 9.06All others 7.18
Execution unit
Operation Delay (ns)
SUBI 9.60... ...
SLT 7.60SRLI, and all 6.91
others
The minimum delay of the decoder in Table 6.5 is larger than the minimum delay of all
other stages, making it the shortest inevitable path of the pipeline. Hence, delays less than
7.18 ns in the delay profile of other units were grouped and rounded up to the maximum of the
group. In the case of the execution unit, all delays equal to or less than 6.91 ns were grouped
together, as shown in Table 6.5. The clock generation circuit was modified according to the
new delay profiles, followed by an engineering change order (ECO) pass to update the layout.
The next step according to Figure 6.6 is to fine-tune the delay elements. The physical design
tool is used to write the post-layout HDL netlist along with the standard delay format (SDF)
file and the standard parasitic exchange format (SPEF) file for post-layout STA. The long delay
chains used during synthesis are adjusted by replacing them with other delay elements from
the delay library, then tested by the STA tool to check if they are of appropriate length. This
process is repeated until appropriate delay values are achieved. An ECO pass is done to update
the layout with the modified delays.
It should be noted that the delays were chosen to be longer than needed initially. During
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 102
the delay tuning, they were trimmed and then tested using the STA tool. Thus, the ECO
update was used only once at the end of the delay tuning process, followed by a final check of
the delays.
The longest delay for the Si signals of each pipeline stage was compared against the k1
delay (see Figure 6.3) to ensure that the delay selection signals are settled before the input
pulse emerges out of the k1 delay element. In our experiments, delay k1 was sufficiently larger
than the Si signals delays.
The PVT corners for the 90nm technology used in the experiments are shown in Table 6.6.
Delays were tuned under the worst PVT corner with the 10% margin. Since different paths may
have different levels of sensitivity to PVT variations, delays and data paths were also examined
under the typical and best PVT corners to ensure that the delays are sufficiently large. As
explained in the next section, the VariPipe processor was simulated under all PVT corners to
verify correct functionality.
Table 6.6: PVT corners
PVT corner Process Voltage Temperature
Best Fast 1.1 -40◦CTypical Typical 1.0 25◦CWorst Slow 0.9 125◦C
6.7 Evaluation
In this section, different characteristics of VariPipe and fixed-clock processors are compared.
Also, the area and energy overhead of the clock generation circuit are quantified. Function-
ality, performance and energy consumption of the VariPipe DLX processor and its fixed-clock
counterpart were analyzed using the three benchmark suites shown in Table 6.7, which were
compiled by DLX GCC [82]. Post-layout simulations of the circuits were performed for each
benchmark and switching activities were recorded in the switching activity interchange format
(SAIF). These, together with parasitic data (SPEF files), were used by PrimeTime-PX for
simulation-based power analysis.
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 103
Table 6.7: Benchmarks
Source Benchmark
MiBench [83]
adpcm coderadpcm decodercrc32dijkstraqsort
PowerStone [84]
bcntblitcompressucbqsort
Applications from [85,86]
Bubble SortJPEG-DCTMP3-DCT32MPEG2-Bdist
6.7.1 Performance analysis
Figure 6.8 shows execution times under best, typical and worst PVT conditions. The perfor-
mance of the fixed-clock system is the same under all conditions, but the performance of the
VariPipe system varies with PVT conditions as shown. Table 6.8 shows the execution time
01000
2000300040005000
600070008000
900010000
adpc
m_c
oder
adpc
m_d
ecod
er
crc3
2
dijk
stra
qsor
tbc
nt blit
com
pres
s
ucbq
sort
Bub
ble
Sort
JPEG
-DCT
MP3-
DCT32
MPEG
2-Bdi
st
Ex
ecu
tio
n T
ime
(µs)
VariPipe Best VariPipe Typ VariPipe Worst Fixed-clock
Figure 6.8: Performance of VariPipe and fixed-clock DLX processors under all PVT corners
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 104
Table 6.8: Execution time reduction percentage using VariPipe
ProgramReduction % Reduction %
under worst under typ
adpcm coder 14.0 51.3adpcm decoder 13.4 51.0crc32 14.5 51.7dijkstra 12.8 50.6qsort 10.6 49.4bcnt 12.4 50.4blit 11.4 50.0compress 14.1 51.4ucbqsort 9.4 48.7Bubble Sort 11.0 49.6JPEG-DCT 15.9 52.4MP3-DCT32 15.0 51.9MPEG2-Bdist 15.4 52.1
Average 12.7 50.6
reduction percentage obtained using VariPipe. Under the worst-case conditions, the execution
times of benchmarks are 13% shorter, on average, for the VariPipe design, because the VariPipe
system adjusts the clock period in each cycle to match the current operations. The reduction in
execution time varies with the frequency of occurrence of the instructions and their sequence in
the program. The VariPipe system is twice as fast as the fixed-clock counterpart under typical
conditions. The average percentages in the table are calculated by summing the execution times
of all programs.
It should be noted that in the case of deep pipelines with shallow logic in each stage, the
difference between the delay of operations may be negligible. Hence, the gain that can be
achieved with a VariPipe design may be small. In those cases, VariPipe can still improve the
performance by adjusting the clock period to the prevailing PVT conditions. This performance
gain should be carefully compared against the overhead of the clock generation circuit.
6.7.2 Energy consumption analysis
Energy consumption of the two processors under typical PVT conditions is compared in Ta-
ble 6.9. Energy values in the table do not include memory and IO. The core energy is the
energy consumption of the processor excluding the clock tree. The VariPipe system consumes
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 105
only 3% more energy than the fixed-clock period system.
Table 6.9: Energy consumption under the typical PVT corner
Fixed-clock processor VariPipe processor
Program Energy (µJ) Execution time Energy (µJ) Execution timeCore Clock tree Total (µs) Core Clock tree Total (µs)
adpcm coder 2.464 4.741 7.205 1871.1 2.663 4.782 7.445 910.3adpcm decoder 12.309 21.151 33.460 8333.6 13.012 21.323 34.335 4084.1crc32 2.450 4.748 7.198 1875.8 2.697 4.789 7.486 906.7dijkstra 1.255 2.380 3.635 940.4 1.324 2.401 3.725 464.1qsort 7.829 12.091 19.920 4767.9 8.075 12.195 20.270 2412.5bcnt 0.419 0.732 1.151 289.0 0.441 0.738 1.179 143.2blit 0.948 1.722 2.670 679.7 1.008 1.737 2.745 340.6compress 12.340 25.006 37.346 9864.5 13.596 25.218 38.814 4792.4ucbqsort 14.427 22.934 37.361 9039.7 15.224 23.404 38.628 4634.5Bubble Sort 0.165 0.215 0.380 84.7 0.168 0.217 0.385 42.7JPEG-DCT 5.830 9.769 15.599 3850.5 6.203 9.852 16.055 1831.5MP3-DCT32 0.721 1.223 1.944 479.5 0.773 1.232 2.005 230.4MPEG2-Bdist 0.461 0.844 1.305 330.7 0.505 0.846 1.351 158.3Total energy,
61.618 107.556 169.174 42407.1 65.689 108.734 174.423 20951.3Total time
6.7.3 Area and energy overhead
The clock generation circuit takes up only 2.6% of the total area of the VariPipe processor.
The area of the clock generation circuit is mainly taken by the delay elements. As previously
mentioned, the energy overhead of using VariPipe is merely 3%. The clock generation circuit
is a very small portion of the total area and is implemented by low-leakage cells and thus, its
leakage power is negligible compared to that of the processor.
6.7.4 Resilience to PVT variations
The VariPipe system has been shown to work correctly under all PVT conditions. It au-
tomatically adjusts to inter-chip and intra-chip PVT variations to deliver the best-possible
performance. Figure 6.8 represents the inter-chip PVT variation analysis of the VariPipe DLX
processor. The execution times drop by as much as 62% from the worst to the best PVT condi-
tions. The VariPipe system is also resilient to intra-chip PVT variations. To test this, starting
from the typical-case SDF file, the delays of the execution unit and its completion detection
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 106
circuit were augmented by 10% using the Design Compiler’s derating commands. The other
units were not changed. A new SDF file was generated to be used in simulations, which verified
that the system executed all the benchmarks correctly.
6.7.5 Reduction in electromagnetic noise
The clock is often the main source of electromagnetic noise in a digital system, because it has
a fixed frequency that is also the highest in the system [90]. Many circuits employ spread-
spectrum oscillators to overcome this problem [91]. In VariPipe systems, the clock frequency
varies within a range around an average value and thus, the clock power is spread over that
range. The clock power spectrum of the VariPipe DLX processor under the worst PVT corner
for one of the benchmarks is compared against its fixed-clock counterpart in Figure 6.9. The
maximum clock power of VariPipe is about 28 dB less than that of the fixed-clock design.
With no central peak in the frequency spectrum, the VariPipe processor should generate less
electromagnetic noise compared to its fixed-clock counterpart.
6.7.6 Suitability for voltage scaling
The 90nm technology used in the experiments is characterized for two supply voltage levels:
1.0 V and 1.2 V. These characterizations were used to apply voltage scaling to the VariPipe
processor. It was ensured that the pads were compatible with 1.2 V and no hold violation
occurred. The system was simulated under typical PVT conditions for both supply voltages.
The system automatically adjusted its speed to changes in supply voltage. It was 1.2 times
faster using the 1.2 V supply, compared to the 1.0 V supply. This shows that the VariPipe
design is amenable to voltage scaling techniques.
6.8 Related Work
As mentioned in Section 6.1, other studies have been published to design variable-speed pipelines
and use typical PVT conditions. In this section, some of these studies are reviewed and com-
pared against VariPipe.
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 107
1 2 3 4 5 6 7 8 9 10
x 108
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
Frequency (Hz)
Clo
ck p
ow
er (
dB
)
(a) VariPipe
1 2 3 4 5 6 7 8 9 10
x 108
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
Frequency (Hz)
Clo
ck p
ow
er (
dB
)
(b) Fixed-clock
Figure 6.9: Comparison of the clock power spectra
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 108
As mentioned in the section 5.9, variable clocking was first addressed in Dean’s PhD the-
sis [7]. Dean uses partial duplicates of the functional units in each of pipeline stages as tracking
cells. As mentioned before, duplicates may introduce a substantial area overhead, which in
turn, increases the power consumption considerably. VariPipe uses variable delays, which have
a significantly smaller overhead compared to the tracking cells in Dean’s method. The case
study showed that the overhead of the added clock generation circuit for a VariPipe DLX pro-
cessor is 2.6% in area and 3% in energy consumption. The VariPipe DLX processor and Dean’s
both achieve 2X performance improvement over isochronous design. The design of tracking
cells in Dean’s work is complicated as they depend on the funciton being matched and also
they are implemented at the transistor level. Loads on the transistors of the functional units
are imitated using passive transistors. However, compred to matched delays, Dean’s tracking
cells can better model the delay of the corresponding path. VariPipe employs static timing
analysis in the design of matched delay elements for completion detection circuits. As such, the
design of the completion detection circuit is independent of the function being matched. This
allows the introduction of a simple design methodology that uses conventional design tools to
implement variable-clock systems with standard cells. As a result, the proposed approach can
be readily used in many applications.
Telescopic units are introduced in [89] to design variable-speed pipelines. A fixed clock
period shorter than the delay of the critical path is applied to the pipeline. When the critical
path is triggered, a hold signal is raised to show that another clock cycle is required for the
instruction to complete. Since the critical path of each pipeline stage is not triggered in every
cycle, an overall throughput improvement of 27% has been achieved. In comparison, VariPipe
adjusts to the current instructions in the pipeline as well as present PVT conditions, and hence,
achieves a better performance improvement.
TEAtime [73] uses a replica of the critical path to track PVT variations in a DLX-style
processor on FPGA and achieves a 34% speed improvement. The processor does not change its
speed with instructions and it is not resilient to intra-chip PVT variations.
The work presented in [92] follows the same goals of VariPipe. Epassa, Boyer and Savaria
first find the latencies of instructions in all pipeline stages and save them in memory. These data
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 109
are used at run time for variable clocking using a PLL and a variable-period clock synthesizer.
A 17% speedup for one of the test programs has been achieved. This approach is limited by
the number of the phases that the PLL can produce. Also, the clock does not adjust to PVT
variations automatically.
The Razor project [8] shows the possibility of reducing the voltage margins used in worst-
case analysis of synchronous circuits. This work reduces power consumption by reducing the
input voltage, but keeps the clock frequency intact. An error recovery circuit is added to cope
with any timing errors due to the reduced voltage.
Asynchronous circuits can also be designed to achieve average-case performance and ad-
justability to PVT variations. Several asynchronous design styles exist, including desynchro-
nization [46], by which a synchronous design is converted into an asynchronous one, and Mouse-
trap [3], which is a methodology to design high-speed pipelines taking advantage of PVT vari-
ability. Desynchronization introduces an area overhead of 13.9% in a DLX microprocessor [46];
the processor does not adjust its speed according to the operations in the pipeline. In a design
approach by Nowick, variable delays are used in the implementation of a speculative completion
detection circuit for an asynchronous adder [93]. VariPipe is simpler than asynchronous design.
It uses standard-cell design implementation and also does not require customized transistor-
level circuits. It uses conventional synchronous design tools and thus, neither asynchronous
design methods nor asynchronous design tools are required.
6.9 Conclusion
This chapter proposed a new clocking scheme in which the clock period tracks the delay of a
pipeline on a cycle-by-cycle basis. A low-overhead clock generation circuit and a standard-cell
design flow compatible with today’s dominant EDA design methodology is demonstrated, which
should enable designers to use the proposed approach in many applications. A case study has
been presented, which demonstrates that the VariPipe DLX processor has a two-fold perfor-
mance advantage over its fixed-clock counterpart. The overhead of the added clock generation
circuit is only 2.6% in area and 3% in energy consumption. Resilience to PVT variations,
Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 110
reduction in electromagnetic noise, and suitability for voltage scaling are other advantages of
VariPipe systems.
Chapter 7
Conclusion and Future Work
7.1 Conclusion
7.1.1 Summary and Contributions
This dissertation introduces a new methodology to implement synchronous systems with asyn-
chronous advantages. Asynchronous logic is used to generate the clock signal to the main
synchronous core. The resulting system automatically tunes its speed in every cycle to deliver
the best-possible performance under the prevailing PVT conditions. In the case of VariPipe,
the system responds not only to the prevailing PVT conditions but also to the operations cur-
rently taking place in the pipeline. As a result, the performance and power requirements of the
design are no longer defined by the worst-case analysis of the system. Instead, the system is
characterized with a range of deliverable performances and corresponding power requirements.
The main features of the design techniques introduced in this thesis are listed below:
• Only commercial synchronous design tools are used.
• Asynchronous tools and knowledge are not required.
• The overhead in terms of design and circuit control is minimal.
It has been shown that the proposed technique can be used to solve or diminish several
urgent problems in the deep nanometer regime as follows:
111
Chapter 7. Conclusion and Future Work 112
• Reduce power consumption and area without deteriorating performance under typical
PVT conditions.
• Improve performance with negligible area and energy consumption overhead.
• Help in handling process variations.
• Expand the design space by introducing new power-performance trade-offs, which are
particularly useful for portable devices.
These advantages make the PVT-aware, self-tuning design a viable solution for today’s
shrinking technologies.
Several other contributions have been presented in this dissertation, which may be summa-
rized as follows:
• In Chapter 3, edge-triggered flip-flops and T-elements are used to improve concurrency
in the operation of asynchronous circuits. It has been shown that by overlapping the
handshakes involved in write-after-read operations, significant speed improvements can
be achieved along with reductions in circuit area and power-delay product. Possible gains
in speed for several configurations are analyzed and experimentally examined.
• A clock generation scheme that accounts for inter-chip and intra-chip variations is pre-
sented in Chapter 4.
• A complete design flow to implement PVT-aware systems using conventional synchronous
design tools is presented in Chapters 5 and 6.
• A low-overhead PVT and operation aware clock generation circuit using only standard
cells is presented in Chapters 5 and 6.
• A technique to implement variable delays with reduced switching power is presented in
Chapter 6, which can be used in the proposed clock generation circuit as well as in
asynchronous systems.
• A new approach to find the delays of different operations of a computational unit using
static timing analysis is presented in Chapter 6.
Chapter 7. Conclusion and Future Work 113
7.2 Future Work
In this section different design and computer aided design (CAD) directions to continue the
work presented in this dissertation are suggested.
In this dissertation, a methodology to implement PVT-aware and VariPipe systems are
introduced and tested on a 32-bit processor. PVT-aware and VariPipe techniques should be
tested on bigger designs and applications. Also, several chips should be fabricated, tested and
compared against the traditional synchronous design. Properties such as performance, power
consumption, yield and electromagnetic noise should be examined under different operating
conditions for many chips.
A future research direction is to fully automate the proposed methodologies. This is espe-
cially important for the VariPipe technique because creating operation selection tables for large
designs may be difficult.
The PVT-aware design technique proposed in this dissertation may be expanded to be
modular for big designs. The system can be divided into multiple subsystems, each implemented
as a PVT-aware self-tuning design and then connected to other subsystems. This results in two
potential advantages: 1) each subsystem works with its own speed 2) the necessity of a global
clock is mitigated, which may be useful in the design of large systems on chip (SoCs). This
direction of research shares many goals and challenges with the globally asynchronous locally
synchronous (GALS) design approach.
A future research direction for VariPipe technique is to employ more fine grained com-
pletion detection techniques. In the approach presented in this thesis, when an operation is
detected, the longest possible delay for that operation is chosen. Using speculative completion
detection [93], it is possible to choose a shorter delay for an operation based on its input data.
For example, if both inputs to an adder are zero, a shorter delay can be chosen to indicate
the completion of addition. Thus, the completion of each operation can be detected more
accurately and cycle reduction can be targeted more aggressively. However, implementing a
speculative completion mechanism for each operation may increase the power consumption and
area of the circuit and complicate the design methodology. Therefore, trade-offs involved in
Chapter 7. Conclusion and Future Work 114
using speculative completion should be studied carefully.
Finally, more typical PVT corners are required for the PVT-aware digital design imple-
mentations. In the multi-corner design approach, which has been used for many years with
dominant EDA tools, the best-case PVT corner is used to check hold violations and the worst-
case PVT corner is used to check setup time and specify the clock period. The typical PVT
corner in most technologies is characterized at a temperature of 25◦C, whereas real designs
might be running typically at a higher temperature. Therefore, information about more typical
PVT corners with higher temperatures is needed.
Appendix A
Previous Publications
Parts of this thesis have based previously published, as follows:
• Using Edge-triggering in the Asynchronous Synthesis of Write-after-read Operations, Navid
Toosizadeh and Safwat G. Zaky, In the Proc. of the International Conference on Appli-
cation of Concurrency to System Design (ACSD), Xian, China, pp. 23-28, June 2008.
• Application of Concurrency in the Asynchronous Design of Write-after-read Operations,
Navid Toosizadeh and Safwat G. Zaky, Fundamenta Informaticae Journal, IOS press,
2009.
• VariPipe: Low-overhead Variable-clock Synchronous Pipelines, Navid Toosizadeh, Safwat
G. Zaky and Jianwen Zhu, accepted for publication in the IEEE International Conference
on Computer Design (ICCD), Lake Tahoe, California, October 4-9, 2009.
115
Appendix B
Balsa Code for Radix-4 Booth
Multiplier
import [balsa.types.basic]
type Imm16 is 16 bits
type Imm17 is 17 bits
type Imm3 is 3 bits
type Imm9 is array 0 .. 8 of bit
type Imm1 is array 1 of bits
type Bit9 is 9 signed bits
type A3_t is array 3 of bit
procedure mult_booth (input multiplicand_multiplier: Imm16;
output prod: Imm16) is
variable lostbit : bit
variable test : bit
variable iteration : Imm3
variable product : Imm17
variable temp : Bit9
variable case_var: Imm3
shared add_sub is
116
Appendix B. Balsa Code for Radix-4 Booth Multiplier 117
begin
product := (#product[0..7] @
((((#product[8..16] as Bit9) + temp)
as Bit9) as Imm9) as Imm17) --accumulation statement(1)
end
begin
loop
multiplicand_multiplier -> then
iteration := 4;
product := ((#multiplicand_multiplier[0..7] @
(0b000000000 as Imm9)) as Imm17);
lostbit := 0;
loop while iteration > 0 then --loop for 4 iterations.
-- Depending on the sequence of bits, update the product:
case ((((lostbit as Imm1) @ #product[0..1]) as A3_t)
as Imm3) of
0b001 then
temp := ((((#multiplicand_multiplier[8..15] @
(#multiplicand_multiplier[15] as Imm1)) as Imm9))
as Bit9);
add_sub()
| 0b010 then
temp := ((((#multiplicand_multiplier[8..15] @
(#multiplicand_multiplier[15] as Imm1)) as Imm9))
as Bit9);
add_sub()
| 0b011 then
temp := (((((0b0 as Imm1) @
#multiplicand_multiplier[8..15]) as Imm9))
Appendix B. Balsa Code for Radix-4 Booth Multiplier 118
as Bit9);
add_sub()
| 0b100 then
temp := -(((((0b0 as Imm1) @
#multiplicand_multiplier[8..15]) as Imm9))
as Bit9);
add_sub()
| 0b101 then
temp := -((((#multiplicand_multiplier[8..15] @
(#multiplicand_multiplier[15] as Imm1)) as Imm9))
as Bit9);
add_sub()
| 0b110 then
temp := -((((#multiplicand_multiplier[8..15] @
(#multiplicand_multiplier[15] as Imm1)) as Imm9))
as Bit9);
add_sub()
end;
--save the second last bit for the next iteration:
lostbit := #product[1];
--shift product:
product := ((#product[2..16] @ (#product[16] as Imm1)
@ (#product[16] as Imm1)) as Imm17); --accumulation statement(2)
iteration := (iteration - 1 as Imm3) --accumulation statement(3)
end;
prod <- (#product[0..15] as Imm16)
end
end
end
Appendix C
Chip Layout of the PVT-aware
Processor
The areas highlighted red in Figure C.1 are the submodules of the clock generation circuit.
There are four completion detection circuits in four quadrants of the chip and a clock pulse
generator at the center.
119
Appendix C. Chip Layout of the PVT-aware Processor 120
Figure C.1: Chip layout of the PVT-aware processor
Bibliography
[1] A. Davis and S. M. Nowick, “An introduction to asynchronous circuit design,” in mono-
graph on asynchronous design, 1997.
[2] I. E. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, no. 6, pp. 720–
738, June 1989.
[3] M. Singh and S. M. Nowick, “Mousetrap: High-speed transition-signaling asynchronous
pipelines,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 6,
pp. 684–698, June 2007.
[4] N. Karaki, T. Nanmoto, H. Ebihara, S. Utsunomiya, S. Inoue, and T. Shimoda, “A flexible
8b asynchronous microprocessor based on low-temperature poly silicon tft technology,” in
Proc. of the IEEE International Conference on Solid-state Circuits Conference, vol. 1,
2005, pp. 272–598.
[5] L. Necchi, L. Lavagno, D. Pandini, and L. Vanzago, “An ultra-low energy asynchronous
processor for wireless sensor netwroks,” in Proc. of the 12th IEEE International Symposium
on Asynchronous Circuits and Systems, Mar. 2006, pp. 78–85.
[6] D. Edwards, A. Bardsley, L. Janin, L. Plana, and W. Toms, “Balsa : A tutorial guide,
version v3.5,” in http://intranet.cs.man.ac.uk/apt/projects/tools/balsa/, 2006.
[7] M. E. Dean, “STRiP: A self-timed RISC processor,” Ph.D. dissertation, Stanford Univer-
sity, 1992.
121
BIBLIOGRAPHY 122
[8] T. Austin, D. Blaauw, T. Mudge, and K. Flautner, “Making typical silicon matter with
Razor,” Computer, vol. 37, no. 3, pp. 57–65, Mar. 2004.
[9] A. Bardsley and D. Edwards, “Compiling the language balsa to delay-insensitive hard-
ware,” in Hardware Description Languages and their Applications (CHDL), Apr. 1997, pp.
89–91.
[10] “HANDSHAKE SOLUTIONS,” in http://www.handshakesolutions.com.
[11] K. van Berkel, Handshake Circuits: an Asynchronous Architecture for VLSI Programming.
Cambridge University Press, 1993.
[12] A. Peeters, “Single-rail handshake circuits,” Ph.D. dissertation, Eindhoven University of
Technology, 1996.
[13] E. Humenay, D. Tarjan, and K. Skadron, “Impact of process variations on multicore per-
formance symmetry,” in Proc. of the Design, Automation and Test in Europe, Apr. 2007,
pp. 1–6.
[14] W. Kuzmicz, E. Piwowarska, A. Pfitzner, and D. Kasprowicz, “Static power consumption in
nano-cmos circuits: Physics and modelling,” in Proc. of the 14th International Conference
Mixed Design of Integrated Circuits and Systems, June 2007, pp. 163–168.
[15] E. Macii, L. Bolzani, A. Calimera, A. Macii, and M. Poncino, “Integrating clock gating
and power gating for combined dynamic and leakage power optimization in digital cmos
circuits,” in Proc. of the 11th EUROMICRO Conference on Digital System Design Archi-
tectures, Methods and Tools, Sep. 2008, pp. 298–303.
[16] A. J. Martin and M. Nystrom, “ILLIAC II-a short description and annotated bibliography,”
IEEE Transactions on Computers, no. 3, pp. 399–403, June 1965.
[17] “Arithmetic processor 166 instruction manual.” Digital Equipment Corp., Maynard, MA,
1960.
[18] D. A. Huffman, “The synthesis of sequential switching circuits,” in Sequential Machines:
Selected Papers, E. F. Moore, Ed. Reading, MA: Addison-Wesley, 1964.
BIBLIOGRAPHY 123
[19] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” in Proc. Int. Symp.
Theory of Switching, 1959, pp. 204–243.
[20] T. J. Chaney and C. E. Molnar, “Anomalous behavior of synchronizer and arbiter circuits,”
IEEE Transactions on Computers, no. 4, pp. 421–422, Apr. 1973.
[21] A. J. Martin, “The design of a self-timed circuit for distributed mutual exclusion,” in Proc.
1985 Chapel Hill Conf. VLSI. H. Fuchs, Ed., 1985, pp. 245–260.
[22] A. Bardsley, “Implementing balsa handshake circuits,” Ph.D. dissertation, University of
Manchester, 2000.
[23] A. Bardsley and D. A. Edwards, “The Balsa asynchronous circuit synthesis system,” in
Forum on Design Languages, Sep. 2000.
[24] A. J. Martin et al., “The design of an asynchronous microprocessor,” in Proc. Decennial
Caltech Conf. Advanced Research in VLSI. C. L. Seitz, Ed., 1991, pp. 351–373.
[25] S. B. Furber et al., “A micropipelined ARM,” in Proc. of VII Banff Workshop: Asyn-
chronous Hardware Design, 1993.
[26] A. Takamura et al., “TITAC-2: an asynchronous 32-bit microprocessor based on scalable-
delayinsensitive model,” in IEEE International Conference on Computer Design, 1997, pp.
228–235.
[27] A. J. Martin et al., “The design of an asynchronous mips r3000 processor,” in Proc. 17th
Conf. Advanced Research in VLSI. Los Alamitos, CA: IEEE CS Press, 1997.
[28] A. J. Martin and M. Nystrom, “Asynchronous techniques for system-on-chip design,” Proc.
of the IEEE, vol. 94, no. 6, pp. 1089–1120, June 2006.
[29] S. M. Nowick and M. Singh, “Mousetrap: Ultra-high-speed transition-signaling asyn-
chronous pipelines,” in Computer Design, 2001, pp. 9–17.
[30] R. B. Reese, M. A. Thornton, and C. Traver, “Two-phase micropipeline control wrapper
with early evaluation,” Electreonic Letters Online no. 20040256, IEE, no. 3, 2004.
BIBLIOGRAPHY 124
[31] I. E. Sutherland and J. Ebergen, “Computers without clocks,” Scientific American, Aug.
2002.
[32] I. E. Sutherland and S. Fairbanks, “GasP: A minimal FIFO control,” in Proc. International
Symposium on Advanced Research in Asynchronous Circuits and Systems, 2001, pp. 46–53.
[33] W. A. Clark, “Macromodular computer systems,” in Proc. of the Spring Joint Computer
Conference, 1967.
[34] S. M. Burns, “Automated compilation of concurrent programs into self-timed circuits,”
Ph.D. dissertation, California Institute of Technology, 1991.
[35] R. E. Miller, Switching Theory, vol. II: Sequential Circuits and Machines. John Wiley &
Sons, New York, NY, 1965.
[36] A. M. Lines, “Pipelined asynchronous circuits,” 1995.
[37] R. O. Ozdag and P. A. Beerel, “High-speed QDI asynchronous pipelines,” in Proc. Inter-
national Symposium on Advanced Research in Asynchronous Circuits and Systems, 2002,
pp. 13–22.
[38] C. E. Molnar, I. W. Jones, W. S. Coates, J. K. Lexau, S. M. Fairbanks, and I. E. Sutherland,
“Two fifo ring performance experiments,” in Proc. of the IEEE, vol. 87, no. 2, 1999, pp.
297–307.
[39] W. P. Burleson, M. Ciesielski et al., “Wave-pipelining: A tutorial and research survey.”
IEEE Trans. on VLSI Systems, vol. 6, no. 3, pp. 464–474, Sep. 1998.
[40] O. Hauck and S. A. Huss, “Asynchronous wave pipelines for high throughput datapaths,”
in Proc. of the IEEE Conference on Electronics, Circuits, and Systems, vol. 1, 1998, pp.
283–286.
[41] B. D. Winters and M. R. Greenstreet, “A negative-overhead, self timed pipeline,” in Proc.
International Symposium on Advanced Research in Asynchronous Circuits and Systems,
Apr. 2002, pp. 32–41.
BIBLIOGRAPHY 125
[42] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D. dissertation,
Stanford University, 1984.
[43] F. K. Gurkaynak et al., “Gals at eth zurich: Success of failure?” in Proc. of the 12th IEEE
International Symposium on Asynchronous Circuits and Systems, 2006, pp. 150–159.
[44] Philips Semiconductors., “P87cl888; 80c51 ultra low power (ulp) telephony controller.”
[45] D. Caucheteux, E. Beigne, M. Renaudin, and E. Crochon, “AsyncRFID: fully asynchronous
contactless systems, providing high data rates, low power and dynamic adaptation,” in
Proc. of the 12th IEEE International Symposium on Asynchronous Circuits and Systems,
2006, pp. 86–97.
[46] N. Andrikos, L. Lavango, D. Pandini, and C. P. Sotiriou, “A fully-automated desynchro-
nization flow for synchronous circuits,” in Design Automation Conf., June 2007, pp. 982–
985.
[47] D. L. D. S. M. Nowick, M. E. Dean and M. Horowitz, “The design of a high-performance
cache controller: a case study in asynchronous synthesis,” Integration, the VLSI journal,
vol. 15, no. 3, pp. 241–262, Oct. 1993.
[48] Z. Bing, H. Yong, and Q. Yulin, “An asynchronous data-path design for viterbi decoder,” in
Proc. of the 7th International Conference on Solid State and Integrated Circuits Technology,
vol. 3, 2004, pp. 1645–1648.
[49] L. A. Plana, P. A. Riocreux, W. J. Bainbridge, A. Bardsley, J. D. Garside, and S. Temple,
“SPA - a synthesisable Amulet core for smartcard applications,” in Proc. of the 8th IEEE
International Symposium on Asynchronous Circuits and Systems, 2002, pp. 201–210.
[50] J. Ebergen, A. Chow, B. Coates, J. Schauer, and D. Hopkins, “An asynchronous high-
throughput control circuit for proximity communication,” in 12th IEEE International Sym-
posium on Asynchronous Circuits and Systems, 2006, pp. 23–33.
BIBLIOGRAPHY 126
[51] L. Fesquet and M. Renaudin, “A programmable logic architecture for prototyping clockless
circuits,” in International Conference on Field Programmable Logic and Applications, 2005,
pp. 293–298.
[52] C. Lau, “Asynchronous design FPGA,” 2004.
[53] J. Cortadella et al., “Desynchronization: Synthesis of asynchronous circuits from syn-
chronous specifications,” IEEE Transactions on COMPUTER-AIDED DESIGN of Inte-
grated Circuits and Systems, vol. 25, no. 10, pp. 1904–1921, Oct. 2006.
[54] K. M. Fant and S. A. Brandt, “NULL convention LogicTM: a complete and consistent
logic for asynchronous digital circuit synthesis,” in Proc. of the International Conference
on Application Specific Systems, 1996, pp. 261–273.
[55] M. Ligthart et al., “Asynchronous design using commercial hdl synthesis tools,” in Proc. of
the IEEE 37th Annual 2003 International Carnahan Conference on Security Technology,
1996, pp. 501–507.
[56] A. Kondratyev and K. Lwin, “Design of asynchronous circuits using synchronous cad
tools,” IEEE Design and Test of Computers, vol. 19, no. 4, pp. 107–117, July 2002.
[57] “BALSA,” in http://www.cs.manchester.ac.uk/apt/projects/tools/balsa.
[58] L. A. Plana, S. Taylor, and D. Edwards, “Attacking control overhead to improve synthe-
sised asynchronous circuit performance,” in Proc. of IEEE International Conference on
Computer Design (ICCD), Oct. 2005, pp. 703–710.
[59] T. Yoneda, A. Matsumoto, M. Kato, and C. Myers, “High level synthesis of timed asyn-
chronous circuits,” in Proc. of the 11th IEEE International Symposium on Asynchronous
Circuits and Systems, Mar. 2005, pp. 178–189.
[60] A. Bink and R. York, “Arm996hs: The first licensable, clockless 32-bit processor core,”
IEEE Micro, vol. 27, no. 2, pp. 58–68, March-April 2007.
[61] N. Toosizadeh and S. G. Zaky, “Using edge-triggering in the asynchronous synthesis of
write-after-read operations,” in Proc. of the 8th International Conference on Application
BIBLIOGRAPHY 127
of Concurrency to System Design (ACSD). Xian, China: IEEE Computer Society, June
2008, pp. 23–28.
[62] ——, “Application of concurrency in the asynchronous design of write-after-read opera-
tions,” Fundamenta Informaticae Journal, IOS Press, 2009.
[63] R. Kol and R. Ginosar, “A doubly-latched asynchronous pipeline,” in Proc. of IEEE In-
ternational Conference on Computer Design (ICCD), 1997, pp. 706–711.
[64] M. Greenstreet and K. Steiglitz, “Bubbles can make self-timed pipelines fast,” The Journal
of VLSI Signal Processing, vol. 2, no. 3, pp. 139–148, Nov. 1990.
[65] T. A. Chu, “Synthesis of self-timed control circuits from graphs: An example,” in Proc. of
IEEE International Conference on Computer Design (ICCD), 1986, pp. 565–571.
[66] T. Nanya, Y. Ueno, H. Kagotani, M. Kuwako, and A. Takamura, “Titac: Design of a quasi-
delay-insensitive microprocessor,” IEEE Design and Test of Computers, vol. 11, no. 2, pp.
50–63, Summer 1994.
[67] L. Plana and S. Nowick, “Concurrency-oriented optimization for low-power asynchronous
systems,” in Proc. of International Symposium on Low Power Electronics and Design
(ISLPED), 1996, pp. 151–156.
[68] H. van Gageldonk, “An asynchronous low-power 80c51 microcontroller,” Ph.D. disserta-
tion, Eindhoven University, 1998.
[69] I. Nitta, T. Shibuya, and K. Homma, “Statstical static timing analysis technology,” in
Fujitsu Publications, Oct. 2007.
[70] N. Menezes, “The good, the bad, and the statistical,” in International Symposium on
Physical Design (ISPD), 2007, p. 168.
[71] T. Gemmeke, M. Gansen, H. J. Stockmanns, and T. G. Noll, “Design optimization of
low-power high-performance DSP building blocks,” IEEE Journal of Solid-State Circuits,
vol. 39, no. 7, pp. 1131–1139, July 2004.
BIBLIOGRAPHY 128
[72] S. Borkar, T. Karnik et al., “Parameter variations and impact on circuits and microar-
chitecture,” in Proc. of the ACM/IEEE Design Automation Conference, June 2003, pp.
338–342.
[73] A. K. Uht, “Uniprocessor performance enhancement through adaptive clock frequency
control,” IEEE Trans. on Computer, vol. 54, no. 2, pp. 132–140, Feb. 2005.
[74] K. von Arnim, E. Borinski, and P. Seegebrecht, “Efficiency of body biasing in 90-nm cmos
for low-power digital circuits,” IEEE Journal of Solid-state Circuits, vol. 40, no. 7, pp.
1549–1556, July 2005.
[75] J. W. Tschanz, J. Y. Kao, S. G. Narendra et al., “Adaptive body bias for reducing impacts
of die-to-die and within-die parameter variations on microprocessor frequency and leakage,”
IEEE Journal of Solid-state Circuits, vol. 37, no. 11, pp. 1396–1402, Nov. 2002.
[76] T. D. Burd, A. Pering, A. J. Stratakos, and R. W. Brodersen, “A dynamic voltage scaled
microprocessor system,” IEEE Journal of Solid-state Circuits, vol. 35, no. 11, pp. 1571–
1580, Nov. 2000.
[77] J. L. Hennessey and D. A. Patterson, Computer Architecture: A Quantative Approach.
Morgan Kaufmann, 2007.
[78] “ASPIDA,” in http://www.opencores.org/projects.cgi/web/aspida.
[79] “Clock domain crossing: Closing the loop on clock domain functional implementation
problems,” in http://w2.cadence.com/whitepapers/cdc wp.pdf. Cadence Design Systems,
Inc., 2004.
[80] A. Lines, “Asynchronous interconnect for synchronous SoC design,” IEEE Micro, vol. 24,
no. 1, pp. 32–41, Feb. 2004.
[81] “ASIC design flow,” in http://www.faraday-tech.com/html/products/asic/Design Flow.html.
[82] “DLX GCC,” in http://www2.ucsc.edu/courses/cmps111-elm/dlx.
[83] “MiBench,” in http://www.eecs.umich.edu/mibench.
BIBLIOGRAPHY 129
[84] L. H. Lee, B. Moyer, and J. Arends, “Instruction fetch energy reduction using loop caches
for embedded applications with small tight loops,” in Proc. of International Symposium
on Low Power Electronics and Design (ISLPED), 1999, pp. 267–269.
[85] B. Gorjiara and D. Gajski, “Automatic architecture refinement techniques for customizing
processing elements,” in Proc. of the 45th ACM/IEEE Design Automation Conf., June
2008, pp. 379–384.
[86] “NISC technology website,” in http://www.cecs.uci.edu/∼nisc.
[87] S. Sirowy, W. Yonghui, S. Lonardi, and F. Vahid, “Clock-frequency assignment for multiple
clock domain systems-on-a-chip,” in Proc. of the Design, Automation & Test in Europe
Conference & Exhibition, Apr. 2007, pp. 1–6.
[88] N. Toosizadeh, S. G. Zaky, and J. Zhu, “Varipipe: Low-overhead variable-clock syn-
chronous pipelines,” in Proc. of the 27th IEEE International Conference on Computer
Design (ICCD), Lake Tahoe, California, Oct. 2009.
[89] L. Benini, E. Macii, and M. Poncino, “Telescopic units: Increasing the average throughput
pipelined designs by adaptive latency control,” in Design Automation Conf., June 1997,
pp. 22–27.
[90] H. W. Ott, Noise Reduction Techniques in Electronic Systems, 2nd ed. John Wiley &
Sons, 1988.
[91] DALLAS SEMICONDUCTOR, “App note 3512: Spread-spectrum clock oscillators lower
EMI,” 2005.
[92] H. G. Epassa et al., “Implementation of a cycle by cycle variable speed processor,” in Proc.
of ISCAS, May 2005, pp. 3335–3338.
[93] S. M. Nowick, “Design of a low-latency asynchronous adder using speculative completion,”
in IEE Proceedings – Computers and Digital Techniques, Sep. 1996, pp. 301–307.