Copyright
by
Jae-Seok Yang
2011
The Dissertation Committee for Jae-Seok Yangcertifies that this is the approved version of the following dissertation:
Nanometer VLSI Design-Manufacturing Interface for
Large Scale Integration
Committee:
David Z. Pan, Supervisor
Jacob Abraham
Michael Orshansky
Frank Liu
Nur Touba
Nanometer VLSI Design-Manufacturing Interface for
Large Scale Integration
by
Jae-Seok Yang, B.S.; M.S.; M.S.
DISSERTATION
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT AUSTIN
May 2011
Dedicated to my father Jun-Hwan Yang,
my mother Mrs. Soon-Hui Jeong
and my wife Mrs. Yoon-Jeong Cho
Acknowledgments
First and foremost, I would like to thank my research advisor Prof.
David Z. Pan. He has been always readily available for technical guidance and
discussions, and I have gained many other skills from him such as identifying
research, executing research ideas, writing clear papers, and time management
during my graduate life. He has not only guided me on research, but also
provided psychological support during times of turmoil in my life. Without
his guidance and support, this dissertation would not have been completed.
I would like to acknowledge the help of my fellow graduate students. I
am thankful to the technical discussions and friendly support they have pro-
vided. Past and current graduate students without whose help this dissertation
would not have been possible include: James Ban, Minsik Cho, Jiwoo Pak,
Duo Ding, Jerrica Gao, Ou He, Wooyoung Jang, Anurag Kumar, Katrina Lu,
Joydeep Mitra, Ashutosh Chakraborty, Kun Yuan, HyunJin Kim, Yilin Zhang
and Bei Yu. It has been my privilege to work with you all. I would especially
thank James Ban who has provided me immense support throughout these
years.
I would like to express deep gratitude to my PhD committee members:
Prof. Jacob Abraham, Prof. Michael Orshansky, Dr. Frank Liu and Prof.
Nur Touba for agreeing to serve in my PhD committee despite their very busy
v
schedules, and for their insightful comments and discussions. I would also
like to thank Prof. Sung Kyu Lim and GTCAD lab members at School of
ECE, Georgia Institute of Technology including Krit Athikulwongse, Young-
Joon Lee, Xin Zhao, Dae Hyun Kim and Moongon Jung. Collaboration with
GTCAD lab members has been really great opportunities to challenge new
research.
Special thanks to the friendly and professional administrative staff at
the ECE Department and CERC at the University of Texas at Austin (UT). In
particular, I am grateful to the support and encouragement from Debi Prather
and Melanie Gulick, the ECE graduate program coordinator.
I would like to specially thank Prof. Earl Swartzlander, Prof. Mark
McDermott under whom I took several graduate courses which helped me
improve my research work. I would also like to thank CAE team members
in Samsung for the friendly support they have provided. Past and current
CAE team members without whose help this dissertation would not have been
possible include: Dr. Jeong-Taek Kong, Dr Chul-Hong Park, Joon-Ho Choi,
Sanghoon Lee, Jong-Bae Lee, Moon-Hyun Yoo, Young-Kwan Park. I would
specially thank Dr. Jeong-Taek Kong who gave me opportunities to start PhD
program and provided me immense support throughout PhD program, both
in times of deep despair and in times of joy.
Finally, and most importantly, I am indebted to the love and support
that I have received from my family members. In special, I am thankful to my
wife, my parents, sister, parents-in-law, sister-in-law, brother-in-law who made
vi
several sacrifices to ensure that I could pursue my dream of pursuing a PhD.
Before I joined University of Texas, Austin, I had a strong mind of going back
to industry and giving up pursuing a PhD. If not for the immense emotional
support and constant encouragement that I got from my wife, I would not
have been able to complete my PhD. I would also like to thank my son, who
will be born in July, 2011.
vii
Nanometer VLSI Design-Manufacturing Interface for
Large Scale Integration
Publication No.
Jae-Seok Yang, Ph.D.
The University of Texas at Austin, 2011
Supervisor: David Z. Pan
As nanometer Very Large Scale Integration (VLSI) demands more tran-
sistor density to fabricate multi-cores and memory blocks in a limited die size,
many researches have been performed to keep Moore’s Low in two different
ways: 2D geometric shrinking and 3D vertical wafer stacking. For the geomet-
ric shrinking, nano patterning with 193nm lithography equipment is one of the
most fundamental challenges beyond 22nm while the next-generation lithog-
raphy, such as Extreme Ultra-Violet (EUV) lithography still faces tremendous
challenges for volume production in the near future. As a practical solution,
Double Patterning Lithography (DPL) has become a leading candidate for
sub-20nm lithography process. Another approach for multi-core integration is
3D wafer stacking with Through Silicon Via (TSV). Computer-Aided-Design
(CAD) approaches to enable robust DPL and TSV technology are the main
focus of this dissertation.
viii
DPL poses new challenges for overlay and layout decomposition. There-
fore, overlay induced variation modeling and efficient decomposition for better
manufacturability are in great demand. Since the variation of metal space
caused by overlay results in coupling capacitance variation, we first model
metal spacing variation with individual overlay sources. Then, all overlay
sources are considered to determine the worst timing with coupling capaci-
tance variation. Non-parallel pattern caused by overlay is converted to par-
allel one with equivalent spacing having the same delay to be applicable of a
traditional RC extraction flow. Our experiments show that the delay variation
due to overlay in DPL can be up to 9.1%, and well decomposed layout can
reduce the variability.
For DPL layout decomposition, we propose a multi-objective and flex-
ible framework for stitch minimization, balanced density, and overlay com-
pensation, simultaneously. We use a graph theoretic algorithm for minimum
stitch insertion and balanced density. Additional decomposition constraints
for overlay compensation are obtained by Integer Linear Programming (ILP).
Robust contact decomposition can be obtained with additional constraints.
With these constraints, global decomposition is performed using a modified
Fiduccia-Mattheyses (FM) graph partitioning algorithm. Experimental re-
sults show that the proposed framework is highly scalable and fast: we can
decompose all 15 benchmark circuits in five minutes in a density balanced
fashion, while an ILP-based approach can finish only the smallest five circuits.
In addition, we can remove more than 95% of the timing variation induced by
ix
overlay for tested structures.
Three-dimensional integration has new manufacturing and design chal-
lenges such as device variation due to TSV induced stress and timing corner
mismatch between different stacked dies. Since TSV fill material and silicon
have different Coefficients of Thermal Expansion (CTE), TSV causes silicon
deformation due to different temperatures at chip manufacturing and operat-
ing. Therefore, the systematic variation due to TSV induced stress should be
considered for robust 3D IC design. We propose systematic TSV stress aware
timing analysis and show how to optimize layout for better performance. First,
a stress contour map with an analytical radial stress model is generated. Then,
the tensile stress is converted to hole and electron mobility variations depend-
ing on geometric relations between TSVs and transistors. Mobility variation
aware cell library and netlist are generated and incorporated in an industrial
timing engine for 3D-IC timing analysis. TSV stress induced timing variations
can be as much as ± 10% for an individual cell. As an application for layout
optimization, we can exploit the stress-induced mobility enhancement to im-
prove timing on critical cells. We show that stress-aware perturbation could
reduce cell delay by up to 14.0% and critical path delay by 6.5% in our test
case.
Three-dimensional Clock Tree Synthesis (3D CTS) is one of the main
design difficulties in 3D integration because clock network is spreading over all
tiers. In 3D CTS, timing corner mismatch between tiers is caused because each
tier is manufactured in independent process. Therefore, inter-die variation
x
should be considered to analyze and optimize for paths spreading over several
tiers in 3D CTS. In addition, mobility variation of a clock buffer due to stress
from TSV can cause unexpected skew which degrades overall chip performance.
Therefore, we propose clock period optimization to consider both timing corner
mismatch and TSV induced stress. In our experiments, we show that our clock
buffer tier assignment reduces clock period variation up to 34.2%, and the
most of stress-induced skew can be removed by our stress-aware CTS. Overall,
we show that performance gain can be up to 5.7% with the proposed CTS
algorithm.
As technology scaling continues toward 14nm and 3D-integration, this
dissertation addresses several key issues in the design-manufacturing interface,
and proposes unified analysis and optimization techniques for effective design
and manufacturing integration.
xi
Table of Contents
Acknowledgments v
Abstract viii
List of Tables xv
List of Figures xvi
Chapter 1. Introduction 1
1.1 Challenges for Nanometer VLSI Design and Manufacturing . . 1
1.2 Overview and Contribution of this Dissertation . . . . . . . . . 8
Chapter 2. Overlay aware Timing Variation Modeling for DPL 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Preliminaries and Related Works . . . . . . . . . . . . . . . . . 13
2.2.1 DPT Process . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Capacitance Variation induced by Overlay . . . . . . . . 14
2.2.3 Sources of Overlay . . . . . . . . . . . . . . . . . . . . . 15
2.3 Layout Distortion Estimation . . . . . . . . . . . . . . . . . . 18
2.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Translation Overlay Consideration . . . . . . . . . . . . 20
2.3.3 Rotation Overlay Consideration . . . . . . . . . . . . . . 22
2.3.4 Magnification Overlay Consideration . . . . . . . . . . . 24
2.3.5 Overall Capacitance Variation with Overlay . . . . . . . 25
2.4 Timing Analysis with Parameterized Capacitance . . . . . . . 28
2.5 Applications and Results . . . . . . . . . . . . . . . . . . . . . 30
2.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xii
Chapter 3. A Graph-Theoretic, Multi-Objective Layout Decom-position 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Decomposition Requirements and Motivation . . . . . . . . . . 41
3.2.1 Balanced Density . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Robust Contact Decomposition . . . . . . . . . . . . . . 42
3.2.3 Overlay Compensation . . . . . . . . . . . . . . . . . . . 43
3.3 Bi-partitioning Based Decomposition . . . . . . . . . . . . . . 44
3.3.1 Overall Decomposition Flow . . . . . . . . . . . . . . . 44
3.3.2 Finding Neighboring Rectangles . . . . . . . . . . . . . 47
3.3.3 Grouping and Relative Coloring . . . . . . . . . . . . . 48
3.3.4 Group Color Assignment Problem . . . . . . . . . . . . 49
3.3.5 Min-Cut based Stitch Minimization . . . . . . . . . . . 50
3.3.6 Modified Graph Min-Cut Partitioning . . . . . . . . . . 53
3.3.7 Complexity Analysis . . . . . . . . . . . . . . . . . . . . 56
3.4 Application to Contact Layer Decomposition . . . . . . . . . . 57
3.5 Application to Robust Metal Layer Decomposition . . . . . . . 60
3.5.1 TDD Constraints for Overlay Compensation . . . . . . . 60
3.5.2 TDD Constraints aware Decomposition . . . . . . . . . 65
3.6 Application to Hierarchical Decomposition . . . . . . . . . . . 67
3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 68
3.8 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 4. TSV Stress aware Timing Analysis and Layout Op-timization for 3D-ICs 81
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Related Work and Motivation . . . . . . . . . . . . . . . . . . 83
4.3 Overall TSV Stress aware Design Flow . . . . . . . . . . . . . 86
4.4 TSV Stress and Mobility Variation Modeling . . . . . . . . . . 88
4.4.1 Mobility Variation for a Single TSV . . . . . . . . . . . 88
4.4.2 Mobility Variation for Multiple TSVs . . . . . . . . . . 91
4.5 Timing Analysis with TSV Stress Consideration . . . . . . . . 94
4.5.1 Timing Analysis for 3D-ICs . . . . . . . . . . . . . . . . 95
xiii
4.5.2 Timing Library for Mobility Variation . . . . . . . . . . 96
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 5. Robust Clock Tree Synthesis with Timing Yield Op-timization for 3D-ICs 104
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Related Work and Motivation . . . . . . . . . . . . . . . . . . 106
5.2.1 Clock Buffer Tier Assignment . . . . . . . . . . . . . . . 107
5.2.2 Clock Buffer Variation due to TSV induced Stress . . . 109
5.3 Robust Clock Tree Design . . . . . . . . . . . . . . . . . . . . 112
5.3.1 σCP Minimization for Critical Paths . . . . . . . . . . . 114
5.3.2 Buffer Variation Modeling under TSV induced Stress . . 119
5.3.3 Three-Dimensional Buffered Clock Tree Synthesis (CTS) 122
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 126
5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Chapter 6. Conclusions 134
Bibliography 137
Vita 153
xiv
List of Tables
2.1 Timing Errors ignoring the dependency. . . . . . . . . . . . . . . . . 29
3.1 Runtime comparison(mins = 42nm) . . . . . . . . . . . . . . . 73
3.2 Decomposition results with ISCAS benchmark(mins = 54nm,minom = 20nm), ILP solver(exact method) . . . . . . . . . . . 74
3.3 Decomposition results with ISCAS benchmark(mins = 54nm,minom = 20nm), Graph Partition(Proposed heuristic) . . . . . 74
3.4 Overlay compensation with TDD . . . . . . . . . . . . . . . . 79
4.1 TSV specification . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2 Longest path delay and TNS comparison . . . . . . . . . . . . 100
4.3 Gate optimizations on the target path with perturbation . . . 101
5.1 Buffer rising delay variation according to mobility changes (nom-inal delay: 210ps) . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Circuit Information . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3 Skew change due to TSV stress according to clock source z-location (CKT9) . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4 Clock period analysis result. Case1 (No covariance optimizationis done, and no stress is considered) . . . . . . . . . . . . . . . 129
5.5 Clock period analysis result. Case2 (No covariance optimizationis done, but TSV stress is considered) . . . . . . . . . . . . . . 130
5.6 Clock period analysis result. Case3 (Covariance optimization isdone, but no stress is considered) . . . . . . . . . . . . . . . . 131
5.7 Clock period analysis result. Case4 (Covariance optimization isdone, and TSV stress is considered) . . . . . . . . . . . . . . . 132
5.8 No stress, no inter-die aware CTS vs. our stress and inter-dieaware CTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xv
List of Figures
1.1 Lithography trend to follow Moore’s law [Courtesy Intel]. . . . 1
1.2 Mask split in DPL . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Unwanted pattern distortion in DPL . . . . . . . . . . . . . . 4
1.4 An illustration showing difficulties in DPL . . . . . . . . . . . 5
1.5 TSV can cause tensile stress in silicon near TSV. . . . . . . . 6
2.1 DPT process and patterning result. . . . . . . . . . . . . . . . 13
2.2 Effective capacitance variation. . . . . . . . . . . . . . . . . . 14
2.3 Translation overlay variable. . . . . . . . . . . . . . . . . . . . 16
2.4 Rotation overlay variable. . . . . . . . . . . . . . . . . . . . . 17
2.5 Magnification overlay variable. . . . . . . . . . . . . . . . . . . 17
2.6 Vector expression of overlay. . . . . . . . . . . . . . . . . . . . 19
2.7 ∆S dependency on location. . . . . . . . . . . . . . . . . . . . 20
2.8 ∆ST for various geometric relations. . . . . . . . . . . . . . . . 21
2.9 ∆XR and ∆YR for rotation overlay. . . . . . . . . . . . . . . . 22
2.10 β′ for various geometric relations. . . . . . . . . . . . . . . . . 24
2.11 ∆XM and ∆YM for magnification overlay. . . . . . . . . . . . 25
2.12 ∆S dependency on location. . . . . . . . . . . . . . . . . . . . 26
2.13 Equivalent n-π model for Seqv calculation. . . . . . . . . . . . 26
2.14 Overlay aware timing analysis flow. . . . . . . . . . . . . . . . 30
2.15 Test structure used to verify overlay aware timing analysis flow 31
2.16 Delay variation with overlay when target layout is in near cen-ter. Translation overlay is dominant. . . . . . . . . . . . . . . 32
2.17 Delay variation with overlay when target layout is in near edgeof die. Rotation and magnification overlay as well as translationoverlay have an impact on delay variation. . . . . . . . . . . . 33
2.18 Test structure for an alternative decomposition which has lessoverlay effect on timing . . . . . . . . . . . . . . . . . . . . . . 34
xvi
2.19 Comparision of delay variation between two different decompo-sition scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.20 Compensation trend of overlay induced variation as logic depthincreases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.21 Timing variation over θ for multiple metal layers. . . . . . . . 37
3.1 Layout decomposition. . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Aerial image comparison of decomposed layout with differentdensity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 An illustration of contact to gate space variation due to overlay. 43
3.4 Overlay compensated decomposition. . . . . . . . . . . . . . . 44
3.5 Overall flow of the decomposition framework. . . . . . . . . . 45
3.6 Decomposition results after each step (AOI2 cell, Metal1). . . 46
3.7 Re-segmentation and grouping process. . . . . . . . . . . . . . 48
3.8 An example showing group color assignment formulation afterrelative coloring. . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.9 An example showing how to handle a repulsive layout pair inFM partitioning algorithm. . . . . . . . . . . . . . . . . . . . . 53
3.10 Group color assignment with different partitioning scheme anddecomposition results. . . . . . . . . . . . . . . . . . . . . . . 55
3.11 Layout division for local balance. . . . . . . . . . . . . . . . . 56
3.12 ∆Ion variation for various overlay cases. . . . . . . . . . . . . . 58
3.13 An application to contact layer decomposition. . . . . . . . . . 59
3.14 An example with four neighbors. . . . . . . . . . . . . . . . . 64
3.15 Different decomposition with TDD. . . . . . . . . . . . . . . . 65
3.16 Decomposition with TDD constraints. . . . . . . . . . . . . . . 66
3.17 Flattened layout decomposition. . . . . . . . . . . . . . . . . . 69
3.18 Hierarchical layout decomposition. . . . . . . . . . . . . . . . . 70
3.19 Metal 1 decomposition with our framework. . . . . . . . . . . 71
3.20 Graph showing runtime increment as circuits are bigger. . . . 72
3.21 Extremely unbalanced decomposition (S38584). . . . . . . . . 75
3.22 C432 decomposed layout. . . . . . . . . . . . . . . . . . . . . . 76
3.23 The balanced decomposed layout has less EPE than the unbal-anced one (C432). . . . . . . . . . . . . . . . . . . . . . . . . . 77
xvii
3.24 Delay variation for an inverter chain to verify the robust contactdecomposition in DPL. . . . . . . . . . . . . . . . . . . . . . . 78
3.25 Reduction of timing variation as more stitches are inserted (Net3). 80
4.1 Thermal stress around TSV. . . . . . . . . . . . . . . . . . . . 84
4.2 Mobility change due to tensile stress. . . . . . . . . . . . . . . 85
4.3 Overall flow for TSV stress aware design. . . . . . . . . . . . . 87
4.4 Optimal orientation of MOSFET to maximize the mobility for(001) surface, 〈110〉 channel. . . . . . . . . . . . . . . . . . . . 89
4.5 Stress contour map for a single TSV with 0.5um KOZ. . . . . 91
4.6 Mobility contour map for a TSV. . . . . . . . . . . . . . . . . 92
4.7 Linear super-position of TSV stress. . . . . . . . . . . . . . . . 93
4.8 Zigzag TSV placement has less (∆µ/µ)h between rows due tocompensation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.9 Timing corner determination according to mobility variation. . 95
4.10 Timing corner with TSV stress. . . . . . . . . . . . . . . . . . 96
4.11 Inverter delay variation with different (∆µ/µ)h and (∆µ/µ)e. . 97
4.12 ∆µ/µ contour map for 22 x 21 TSV array. . . . . . . . . . . . 99
4.13 Cell perturbation to take advantage of mobility variation. . . . 103
5.1 Clock path p with clock buffers . . . . . . . . . . . . . . . . . 107
5.2 Thermal stress around TSV. . . . . . . . . . . . . . . . . . . . 110
5.3 Buffer delay variation due to TSV stress. . . . . . . . . . . . . 111
5.4 Overall proposed 3D CTS flow. . . . . . . . . . . . . . . . . . 112
5.5 An illustration of the buffer tier assignment procedure in abottom-up manner. Note that MP1’s location is changed atstep(c) to minimize #TSVs. . . . . . . . . . . . . . . . . . . . 114
5.6 Necessity of #TSV control between two consecutive clock buffers.118
5.7 TSV induced stress and clock buffer variation modeling. . . . 121
5.8 Wire, TSV, and buffer modeling for delay calculation. . . . . . 124
5.9 Illustrations for merging point determination. . . . . . . . . . 125
5.10 Runtime trend. . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.11 Trade-off between covariance and # TSVs. . . . . . . . . . . . 133
xviii
Chapter 1
Introduction
1.1 Challenges for Nanometer VLSI Design and Manu-facturing
As nanometer very large scale integration (VLSI) becomes the main-
stream of semiconductor industry, more transistors need to be integrated into
one chip. There are mainly two approaches to integrate more transistors within
limited chip size for multi-core design. The first approach is geometrical shrink-
ing of feature size (2D-integration), and the second one is vertical die stacking
with TSVs (3D-integration).
Figure 1.1: Lithography trend to follow Moore’s law [Courtesy Intel].
Geometric shrinking has been preferred to increase chip performance
as well as transistor density. However, the mainstream lithography technology
1
using 193nm stepper has been facing severe limitations [8, 29, 50] for sub-20nm
technology node as shown in Fig. 1.1. The smallest feature size to be printed is
defined as K1·λ/NA in which K1 is referred to as K-factor for a given process,
λ is a wavelength of a light source, and NA is a numerical aperture determined
by lens size. When we use immersion lithography of NA=1.35, K-factor should
be less than 0.2 to print 64nm pitch (half pitch: 32nm) pattern. However, the
theoretical limitation of K-factor is 0.25 with intensive OPC [21, 82, 87]. One
possible solution to overcome the limitation is to use high NA lithography
system. Chip makers have been using immersion lithography for sub-40nm
patterning which enhances NA from 0.93 (dry) to 1.35 (wet). However, it
is hard to find new liquid material to increase NA more than 1.35 in the
near future [1, 27, 39]. Electron beam lithography and nano-imprint have their
serious limitations due to throughput and mask defects [51, 86, 98]. As an
ideal solution, EUV(Extreme Ultra-Violet) lithography has been proposed.
Since the wavelength of EUV light source is 13.5nm, sub-20nm patterning is
possible with EUV. However, EUV lithography equipment is not available for
20nm production due to technical barriers such as the lack of power sources,
resists, and defect-free masks.
An alternative choice of 20nm patterning is Double Patterning Lithog-
raphy (DPL) [4, 48, 67, 68] which prints two lines in a pitch as illustrated in
Fig. 1.2. Double patterning lithography (DPL) [55] has been proposed as a
strong candidate for 20nm/14nm technology nodes. Double patterning can
be implemented in three ways: Litho-Etch-Litho-Etch (LELE), Litho-Freeze-
2
(a) Layout before decomposition (b) Layout after decomposition
Figure 1.2: Mask split in DPL
Litho-Etch (LFLE), and Self-Aligned Double Patterning (SADP). LELE uses
two lithography exposures and etches on hard-mask to create smaller chip
features. LFLE works by freezing the developed resist pattern of the first ex-
posure, then adding a second resist layer immediately on top for the second
exposure. The resist pattern is etched at one time after developing. SADP
works by depositing a spacer layer over the chip covering all hard mask fea-
tures. The covered layer is selectively etched away leaving two sidewalls along
any ridge, then the ridge is removed [5, 6, 76]. LELE and LFLE require more
accurate overlay control to align two exposures [18, 24, 37, 48, 57]. LFLE uses
fewer processing steps [2, 78]. However, LFLE requires insensitive resists be-
tween two lithography steps. SADP has less flexibility for 2-D patterning and
requires more processing steps such as trimming masks. The overlay require-
ment of SADP is less stringent than for other double patterning methods.
Therefore, SADP is widely used for NAND-flash which has 1-D structures and
requires the most advanced technology.
Since metal layer has inherently 2-D structures, we need to resolve the
3
(a) Designed pattern (b) Printed pattern with overlay
Figure 1.3: Unwanted pattern distortion in DPL
problems of the LELE/LFLE approach which are overlay in Fig. 1.3 and effi-
cient layout decomposition. The advanced lithography equipment has about
5nm of overlay in 3-σ level which is relatively large compared with 14nm half
pitch size [48, 57]. Since overlay may change the circuit behavior such as tim-
ing, noise, and power due to the distortion of layout pattern as shown in
Fig. 1.3(b), we need a methodology to estimate the variation due to overlay
during chip design. Metal layer variation due to overlay can affect the circuit
performance and yield seriously because overlay changes coupling capacitance
between metals. In this dissertation, we focus a systematic method to estimate
the performance variation with several overlay variables.
All types of DPL require layout decomposition before manufactur-
ing [16, 83, 85, 100]. When two features are located closely within the mini-
mum design rule, they need to be decomposed two different masks for LELE
and LFLE. SADP also requires layout decomposition to assign a feature to
a specific sidewall. During decomposition, coloring conflict can be resolved
by inserting stitches. However, minimum stitch insertion is preferred because
4
(a) Potential yield loss due tostitches
(b) irresolvable conflict
Figure 1.4: An illustration showing difficulties in DPL
stitch insertion requires overlap margin, resulting in unwanted chip area in-
crease. In addition, stitches may result in significant printability degradation
due to overlay error and line-end effect [17] as shown in Fig. 1.4(a). Not all the
conflicts can be resolved by inserting stitches. An irresolvable conflict is called
a native conflict in Fig. 1.4(b). The only way to remove native conflicts is to
modify layout [97]. After layout modification, we need to decompose layout
iteratively. Therefore, a fast decomposition algorithm is required to shorten
the time of an iterative layout modification and decomposition. Another re-
quirement is density balancing between decomposed patterns. In addition,
we observe that overlay effect on circuit variation can be removed with well-
decomposed layout. The key challenge of DPL is to accomplish high quality
decomposition for large-scale layouts under reasonable runtime with the follow-
ing objectives: a) the number of stitches is minimized, b) the balance between
two decomposed layers is maximized for further enhanced patterning, c) Pre-
5
defined coloring relations are kept for more robust contact decomposition, d)
the impact of overlay on coupling capacitance is reduced for less timing varia-
tion [93]. We show a scalable and flexible decomposition framework with the
multi-objectives to meet the above constraints.
Many chip makers have released 3D die integration with wire bonding.
The advantages of 3D integration include heterogeneous integration and chip
density increase by stacking dies vertically. 3D integration with TSV has
been gained main focus for future multi-core integration because of several
additional benefits compared with wire bonding scheme, which include wire
length reduction and increased band width between dies.
Figure 1.5: TSV can cause tensile stress in silicon near TSV.
However, 3D-ICs with TSVs have new manufacturing challenges. Tung-
sten(W), poly-silicon, and copper (Cu) have all been considered as fill mate-
rials of TSVs. Since copper has low resistivity, it is widely used material
for TSV fill. However, copper CTE differs from silicon CTE which can be
a source of silicon strain. The CTE mismatch between copper and silicon
6
causes inevitable stress on silicon near TSVs. Because copper electroplating
and annealing temperature is higher than operating temperature, tensile stress
appears on silicon [54, 74] after cooling down to room temperature as shown
in Fig. 1.5.
The tensile stress on silicon causes reliability problem such as crack-
ing [40, 61, 65]. In addition, the stress can change mobility of carriers. There-
fore, TSV stress induced by CTE mismatch may cause timing violation if
cells on a critical path are placed near TSVs. Tensile stress enhances electron
mobility. However, hole mobility is either enhanced or degraded depending
on TSV and transistor channel direction. Longitudinal tensile stress reduces
hole mobility while transverse tensile stress increases the mobility [80]. When
TSV induced tensile stress is 100MPa and the stress works for longitudinal
direction, hole mobility degradation can be up to 7.2%, which makes PMOS
transition slow. If PMOS is on a critical path, it can cause unexpected setup
time violation which is not detected with the current timing analysis flow. We
propose a design flow to analyze timing variation by TSV induced stress, and
show its implications for layout optimizations during 3D-IC design.
Three-dimensional Clock Tree Synthesis (3D CTS) has new challenges
such as timing corner mismatch between tiers and device variation due to
Through Silicon Via (TSV) induced stress [95]. Timing corner mismatch be-
tween tiers is caused because each tier is manufactured in independent process.
Therefore, inter-die variation should be considered to analyze and optimize for
paths spreading over several tiers. TSV induced stress is another challenge in
7
3D CTS. Mobility variation of a clock buffer due to stress from TSV can cause
unexpected skew which degrades overall chip performance. In this dissertation,
we propose clock tree design methodology with the following objectives: (a) to
minimize clock period variation by assigning optimal z-location of clock buffers
with an Integer Linear Program (ILP) formulation, (b) to prevent unwanted
skew induced by the stress.
1.2 Overview and Contribution of this Dissertation
The algorithms and modeling for addressing the above mentioned chal-
lenges are described in the next four chapters. The overall flow of the disser-
tation is as follows:
In Chapter 2, we provide a set of analytical formulae to model coupling
capacitance variation induced by overlay, and show that the timing variation
due to overlay can be up to 9.1%. We also show that layout decomposition
methods [16, 83] play a role in timing variation reduction. The work in Chapter
2 is used as a cost function for overlay tolerant layout decomposition for DPL
in Chapter 3. Within given timing constraints, the work provides a way to
determine the maximum allowed amount of overlay for process development.
Layout decomposition work for robust DPL is in Chapter 3. For lay-
out decomposition, we propose two step approaches with initial coloring and
optimization by showing that the number of stitch insertions required to de-
compose a layout is equal to the cut size of graph partitioning, and stitch
minimization can be achieved by a min-cut partitioning algorithm. Our graph
8
based decomposition is faster than ILP based decomposition with less than 1%
stitch increase. In addition, we show that layout decomposition plays a role in
enhancing patterning quality by enforcing balanced density and reduction of
overlay effects on contact and metal layers. Our work is the first one dealing
with lithography friendly objectives such as density and variability reduction
during layout decomposition.
In Chapter 4, we propose compact stress and mobility modeling to
consider systematic timing variation caused by TSV stress for 3D-ICs. We
show that TSV stress can change hole mobility quite significantly, e.g., from
-22% to +10%, which means more than 20% variation for single cell delay.
Thus it can deteriorate the overall chip performance thus must be considered
during timing analysis and optimization. Finally, we propose several layout
optimization techniques including small perturbation, zigzag TSV placement
and optimal cell rotation, and show that the optimization can improve single
cell delay by 14% and critical path delay up to 6.5%
Three-dimensional CTS to consider inter-die variation and TSV in-
duced stress is proposed in Chapter 5 by showing that clock buffer tier assign-
ment can play a role to reduce clock period variation. With our clock buffer
assignment, we show that standard deviation of clock period can be reduced
up to 34.2%. Thus, we can increase chip operating frequency for the same
timing yield, or we can increase timing yield for the same operating frequency.
This is also the first work to show that TSV-induced stress can cause un-
wanted clock skew and propose buffer delay variation model to consider the
9
stress during CTS. With stress-aware 3D CTS, we show that the most of skew
due to TSV-induced stress can be removed.
Finally, in Chapter 6, we draw conclusions based on the results of the
previous chapters as well as present a few promising research directions to
enable robust and reliable double patterning lithography and 3D integration
with TSV.
10
Chapter 2
Overlay aware Timing Variation Modeling for
DPL
2.1 Introduction
As the minimum pitch size decreases, the lithography process has been
confronted with severe limitation. The traditional lithography using 193nm
wavelength light cannot print sub-20nm pattern. Even though EUV is a
promising candidate for a light source of future lithography equipment, it is
not clear when EUV equipment will be commercially available because of the
difficulty in dealing with the strong light source. The alternative technology
for 20nm technology is DPL. Since DPL does patterning twice, we can print
two lines in a pitch which reduces K-factor to 0.125. Even though DPT is
an inefficient process because of the low throughput, the industry is trying to
adopt DPL as a solution of 20nm patterning because DPL is probably the only
technology to be available before EUV equipment is shipped.
The difficulty of DPL is the stitch generation and overlay control to
reduce mask misalignment between the first and second lithography steps. The
advanced lithography equipment has about 5nm of overlay in 3-σ level which
is relatively large compared with 20nm half pitch size [48, 57]. Since overlay
11
may change the circuit behavior such as timing, noise, and power due to the
distortion of layout pattern [9, 10, 20], we need a methodology to estimate the
variation due to overlay during design. Metal layer variation due to overlay
can affect the circuit performance and yield seriously because overlay changes
coupling capacitance between metals.
In this chapter, we present a systematic method to estimate the per-
formance variation with several overlay variables. This is the first work to
deal with overlay and timing relation, to our best knowledge. From the over-
lay measurement, it is possible to distinguish sources of overlay which are
translation, rotation, and magnification. Since overlay changes metal spacing,
we model coupling capacitance to be a function of overlay variables. Metal
spacing with overlay is not constant because overlay effects are different on
a position by position. Thus, parallel interconnects become non-parallel after
overlay modeling. To be able to use traditional 2.5-D extraction tool [63], non-
parallel conductors are converted to parallel layers with equivalent spacing in
terms of timing behavior. After modeling capacitance with overlay variables,
we show how to analyze timing by sweeping overlay variables, and determine
the variables which make the worst timing. With these variables, we can
construct a new equivalent layout with overlay consideration for an effective
timing analysis.
The rest of the chapter is organized as follows. Preliminaries and related
works are presented in section 2.2. We propose how to model timing variation
in section 2.3. The determination of overlay variables to make the worst timing
12
is presented in section 2.4. In section 2.5, the application of the work and
experimental results are shown. We draw discussions in section 2.6.
2.2 Preliminaries and Related Works
2.2.1 DPT Process
Figure 2.1: DPT process and patterning result.
Various double patterning methods are being developed. A common
way of DPT consists of Lithography-Etching-Lithography-Etching process as
shown in Figure 2.1 [57]. The first pattern is formed on hard mask 1 in (c), and
the second photo resist is etched in (e). We can print the final pattern on hard
mask 2(g). One of the most important challenges of DPT comes from overlay
13
between two independent lithography steps. Overlay causes the variation of
coupling capacitances for metal layers in (i). The variation of capacitances
between metal layers results in a timing and noise variation.
2.2.2 Capacitance Variation induced by Overlay
(a) One sided cap. (b) Two sided cap.
(c) Decoupled cap.
Figure 2.2: Effective capacitance variation.
Since coupling capacitance is inversely proportional to spacing between
two conductors, a spacing variation due to overlay in DPT results in a variation
of a coupling capacitance [94]. For instance, if metal spacing without overlay is
14
50nm and overlay reduces the spacing by 10nm, Cc overlay is equal to 1.25Cc
in Figure 2.2(a). One sided capacitance variation is understood explicitly.
Overlay effect in Figure 2.2(b) is not clear because the total capacitance is
changed from 2Cc to 2.08Cc (Cc overlay1: 1.25Cc, Cc overlay2: 0.83Cc) at
spacing=50nm, Y-translation overlay=10nm. However, if interconnects have
different slew rate, we can see clearly overlay effect on timing. The coupling
capacitance can be substituted to an equivalent grounded capacitance for an
equal delay [13, 41]. The miller factor is defined as MF = 1 + Tv/Ta. Here,
Tv is a transition time of victim, and Ta is a transition time of aggressor. If
MF1 is three, and MF2 is one in Figure 2.2(c), the total decoupled capacitance
is 4.58Cc which is 14.5% bigger than the capacitance (4Cc) in case of ignoring
overlay. From this observation, overlay in DPT has a significant effect on
an effective capacitance for multiple aggressor case as well as single aggressor
case.
2.2.3 Sources of Overlay
In this section, overlay variables are assumed to be measured from the
misalignment between the first and the second patterning. For simplicity,
we assume that overlay is a systematic variation, and randomly generated
overlay is negligible. There are three types of overlay [50]. The first one is a
translation overlay which is caused by mismatching a mask in a horizontal or
vertical direction. The translation overlay has two variables. The first variable
is overlay amplitude. The amplitude in advanced lithography system is in the
15
(a) Translation overlay (b) Variables for translation overlay
Figure 2.3: Translation overlay variable.
range of 3nm to 5nm. From a wafer measurement, we can extract the overlay
amplitude of 3-σ level. The second variable is overlay angle which is purely
random because a translation shift can be any direction. Overlay amplitude
is symbolized as α, and overlay angle is expressed as θ in Figure 2.3. The
counterclockwise of θ is defined as a positive θ.
The second source of overlay is a rotation of masks between the first
patterning and the second patterning. Even though the rotation overlay is less
than 0.1µ radian in advanced equipment, rotation overlay can make several
nm of layout distortion at the edge of a die. A variable of rotation overlay is
φ, and the clockwise direction of rotation is defined as positive φ as shown in
Figure 2.4. Rotation overlay depends on pattering location and the amount
of φ.
The last component of overlay is called magnification overlay. If there
16
(a) Rotation overlay (b) Variable for rotation overlay
Figure 2.4: Rotation overlay variable.
(a) Magnification overlay (b) Variable for magnification overlay
Figure 2.5: Magnification overlay variable.
17
is a temperature variation between the first and the second lithography step,
there is magnification overlay because a mask has a different thermal expansion
rate with a wafer. In a projection lithography system, a magnification factor
fluctuation is another source of magnification overlay. The variation on layout
due to magnification overlay is bigger as the pattern is far from the center
point. A variable of magnification overlay is defined to M in Figure 2.5. A
positive M means the second patterning is enlarged than the first one. M is
less than 0.1ppm in advanced lithography equipment. The overlay variables
are extracted in 3-σ level from overlay measurement [50].
2.3 Layout Distortion Estimation
We will propose a numerical formula to predict coupling capacitance
variation induced by overlay in this section.
2.3.1 Problem Definition
In Figure 2.6(a), CC is a coupling capacitance obtained by standard
RC extraction tool without overlay consideration. Coverlay is a coupling capac-
itance with overlay modeling. (X, Y) is a node on 2nd patterned interconnect.
The original space(S) is increased by ∆S which makes CC to be Coverlay. Each
overlay source makes a shift of (X,Y). (X, Y) is shifted to (XT , YT ) by trans-
lation overlay. (XR,YR) is a shifted point by rotation overlay. Similarly, (X,Y)
moves to (XM ,YM) by magnification overlay. The overall shift is the sum of
18
(a) Coverlay definition (b) ∆S definition
Figure 2.6: Vector expression of overlay.
each movement.
∆S = ∆ST +∆SR +∆SM (2.1)
Since two metals are parallel horizontally, ∆S is equal to ∆Y . In
Figure 2.6(b), metal space is increased by ∆YT+∆YR+∆YM . Since coupling
capacitance is in inverse proportional to metal space, Coverlay is a function of
∆S in the following.
Coverlay =S
S +∆S· CC (2.2)
To incorporate overlay measurement, we need a formula of ∆S with
overlay variables.
19
2.3.2 Translation Overlay Consideration
To predict metal space for translation overlay, let us assume that (X,Y)
is moved to (XT , YT ) by α and θ.
(X, Y )α,θ↔ (XT , YT ) (2.3)
Figure 2.7: ∆S dependency on location.
From Figure 2.7, X-translation(∆XT ) is αcosθ, and Y-translation(∆YT )
is αsinθ. Even though ∆XT and ∆YT depend on α and θ, ∆ST is not propor-
tional to α and θ directly because ∆ST depends on a geometric relation.
Figure 2.8 shows possible geometric relations which are represented by
γ defined as a degree between X-axis and orthogonal vector from 1st pattern
to 2nd pattern. Since metal space is changed by X-translation overlay when
two metals are parallel in vertical direction, ∆ST is equal to ∆XT for γ=0.
20
Figure 2.8: ∆ST for various geometric relations.
Similarly, ∆ST is ∆YT for γ=π/2. We calculate ∆ST for every geometric
relation in Figure 2.8. ∆ST is generalized in the following equation.
∆ST = α · cos (θ − γ) (2.4)
Equation 2.4 makes us consider a capacitance variation with translation
overlay. Since γ represents geometric relations, it is a fixed value for a given
decomposed layout. Thus, it is possible to analyze a timing variation with θ
and α variables.
21
2.3.3 Rotation Overlay Consideration
To predict metal space for rotation overlay, let us assume that (X,Y)
is moved to (XR,XR) by φ rotation.
(X, Y )φ↔ (XR, YR) (2.5)
Since the effects of rotation overlay are not the same when their po-
sitions are different, we need to consider locations of patterns for estimating
distortion induced by rotation overlay.
Figure 2.9: ∆XR and ∆YR for rotation overlay.
The location of (X,Y) is expressed by the distance(D) from the center
and the angle(β) from X-axis in Figure 2.9. D’ is a distance from the center
to (XR, YR). β is a position angle which is positive in a counterclockwise.
D =√X2 + Y 2, D′ = D/ cosφ (2.6)
22
∆XR with φ rotation is |XR −X|, and ∆YR is |YR − Y |. ∆XR and
∆YR are obtained from Figure 2.9.
∆XR = D · cos (β − φ) / cosφ− cosβ
∆YR = D ·
cos(
β +π
2− φ)
/ cosφ− cos(
β +π
2
)
(2.7)
∆XR and ∆YR have π/2 phase difference which is useful for deriving a
general formula.
∆SR = D · cos (β′ − φ) / cosφ− cos (β′) (2.8)
In equation 2.8, β′ is a new variable considering both geometric rela-
tion and position angle. When the geometric relation is γ=0, metal space is
increased by ∆XR. Thus, β′ is equal to β when γ is zero. If γ is π, ∆SR is
-∆XR because metal space decreases for the relation. Since π phase shift of
cos changes its sign, β′ is β-π when γ is π. Similarly, we calculate β’ for every
relation in Figure 2.10. General formula of ∆SR is in the following equation.
∆SR = D · cos (β − γ − φ) / cosφ− cos (β − γ) (2.9)
We can model a capacitance variation with rotation overlay using equa-
tion 2.9. D and β is determined by initial position (X,Y). Since γ is a fixed
value for a given decomposed layout and φ is only a variable, we can analyze
a timing variation by sweeping the rotation variable from -φ to +φ.
23
Figure 2.10: β′ for various geometric relations.
2.3.4 Magnification Overlay Consideration
D” is defined as a distance from the center to a shifted point due to
magnification overlay. Since M is defined as (D”-D)/D, D”-D is equal to M·D.
Thus, ∆XM and ∆YM are obtained in the following.
∆XM = M ·D · cosβ,∆YM = M ·D · sinβ (2.10)
The layout distortion due to magnification overlay is generalized in the
following.
∆SM = M ·D · cos(β − γ) (2.11)
Since M is only a variable in DPT in Equation 2.11, we can analyze
timing variation by sweeping the magnification variable from -M to +M.
24
Figure 2.11: ∆XM and ∆YM for magnification overlay.
2.3.5 Overall Capacitance Variation with Overlay
∆S = ∆ST +∆SR +∆SM
= α · cos (θ − γ)
+D · cos (β − γ − φ) / cosφ− cos (β − γ)
+M ·D · cos(β − γ) (2.12)
In equation 2.12, we propose the numerical expression of ∆S param-
eterized by overlay variables, location, and geometric relation. Even though
rotation and magnification overlay depend on their initial position, the linear
summation of each spacing variation gives us an accurate result.
Non-parallel metal layer after overlay consideration should be converted
to parallel one with equivalent timing behavior to be applicable of 2.5D RC-
extractor which cannot deal with a diagonal pattern. In Figure 2.12, we assume
that (X1,Y1) is close to a driver, and (Xn,Yn) is close to a receiver. ∆S1 is
25
Figure 2.12: ∆S dependency on location.
a metal spacing variation at (X1,Y1) while ∆Sn is metal spacing variation at
(Xn,Yn). ∆ST is the same in every position of a die because ∆ST is inde-
pendent of a location. However, since ∆SR and ∆SM depend on their initial
position, ∆S1 is different with ∆Sn.
Figure 2.13: Equivalent n-π model for Seqv calculation.
Seqv is defined as the metal spacing to have an equivalent delay. To find
Seqv, interconnect is assumed to be modeled by n-π model which is equivalent
to one-π model in Figure 2.13. Ci and Ri is the capacitance and resistance
in ith-π structure. The capacitance at the driver is C1, and the one at the
26
receiver is Cn.
R
2n
n−1∑
i=1
i
n(Ci + Ci+1) + Cn
(2.13)
Using elmore delay model [32, 33, 53], the delay of n-π model is ex-
pressed in equation 2.13. When C1 is equal to Cn, the above equation becomes
R · C/2 which is the same value with the delay of one-π model. When C1 is
not equal to Cn, the delay of interconnect is R ·Ceqv/2 because the resistance
is constant. Ceqv is the equivalent capacitance after converting to parallel
pattern.
∆C = Ci+1 − Ci =Cn − C1
n− 1
Ci = C1 +∆C · (i− 1)
Ci+1 = C1 +∆C · i (2.14)
∆C is a step increase of capacitance for n-π model. Equation 2.13 can
be expressed by C1 and ∆C. As n goes infinite, the delay of interconnect is
presented in the following.
limn→∞
[
R
2n
n−1∑
i=1
i
n(2C1 −∆C + 2 ·∆C · i) + Cn
]
(2.15)
After simplifying, equation 2.15 becomes R·C1+2Cn/6 which should be
equal to R·Ceqv/2.
Ceqv =1
3· C1 +
2
3· Cn
Seqv =3 · S1 · Sn
2 · S1 + Sn
(2.16)
27
Here, S1 is S+∆S1, and Sn is S+∆Sn. Since there is a resistance
shield effect, the capacitance close to a receiver has more effect on delay, and
equation 2.16 implicates the physical meaning.
Coverlay =S
Seqv
· CC (2.17)
Every coupling capacitance can be represented by equation 2.17 after
overlay consideration.
2.4 Timing Analysis with Parameterized Capacitance
This section will show the determination of overlay variables to be the
worst timing result. Once we decide the variables for the worst delay, we
can modify a layout with overlay variables that is able to be used for the
worst corner timing simulation. Then, the traditional timing analysis flow
can be used for the layout which is modified to maximize the delay with
overlay. There are four overlay variables defined. Obviously, the large values
of variables maximize the distortion of layout. However, φ and M are a signed
value, and the angle of translation overlay is purely random variable. Thus,
the efficient way of determining φ, M and θ for the worst timing is required.
Since rotation and magnification overlay depend on the location of pat-
tern, the location shift caused by other type of overlay can affect the shift of
rotation and magnification. However, the shifts are several nm which is rela-
tively small compared with the location of pattern. To see the dependency of
each overlay, we construct two types of random circuit consisted of ten logic
28
depths. We use a different RC generation method for each case of circuit. One
hundreds of circuits in each type are generated randomly.
Table 2.1: Timing Errors ignoring the dependency.Case1 Case2
Average(%) 0.007 0.011Std.Deviation(%) 0.020 0.036Avg.+ 3Std(%) 0.067 0.119
When we find θ to make the worst timing, φ and M are set to zero. θ
and M are set to zero for finding φ of the worst case. Similarly, we find M
when θ and M are zero. The exact variables to make the worst timing result
are obtained by sweeping every variable in a given range. Since the number
of variable to be swept is three, sweeping variables for entire search space is a
time consuming procedure. In table I, we compare the timing result between
the case of ignoring dependency and exact solution obtained by sweeping the
variables. The difference is less than 0.119% in 3-σ level which is negligible.
From this result, we see that one variable for the worst timing can be obtained
by fixing other variables to zero.
Figure 2.14 shows the timing analysis flow for a systematic overlay vari-
ation. Geometric information and coupling capacitance without overlay are
extracted. Coupling capacitances are substituted to Coverlay. Since we do the
variable searching consecutively, we can use the previously obtained variable
as an initial point instead of zero. We could see that overlay variables obtained
by the consecutive way is exactly the same with those obtained by sweeping
29
Figure 2.14: Overlay aware timing analysis flow.
simultaneously. The number of simulation to determine overlay variables for
the worst timing is Nθ+Nφ+NM where Nθ, Nφ and NM is the number of sim-
ulation for sweeping θ, φ and M respectively. With the overlay variables, we
can construct new metal layers considered the variation for the worst corner
timing analysis. The best corner timing analysis can be done in a similar way.
2.5 Applications and Results
In this section, we verify the accuracy of modeling and give an applica-
tion example. Figure 2.15 shows a test structure used to verify the presented
work. Our targeting interconnect for timing analysis is the one in the middle.
Translation overlay will be dominated because the location of wire is close to
the center.
We assumed that α is 3nm, φ is ± 0.05 µ radian, M is ± 0.05ppm, and
30
Figure 2.15: Test structure used to verify overlay aware timing analysis flow
metal spacing without overlay is 48nm. This assumption is reasonable at 3-σ
level [57]. We calculated the delay along the translation angle as shown in
Figure 2.17. As we expected, rotation and magnification overlay is negligible
because the position of interconnect is close to the center. The variation due
to translation overlay is up to 7.8%.
To see the rotation and magnification overlay effect, the same layout
in Figure 2.15 was shifted by (5000um,5000um). With the same overlay con-
dition, the overall timing variation is 9.1%. The additional 1.3% variation
comes from rotation and magnification overlay. When φ is negative, the de-
lay increases because the counterclockwise rotation of the second patterning
reduces spacing between metal layers. Similarly, if M is positive, the second
patterning moves toward the first patterning. Thus, we can see the worst de-
31
(a) Line graph along translation angle (b) Circular graph along translation angle
Figure 2.16: Delay variation with overlay when target layout is in near center.Translation overlay is dominant.
lay when M is positive value. With overlay vs. delay graph, we can see the
variation effects for both the overall overlay and separated overlay source.
It is possible to optimize interconnect with overlay vs. delay graph. Fig-
ure 2.18 shows an alternative decomposition method for DPT which switches
the patterning order of the lower part of targeting interconnect.
The delay of two decomposition cases are compared in Figure 2.19. The
decomposition in Figure 2.18 is more robust than the first one. The delay of the
worst overlay case is 0.87ns which is 3% less than the first case. In the second
decomposition, the variation range is 2.7% while the first decomposition case
has 9.1% variation range. If there are two or more ways to decompose metal
layer, we need a way to decide which one is better. Overlay vs. delay graph
gives us a way to determine the optimized decomposition.
32
(a) Line graph along translation angle (b) Circular graph along translation angle
Figure 2.17: Delay variation with overlay when target layout is in near edgeof die. Rotation and magnification overlay as well as translation overlay havean impact on delay variation.
The delay propagation effects will be verified in here. We generated
RC network and geometric relation randomly. The same overlay condition in
the previous experiment was used. Two fan-ins and one fan-out are assumed
to construct a circuit connection. The well-designed circuit with fully random
relation that γ is allowed to be 0, π/2, π, 3π/2 shows a self-compensated trend.
As a logic depth increases, the cumulated delay may have less delay variation
because positive and negative variation is compensated for well-designed cir-
cuit. To compare the result with a bad designed circuit, we generate a circuit
with constraints that γ is allowed to be 0, π/2. With the γ restriction, the
circuit has particular geometric relations that the second pattern is always
placed on the upper or right side of the first pattern. As logic depth goes over
ten, the self-compensation effect of a restricted design is much less than the
33
Figure 2.18: Test structure for an alternative decomposition which has lessoverlay effect on timing
case having fully random relation. The overlay aware routing may reduce the
timing variation of a circuit.
Figure 2.21 shows a timing result of multiple metal layer. For simplicity,
only translation overlay is considered in the experiment. With the assumption
that an overlay is independent between layers, the overlay angle making the
worst delay for each layer makes the worst delay when we consider two layers
together. For example, in Figure 2.21, the overlay angle to make the worst
delay is 280 for metal 1, and 160 for metal 2. When we consider two layers
together, the worst delay is the case that each layer makes the worst delay.
After finding the overlay variables of an individual layer to make the worst
delay, we can analysis circuit behavior with these variables for multiple layer
34
(a) Line graph along translation angle (b) Circular graph along translation angle
Figure 2.19: Comparision of delay variation between two different decomposi-tion scheme.
analysis.
2.6 Discussions
We proposed a method to estimate the layout distortion due to over-
lay which is inevitable for DPT. We define several overlay variables such as
the amplitude of translation overlay, the angle of rotation overlay, and the
magnification factor. With a given overlay variable, we could model the pa-
rameterized coupling capacitance. We showed how to determine the overlay
variables for the worst timing of a chip. In addition, we showed that different
pattern decompositions vary in overlay tolerance. This work provides a way
of designing a robust circuit with consideration of overlay.
35
Figure 2.20: Compensation trend of overlay induced variation as logic depthincreases.
36
(a) Test circuit (b) Delay vs. θ on Metal 1
(c) Delay vs. θ on Metal 2 (d) Delay vs. θ on Metal 1 and 2
Figure 2.21: Timing variation over θ for multiple metal layers.
37
Chapter 3
A Graph-Theoretic, Multi-Objective Layout
Decomposition
3.1 Introduction
In the previous chapter, we discussed overlay effect on timing in DPL.
DPL requires layout decomposition before manufacturing the process. The
other challenge for DPL is layout decomposition. As we observed in the pre-
vious chapter, robust decomposition can mitigate overlay effect on timing.
When the distance between two patterns is less than minimum space(mins),
two rectangles need to be assigned to different masks. We can resolve conflicts
by inserting stitches during decomposition as shown in Fig. 3.1(a). However,
minimum stitch insertion is preferred to reduce area and improve yield. Not
all the conflicts can be resolved by inserting stitches. An irresolvable conflict is
called a native conflict in Fig. 3.1(b). The only way to remove native conflicts
is to modify layout. After layout modification, we need to decompose layout
iteratively. Therefore, a fast decomposition algorithm is required to shorten
the time of an iterative layout modification and decomposition.
Several rule-based decomposition methods have been proposed in [17,
43, 45, 90, 91, 99]. The combined approach of routing and decomposition was
38
(a) Decomposable layout (b) Undecomposable layout
Figure 3.1: Layout decomposition.
proposed in [17]. The paper in [43, 45, 90] proposed a layout decomposition
method based on constraint graph construction and ILP. The paper in [99]
presented a simultaneous optimization of conflicts and stitches with an ILP
formulation. Our algorithm does not need to consider conflicts during de-
composition explicitly because our framework guarantees conflict-free after an
initial (relative) coloring. In addition, since ILP is NP-hard, the algorithm
in [43, 99] cannot be applied to large scale decomposition without layout par-
tition, which is required to decompose layout balanced or overlay compensated.
Even though layout partitioning can be used, optimality cannot be guaranteed.
Furthermore, conventional metrics are just decomposability and stitch. There-
fore, we need a framework that can handle all requirements within reasonable
time.
The proposed decomposition framework is highly scalable enough for
large scale layout decomposition. The decomposition framework consists of
three steps: relative coloring, constraint insertion for robust decomposition,
and group color assignment by bi-partitioning. In the framework, relative
39
coloring works for removing every conflict except native conflicts, and we opti-
mize decomposition for minimum stitch and balanced density during a group
color assignment step. The optimization for decomposition is mainly based on
a graph theoretical approach, which is extended from a min-cut graph parti-
tioning algorithm such as FM [26]. Overall, the framework runs in polynomial
time.
Another benefit of our framework is that balanced density can be
achieved concurrently with stitch minimization. Uneven density can cause
more not only pattern distortion and hotspot [87], but also etch bias between
two patterning steps, while a balanced layout allows more space for scatter-
ing bar insertion during OPC(optical proximity correction) as well as better
patterning quality and uniformity. Therefore, the decomposed two layers need
to be balanced as much as possible. Our framework balances local density as
well as global density.
In addition to the above advantages, our framework is flexible to add
more constraints during decomposition. For applications to show the flexi-
bility of our framework, we propose contact decomposition and overlay self-
compensated decomposition. When DPL is used for contact layer patterning,
source and drain contacts forming a transistor need to have the same color
in order to reduce side effects due to overlay between the first and the sec-
ond contact patterning. In addition, because metal patterns formed by DPL
can have space variation due to overlay, it can change coupling capacitance
between neighboring wires, resulting in timing variation [15, 30, 96]. TDD con-
40
straints are generated by an ILP formulation to mitigate the timing variation
due to overlay. The number of variables in our ILP formulation for TDD
constraints generation is tractable enough for large scale decomposition. The
TDD constraints are inserted as a weighted edge in our graph modeling for
color assignment. In the experimental result, we show that timing variation
caused by overlay can be reduced from 8.1%∼9.3% to less than 0.5%.
Overall, our graph theoretic decomposition optimizes layout for mini-
mum stitch insertion with balanced density and overlay compensation, simul-
taneously.
The rest of the chapter is organized as follows. We introduce several
requirements for decomposition in section 3.2. Our decomposition framework
is proposed in section 3.3. Robust contact decomposition will be shown in
section 3.4. In section 3.5, we will explain how to extract TDD constraints, and
show our graph based TDD approach. We show hierarchical decomposition
in section 3.6. Experimental results are shown in section 3.7, and we draw
discussions in section 3.8.
3.2 Decomposition Requirements and Motivation
3.2.1 Balanced Density
Even though two features within minimum space are assigned to dif-
ferent masks, unbalanced density can cause lithography hotspot as well as
lowered CD uniformity due to irregular pitch. The aerial image of an unbal-
anced decomposition can have a patterning problem as shown in Fig. 3.2(a).
41
(a) Unbalanced decomposition: 38%(Red) and 62% (Blue)
(b) Balanced decomposition: 48%(Red) an 52% (Blue)
Figure 3.2: Aerial image comparison of decomposed layout with different den-sity.
The balanced decomposition of the same circuit in Fig. 3.2(b) does not have
the problem. Even though the decomposed layout in Fig. 3.2 does not repre-
sent a general case, balanced decomposition provides more lithography friendly
layout because of higher regularity [73]. Therefore, balanced density should
be considered in decomposition algorithms.
3.2.2 Robust Contact Decomposition
In DPL, overlay between source/drain(S/D) contact can change the
distance between contact to gate. Fig. 3.3(a) shows contact patterning without
overlay. In traditional single patterning, S/D contact displacement due to
overlay is the same direction as shown in Fig. 3.3(b). However, if we assign S/D
contact to different masks in DPL, overlay of S/D contact may not be the same
direction. Two worst cases are shown in Fig. 3.3(c) and Fig. 3.3(d). When
42
we ignore contact stress effect, a transistor in Fig. 3.3(c) has the maximum
channel current, while Fig. 3.3(d) shows the minimum current case. In the
section 3.4, we propose a robust decomposition for contact layer to address
the S/D contact mismatch.
(a) No overlay between S/Dcontacts
(b) S/D overlay with single pat-terning
(c) Maximum Ion case in DPL (d) Minimum Ion case in DPL
Figure 3.3: An illustration of contact to gate space variation due to overlay.
3.2.3 Overlay Compensation
The main problem of litho-etch-litho-etch (LELE) DPL comes from
overlay between the 1st and the 2nd patterning step. Even though manufac-
43
(a) Decomposition without overlaycompensation
(b) Decomposition with overlay com-pensation
Figure 3.4: Overlay compensated decomposition.
turing engineers are working to gain more overlay control, it is impossible
to remove overlay completely. Therefore, it is desirable to compensate over-
lay during design phase. Fig. 3.4 shows an observation how to compensate
the overlay effect. Without overlay consideration during decomposition, the
variations for the coupling capacitance between two metal layers are in the
same direction(Fig. 3.4(a)). However, Fig. 3.4(b) shows less timing variation
because C1 decreases while C2 increases with overlay. In the section 3.5, we
propose a decomposition method in order to compensate overlay effect.
3.3 Bi-partitioning Based Decomposition
3.3.1 Overall Decomposition Flow
The overall flow of our framework is shown in Fig. 3.5. The first step is
to divide every shape into rectangles which are basic units in our framework
because polygon is difficult to find neighboring shape, and dividing into single
grid unit can increase complexity and memory usage.
Fig. 3.6 shows the decomposition result after each step. A rectangle
44
Figure 3.5: Overall flow of the decomposition framework.
is divided into smaller ones based on projected region from a non-touching
neighboring rectangle in Fig. 3.6(b). The re-segmented rectangles based on the
projection in Fig. 3.6(c) are grouped by traversing non-touching neighbors in
Fig. 3.6(d). If a re-segmented rectangle does not have a non-touching neighbor,
the rectangle becomes an independent group by itself. A group consisted of
more than two rectangles is defined as a dependent group. There are nine
dependent groups in Fig. 3.6(d). In Fig. 3.6(e), initial (relative) colors in
each group is assigned to the rectangles to remove every conflict. There are
23 stitches after relative color assignment which does not consider the stitch
minimization. We will explain grouping and relative coloring in section 3.3.3.
Layout decomposition at the relative coloring step is done without any
objective. To minimize stitches, we introduce group color determination with
45
(a) Segmentationinto rectangles andfinding conflicts
(b) Projection toconflict rectangleswithin mins
(c) Re-segmentationbased on projectedrectangles
(d) Rectangle group-ing for initial coloring
(e) Relative coloringto remove conflictswithin groups
(f) Group color as-signment to minimizestitch insertion
Figure 3.6: Decomposition results after each step (AOI2 cell, Metal1).
the min-cut partitioning based algorithm in section 3.3.5. Fig. 3.6(f) shows
the final coloring result after stitch minimization. After bi-partitioning of
decomposition graph, rectangles in the same partition have the same coloring.
After the partitioning, group coloring may be flipped or kept the same color.
For example, all rectangles in G1, G3, and G5 flipped their initial color in
Fig. 3.6(f). Notice that an independent group may flip its color as well. After
color flipping, three stitches are required to resolve conflicts.
The separation of relative coloring and group color assignment enable
46
Algorithm 1 Finding neighbors in left edge
1: Sorted R according to right coordinate : RR2: for each rectangle r ∈ R do3: r.left - mins ≤ r′ ≤ r.left : r′ ∈ RN4: for each rectangle r′ ∈ RN do5: if Distance(r,r′) == zero then6: r →touching neighbor.insert(r′)7: else if Distance(r,rr) ≤ mins then8: r →non touching neighbor.insert(r′)9: end if
10: end for11: end for
us to find a native conflict during the relative coloring step, which reduces the
design time for iteratively removing native conflicts and decomposition.
3.3.2 Finding Neighboring Rectangles
After converting polygons into rectangles, touching neighbors as well as
non-touching neighbors which are the rectangle within the given distance mins
are obtained for every rectangle. Algorithm 1 shows the pseudo-code to find
neighbors from left edge of a rectangle. Searching right, bottom, and top edges
are done in the similar way. First, we increasingly sort all rectangles based on
right edge coordinate. Let r.left be the left position of a rectangle r. After the
sort, non-touching neighbors of r are limited to rectangles having a right edge
coordinate from r.left - mins to r.left. The complexity of finding neighbors
is O(NlogN) due to the sorting, in which N is the number of rectangles.
47
(a) Re-segmentation (b) Relative coloring
Figure 3.7: Re-segmentation and grouping process.
3.3.3 Grouping and Relative Coloring
A rectangle is divided into smaller pieces of rectangles based on the
candidate location for stitch insertion. For example, in Fig. 3.7(a), a rectangle
is divided into three small rectangles (r1, r2, r3). Projection from r4 makes
r1 to be colored differently from r4. In addition, to get enough margin for
overlap of stitch insertion, r1 should be extended by minom/2, where minom
is the minimum overlap margin for overlay. r3 and r5 should have a different
color in a similar way. r2 does not have a non-touching neighbor which means
that we can assign any color to r2. The re-segmented rectangles are grouped
by a neighboring relation in Fig. 3.7(b). A relative color is assigned to all the
grouped rectangles, which indicates their relative relationship to other rectan-
gles in the same group. Grouping and relative coloring can be implemented
simultaneously with Algorithm 2.
48
Algorithm 2 Grouping and Relative Coloring(DFS)
1: A set of re-segmented rectangles : S2: RelativeColor = 03: for each rectangle s ∈ S do4: Create new group G5: s →relative color = Relative Color6: G →rectangle.insert(s)7: A set of non touching neighbors : N8: for each rectangle n ∈ N do9: Recursive Relative Coloring(G,n,INV(RelativeColor))
10: end for11: end for12: Recursive Relative Coloring(G,s,RelativeColor)13: s →relative color = RelativeColor14: G →rectangle.insert(s)15: A set of non touching neighbors : N16: for each rectangle n ∈ N do17: Recursive Relative Coloring(G,n,INV(RelativeColor))18: end for
3.3.4 Group Color Assignment Problem
In a given layout, let G = gi|1 ≤ i ≤ n be a set of groups and
H = hw|1 ≤ w ≤ x be a set of stitch candidates and S = sj|1 ≤ j ≤ m
be a set of stitches to be inserted. Every boundary between two touching
groups becomes a candidate for stitch insertion. x is the number of touching
pairs of groups. The goal of our problem is to find minimum set of S from
H. When two touching groups have different colors, hw becomes an element
of S. Let G0 = gi0 |1 ≤ i0 ≤ n0 be a set of groups assigned to color0 and
G1 = gi1 |1 ≤ i1 ≤ n1 be a set of groups assigned to color1. The group color
assignment is a process which assigns gi to either G0 or G1 to minimize m, the
49
element number of S.
Let Aw and Bw be the wth touching pair between two groups. If Aw and
Bw have different colors, stitch insertion is required and hw becomes an element
of S. From the observation, an objective function for stitch minimization can
be formulated as minimizing∑
w Aw
⊕
Bw, which is formulated with an ILP in
(3.1). Exclusive-OR operation can be formulated in ILP with four constraints
in the following formulation.
Minimize :∑
Xw
Subject To : Aw + Bw ≥ Xw
: Aw −Bw ≤ Xw
: −Aw + Bw ≤ Xw
: −Aw −Bw + 2 ≥ Xw (3.1)
3.3.5 Min-Cut based Stitch Minimization
To link the boolean objective function with layout segmented into rect-
angles, a boolean variable is assigned to each rectangle to indicate the group
and relative color. For example, rectangle group A in Fig. 3.8(a) is expressed
by A and A to indicate group and relative color. Independent rectangle groups
represented by X, Y, and Z do not have a complementary variable.
There are five groups in Fig. 3.8(a). Our goal is to determine the five
binary variables to minimize the objective function in Fig. 3.8(b). The ILP
50
(a) Boolean variable assignment aftergrouping and relative coloring
(b) Formulation for minimumstitch insertion
Figure 3.8: An example showing group color assignment formulation afterrelative coloring.
formulation in (3.1) is not an efficient way for the problem because ILP is in
class NP-Hard. There is no way to find the exact solution for the formula in
polynomial time because the number of solution sets is 2n, when there are n
groups in a layout. Therefore, we need an efficient heuristic method for our
problem.
If we present the objective function as a graph, according to the follow-
ing theorem, the group color assignment for minimum stitch insertion can be
achieved with a graph bi-partitioning algorithm. Even though a graph parti-
tioning does not guarantee an optimal solution, efficient heuristics such as FM
have been studied.
Theorem 1. The number of stitches in layout decomposition is equal to the
cut size of the bi-partitioning problem in graph theory.
51
Proof. Let G = gi|1 ≤ i ≤ n be a set of groups and S = sj|1 ≤ j ≤ m be
a set of stitches. After group color assignment, G0 = gi0 |1 ≤ i0 ≤ n0 is a
set of groups assigned to color0 and G1 = gi1 |1 ≤ i1 ≤ n1 is a set of groups
assigned to color1. sj appears only when two touching groups are assigned
to gi0 and gi1 because a stitch disappears when two touching groups have the
same color. Therefore, the number of stitches, m, is equal to the number of
times that two touching groups have different colors.
In a similar way, let C = ck|1 ≤ k ≤ p be a set of cuts and V =
vl|1 ≤ l ≤ q be a set of vertexes in the graph for bi-partitioning. The
cut in bi-partitioning appears when two connected vertexes are in a different
partition. Therefore, the cut size, p is equal to the sum of the edge weight
linked between other partitions. When we consider G as a vertex in a graph,
G0 and G1 become two partitioned set of vertexes. Therefore, m becomes p
with graph partitioning expression, and the number of stitches is equal to the
cut size of the bi-partitioning.
According to Theorem 1, the solution of a min-stitch formulation is the
same as the results of a min-cut bi-partitioning. The only difference from a
conventional min-cut partitioning algorithm is that a vertex representing the
group color is incompatible with a complementary vertex of the same group.
We define the two nodes to be in different partitions as a repulsive pair.
52
3.3.6 Modified Graph Min-Cut Partitioning
We implement the classic FM partitioning algorithm with the following
modification in order to support repulsive layout pairs and balanced density.
(a) A and A are in the same par-tition
(b) A and A are in the differentpartitions
Figure 3.9: An example showing how to handle a repulsive layout pair in FMpartitioning algorithm.
Repulsive pairs cannot be guaranteed to be assigned to different par-
titions simply by assigning negative weights because min-cut partitioning can
fall in local minima. Therefore, in our method, we process the repulsive pairs
in two different ways. First, if the two nodes are in the same partition, their
moving gains are calculated independently. The one with a higher gain is sub-
ject to be moved first, then upon moving, the other one will be locked as well
53
with a gain of zero. In Fig. 3.9(a), A and A form a repulsive pair, but they
are in the same partition, so their gains are zero and eight respectively. When
A is moved, A is locked as a result.
On the other hand, if the two nodes are in the opposite partition, their
gains are kept the same and the value is the sum of their original gains. Upon
picking one of them as the candidate for moving, a swap is carried out to ensure
they are still kept on opposite sides. If A and A are in opposite partition like
shown in Fig. 3.9(b), the sum of their original gains is eight, so their gains
become eight. It results in a swap in A and A when A is picked for moving.
Regardless of what the initial partition is in this example, it results in the
same final solution under our method.
In Fig. 3.10, we construct a graph to decompose the layout in Fig. 3.8.
The pair of group colors represented by A and A is a repulsive pair. E and E
also need to be in different partitions. Edge weight between two vertexes means
the number of touching pairs between two groups. For example, since A and
X have two touching points, the edge weight between A and X becomes two.
Fig. 3.10(a) shows min-cut partitioning and the corresponding decomposed
layout for minimum stitch insertion. Red color is assigned to all rectangles
belonging to groupA and groupE, while rectangles in A, E, X, Y and Z are
assigned to blue color.
During partitioning, we can control stitch count and density balance.
Fig. 3.10(b) shows the balanced partitioning and the decomposed layout which
has four stitches. Global balance can be achieved by adding a weight on a
54
(a) Min-cut partitioning and the decomposed layout
(b) Balanced min-cut partitioning and the decomposed layout
Figure 3.10: Group color assignment with different partitioning scheme anddecomposition results.
vertex in the graph. In order to maintain a balance between two partitions,
we need to start with a balanced initial partitioning solution and then keep
tracks of the total weight on both partitions. Every time before moving a
vertex into another partition, a check is carried out to ensure the area of the
original partition does not drop below a certain threshold after moving.
To consider local balance, we need to divide the layout into subdivisions
and assign a balance constraint to each of them as shown in Figure 3.11. The
constraints for local balanced density are presented in Figure 3.11(b). r is the
55
(a) Layout division (b) Local balance constraints
Figure 3.11: Layout division for local balance.
balance ratio, and when r is 0.5, the decomposed layers are evenly distributed.
Wji is the total area of Rji. smaxji is the weight of a vertex having maximum
weight in Rji. Aji is the sum of vertex weight in a partition. When a layout
is divided into i columns and j rows, the number of constraints is i× j. The
local balance is guaranteed by not moving a vertex to another partition if the
move breaks any of the balance constraints.
3.3.7 Complexity Analysis
Let N be the number of rectangles, and E be the number of neighbor-
ing pairs. Segmentation from polygon into a rectangle takes O(N). Finding
neighbors need O(NlogN) because sorting according to coordinate is required.
The complexity of projection to non-touching neighbor is O(E). Grouping and
relative coloring using DFS requires O(N+E). Group color assignment with
min-cut partitioning can be done in linear time, O(N). Therefore, the com-
plexity of the framework is O(NlogN). Since generating TDD constraints runs
56
for selected nets and the number of variable is usually less than one hundred,
we can ignore the complexity of the TDD part.
The proposed decomposition framework works in polynomial time, while
ILP based approaches [43, 99] are in NP-hard class.
3.4 Application to Contact Layer Decomposition
The early applications of DPL are contact and metal layers. Since
transistor size becomes smaller, contact becomes one of dominant variation
source [7]. As we discuss in the section 3.2.2, contact mismatch can cause
additional transistor variation.
We use TCAD simulation in order to verify current variation due to
contact overlay in DPL. We assume that maximum overlay is 7.5nm which
means that the overlay case for Fig. 3.3(d) has +15um S/D contact space
variation. Since single patterning has no space variation between S/D con-
tact, zero of space variation means single patterning results in Fig 3.12. As
we can expect, if we make S/D contacts with the different masks in DPL,
∆Ion variation due to overlay becomes almost two times than that of single
patterning. Therefore, we need to print S/D contacts on the same mask in
order to remove the additional variation due to overlay in DPL.
Our decomposition framework can be used to address the requirement.
Fig. 3.13(a) shows contact layers to be decomposed, and violated contact pairs
for minimum spacing rule are (C,G), (F ,I), (H,K) and (J ,N) in the figure
57
Figure 3.12: ∆Ion variation for various overlay cases.
because spacing between pairs is less than mins. In the graph expression,
the violated pairs are repulsive nodes and represented by fine (red) lines in
Fig. 3.13(c). Since S/D contacts in a transistor need to be assigned to the
same mask, we insert strongly connected edges between S/D contacts. The
strongly connected edges are not allowed to be cut during graph partitioning.
In order to guarantee that, we group the strongly connected node as a virtual
node in Fig. 3.13(d). By grouping the strongly connected edge, we can make
sure that strongly connected nodes are partitioned to the same mask. The
corresponding decomposed result is shown in Fig. 3.13(b). Even though we
enforce S/D contacts to be in the same mask, area increase is negligible if S/D
contact space is larger than mins defined in design rule, because the proposed
decomposition does not generate more mins violation.
58
(a) Original contact layer (b) Decomposed contact layer with overlayconsideration
(c) Decomposition graph for the contact layer (Bold line:strongly connected edge, Fine (Red) line: repulsive rela-tions)
(d) Simplified graph after contactgrouping
Figure 3.13: An application to contact layer decomposition.
59
3.5 Application to Robust Metal Layer Decomposition
3.5.1 TDD Constraints for Overlay Compensation
We can get a robustly decomposed layout against timing variation due
to overlay. Delay variation due to overlay is expressed in Eqn(3.2). Since
non-coupling capacitance such as metal to ground and output loading is in-
dependent of overlay, ∆Delayoverlay does not need to consider non-coupling
capacitance.
∆Delayoverlay = Delayno overlay −Delayoverlay (3.2)
In Eqn(3.3), let Cc be a coupling capacitance when there is no overlay
with distance d0. ǫ is the permittivity of an isolation material such as SiO2.
When we consider space variation due to overlay, Cc becomes Cc overlay which
has a distance variation presented by ∆d0. Let g be a variation of the coupling
capacitance between horizontally parallel patterns, and h be a variation of the
coupling capacitance between vertically parallel patterns.
Cc = ǫArea/do,∆Cc = do/(do +∆do)
Cc overlay = Cc∆Cc = ǫArea/(do +∆do)
g and h = 1−∆Cc = ∆do/(do +∆do) (3.3)
Let ak be the multiplication factor of the coupling capacitances for a
horizontally parallel pattern and bk be the multiplication factor for vertically
parallel patterns. ak and bk are defined as kth coupling capacitance times
60
the resistance from driver to the location of the coupling capacitance. In
equ 3.4, Rd is the driver resistance and Rn is the nth resistance from the driver.
We assume that coupling capacitance is located in the mth resistance. MFk
is the miller factor [13, 41] of the kth coupling capacitance for equal timing
determined by slew rate of victim and aggressor.
ak and bk =
Rd +m−1∑
n=1
Rn + 0.5Rm
MFkCck (3.4)
∆Delayoverlay can be re-written in Eqn(3.5). i denotes the number of
aggressors horizontally, and the number of aggressors vertically is presented
by j.
∆Delayoverlay = a1g1 + . . .+ aigi + b1h1 + . . .+ bjhj
= AGT + BHT
where, A = [a1 a2 . . . ai] , G = [g1 g2 . . . gi]
B = [b1 b2 . . . bj] , H = [h1 h2 . . . hj] (3.5)
Distance variation due to translation overlay is expressed in Eqn(3.6)
according to [96]. For simplicity, we will focus on translation overlay. Our
work can be extended to include rotation and magnification overlay. λ is the
amplitude of translation overlay, and θ is the translation angle which is a
totally random value. γ represents the relative geometry relation between the
61
1st and the 2nd pattern. Then, our problem is defined to assign γ to reduce
overlay effect on timing.
∆do = λcos(θ − γ) (3.6)
From Eqn(3.3) and (3.6), we can derive gi and hi. Since gi has two
possible choices according to γ (Note that the horizontally parallel pattern
can have two geometric relations: γ = π/2 or γ = 3π/2), gi is modeled with
a binary variable, xi which indicate geometric relations. For example, if xi
has one, it means that γ = π/2 is preferred to reduce overlay effect, while
the opposite geometric relation (γ = 3π/2) is preferred if xi is zero. We can
simplify gi when d0 is relatively larger than λ. Since overlay amplitude should
be controlled to less than 10% of spaces(d0) between two patterns, d0+λsin(θ)
can be simplified to d0. hj has a π/2 phase difference with gi. After simplifying,
we present gi and hj in Eqn(3.7).
gi =xiλcos(θ − π/2)/(d0 + λcos(θ − π/2))
+(1− xi)λcos(θ − 3π/2)/(d0 + λcos(θ − 3π/2))
≈ λ
d0(2xi − 1)sin(θ)
hj =yjλcos(θ)/(d0 + λcos(θ))
+(1− yj)λcos(θ − π)/(d0 + λcos(θ − π))
≈ λ
d0(2yj − 1)cos(θ) (3.7)
62
From Eqn(3.5) and (3.7), our cost function for overlay impact on timing
is derived in Eqn(3.8). To minimize the cost function, we need to minimize√
α2 + β2, the amplitude of sin(θ + φ). φ works only for phase difference.
Since the horizontal and vertical patterns are orthogonal, we can optimize α2
and β2 independently, which means that the overlay impact is minimized when
α2 and β2 have minimum values, respectively.
√
α2 + β2sin(θ + φ)
where, α = 2AXT −i∑
n=1
an, β = 2BY T −j∑
n=1
bn
X = [x1, x2, . . . , xi] , Y = [y1, y2, . . . , yj ]
φ = sin−1
(
β√
α2 + β2
)
(3.8)
Finally, we propose an ILP formulation to find the relative positions of
horizontally parallel patterns to minimize α2.
minimize 4×i∑
n=1
an
(
AW Tn − wnn
i∑
p=1
ap
)
+
(
i∑
p=1
ap
)2
s.t. wii = xi
1 + wij ≥ xi + xj
xi ≥ wij
xj ≥ wij (3.9)
63
New binary variables, wij are introduced to change quadratic integer
programming to linear integer programming. xi and wij are binary variables
in (3.9), and Wn are defined as follows.
Wn = [wn1 wn2 . . . wni] (3.10)
In a similar way, we can get a solution set of Y to minimize β2 by
substituting a to b, and x to y in the above ILP formulation.
Figure 3.14: An example with four neighbors.
Fig 3.14 shows an example pattern to apply the proposed method.
a1, a2, b1, and b2 corresponding to each coupling capacitance are presented in
Fig 3.14. Here, there are two horizontally parallel aggressors and two vertically
parallel aggressors.
α = a1(2x1 − 1) + a2(2x2 − 1)
β = b1(2y1 − 1) + b2(2y2 − 1) (3.11)
64
α and β of Fig 3.14 are shown in Eqn(3.11). By the proposed ILP
formulation in Eqn(3.9), we can determine x1 and x2 to minimize α2, and
y1 and y2 to minimize β2, respectively. Here, x1 and x2 mean the relative
decomposition for Cc1 and Cc2 as shown in Fig. 3.15.
(a) Overlay compensation with threestitches
(b) The same compensation with twostitches
Figure 3.15: Different decomposition with TDD.
After applying our ILP formulation, we can get two different decompo-
sition results as shown in Fig 3.15. Three stitches are inserted at Fig 3.15(a).
However, a different solution for the same overlay compensation requires only
two stitches. We will present our combined approach of TDD constraints and
graph based stitch minimization in order to obtain stitch optimization with
TDD constraints.
3.5.2 TDD Constraints aware Decomposition
From the previous subsection, we can get constraints for TDD. X and Y
are determined by the ILP formulation. Let X be partitioned into X0 and X1.
65
(a) After relative coloring (b) Group color assignment withoutTDD constraints
(c) TDD Constraints inser-tion
(d) Group color assignment when the edgeweight(w) is bigger than one
Figure 3.16: Decomposition with TDD constraints.
In Fig 3.15(a), X0 is x1 while x2 belongs to X1. Since the complement of x2
is x′2, we can express the constraints with X0 = x1, x
′2. Since the elements in
X0 need to have the same color, x1 should be forced to be with x′2 in the same
partition. We can make them to be in the same partition by adding weighted
edge. The allowed stitch increase can be controlled by adjusting the weight.
Decomposition for more overlay compensation needs to increase edge weight
which results in more stitches.
66
Fig. 3.16 shows the graph based TDD procedure. The grouping and
relative coloring result is presented in Fig. 3.16(a). Graph based decomposition
without TDD constraints is in Fig. 3.16(b). Two edges are inserted to consider
TDD constraints in Fig. 3.16(c). When the weight denoted by w is bigger than
one, new partition result is in Fig. 3.16(d). Two stitches are inserted for TDD,
and the decomposition result looks like Fig. 3.15(b).
3.6 Application to Hierarchical Decomposition
One more requirement for layout decomposition is hierarchical decom-
position support. Since standard cells are optimized with manual effort, chang-
ing coloring within standard cell may not be preferred. Our proposed decom-
position can be applied to hierarchical decomposition as well as flattened de-
composition. In Fig. 3.17 (a), we show flattened decomposition of connected
cells (inverter-nor-inverter). After decomposition graph generation and min-
cut partitioning, we can get decomposed layout for minimum stitch insertion
in Fig. 3.17(b). We can observe three stitches in both decomposition graph
and decomposed layout.
We show the decomposition flow with pre-decomposed standard cells
in 3.18. Let inverter1 be decomposed into I1 and I1’, and Nor cell be pre-
decomposed into N1 and N1’, and Inverter2 be pre-decomposed into I2 and
I2’. Then, we can construct decomposition graph based on pre-decomposed
metal layer. The size of decomposition graph reduces dramatically. When
N cells need to be decomposed, the number of node in decomposition graph
67
becomes 2N . After min-cut partitioning of the decomposition graph, we can
generate hierarchical decomposed layout in 3.18(c). Only Inverter2 flips its
coloring to minimize the number of stitch. Our proposed decomposition works
without losing pre-assigned relative color by only allowing color flipping.
3.7 Experimental Results
We implement the decomposition framework in C++ and OpenAc-
cess2.2 in order to interface with GDSII directly. We test on a 3.0GHz Linux
machine with 4G RAM to verify our algorithm.
First, we present the decomposed layout for metal 1 layer in Fig. 3.19.
Minimum metal width and space used in Fig. 3.19 (a) are 32nm, 34nm, respec-
tively. Our framework works well for the complicated metal patterning. We
verify that there is no design rule violation after decomposition. 13 stitches
are inserted to resolve rule violations.
Second, we show the scalability and runtime of our algorithm. Poly-
silicon layer in benchmark circuits is scaled down to 40nm half pitch. ISCAS-
85&89 benchmark circuits are used to verify the scalability. Before decompo-
sition, minimum space between poly-silicon was 40nm. We select mins=42nm
and minom=10nm for decomposition to avoid native conflicts, which should be
removed by layout modification. ISCAS-89 circuits have many native conflicts
when mins is bigger than 43nm because the ISCAS-89 benchmarks we have
are not designed for double patterning friendly. Table 3.1 shows the runtime of
decomposition as the design size increases. S38584 which is the biggest circuit
68
(a) Decomposition graph with cell abutment (Inverter-Nor2-Inverter)
(b) Corresponding decomposed layout
Figure 3.17: Flattened layout decomposition.
69
(a) Initial decomposed standard cells (b) Hierarchical de-composition graph
(c) Hierarchically decomposed layout
Figure 3.18: Hierarchical layout decomposition.
70
(a) Metal 1 decomposition with 13 stitch insertions
(b) First mask (c) Second mask
Figure 3.19: Metal 1 decomposition with our framework.
71
(a) Graph partition time (b) Total runtime
Figure 3.20: Graph showing runtime increment as circuits are bigger.
in the benchmark can be decomposed evenly in 285.24s. Since mins=42nm is
close to 40nm minimum space, only a few stitches are required to resolve the
conflicts. The graph in Fig. 3.20(a) supports that our min-cut partition based
approach has linear time complexity as we mentioned in the previous section.
Fig. 3.20(b) shows total runtime of decomposition, versus circuit size.
Third, we verify the quality and efficiency of our framework. Table 3.2
shows the result of our ILP formulation in (3.1) and our heuristic result based
on min-cut partitioning is in Table 3.3. We compare runtime and stitch op-
timization during layout decomposition. GLPK(GNU Linear Programming
Kit) solver is used for ILP solving. Because decomposition with ILP formu-
lation is intractable as circuit size increases, we divide a circuit into several
parts by row of cells because we cannot decompose with ILP even for small
circuits like C499. Since poly-silicon layers are isolated with other parts of
the circuit by row of cells, decomposition after dividing into several rows still
provides an exact solution with the ILP formulation. Note that our ILP im-
72
Table 3.1: Runtime comparison(mins = 42nm)
Runtime(second) ResultsCircuit #Group #Touching Except Min-cut Inserted Balance
neighbors partition partition Total Stitches ratio(%)
C432 1554 763 0.48 0.01 0.49 0 48.18C499 3503 2260 1.12 0.23 1.35 0 48.08C880 3105 1308 0.86 0.07 0.93 0 48.10C1355 4630 2091 1.39 0.21 1.60 1 48.05C1908 7403 3447 2.81 0.52 3.33 0 48.23C2670 11325 5291 3.32 0.92 4.24 0 48.05C3540 13934 6062 4.71 1.26 5.97 0 48.02C5315 20393 9382 7.19 2.01 9.20 5 48.02C6288 18836 7764 6.14 1.27 7.41 0 48.02C7552 29642 13344 11.84 3.13 14.97 0 48.03S1488 5952 2558 1.57 0.29 1.86 0 48.01S15850 7983 3282 130.38 9.78 140.16 0 48.04S35932 188556 75943 263.29 14.21 277.50 0 48.01S38417 195448 74311 270.43 18.49 288.92 0 48.00S38584 188298 72342 262.66 22.58 285.24 1 48.01
plementation provides an optimal solution for minimum stitch insertion with
more runtime because our ILP implementation does not use any speed-up tech-
nique described in [43, 99], which may sacrifice optimality. We usemins=54nm
and minom=20nm to enable more potential stitch insertions. Since ISCAS-89
benchmarks have native conflicts when mins is below 54nm, we do not show
their results in Table 3.2 and Table 3.3. When mins is bigger than 54nm, there
are native conflicts on several ISCAS-85 benchmarks. Therefore, we choose
mins=54nm based on the availability of decomposition, which is enough to
show the efficiency of our framework. The quality of the min-cut graph par-
73
Table 3.2: Decomposition results with ISCAS benchmark(mins = 54nm,minom = 20nm), ILP solver(exact method)
No balance, ILP(Exact)Circuit #Groups #Touching #Partitions RunTime Inserted Balanced
neighbors for ILP (total) stitches ratio(%)
C432 1512 1098 1 0.63 1 20.35C499 3103 3280 12 100.85 50 24.01C880 3758 2631 14 4525.57 198 30.09C1355 4836 3083 18 702.40 114 18.91C1908 7795 5472 18 37019.76 371 22.09C2670 12863 9905 - > 24Hr - -C3540 16638 12021 - > 24Hr - -C5315 24483 18373 - > 24Hr - -C6288 19922 11577 - > 24Hr - -C7552 34309 24789 - > 24Hr - -
Table 3.3: Decomposition results with ISCAS benchmark(mins = 54nm,minom = 20nm), Graph Partition(Proposed heuristic)
No balance 48% balanceCircuit RunTime RunTime Inserted Balanced RunTime RunTime Inserted Balanced
comparison (sec) stitches ratio(%) comparison (sec) stitches ratio(%)
C432 x1.4 0.46 1 33.60 x1.0 0.65 2 48.12C499 x49.9 2.02 50 46.47 x49.9 2.02 50 48.50C880 x2773.0 1.63 198 47.12 x2807.4 1.61 198 48.87C1355 x347.4 2.02 114 36.12 x344.0 2.04 114 48.00C1908 x9762.6 3.79 372 46.78 x10422.2 3.55 373 48.66C2670 - 6.70 947 43.51 - 6.87 948 49.30C3540 - 9.85 1034 41.46 - 10.07 1034 49.39C5315 - 17.43 1546 40.87 - 18.50 1549 48.00C6288 - 11.57 256 30.81 - 11.25 256 48.13C7552 - 30.89 2058 41.97 - 31.52 2060 48.02
titioning depends on the initial partitioning. Thus, we execute the modified
min-cut partitioning for twenty times and pick the case with minimum stitch.
Runtime means the total runtime for twenty runs. When we do not consider
74
balanced density, min-cut partitioning and ILP based decomposition have the
same number of stitches except C1908. C1908 has one more stitch in our
approach. The runtime of graph theoretic decomposition is 1.4∼9762.6 times
faster than that of ILP based decomposition.
(a) Blue(13%), Green(87%) (b) Blue(50%), Green(50%)
Figure 3.21: Extremely unbalanced decomposition (S38584).
Fig. 3.21(a) shows the extremely unbalanced decomposition. Even
though every rule violation is removed in Fig. 3.21(a), we can further improve
patterning quality with balanced partitioning.
Fig. 3.22 compares decomposition between the unbalanced and bal-
anced case at mins=60nm and minom=20nm for C432. The unbalanced layer
has nine stitches with 27% balanced ratio while the balanced decomposition
has 17 stitches with globally 50% balanced ratio. In addition, Fig. 3.22(b)
is locally balanced because we enforce the local density balance between 40%
and 60% in each cell row. The result also shows that there is the trade-off
75
(a) Yellow(27%), Blue(73%) (b) Yellow(50%), Blue(50%)
Figure 3.22: C432 decomposed layout.
between stitch counts and density balance.
OPC and lithography simulation is executed using CALIBRE for the
two cases in Fig. 3.22. Edge Placement Error(EPE) distributions after OPC
are compared in Fig. 3.23. The balanced decomposition in Fig. 3.22(b) shows
lower EPE distribution than the unbalanced decomposition in Fig. 3.22(a),
which indicates that the balanced decomposition has less variation than the
unbalanced decomposition.
Next, we verify the effectiveness to reduce variation caused by overlay
between contact layers in DPL. Fig. 3.24 shows delay variation comparison for
delay path consisted of five inverters. We run Monte-Carlo simulation based
on TCAD simulation in Fig. 3.12. The number of simulation was 50,000. In
3-σ level, we observe that delay variation reduces from ±1% to approximately
76
Figure 3.23: The balanced decomposed layout has less EPE than the unbal-anced one (C432).
±0.5% with the proposed decomposition method.
Last, we verify the effectiveness of timing driven decomposition. As
a test circuit, we made three net structures with a metal spacing of 32nm
assuming 3nm of overlay as shown in Table 3.4. By changing edge weight,
we could control the number of inserted stitches. For example, we could see
41.4% overlay compensation with one stitch insertion when edge weight is 0.2
for Net3. Since the maximum peak to peak delay variation caused by overlay
is 9.0%, timing variation on overlay becomes 5.274%(=9%*(1-0.414)) after
one stitch insertions. It becomes 1.098%(=9%*(1-0.878)) after three stitch
insertion. When we increase edge weight, we could see more stitch insertions
and higher overlay compensation rate.
Fig. 3.25 compares overlay compensation according to different stitch
77
Figure 3.24: Delay variation for an inverter chain to verify the robust contactdecomposition in DPL.
insertion for Net3. As more stitches are inserted, timing fluctuation along the
translation angle reduces. When nine stitches are inserted, we can see that
there is no timing variation due to overlay. The bottom peak variation is not
symmetric with top peak variation in the graph because of the second order
term of sin(θ) and cos(θ) in Eqn 3.7. Note that ignoring the second order term
with the assumption that d0 + α is approximately equal to d0 is reasonable
because we could compensate overlay more than 95% in every test structure.
3.8 Discussions
In this chapter, we propose an efficient and flexible layout decomposi-
tion framework with a graph theoretical approach. All the benchmark circuits
can be decomposed in five minutes with balanced density. Our framework can
78
Table 3.4: Overlay compensation with TDD
Timing OverlayTest #Horizontal #Vertical Variation Edge Inserted Compensationcircuit neighbors neighbors on overlay weight stitches Rate(%)
0.2 0 0.0net1 3 6 9.3% 0.5 1 76.3
1 3 95.9100 3 95.90.2 0 0.0
net2 10 9 8.1% 0.5 0 0.01 3 54.6
100 10 99.90.2 1 41.4
net3 16 16 9.0% 0.5 3 87.81 9 99.8
100 9 99.8
expedite decomposition which requires iterative executions and fixing layout
in order to remove native conflicts. Since the decomposition framework is
flexible to add constraints, we extend our work to timing driven decomposi-
tion which reduces the timing variation due to overlay. As a future work, we
can extend the framework for correlation aware decomposition and multiple
decomposition using a multi-way partitioning algorithm.
79
Figure 3.25: Reduction of timing variation as more stitches are inserted (Net3).
80
Chapter 4
TSV Stress aware Timing Analysis and Layout
Optimization for 3D-ICs
4.1 Introduction
As we discussed in the previous chapters, geometric scaling has been
facing limitation. To keep Moore’s law, the 3D-IC stacking has gained tremen-
dous interests because integration with TSVs can increase chip performance
as well as chip density [14, 46, 49, 102]. In addition, two chips manufactured
by a different process can also be integrated as one chip with 3D integration.
CTE of copper is 17×10−6K−1 at 20C, while CTE of silicon is 3×10−6K−1
at 20C [19]. The CTE mismatch between copper and silicon causes inevitable
stress on silicon for both via-first and via-last approaches. The stress can
change mobility of carriers. Therefore, TSV stress induced by CTE mismatch
may cause timing violation if cells on a critical path are placed near TSVs.
Tensile stress enhances electron mobility. However, hole mobility is either
enhanced or degraded depending on stress and transistor channel direction.
Longitudinal tensile stress reduces hole mobility while transverse tensile stress
increases the mobility [80]. When TSV induced tensile stress is 100MPa and
the stress works for longitudinal direction, hole mobility degradation can be
81
up to 7.2%, which makes PMOS transition slow. If the PMOS is on a critical
path, it can cause unexpected setup time violation which is not detected with
the current timing analysis flow.
Even though several papers have been published regarding TSV stress
for reliability, this is the first work addressing TSV stress from the circuit
design perspective, to our best knowledge. In this chapter, we propose a design
flow to analyze TSV stress induced timing variation, and show its implications
for layout optimizations during 3D-IC design. The first step of our framework
is to generate stress map assuming that TSVs are pre-placed. Stress calculation
is based on analytical model and linear super-position. Stress map is used to
estimate hole mobility variation and electron mobility variation. Since every
cell near TSVs has a different mobility depending on stress and orientation
between channel and TSV, we substitute a cell near TSVs to another cell
having the same topology but having different timing characteristics according
to the estimated hole and electron mobility change.
To show the benefit of our framework, we present that TSV stress aware
design plays an important role to optimize timing by adjusting cell locations
to take advantage of enhanced mobility property due to TSV stress. Since
hole mobility contour is different from electron mobility contour, PMOS and
NMOS should be optimized separately. If a PMOS in a cell is on a critical
path, the cell becomes a critical cell for hole mobility optimization. An NMOS
critical cell can be optimally placed using the similar procedure [3, 64].
The rest of the chapter is organized as follows. We introduce related
82
work regarding strained silicon technology and stress impacts in section 4.2.
The overall stress-aware timing analysis and design flow will be shown in sec-
tion 4.3. We propose compact mobility modeling in section 4.4. In section 4.5,
we will explain how to analyze timing with TSV stress. Experimental results
are shown in section 4.6, and we draw discussions in section 4.7.
4.2 Related Work and Motivation
The formula (4.1) shows the relation between stress and strain. E is
Young’s Modulus, E for silicon is 160GPa. σ is the applied stress and ǫ is the
deformation rate. For example, 160MPa stress in silicon results in 0.1% strain
in silicon.
σ = E × ǫ (4.1)
Strained silicon has been used to enhance Ion of a transistor [81]. How-
ever, there are several unwanted stress sources, which should be considered
during the design phase. Shallow trench isolation(STI) is one of the uninten-
tional stress source [42, 62] because SiO2 used for STI fill pushes out silicon
atoms near STI.
During 3D-IC manufacturing, another stress is caused by CTE mis-
match between copper TSV and silicon as shown in Fig. 4.1. Investigations [75]
show that at 200C an anneal time of 30-60 minutes is required in order to
achieve reasonable copper layer properties. Since CTE of copper is larger
83
Figure 4.1: Thermal stress around TSV.
than silicon, at room temperature, copper has less volume compared with
that during annealing process because of contraction. Several papers have
been published to simulate the TSV induced stress [19, 54] using FEA(finite
element analysis) simulation. They show that TSV can cause tensile stress of
more than 200MPa.
∆µ
µ= −Π× σ (4.2)
Mobility(µ) change as a function of applied stress(σ) has been proposed
by the following formula [77], where Π is the tensor of piezo-resistive coefficients
for holes and electrons, and σ is the applied stress in silicon. Tensile stress has
a positive sign and compressive stress has a negative sign.
Since tensile stress increases mean free path for electron, it enhances
NMOS performance. However, longitudinal tensile stress degrades PMOS per-
formance as shown in Fig. 4.2(a) [36]. With longitudinal stress, piezo-resistive
84
(a) ∆µ/µ for longitudinal tensile stress
(b) ∆µ/µ for transverse tensile stress
Figure 4.2: Mobility change due to tensile stress.
85
coefficient for electrons is -3.16× 10−10 Pa−1, and the coefficient for holes is
7.18× 10−10 Pa−1 for (001) wafer surface and 〈110〉 channel which are the
most popular scheme for semiconductor manufacturing [77, 79]. For example,
when TSV stress is 200MPa, (∆µ/µ)e is +6.32% for NMOS, and (∆µ/µ)h is
-14.36% for PMOS.
However, if TSV is placed perpendicular to a transistor channel, mo-
bility for both holes and electrons is enhanced by adding more space in silicon
lattice for carriers to move fast. For transverse stress, piezo-resistive coeffi-
cient for electrons is -1.76× 10−10 Pa−1, and the coefficient for holes is -6.63×
10−10 for (001) surface and 〈110〉 channel. Similarly, we can expect (∆µ/µ)e =
+3.52%, (∆µ/µ)h = +13.26% with σ=200MPa. Empirically, it is known that
(∆Ion/Ion)pmos is 0.5∼0.9 times of (∆µ/µ)h, and (∆Ion/Ion)nmos is 0.4∼0.6
times of (∆µ/µ)e [56, 84] because Ion of a transistor is determined by the sum
of source, drain, and channel resistance. Therefore, TSV stress aware timing
analysis and layout optimization are essential steps for 3D-IC design.
4.3 Overall TSV Stress aware Design Flow
The overall flow of our 3D-IC design methodology is shown in Fig. 4.3.
Our timing analysis is consisted of two steps. The first step is to calculate
TSV stress and mobility change. Since FEA simulation which provides an
accurate solution takes several hours to simulate stress for one TSV, we use
the analytical model proposed in [54]. Mobility change can be calculated
by extension of the formula (4.2). We will explain the process and device
86
Figure 4.3: Overall flow for TSV stress aware design.
modeling for a single TSV in section 4.4.1, and extend to consider multiple
TSVs in section 4.4.2. The second step is 3D timing analysis with TSV stress.
We use PrimeTime as a STA (static timing analysis) engine. In section 4.5, we
explain how to deal with Verilog netlist and timing library to consider mobility
variation. The timing result can be used for layout optimization.Intuitively, if
a PMOS in a cell is on a critical path, the cell should be moved to the region
that has positive (∆µ/µ)h. Then, we can run timing analysis iteratively to
verify the optimization effect.
87
4.4 TSV Stress and Mobility Variation Modeling
In this section, we will present compact process and device modeling
to consider TSV stress effect on timing.
4.4.1 Mobility Variation for a Single TSV
In this section, we assume that the shape of TSV is a cylindrical type
which is widely used for better manufacturability. FEA based TSV simulation
has been proposed [19, 54]. The simulation approaches provide an accurate
solution with long runtime which is not acceptable for our design flow that
should calculate stress for several thousands of TSVs iteratively after each
optimization. Assuming 2-D radial plain stress, we use the following analytical
solution which is known as Lame′
stress solution in [54].
σrr = −B∆α∆T
2
(
R
r
)2
(4.3)
The analytical stress model provides a relatively accurate solution [54].
In the formula (4.3), B is biaxial modulus, ∆α is CTE difference between cop-
per and silicon, ∆T is the temperature difference between copper annealing
and operating temperature. R is TSV radius, and r is a distance from TSV
edge. We assume that ∆T is 175C which is the case of 25C for the room
temperature and 200C for the copper annealing temperature which is rela-
tively low annealing temperature [75]. The formula shows that the thermal
88
stress near TSV depends on the ratio of TSV radius and a distance from a
TSV edge.
(a) NMOS mobility variation (b) PMOS mobility variation
(c) Optimal placement for thebest NMOS performance
(d) Optimal placement for thebest PMOS performance
Figure 4.4: Optimal orientation of MOSFET to maximize the mobility for(001) surface, 〈110〉 channel.
The formula (4.2) provides an efficient way to calculate mobility varia-
tion due to σrr. As we observed in section 4.2, mobility change depends on not
only σrr but also orientation between applied force and a transistor channel.
The empirical value for showing the relation of mobility change and a channel
direction has been proposed in [36]. We extend the formula (4.2) to consider
89
stress and channel direction in (4.4).
∆µ
µ(θ) = −Π× σrr × α (θ)
θ = tan−1
∣
∣
∣
∣
YTSV − Ypoly
XTSV −Xpoly
∣
∣
∣
∣
(4.4)
where α (θ) is an orientation factor as a function of θ which is defined
the degree between the center of TSV and the center of a transistor channel
when a transistor is placed vertically as shown in Fig. 4.4(a),(b). Π is the
piezo-resistive coefficient at θ = 0 which works as longitudinal stress.
In Fig. 4.4(a), if NMOS is in right side of TSV, θ becomes zero, and
α(0) becomes one, which enhances NMOS mobility at its maximum. However,
if NMOS is in upper side of TSV, α(π/2) is 0.5, which means that NMOS
mobility increase is half of the enhancement at θ=0. PMOS shows opposite
trends, which has the best mobility enhancement at θ=π/2. If θ is zero,
then, PMOS becomes slower than the case of no stress. Fig. 4.4(c) and (d)
show the transistor direction for the best performance. Even though the mixed
channel direction is not allowed due to the patterning difficulty, the observation
provides a way to optimize layout for 3D-ICs.
We generate stress contour map based on (4.3). Fig. 4.5 shows contour
for a TSV having radium of 1.5um. Since the region near TSV may have a
crack or extremely high stress, we define that 0.5um from TSV edge is Keep-
Out-Zone(KOZ), in which no cell is allowed to be placed. We can see stress
90
Figure 4.5: Stress contour map for a single TSV with 0.5um KOZ.
of more than 200MPa out of KOZ. Approximately, 100MPa stress appears on
the region of 1um from a KOZ edge.
Fig. 4.6(a) shows a contour map for hole mobility variation. From the
contour, we can see that hole mobility decreases in a horizontal direction, while
it increases in a vertical region. 45 direction has no hole mobility change.
Contour map for electron mobility variation is presented in Fig. 4.6(b). As we
see in Fig. 4.4(a), horizontal direction has more mobility enhancement zone.
4.4.2 Mobility Variation for Multiple TSVs
Since we use many TSVs for signaling, power/ground and clock net-
work, we need to consider stress effect for multiple TSVs. Each TSV works as
91
(a) Contour map for hole mobility variation
(b) Contour map for electron mobility variation
Figure 4.6: Mobility contour map for a TSV.
stress source to silicon. When a position in a wafer is strained by multiple stress
sources, linear super-position can provide the multiple stress solution [54]. We
propose the mobility variation for multiple TSVs.
∆µ
µ total
=∑ ∆µ
µ(θ) = −Π
∑
i∈TSV s
(σi × α (θi))
(4.5)
92
where σi is the tensile stress caused by ith TSV, α (θi) is the orientation
factor of ith TSV. θi is the degree between the center of ith TSV and a point
that we want to get mobility variation.
(a) σtotal (b) (∆µ/µ)h (c) (∆µ/µ)e
Figure 4.7: Linear super-position of TSV stress.
Fig. 4.7 shows stress and mobility variation contour with linear super-
position for four-TSV array. We can see more stress in a region between TSVs.
(∆µ/µ)e contour has similar trend with stress contour. However, (∆µ/µ)h has
less variation between TSVs. In Fig. 4.8(a) and (b), we compare the (∆µ/µ)h
for two different TSV placement schemes having the same TSV density. Since
zigzag TSV placement has compensation effect for positive and negative hole
mobility between adjacent rows, Fig. 4.8(a) has more stress free zone than
Fig. 4.8(b) even if the mobility degradation effect within a row remains the
same. Fig. 4.8(c) and (d) show electron mobility contour for zigzag and regular
TSV placement, respectively. They do not have compensation effect. From
Fig. 4.8, we can see that zigzag TSV placement is preferred for less PMOS
93
variation while regular TSV placement is preferred for more hole mobility
enhancement zone.
(a) Zigzag TSV placement (b) Regular TSV placement
(c) Zigzag TSV placement (d) Regular TSV placement
Figure 4.8: Zigzag TSV placement has less (∆µ/µ)h between rows due tocompensation.
4.5 Timing Analysis with TSV Stress Consideration
In this section, we explain how to incorporate the mobility variation
into cell level STA flow.
94
4.5.1 Timing Analysis for 3D-ICs
Figure 4.9: Timing corner determination according to mobility variation.
Even though topology of a cell is the same, its timing characteristic will
be changed. Fig. 4.9 shows the example that cells having the same topology
and size can be in different timing corners systematically determined by TSVs.
When two TSVs are near three inverters, cell characteristics are different in
a different position. From the formula (4.5), we can determine ∆µ/µ in any
point for a given layout. After mobility calculation, our framework renames for
cells to include mobility variation in verilog netlist. For example, I2 is renamed
to INVX1 N8 P8 which means -8% hole mobility, +8% electron mobility in
Fig. 4.9.
We prepare a verilog netlist and a parasitic extraction file (SPEF) per
die. In addition, we make a top level Verilog netlist that instantiates the dies
and connects them using wires which corresponds to TSV connections. Then
we make a top level SPEF file for the TSV connections. With a proper timing
constraints file, we can run PrimeTime and get the 3D STA results.
95
4.5.2 Timing Library for Mobility Variation
Figure 4.10: Timing corner with TSV stress.
To consider the systematic variation during timing analysis, we charac-
terize a cell with different mobility corners as shown in Fig. 4.10. Hole mobility
variation is from -14% to +8%, and electron mobility variation is up to +8%
to cover stress caused by TSVs in Fig. 4.9. I1 in Fig. 4.9 is matched the corner
near FF corner, while I3 is in FS corner. With mobility variation aware library
and Verilog netlist having renamed cells, we can run PrimeTime to do timing
analysis with TSV stress.
To cover mobility variation caused by multiple TSVs, we need to extend
the mobility variation range (-22%≤ (∆µ/µ)h ≤+10%, 0%≤ (∆µ/µ)e ≤+24%).
If mobility step is 2%, we need to characterize 221 library with different mo-
bility values which is not available. However, we can observe that rising delay
variation only depends on (∆µ/µ)h, falling delay variation depends on (∆µ/µ)e
96
(a) Rising delay dependency on (∆µ/µ)h
(b) Falling delay dependency on (∆µ/µ)e
Figure 4.11: Inverter delay variation with different (∆µ/µ)h and (∆µ/µ)e.
from Fig. 4.11. When we simulate inverter rising delay with mobility varia-
tion, electron mobility variation does not work for the delay. Similarly, we
can see that falling delay only depends on electron mobility variation. In
addition, from Fig. 4.11, we can see that hole mobility variation can cause
more than 20% PMOS performance variation depending on device technol-
ogy, and electron mobility variation can enhance NMOS performance up to
7.5%. We use inverter in NCSU library and PTM spice model [101] to get the
97
Table 4.1: TSV specification
Width Landing pad KOZ Height Dielectric Resistance Capacitance4.14um 4.54um 0.4um 20um 0.2um 0.1Ω 70fF
tables in Fig. 4.11. Therefore, we can fix (∆µ/µ)e when we sweep (∆µ/µ)h.
30 (=17+13) library characterization will be enough to cover the entire mo-
bility set. If mobility step is 4%, 16 (=9+7) library set is required. Since
delay variation has semi-linear dependency on mobility variation, we can use
interpolation for the mobility value between two libraries.
4.6 Experimental Results
We implement TSV stress aware 3-D timing analysis flow in C++. We
test on a 3.0GHz Linux machine with 4G RAM to verify our implementation.
We generate the mobility aware library based on NCSU 45nm cell library with
2% mobility step. TSV used in this experimentation is in Table 4.1.
First, we show the efficiency of our compact stress and mobility mod-
eling. When we want to get ∆µ/µ at any point on a die, we can get the value
promptly. Even though we generate mobility contour in Fig. 4.12 (Die size:
1.752mm2, #TSVs: 462), it takes only 14.9s. The proposed timing analy-
sis with compact process/device model is fast enough to be used for iterative
optimization purpose. Fig. 4.12(a) shows an observation for layout optimiza-
tion that the leftmost and rightmost sides have more hole mobility enhanced
98
zone than the middle area because the region has less mobility degradation by
horizontally placed neighboring TSVs.
(a) (∆µ/µ)h (b) (∆µ/µ)e
Figure 4.12: ∆µ/µ contour map for 22 x 21 TSV array.
Second, we compare stress aware timing result with no stress case. Ten
benchmark circuits are used to show the timing variation in Table 4.2. The
benchmark circuits are placed for wire length minimization [46] without TSV
stress consideration. We assume that there are four dies stacking, and the
number of inserted TSVs are 10% of #cells in each circuit. When we consider
TSV stress effect, the longest path delay of the benchmarks has variation from
-1.32% to 3.58%. Some benchmarks have timing gain while some benchmarks
have timing penalty. If we consider TSV stress effect during cells and TSVs
placement, we can expect performance improvement for every benchmark.
TNS (total negative slack) has more variation from -12.43% to 22.9% which
99
is bigger than delay variation. That motivates the need of TSV stress aware
layout optimization.
Table 4.2: Longest path delay and TNS comparison
Without TSV stress With TSV stress DifferenceCircuit #Cells Longest TNS Longest TNS Longest TNS
Delay(ns) (ns) Delay(ns) (ns) Delay
IDCT 14,864 12.07 -21,293 11.91 -19,652 -1.32% -7.71%8051 15,712 4.78 -7,868 4.94 -7,956 3.32% 1.12%8086 19,895 9.56 -8,557 9.56 -9,045 0.00% 5.71%MAC2 29,706 7.72 -17,561 7.72 -17,619 0.07% 0.33%
ETHERNET 77,234 18.30 -476 18.95 -482 3.58% 1.24%RISC 88,401 8.28 -1,249 8.34 -1,535 0.74% 22.90%B18 103,711 11.28 -2,082 11.25 -1,823 -0.27% -12.43%
DES PERT 109,181 8.61 -2,801 8.64 -2,575 0.25% -8.06%VGA LCD 126,379 8.01 -543 8.14 -538 1.56% -1.02%
B19 168,943 13.01 -5,539 12.98 -4,974 -0.20% -10.20%
Last, we manually optimize the critical path in 8051 to present the po-
tential benefit of TSV stress aware layout optimization. Before optimization,
the path delay is 4.94ns with stress aware timing analysis. However, we could
reduce the delay to 4.62ns with small layout perturbation which is 6.5% im-
provement. It is even less than the path delay without stress which is 4.78ns in
Table 4.2. Table 4.3 shows the gates on the path. We can see the cell remaining
according to the mobility variation. We adjust each cell location with small
perturbation so that each cell has timing gain. The maximum timing gain in a
cell is 14% improvement. Fig. 4.13 shows how cell relocation works for timing
optimization. We capture the placement result on die2 with mobility variation
contours. The cells in logic depth 2,4,8 and 9 are hole mobility critical cells
100
Table 4.3: Gate optimizations on the target path with perturbation
Logic Original Optimized Timing Original Optimized ReductionDepth Gate Gate Arc Delay(ns) Delay(ns) Ratio
DFFPOSX1 DFFPOSX1 fall 0.337 0.334 -0.6%1 NOR3X1 N2 P14 NOR3X1 P4 P14 rise 0.800 0.767 -4.1%2 AND2X1 N12 P12 AND2X1 P0 P12 rise 0.539 0.492 -8.7%3 INVX1 N6 P12 INVX1 N6 P16 fall 0.207 0.191 -7.9%4 INVX1 N12 P12 INVX1 P2 P12 rise 0.653 0.585 -10.4%5 AND2X1 N16 P16 AND2X1 N4 P14 rise 0.576 0.535 -7.2%6 BUFX2 P6 P12 BUFX2 P6 P12 rise 0.245 0.216 -11.8%7 AOI22X1 P4 P10 AOI22X1 P4 P14 fall 0.159 0.148 -7.3%8 INVX1 P0 P10 INVX1 P2 P12 rise 0.107 0.105 -1.5%9 OR2X1 N4 P10 OR2X1 P2 P8 rise 0.490 0.468 -4.3%10 OR2X2 N16 P18 OR2X2 N2 P12 rise 0.068 0.059 -13.3%11 NOR3X1 P0 P14 NOR3X1 P0 P16 fall 0.100 0.089 -11.6%12 NAND3X1 N4 P14 NAND3X1 P2 P12 rise 0.055 0.051 -7.3%13 BUFX2 N4 P14 BUFX2 P4 P12 rise 0.157 0.149 -4.8%14 OR2X2 P0 P8 OR2X2 P2 P8 rise 0.170 0.169 -1.0%15 AOI22X1 N16 P16 AOI22X1 N16 P16 fall 0.076 0.075 -1.7%16 OAI21X1 N4 P14 OAI21X1 P4 P12 rise 0.072 0.069 -4.9%17 NOR3X1 P2 P14 NOR3X1 P2 P16 fall 0.035 0.034 -2.3%18 AOI21X1 N18 P18 AOI21X1 P4 P12 rise 0.047 0.040 -14.0%19 INVX1 N16 P16 INVX1 N16 P18 fall 0.027 0.024 -9.6%20 OAI21X1 P6 P14 OAI21X1 P6 P14 rise 0.017 0.017 1.6%
Path Delay 4.93719 4.61777 -6.5%
because the timing arc is rising on the path. Therefore, we perturb the cells
to be placed close to green area with hole mobility contour. However, the cells
in logic depth 3 and 7 are electron mobility critical. Therefore, we push the
cells to have more mobility enhancement in Fig. 4.13(c) (d).
101
4.7 Discussions
The 3D IC stacking requires TSV for interconnection between wafers.
Cu TSV causes thermal stress which can lead to significant timing variations.
Stress, though commonly believed to have negative impact on timing can ac-
tually be taken advantage of for timing optimization, since it is a strongly
layout dependent, systematic effect. In this chapter, we develop the first-order
compact model for mobility variation and propose a design methodology to
analyze the systematic variation and optimize layout by locating critical cells
in a mobility enhanced region. We observe up to 24.9% delay variation for
PMOS and 7.5% delay variation for NMOS, which is not considered in exist-
ing timing analysis flow. Our TSV stress-aware timing analysis framework for
3D-IC also open the opportunity for many stress-aware layout optimizations,
such as placement and TSV optimizations.
102
(a) Hole mobility contour with originalcell placement
(b) Hole mobility contour after cell per-turbation
(c) Electron mobility contour with origi-nal cell placement
(d) Electron mobility contour after cellperturbation
Figure 4.13: Cell perturbation to take advantage of mobility variation.
103
Chapter 5
Robust Clock Tree Synthesis with Timing
Yield Optimization for 3D-ICs
5.1 Introduction
As we discussed in the previous chapter, 3D integration with TSVs has
been gained main focus for future System On Chip (SOC) integration. How-
ever, we need to resolve several design challenges for robust 3D integration.
Since TSV induced stress is a source of systematic variation, we can use the
stress during chip design. In this chapter, we present design optimization tech-
niques for TSV-based 3D-ICs using the stress model presented in the previous
chapter, especially in the clock network design [71, 72].
There have been several works on CTS in 3D-ICs. BURITO [59] ad-
dresses buffered clock tree in two stacked dies, and the work in [47] clarifies
the whole flow for the 3D CTS in N-stacked dies without buffer insertion.
The paper in [102] proposed pre-bond testable CTS methods. However, the
previous works have not considered new design challenges for 3D-IC such as
inter-die variation and TSV induced stress.
Process variation can be decomposed into three components [52]: wafer-
to-wafer (inter-die) variation, intra-die variation and random variation. The
104
main challenge of 3D design comes from integration of tiers in different timing
corners, which means that cells along a path can have totally different char-
acteristics on variation. In addition, cells in different tiers lose their spatial
correlation. In other words, cells placed only in the same tier are spatially
correlated in process variation [89]. The paper in [25] proposed how to select
tiers for 3D integration based on the pre-bond measurement data in order to
maximize parametric yield. In this chapter, we propose more aggressive clock
network design to take advantage of timing corner mismatch. After all the cells
and signal TSVs are placed, we can adjust clock buffer z-location to minimize
sum of covariance for better timing yield. We propose an ILP formulation to
determine clock buffer z-location for optimization of near critical paths.
Another design for manufacturing (DFM) challenge of 3D-ICs comes
from difference of Coefficients of Thermal Expansion (CTE) [19, 54, 92]. Be-
cause CTE of copper is larger than the value of silicon, tensile stress appears
on silicon near TSVs after cooling down to room temperature. The stress can
change clock buffer driving capability due to mobility variation. Since PMOS
is more sensitive to silicon stress [92], rising delay has more impact on the
stress, which means that clocking scheme using positive edge triggered flip
flop is more susceptible to TSV induced stress. In this chapter, we propose
buffer delay model for the stress and stress aware clock network design.
Initially, we generate an abstract tree. Since the abstract tree does not
provide where clock buffers are inserted, we cannot determine z-location of
clock buffer before clock buffer insertions are determined. To break this cycle,
105
we use a bottom-up tree construction approach from sink to source, iteratively.
At leaf nodes, we identify if buffer insertion is required, then, determine z-
location with our ILP formulation which works for minimizing clock period
variation. We fix z-location of buffers determined already at the previous steps.
In the next level, we find buffer insertion points and determine z-location of
buffers for nodes. Iteratively, z-locations of clock buffers are determined to
optimize timing yield until it reaches to a clock source. Meanwhile, buffer
variation due to the stress is calculated and considered.
The rest of the work is organized as follows. We will show related
work and motivation in section 5.2. We will propose our robust clock tree
construction in section 5.3. Experimental results will be shown in section 5.4,
and we drea discussions in section 5.5.
5.2 Related Work and Motivation
Clock period(CP ) under process variation is determined by the follow-
ing equation 5.1 at 3σ-level.
CP = µcp + 3σcp (5.1)
Here, mean of clock period is determined by equation 5.2. TCtQ and Tsetup
are clock to q propagation delay and setup time for a flip-flop, respectively.
Combinational logic delay is denoted by Tlogic. Tskew is clock skew for a clock
network.
106
µcp = TCtQ + Tlogic + Tsetup + Tskew (5.2)
There are two ways to improve chip performance, thereby, enhance
timing yield during CTS. First, we can try to minimize σcp. In this chapter,
we show that σcp reduction can be achieved during CTS for 3D-ICs. This is
a random variation reduction process. The second method is to minimize µcp.
We can achieve the goal by Tskew reduction in 3D CTS by considering TSV
induced stress.
5.2.1 Clock Buffer Tier Assignment
Figure 5.1: Clock path p with clock buffers
Fig. 5.1 shows a clock path spreading along two dies. We define clock
buffers connected to Flip-Flop (F/F) for path inputs as a type-A buffers. In a
similar way, clock buffers connected to F/Fs for path outputs are defined as a
type-B buffers. In Fig. 5.1, buffer A is type-A, and buffer B is type-B. Let F1,
B be placed in die0 and L1, F2 be placed in die1. If z-location of clock buffer
107
A is flexible, we can assign clock buffer A to either die0 or die1. Intuitively,
we can assign A to die0 to avoid TSV insertion between A and F1. However,
we have to consider covariance between A and other cells during z-location
determination for the buffer.
For the path p in Fig. 5.1, µcp is determined by equation 5.3 if prop-
agation delay from clock source to buffer A is the same one with delay from
clock source to buffer B. Here, E(F1) stands for mean delay of F1 clock to
q, and E(F2) is mean value of F2 setup time. Mean value of each cell delay
is denoted by E(cell).
µ(p)cp = E(A) + E(F1) + E(L1) + E(F2)− E(B) (5.3)
Variance of CP for the path p is determined by equation 5.4. Variance
of each cell delay is denoted by V ar(cell). σ(p)2cp is the sum of variance of each
gate and covariance between two cells. Since two cells in different dies lose
their correlation, their covariance terms become zero. After cell placement,
we can still determine clock buffer z-location in order to minimize sum of
covariance, which reduces σcp and enhances operating frequency and timing
yield in 3D-ICs.
108
σ(p)2cp = V ar(µcp)
= V ar(A) + V ar(F1) + V ar(L1) + V ar(F2) + V ar(B)
+ 2Cov(A,F1) + Cov(A,L1) + Cov(A,F2)− Cov(A,B)
+ Cov(F1, L1) + Cov(F1, F2)− Cov(F1, B)
+ Cov(L1, F2)− Cov(L1, B)− Cov(F2, B) (5.4)
In our example, let each cell have the same variance and covariance
which are denoted by V AR and COV , respectively. If buffer A is placed
on die1, Cov(A,F1), Cov(A,B), Cov(F1, L1), Cov(F1, F2), Cov(L1, B) and
Cov(F2, B) become zero because they lose their correlation. Similarly, we can
obtain sum of covariance when buffer A is assigned to die0 in equation 5.5.
We can minimize clock period variation by putting buffer A into die0.
If bufferA is on die0, σ2cp = 5V AR
If bufferA is on die1, σ2cp = 5V AR + 4COV (5.5)
In this chapter, we propose an optimal buffer tier assignment to mini-
mize σcp for near critical paths during 3D CTS.
5.2.2 Clock Buffer Variation due to TSV induced Stress
Strained silicon has been used to enhance Ion of a transistor [81]. How-
ever, there are several unwanted stress sources, which should be considered
109
during the design phase.
Figure 5.2: Thermal stress around TSV.
In 3D-IC manufacturing, unwanted stress is caused by CTE mismatch
between copper TSV and silicon as shown in Fig. 5.2. Investigations [75] show
that at 200C an anneal time of 30-60 minutes is required in order to achieve
reasonable copper layer properties. Since CTE of copper is larger than that of
silicon, after annealing, copper has less volume compared with silicon. Several
papers have been published to simulate TSV induced stress [19, 54] using finite
element analysis(FEA) simulation. They show that TSV can cause tensile
stress of more than 200MPa.
∆Mobility = −Π× TSVstress (5.6)
Mobility(µ) change as a function of applied stress has been proposed
by the piezo-resistance model [77], where Π is the tensor of piezo-resistive
coefficients for holes and electrons, and TSVstress is the applied stress in silicon
due to TSVs.
110
(a) Slower rising delay with longitudinaltensile stress
(b) Faster rising delay with transverse ten-sile stress
Figure 5.3: Buffer delay variation due to TSV stress.
Longitudinal tensile stress degrades PMOS performance as shown in
Fig. 5.3(a) [36]. With longitudinal stress, piezo-resistive coefficient for holes
is 7.18× 10−10 Pa−1 for (001) wafer surface and 〈110〉 channel which are the
most popular schemes for semiconductor manufacturing [77, 79]. For example,
when TSV stress is 200MPa, ∆Mobility is -14.36% for PMOS.
However, if TSV is placed perpendicular to a transistor channel in
Fig. 5.3(b), hole mobility is enhanced by adding more space in silicon lattice for
carriers to move fast. For transverse stress, piezo-resistive coefficient for holes
is -6.63× 10−10 for (001) surface and 〈110〉 channel. Similarly, we can expect
∆Mobility = +13.26% with TSVstress=200MPa. Empirically, it is known that
driving current variation for PMOS is 0.5∼0.9 times of ∆Mobility [56, 84].
111
Therefore, systematic clock buffer variation should be considered for robust
clock tree construction.
5.3 Robust Clock Tree Design
Figure 5.4: Overall proposed 3D CTS flow.
In Fig. 5.4, we propose 3D CTS to deal with new challenges presented
112
in section 5.2. The first step is to generate an initial abstract tree having
minimum wire-length with 3D-MMM algorithm [59]. 3D-MMM algorithm
constructs a 3D abstract tree with decision of z-location of merging points in
a recursive top-down manner. We assign the clock TSVs under a given TSV
upper bound, and determine the hierarchical connection among the clock sinks,
internal nodes and clock TSVs. The abstract tree has only merging point and
child node information. In other words, after abstract tree generation, we
do not know where clock buffers are inserted. Therefore, we cannot decide
z-location of a clock buffer. However, to determine buffer insertion, we need
to know TSV insertion point and buffer z-location to calculate downstream
capacitance. To break the problem, we propose a depth by depth buffered
clock tree construction approach from sink to source as illustrated in Fig. 5.5.
First, as shown in Fig. 5.5(a), we identify buffer insertion points if
the downstream capacitance is bigger than allowed maximum capacitance.
Then, in Fig. 5.5(b), we determine z-location of buffers in order to minimize
covariance terms with an ILP formulation in section 5.3.1. After buffer z-
location determination, we need to adjust the z-location of merging point of
the up-stream tree in order to minimize TSV insertion in Fig. 5.5(c). On the
next level of the abstract tree, the same procedures are executed in Fig. 5.5(d).
Once z-location is determined for a clock buffer, we determine x and y location
of buffers. After that, buffer variation due to TSV stress is calculated and
wire-length is calculated to get rid of skew in section 5.3.2.
113
(a) Identification of buffer insertionpoints
(b) Buffer tier assignment
(c) Updating tier information for mergingpoint(MP)s
(d) Repeating the procedures to the nextlevel
Figure 5.5: An illustration of the buffer tier assignment procedure in a bottom-up manner. Note that MP1’s location is changed at step(c) to minimize#TSVs.
5.3.1 σCP Minimization for Critical Paths
From the observation in section 5.2.1, our goal is to minimize sum of
covariance by assigning clock buffer z-location optimally.
Minimize
M−1∑
i=1
M∑
j=i+1
αi,j ∗ Covi,j
(5.7)
Our problem is defined in formulation 5.7. Every pair of two cells in
a clock path has a covariance value denoted by αi,j . M is the number of
114
instances including clock buffers, flip-flops and logic gates in a clock path.
Covi,j = Di,0Dj,0 +Di,1Dj,1 + · · ·+Di,N−1Dj,N−1
where, D ∈ 0, 1
Di,0 +Di,1 + · · ·+Di,N−1 = 1
Dj,0 +Dj,1 + · · ·+Dj,N−1 = 1 (5.8)
Covi,j shows their relations for covariance in the boolean equation 5.8.
If z-location of cell i is the same with that of cell j, Covi,j becomes one. Oth-
erwise, Covi,j becomes zero, which means that there is no spatial correlation
between two cells. Di,n is a binary variable used to indicate z-location of cell
i. For example, if Di,0 is one, cell i is determined to be placed on die0. N is
the number of tiers to be stacked for 3D integration.
115
Minimize
M−1∑
i=1
M∑
j=i+1
αi,j
N−1∑
k=0
Yi,j,k
Subject to
Di,0 +Di,1 + · · ·+Di,N−1 = 1
Dj,0 +Dj,1 + · · ·+Dj,N−1 = 1
Di,k +Dj,k − Yi,j,k ≤ 1
Di,k +Dj,k − Yi,j,k ≥ 0
Di,k −Dj,k − Yi,j,k ≥ −1
−Di,k +Dj,k − Yi,j,k ≥ −1 (5.9)
By combining formulation 5.8 and 5.7, we can obtain an ILP formu-
lation to minimize covariance in formulation 5.9 for the most critical path.
Yi,j,k are temporal binary variations introduced to convert AND operation
(Di,kDj,k) to ILP. If z-locations of two cells are already determined during 3D
placement, we can skip the pair in formulation 5.9 and save runtime for solving
the ILP formulation.
Clock buffers can be connected to multiple clock sinks. If a buffer
z-location determined by one path differs from z-location determined from
another clock path, there will be conflicts of optimization procedure.
Di,0 +Di,1 + · · ·+Di,N−1 = 1 (No restriction)
⇒Di,t−1 +Di,t +Di,t+1 = 1 (With restriction) (5.10)
116
In addition, we need to prevent insertion of multiple TSVs between
consecutive clock buffers. For example, a parent buffer can be assigned to
die3 when a child buffer is already fixed to die1. In that case, two TSVs
are required. To avoid the hopping problem, we restrict z-location for parent
buffer i from t − 1 to t + 1 when pre-determined child buffer is on die t as
shown in formulation 5.10.
Minimize Z1 + Z2 + · · ·+ Zp + · · ·+ ZL
Subject to
M ′−1∑
i=1
M ′
∑
j=i+1
αi,j
N−1∑
k=0
Yi,j,k,p = Zp
Di,t−1,p +Di,t,p +Di,t+1,p = 1
Dj,t′−1,p +Dj,t′,p +Dj,t′+1,p = 1
Di,k,p +Dj,k,p − Yi,j,k,p ≤ 1
Di,k,p +Dj,k,p − Yi,j,k,p ≥ 0
Di,k,p −Dj,k,p − Yi,j,k,p ≥ −1
−Di,k,p +Dj,k,p − Yi,j,k,p ≥ −1 (5.11)
We extend the ILP formulation to optimize multiple critical paths in
formulation 5.11. L is the number of targeting paths for our optimization
problem. M ′ is the number of instances including clock buffer, flip-flop and
logic gates in clock path p. t and t′ are child node z-locations for clock buffer i
117
and j, respectively. The formulation aims to minimize delay variation for the
selected critical paths.
αi,j = ±2 (Cov(i, j) ∗ ρi,j − βi,j)
If xi,j ≤ XL, ρi,j = 1− xi,j
XL
∗ (1− ρmin)
Else If xi,j > XL, ρi,j = ρmin (5.12)
We use the spatial correlation model in [28] to consider distance factor
of spatial correlation as shown in equation 5.12. Let covariance between two
cells i and j be Cov(i, j). We can characterize Cov(i, j) from Hspice measure-
ment. ρi,j is the distance factor to represent that spatial correlation reduces as
distance between two cells increases. xi,j means geometrical distance between
two cells. If xi,j is smaller than XL, ρi,j decreases as xi,j increases. When xi,j
reaches XL, ρi,j becomes ρmin.
(a) Too many TSV insertion (b) Less TSV insertion with less optimalsolution
Figure 5.6: Necessity of #TSV control between two consecutive clock buffers.
The proposed formulation can insert many TSVs between clock buffers
as shown in Fig. 5.6(a). In order to control the number of TSVs, we introduce
118
a new parameter βi,j in equation 5.12. By increasing βi,j , we can decrease
αi,j , thereby, raise the possibility of assigning clock buffer i and j to the same
die. It can reduce the number of inserted TSVs. We can explore the optimal
βi,j value to minimize clock period variance at the specific number of TSV
insertion. αi,j has minus sign only if one clock buffer is type-B defined in
section 5.2.1 because variation of type-B buffer can compensate overall clock
period variation.
5.3.2 Buffer Variation Modeling under TSV induced Stress
Our stress induced variation modeling consists of three steps: 1) com-
pact stress modeling, 2) piezo-resistive model to calculate ∆Mobility, 3) buffer
characterization by sweeping hole and electron mobility. Since FEA simulation
takes several hours even for single TSV stress simulation, we use the analyt-
ical compact model in [54] and linear superposition for multiple TSVs [92]
as a practical way. Then, we convert the stress to mobility variation with
piezo-resistive model in equation 5.6. Since mobility variation due to stress
depends on not only applied stress strength but also orientation between TSV
and transistor channel [36], we use the modified piezo-resistive model in equa-
tion 5.13. Here, Of (θ) is an orientation factor which is obtained from empirical
data in [36] and θ is the degree between center of TSV and transistor channel.
∆Mobility = −Π× TSVstress ×Of (θ) (5.13)
119
Clock buffer delay is pre-characterized according to hole and electron
mobility variation. Assuming rising edge triggered flip-flops, our concern on
buffer delay variation can be narrowed to rising delay only. In table 5.1,
we present rising delay variation to show how much clock buffer delay can
be changed by mobility variation. We can extend the work to falling edge
triggered cases in a similar way. From the table 5.1, rising delay variation
mainly depends on hole mobility variation because PMOS is used to charge
output capacitance during the rising transition. We use NanGate library and
45nm PTM model [101] to characterize the delay variation.
Table 5.1: Buffer rising delay variation according to mobility changes (nominaldelay: 210ps)
∆ Electron ∆ Hole MobilityMobility -16% -8% 0% 8% 16%
0% 12.0% 5.1% 0.0% -4.0% -7.6%8% 10.8% 4.8% -0.3% -4.4% -7.9%16% 11.3% 5.3% 0.1% -4.7% -8.4%
To show clock buffer variation under our modeling, we present rising
delay contour based on the proposed modeling with four TSVs in Fig. 5.7.
Fig. 5.7(a) shows TSV induced stress contour. Radius of TSVs is 2um and
Keep-Out-Zone (KOZ), denoted by gray cylindrical shape, is 1um. Stress due
to the TSV is approximately 150Pa out of KOZ. Fig. 5.7(b), (c) shows electron
mobility and hole mobility variation contours, respectively. Since hole mobility
can be either enhanced or degraded based on relative orientation between a
120
(a) TSV induced stress contour (b) Electron mobility variationcontour
(c) Hole mobility variation con-tour
(d) Rising buffer delay varia-tion contour
Figure 5.7: TSV induced stress and clock buffer variation modeling.
TSV and a transistor channel, we can see that hole mobility is more susceptible
to the stress than electron mobility. Finally, Fig. 5.7(d) shows buffer delay
variation contour for rising transition. As we expect, rising buffer delay is
strongly depending on hole mobility variation. In the four TSVs case, we
observe approximately 10% delay variation for clock buffers from -3% to +7%.
Therefore, TSV stress can lead excessive skew over the minimum skew target
if we do not take account of TSV induced stress effect during CTS.
121
5.3.3 Three-Dimensional Buffered Clock Tree Synthesis (CTS)
The major difference between 2D and 3D clock tree comes from TSVs.
TSVs not only add much larger capacitances which cause more buffer insertion
than 2D clock tree, but also give stress to the clock buffer nearby and changes
the effective resistance of the buffer. Since TSV may lead to manufacturability
problems as well, it is desirable to reduce the number of TSVs during 3D CTS,
besides the fundamental goal of 2D clock tree, zero skew with minimum wire-
length. The 3D CTS is done in bottom-up manner. We assume that TSVs
for logic paths are already fixed, and TSVs for clock trees can be arbitrary
located unless there is an overlapping with other TSVs or cells.
Abstract Tree Generation : As briefly explained in section 5.3, we
use 3D-MMM algorithm to get the abstract tree from given sink location under
the given TSV upper bound [59]. After this step, z-location of each merging
point (MP) is determined.
For every depth in bottom-up manner, do followings:
a) Identify candidates for buffer insertion, if child node capaci-
tance exceeds predefined capacitance.
b) Determine z-location of buffer, using the ILP formulation to
minimize covariance. ILP formulation uses the clock tree information which
has been constructed so far, and logical path information to make the optimal
z-location of newly inserted buffer. If the z-location of buffer determined by
the ILP formulation is different from the z-location of child node, a TSV is
122
inserted between child node and buffer. If buffers on two edges are assigned
to the same tier and MP is not, we substitute MP tier to buffer tier in order
to reduce the number of TSVs.
c) Determine (x, y) location for clock buffers. To get the delay
variation of buffer due to TSV stress, we need to fix buffer and TSV loca-
tion. For simplicity, we assume that TSVs are located immediately after child
nodes if they are required. To determine buffer location, we calculate maxi-
mum allowed wire-length from the TSV to the buffer to guarantee small enough
capacitance. Fig. 5.8 shows wire, TSV, and buffer models to calculate down-
stream capacitance and downstream delay. Buffer’s (x, y) location is defined
as an un-overlapped point with TSV and KOZ on the straight line connecting
two child nodes within the maximum allowed wire-length from the TSV.
d) Get the wire-length of each edge. Based on the downstream
capacitance and downstream delay of left and right child nodes, we calculate
the wire-length from a child node to merging point to meet zero skew. Since
we already know the exact (x, y, z) location of child node, we also have the
minimum wire-length between two child nodes based on half perimeter model.
As shown in Fig. 5.9(a), we need to search the location of merging point on
1-dimensional coordinate, from zero(child1) to totalWL(child2), where
totalWL = |x2− x1|+ |y2− y1|. (5.14)
We use binary search to get the wire-length of each edge. To be more
specific, as depicted in Fig. 5.9(a), from the current reference point, point1 =
123
Figure 5.8: Wire, TSV, and buffer modeling for delay calculation.
γ for the merging point, calculate the skew at the point2 = (γ+dl) and at the
point3 = (γ-dl), where dl is the unit length to move. If skew at point2 is the
minimum between three, we move the reference point to the right side, and if
skew at point3 has the minimum skew, next reference point will be in the left
side. The location of reference point, γ, can be determined using the following
equation 5.15, where i indicates the iteration index for binary search.
γi+1 =
γi − (0.5(i+2))× totalWLif point3 has the minimum skew
γi + (0.5(i+2))× totalWLif point2 has the minimum skew
(5.15)
When the skew at a certain point is smaller than the skew tolerance,
calculation of wire-length from child node to merging point is finished. We
use the maximum iteration for binary search as 15, which can guarantee 3nm
124
(a) Merging point on 1-D (b) Merging point on 2-D
Figure 5.9: Illustrations for merging point determination.
resolution for 100um wire-length. Elongation of the wire is needed when skew
at left child node or right child is the minimum along the whole wire, and
if it is larger than the skew tolerance. In such a case, we can calculate the
wire-length to be elongated as explained in [70].
e) Determine (x, y) location of merging point and TSVs. Merg-
ing point can be placed somewhere in between two child node in x-y plane. We
decide (x, y) location of merging point and TSVs based on the ratio of wire-
length in left and right edge, as described in Fig. 5.9(b). The (x,y) location of
merging point can be expressed as equation 5.16.
xMP = γ × a
a+ b
yMP = γ × b
a+ b(5.16)
Similarly, (x,y) value of TSV can be determined in the same manner because
they are evenly distributed along the edge. For example in Fig. 5.9(b), TSV
for child1 is located in the middle of child1 and MP.
125
f) Calculate stress-induced buffer resistance and refine the
wire-length to compensate it. With the stress map, we can adjust buffer
delay at the current buffer location. Delay variation is directly interpreted as
the buffer resistance variation, thus buffer resistance under the stress map can
be calculated as well. Now revisit the step e) with updated buffer resistance
to compensate the change of buffer resistance. Note that in this time, all the
location of TSVs are fixed as the previous location to keep the same stress
effect, and only wire length is adjusted, and (x, y) of merging point is changed
due to the wire-length change.
By doing step a) to f) for every depth from the bottom of the clock
tree, a buffered 3D clock tree with N dies can be constructed with minimum
wire-length as well as the skew under skew tolerance of the system.
5.4 Experimental Results
We implement the proposed CTS flow in C++ and test on a 2.93GHz
Linux machine with 16G RAM. We use NanGate library and 45nm PTM
model [101] to characterize variance and covariance assuming 5% inter-die
and 5% intra-die variation of transistor lengths. Gurobi [35] is used as an ILP
solver.
We use several random circuits to verify the efficiency of our algorithm.
Table 5.2 shows circuit information used for our experiments. We use the same
clock sink number and TSV density for all benchmarks to focus on the trend
by the various numbers of tiers to be stacked. # T.P. means the number of
126
targeted paths for the optimization. For example, if we choose # T.P.=1, our
algorithm tries to optimize the most critical path. TSV density is a percentage
of occupied area by TSVs. TSV diameter is 4um and KOZ is 1um. We assume
that TSV capacitance is 28ff and resistance is 0.053Ω.
First, we show that our work can provide a design guideline to reduce
the stress effect on clock skew. Table 5.3 shows skew caused by TSV stress
according to clock source z-location. To see stress induced skew change, we do
CTS without stress consideration to be zero skew, and measure the skew with
the stress model. Since a bottom tier in 3D stacking does not need TSVs on
silicon substrate, a clock buffer in Tier 0 does not have an effect on the stress.
If a clock source is in Tier 0 (bottom tier), clock buffers tend to be concentrated
on Tier 0, which can reduce skew variation on the stress. However, we can see
huge increase of the skew (62.9ps) when the clock source is placed on Tier 1.
This result shows a guideline for skew reduction caused by TSV stress. For
the remaining experiments, we assume that clock sources are placed in Tier 0
to show conservative results.
Second, we verify the usefulness of our stress aware CTS. Table 5.4 and
Table 5.5 show case1 and case2 in order to show skew variation for all of the
benchmarks. In Table 5.4, case1 means CTS without covariance optimization
and stress consideration while case2 in Table 5.5 is stress aware CTS without
covariance optimization. In the table, Cov. means average covariance for the
optimized paths. σ stands for standard deviation of CP . Covariance and σ
are average values for all targeting paths. The comparison shows that the
127
Table 5.2: Circuit Information
Name #Tier DieSize:um2 #T.P. #Sink TSV density
CKT1 2 10002 1 2000 10%CKT2 3 10002 1 2000 10%CKT3 4 10002 1 2000 10%CKT4 2 20002 10 2000 10%CKT5 3 20002 10 2000 10%CKT6 4 20002 10 2000 10%CKT7 2 50002 100 2000 10%CKT8 3 50002 100 2000 10%CKT9 4 50002 100 2000 10%
Table 5.3: Skew change due to TSV stress according to clock source z-location(CKT9)
Source Tier Tier 0 Tier 1 Tier 2 Tier 3
Skew without TSV stress < 0.1ps < 0.1ps < 0.1ps < 0.1psSkew with TSV stress 9.3ps 62.9ps 53.7ps 37.0ps
skew due to the stress can be up to 12.8ps for CKT8 if we do not consider
TSV stress variation during CTS. Clock period of CKT8 can increase 2.8%
from 454ps to 466.8ps. If the clock source is on Tier 1, overall clock frequency
can increase more than 10% from Table 5.3. Table 5.4 and Table 5.5 show no
penalty of clock buffers, TSVs and wire-length for stress aware CTS.
Next, our variation reduction using the ILP formulation is verified in
Table 5.6 and Table 5.7. CTS without stress consideration, case3 in Table 5.6,
shows relatively large skew caused by TSV stress because our ILP formulation
enforces clock buffers on spreading more evenly over the tiers. We use β = 0 to
128
Table 5.4: Clock period analysis result. Case1 (No covariance optimization isdone, and no stress is considered)
σ Skew µcp # # WL CPUCircuit Cov. (ps) (ps) +3σ(ps) Buf TSV (um) (s)
CKT1 7.8 14.0 1.4 405.6 877 676 2.03e7 18CKT2 -1.4 13.9 5.0 400.4 1025 1288 2.39e7 24CKT3 -23.5 13.0 6.5 476.4 1168 1824 3.00e7 23CKT4 -13.8 13.4 0.2 429.5 892 684 2.15e7 20CKT5 11.9 14.6 10.1 484.8 1025 1293 2.31e7 23CKT6 0.0 14.4 3.8 430.1 1195 1918 3.16e7 22CKT7 89.2 16.7 1.0 460.4 878 674 2.08e7 19CKT8 107.0 17.4 12.8 466.8 1042 1307 2.61e7 18CKT9 133.6 18.2 9.3 466.0 1185 1895 3.05e7 21
see maximum variation reduction. β is a control parameter to avoid too many
TSV insertion introduced in equation 5.12. In case4, CKT9 takes 143 seconds
to do CTS. Fig. 5.10 shows the runtime increase as the number of optimized
paths are increased. Even though our ILP formulation is in class-NP, we show
that five thousands of paths can be optimized in 275 minutes, which is still
reasonable for practical use.
Last, Table 5.8 compares case1 in Table 5.4 and case4 in Table 5.7 to
see clearly combined impact on covariance reduction and stress consideration.
With our ILP formulation, standard deviation(σ) for clock period decreases
up to 34.2% for CKT6. Combining our ILP formulation and stress modeling,
we can reduce the clock period for CKT8 at 3-σ level up to 5.7%. All of the
benchmarks show huge variation(avg: 22.3%) and skew reduction(avg: 5.6ps)
129
Table 5.5: Clock period analysis result. Case2 (No covariance optimization isdone, but TSV stress is considered)
σ Skew µcp # # WL CPUCircuit Cov. (ps) (ps) +3σ(ps) Buf TSV (um) (s)
CKT1 8.0 14.0 0.1 404.3 877 676 2.03e7 15CKT2 -1.2 13.9 0.0 395.5 1025 1288 2.39e7 24CKT3 -23.5 13.0 0.0 469.9 1170 1824 3.00e7 25CKT4 -13.9 13.4 0.0 429.2 892 684 2.15e7 19CKT5 12.6 14.7 0.0 474.8 1024 1293 2.31e7 24CKT6 -0.2 14.4 0.0 426.2 1198 1918 3.15e7 24CKT7 89.7 16.8 0.0 459.5 881 674 2.08e7 20CKT8 106.9 17.4 0.0 454.0 1044 1307 2.62e7 15CKT9 134.8 18.1 0.8 457.0 1184 1895 3.05e7 21
even if clock sources are assumed on the bottom tier.
Buffer and wire-length increase are negligible during covariance opti-
mization. However, the number of TSVs dramatically increases up to 59.1%.
In our ILP formulation, we can control covariance optimization ratio and TSV
insertion number by adjusting β in equation 5.12. More interestingly, we ob-
serve that there is trade-off relation between covariance reduction and TSV
insertion. More covariance reduction requires more TSV insertion in Fig. 5.11.
From the graph, we can achieve maximum covariance reduction for a maximum
allowed TSV number.
130
Table 5.6: Clock period analysis result. Case3 (Covariance optimization isdone, but no stress is considered)
σ Skew µcp # # WL CPUCircuit Cov. (ps) (ps) +3σ(ps) Buf TSV (um) (s)
CKT1 -100.3 9.3 1.4 391.6 877 680 2.03e7 34CKT2 -34.8 12.6 12.4 404.0 1024 1322 2.39e7 42CKT3 -59.3 11.6 31.2 496.8 1168 1851 3.00e7 43CKT4 -56.9 11.7 40.0 464.4 893 797 2.15e7 41CKT5 -94.5 10.4 25.3 487.4 1026 1494 2.32e7 51CKT6 -116.8 9.5 11.8 423.3 1195 2303 3.18e7 57CKT7 -14.6 13.3 45.6 494.7 883 1002 2.13e7 92CKT8 -29.9 12.9 50.9 491.4 1044 2080 2.69e7 107CKT9 -13.7 13.6 42.5 485.4 1186 2302 3.08e7 141
5.5 Discussions
For 3D-IC design, we observe two important design challenges: Varia-
tion between tiers, TSV induced stress. Inter-die variation effect can be used
to compensate clock path variation, which optimizes random variation. TSV
induced stress is a systematic component of variation. We could reduce nom-
inal value of clock period by considering the stress during CTS, and minimize
the variation of clock period with optimal assignment of clock buffer z-location.
The proposed 3D CTS can enhance maximum frequency up to 5.7% by com-
bining the two approaches. Our observations are not limited to CTS, and open
the new opportunities for statistical timing analysis and physical design.
131
Table 5.7: Clock period analysis result. Case4 (Covariance optimization isdone, and TSV stress is considered)
σ Skew µcp # # WL CPUCircuit Cov. (ps) (ps) +3σ(ps) Buf TSV (um) (s)
CKT1 -100.3 9.3 0.1 390.3 877 680 2.03e7 32CKT2 -28.8 12.9 0.0 392.4 1025 1322 2.39e7 43CKT3 -60.4 11.5 0.0 465.4 1170 1851 3.00e7 43CKT4 -55.3 11.8 0.0 424.5 893 797 2.16e7 45CKT5 -95.2 10.4 0.0 462.0 1025 1494 2.32e7 52CKT6 -113.5 9.6 0.0 412.1 1198 2303 3.17e7 56CKT7 -12.9 13.4 0.0 449.3 886 1006 2.13e7 90CKT8 -30.4 12.8 0.0 440.4 1046 2080 2.70e7 113CKT9 -11.3 13.7 0.0 443.1 1186 2280 3.09e7 143
Figure 5.10: Runtime trend.
132
Figure 5.11: Trade-off between covariance and # TSVs.
Table 5.8: No stress, no inter-die aware CTS vs. our stress and inter-die awareCTS
Circuit ∆σ Skew(ps) µcp + 3σ(ps) #Buf #TSV WL
CKT1 -33.3% -1.3 -3.8% 0.0% 0.6% 0.1%CKT2 -9.1% -5.0 -2.0% 0.0% 2.6% 0.1%CKT3 -11.2% -6.5 -2.3% 0.2% 1.5% 0.0%CKT4 -12.3% -0.2 -1.2% 0.1% 16.5% 0.3%CKT5 -28.7% -10.1 -4.7% 0.0% 15.5% 0.3%CKT6 -34.2% -3.8 -4.2% 0.3% 20.1% 0.5%CKT7 -20.5% -1.0 -2.4% 0.9% 49.3% 2.2%CKT8 -26.0% -12.8 -5.7% 0.4% 59.1% 3.3%CKT9 -25.4% -9.3 -4.9% 0.1% 20.3% 1.3%AVG -22.3% -5.6 -3.5% 0.2% 20.6% 0.9%
133
Chapter 6
Conclusions
This dissertation studied algorithms and modeling techniques to mit-
igate the difficulty of continuing Moore’s law. To achieve high density in-
tegration, two different approaches were exploited. The first direction was to
overcome patterning limitation with double patterning technology. The second
technique was 3D wafer stacking with TSV connection.
DPL should be used to print below 80nm pitch pattern which cannot
be printed with current lithography equipment. At 20nm technology node,
DPL will be used for metal layers. However, gate, contact and metal layers
need to be printed with DPL for 14nm technology if EUV is not available.
In this dissertation, we propose CAD approaches to enable more robust DPL.
In Chapter 2, we proposed a method to estimate the layout distortion due
to overlay which is inevitable for DPT. We define several overlay variables
such as the amplitude of translation overlay, the angle of rotation overlay, and
the magnification factor. With a given overlay variable, we could model the
parameterized coupling capacitance. We showed how to determine the overlay
variables for the worst timing of a chip. This work provides a way of designing
a robust circuit with consideration of overlay. In Chapter 3, we propose a
134
fast and flexible layout decomposition framework with a graph theoretical
approach. To show the flexibility of our work, we extend our work to reduce
the timing variation due to overlay for contact and metal layer decomposition.
TSV will be used for memory stacking, and its usage will be extended
to memory and logic integration, and finally, TSV will be used to connect
between logic chips. Since wafer stacking with TSV is under development pro-
cess, we need to investigate the possible problem and solution for robust TSV
integration. In Chapter 4, we show that Cu TSV causes thermal stress which
can lead to significant timing variations. Stress can be taken advantage of for
timing optimization, since it is a strongly layout dependent, systematic effect.
In this thesis, we develop the first-order compact model for mobility variation
and propose a design methodology to analyze the systematic variation and
optimize layout by locating critical cells in a mobility enhanced region. For
3D-IC design aspect, we observe one more design challenge which is variation
between tiers. Inter-die variation effect can be used to compensate clock path
variation, which optimizes random variation while TSV induced stress is a
systematic component of variation. We could reduce nominal value of clock
period by considering the stress during CTS, and minimize the variation of
clock period with optimal assignment of clock buffer z-location. The proposed
3D CTS can enhance maximum frequency by combining the two approaches.
We hope that this dissertation will foster further research follow-ups in
the above area. Some of the possible directions include:
135
• In 14nm technology node, double patterning may not be enough to get
50% scalability. If EUV is not ready for 14nm technology, triple pattern-
ing may be the only choice for metal layers. Triple patterning requires
more complicated decomposition. Since our decomposition algorithm
uses bi-partitioning, our work can be extended to layout decomposition
for triple patterning with a 3-way partitioning method. Even our work
can be extended to provide an effective solution for quadraple patterning.
In addition, rule based decomposition may not be efficient because of no
hot spot consideration. Therefore, hybrid layout decomposition com-
bining rule and model based decomposition may generate more effective
decomposed layout.
• Our TSV stress-aware timing analysis framework for 3D-IC may open
the opportunity for many stress-aware layout optimizations, such as CTS
and TSV optimizations. Our observations are not limited to CTS and
physical design area, and open the new opportunities for statistical tim-
ing analysis. Because stress due to TSV can happen in the vertical
direction even if we focus the horizontal direction stress in this disserta-
tion, there will be good research opportunities with vertical stress con-
sideration. In addition, since there are many other stress sources during
semiconductor manufacturing, we can extend the work to consider the
combining effect of other stress sources.
136
Bibliography
[1] K. Adam and W. Maurer. Polarization effects in immersion lithogra-
phy. Journal of Microlithography, Microfabrication and Microsystems,
4(3):031106, 2005.
[2] Tomoyuki Ando, Masaru Takeshita, Ryoich Takasu, Yasuhiro Yoshii,
Jun Iwashita, Shogo Matsumaru, Sho Abe, and Takeshi Iwai. Pattern
Freezing Process Free Litho-Litho-Etch Double Patterning. In Proc. of
SPIE, volume 7140, Feb 2008.
[3] Krit Athikulwongse, Ashutosh Chakraborty, Jae-Seok Yang, David Z.
Pan, and Sung Kyu Lim. Stress-Driven 3D-IC Placement with TSV
Keep-Out Zone and Regularity Study. In Proc. Int. Conf. on Computer
Aided Design, Nov 2010.
[4] G. Bailey, A. Tritchkov, J.-W. Park, L. Hong, V. Wiaux, E. Hendrickx,
S. Verhaegen, P. Xie, and J. Versluijs. Double pattern EDA solutions
for 32nm HP and beyond. In Proc. SPIE 6521, 2007.
[5] Yongchan Ban, Soo-Han Choi, Kevin Lucas, Chul-Hong Park, and David Z.
Pan. Layout Decomposition of Self-Aligned Double Patterning for 2D
Random Logic Patterning. In Proc. SPIE, 2011.
137
[6] Yongchan Ban, Kevin Lucas, and David Z. Pan. Flexible 2D Layout De-
composition Framework for Spacer-type Double Pattering Lithography.
In Proc. Design Automation Conf., June 2011.
[7] Yongchan Ban and David Z. Pan. Compact Modeling and Robust Lay-
out Optimization for Contacts in Deep Sub-wavelength Lithography. In
Proc. Design Automation Conf., Jun 2010.
[8] Yongchan Ban and Jae-Seok Yang. Layout Aware Line-Edge Roughness
Modeling and Poly Optimization for Leakage Minimization. In Proc.
Design Automation Conf., June 2011.
[9] K. D. Boese, A. B. Kahng, B. A. McCoy, and G. Robins. Fidelity and
near-optimality of Elmore-based routing constructions. In Proc. IEEE
Int. Conf. on Computer Design, pages 81–84, 1993.
[10] K. Cao, J. Hu, and S. Dobre. Standard cell characterization considering
lithography induced variations. In IEEE/ACM Int. Workshop on on
Timing Issues in the Specification and Synthesis of Digital Systems (TAU
), 2006.
[11] H. Chang and S.S. Sapatnekar. Statistical Timing Analysis Under Spa-
tial Correlations. In IEEE Trans. on Computer-Aided Design of Inte-
grated Circuits and Systems, September 2005.
[12] Y.-S. Chang, M.-F. Tsai, C-C Lin, and J-C Lai. Pattern Decomposi-
tion and Process Integration of Self-Aligned Double Patterning for 30nm
138
Node NAND FLASH Process and Beyond. In Proc. SPIE 7274, 2009.
[13] P. Chen, D. A. Kirkpatrick, and K. Keutzer. Miller Factor for Gate-
Level Coupling Delay Calculation. In Proc. Int. Conf. on Computer
Aided Design, Nov 2000.
[14] C. Chiang and S. Sinha. The road to 3d eda tool readiness. In Proc.
Asia and South Pacific Design Automation Conf., Jan 2009.
[15] E. Y. Chin and A. R. Neureuther. Variability aware interconnect timing
models for double patterning. In Proc. SPIE 7275, 2009.
[16] T.-B. Chiou, R. Socha, H. Chen, L. Chen, S. Hsu, P. Nikolsky, A. van
Oosten, and A. C. Chen. Development of layout split algorithms and
printability evaluation for double patterning technology. In Proc. SPIE
6924, 2008.
[17] Minsik Cho, Yongchan Ban, and David Z. Pan. Double Patterning Tech-
nology Friendly Detailed Routing. In Proc. Int. Conf. on Computer
Aided Design, Nov 2008.
[18] Dongsub Choi, Chulseung Lee, Changjin Bang, Daehee Cho, Myunggoon
Gil, Pavel Izilson, Seunghoon Yoon, and Dohwa Lee. Optimization of
high order control including overlay, alignment, and sampling. In Proc.
SPIE 6922, 2008.
139
[19] T. Dao, D. H. Triyoso, M. Petras, and M. Canonico. Through Silicon
Via Stress Characterization. In IEEE International Conference on IC
Design and Technology, 2009.
[20] F. Dartu, N. Menezes, J. Qian, and L. T. Pillage. A gate-delay model for
high-speed CMOS circuits. In Proc. Design Automation Conf., pages
576–580, 1994.
[21] Duo Ding, Jhih-Rong Gao, Kun Yuan, and David Z. Pan. A Generic
Lithography-friendly Detailed Router based on Post RET Data Learning
and Hotspot Detection . In Proc. Design Automation Conf., Jun 2011.
[22] Micrea Dusa, Jo Finders, and Stephen Hsu. Double patterning lithog-
raphy: The bridge between low k1 ArF and EUV. In mic, Feb 2008.
[23] Mircea Dusa, John Quaedackers, Olaf F. A. Larsen, Jeroen Meessen,
Eddy van der Heijden, Gerald Dicker, Onno Wismans, Paul de Haas,
Koen van Ingen Schenau, Jo Finders, Bert Vleemingb, Geert Storms,
Patrick Jaenen, Shaunee Cheng, and Mireille Maenhoudt. Pitch dou-
bling through dual patterning lithography challenges in integration and
litho budgets. In Proc. SPIE, volume 6520, February 2007.
[24] Ilan Englard, Rich Piech, Claudio Masia, Noam Hillel, Liraz Gershtein,
Dana Sofer, Ram Peltinov, and Ofer Adan. Accurate in-resolution level
overlay metrology for multi patterning lithography techniques. In Proc.
SPIE 6922, 2008.
140
[25] C. Ferri, S. Reda, and R. I. Bahar. Strategies for improving the para-
metric yield and profits of 3d ics. In Proc. Int. Conf. on Computer
Aided Design, Nov 2007.
[26] C.M. Fiduccia and R.M. Mattheyses. A Linear-Time Heuristic for Im-
proving Network Partitions. In Proc. Design Automation Conf., June
1982.
[27] D. Flagello, Bernd Geh, Steve Hansen, and Michael Totzeck. Polar-
ization effects associated with hyper-numerical-aperture (> 1) lithogra-
phy. Journal of Microlithography, Microfabrication and Microsystems,
4(3):031104, 2005.
[28] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos.
Modeling Within-Die Spatial Correlation Effects for Process-Design Co-
Optimization. In Proc. Int. Symp. on Quality Electronic Design, Mar
2005.
[29] H. Fukutome, T. Aoyama, Y. Momiyama, T. Kubo, Y. Tagawa, and
H. Arimoto. Direct evaluation of gate line edge roughness impact on
extension profiles in sub-50nm n-mosfets. In Proc. Int. Symp. on
Physical Design, pages 433–436, 2004.
[30] R. S. Ghaida and P. Gupta. Design-Overlay Interactions in Metal Dou-
ble Patterning. In Proc. SPIE 7275, 2009.
141
[31] Mohit Gupta, Kwangok Jeong, and Andrew B. Kahng. Timing yield-
aware color reassignment and detailed placement perturbation for double
patterning lithography. In Proc. Int. Conf. on Computer Aided Design,
November 2009.
[32] R. Gupta, B. Tutuianu, B. Krauter, and L. T. Pillage. The Elmore
delay as a bound for RC trees with generalized input signals. In Proc.
Design Automation Conf., pages 364–369, June 1995.
[33] R. Gupta, B. Tutuianu, and L. T. Pileggi. The Elmore delay as a
bound for RC trees with generalized input signals. In IEEE Trans.
on Computer-Aided Design of Integrated Circuits and Systems, pages
95–104, January 1997.
[34] http://www.gnu.org/software/glpk/.
[35] http://www.gurobi.com/.
[36] H. Irie, K. Kita, K. Kyuno, and A. Toriumi. In-Plane Mobility Anisotropy
and Universality Under Uni-axial Strains in n- and p-MOS Inversion
Layers on (100), (110), and (111) Si. In IEEE International Electron
Devices Meeting, 2004.
[37] K. Jeong, A. B. Kahng, and R. O. Topaloglu. Assessing Chip-Level
Impact of Double-Patterning Lithography. In Proc. Int. Symp. on
Quality Electronic Design, March 2010.
142
[38] K. Jeong and A.B. Kahng. Timing Analysis and Optimization Impli-
cations of Bimodel CD Distribution in Double Patterning Lithography.
In Proc. Asia and South Pacific Design Automation Conference, 2009.
[39] Minhee Jung, Joon-Min Park, Moonseok Kim, Sukjoon Hong, Jaisoon
Kim, In-Ho Park, and Hye-Keun Oh. 32 nm half pitch formation with
high numerical aperture single exposure. In Proc. of SPIE, volume
7274, Feb 2009.
[40] Moongon Jung, Joydeep Mitra, David Z. Pan, and Sung Kyu Lim. TSV
Stress-aware Full-Chip Mechanical Reliability Analysis and Optimiza-
tion for 3D IC. In Proc. Design Automation Conf., June 2011.
[41] A. B. Kahng, S. Muddu, and E. Sarto. On Switch Factor Based Analysis
of Coupled RC Interconnects. In Proc. Design Automation Conf., June
2000.
[42] A. B. Kahng, P. Sharma, and R. O. Topaloglu. Exploiting STI stress
for perform. In Proc. Int. Conf. on Computer Aided Design, Nov 2007.
[43] A.B. Kahng, C.-H. Park, and H. Yao. Layout Decomposition for Double
Patterning Lithography. In Proc. Int. Conf. on Computer Aided
Design, Nov 2008.
[44] Andrew B. Kahng, Sudhakar Muddu, and Devendra Vidhani. Noise and
Delay Uncertainty Studies for Coupled RC Interconnects. In Proc. Int.
Conf. Asic/SOC, 1999.
143
[45] Andrew B. Kahng, Chul-Hong Park, Xu Xu, and Hailong Yao. Layout
Decomposition Approaches for Double Patterning Lithography.
[46] D. H. Kim, K. Athikulwongse, and S. K. Lim. A Study of Through-
Silicon-Via Impact on the 3-D Stacked IC Layout. In Proc. Int. Conf.
on Computer Aided Design, Nov 2009.
[47] T.-Y. Kim and T. Kim. Clock tree embedding for 3d ics. In Proc. Asia
and South Pacific Design Automation Conf., Jan 2010.
[48] D. Laidler, P. Leray, K. D’have, and S. Cheng. Sources of Overlay Error
in Double Patterning Integration Schemes. In Proc. SPIE 6922, 2008.
[49] Y.-J. Lee, R. Goel, and S. K. Lim. Multi-functional Interconnect Co-
optimization for Fast and Reliable 3D Stacked ICs. In Proc. Int. Conf.
on Computer Aided Design, Nov 2009.
[50] Harry J. Levinson. Principles of Lithography, 2nd Edition. SPIE
Publications, 2005.
[51] Burn J. Lin. Successors of ArF Water-Immersion Lithography: EUV
Lithography, Multi-e-beam Maskless Lithography, or Nanoimprint? In
J. Micro/Nanolith. MEMS MOEMS, volume 7, Dec 2008.
[52] F. Liu. A General Framwwork for Spatial Correlation Modeling in VLSI
Design. In Proc. Design Automation Conf., Jun 2007.
144
[53] F.-J. Liu, J. Lillis, and C.-K. Cheng. A new layout-driven timing model
for incremental layout optimization. In Proc. Asia and South Pacific
Design Automation Conf., 1997.
[54] K. H. Lu, X. Zhang, S.-K. Ryu, J. Im, R. Huang, and P. S. Ho. Thermo-
Mechanical Reliability of 3-D ICs containing Through Silicon Vias. In
Electronic Components and Technology Conference, 2009.
[55] Kevin Lucas, Chris Cork, Alex Miloslavsky, Gerry Luk-Pat, Levi Barnes,
John Hapli, John Lewellen, Greg Rollins, Vincent Wiaux, and Staf Ver-
haegen. Interactions of double patterning technology with wafer pro-
cessing, OPC and design flows. In Proc. SPIE 6924, 2008.
[56] M. S. Lundstrom. On the Mobility Versus Drain Current Relation for
a Nanoscale MOSFET. In IEEE Electron Device Letters, volume 22,
pages 293–295, June 2001.
[57] W.-K. Ma, J.-H. Kang, C.-M. Lim, H.-S. Kim, S.-C. Moon, S. Lalbaha-
doersing, and S.-C Oh. Alignment system and process optimization for
improvement of double patterning overlay. In Proc. SPIE 6922, 2008.
[58] Yongchan Ban Minsik Cho, Kun Yuan and David Z. Pan. ELIAD:
Efficient Lithography Aware Detailed Routing Algorithm with Compact
and Macro Post-OPC Printability Prediction. In IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems (TCAD),
volume 28, pages 1006–1016, 2009.
145
[59] J. Minz, X. Zhao, and S. K. Lim. Buffered clock tree synthesis for 3d
ics under thermal variations. In Proc. Asia and South Pacific Design
Automation Conf., Jan 2008.
[60] J. Mitra, P. Yu, and D. Z. Pan. RADAR: RET-Aware Detailed Rout-
ing Using Fast Lithography Simulations. In Proc. Design Automation
Conf., Jun 2005.
[61] Joydeep Mitra, Moongon Jung, Rui Huang, Suk-Kyu Ryu, Sung Kyu
Lim, and David Z. Pan. A Fast Simulation Framework for Full-Chip
Thermo-Mechanical Stress and Reliability Analysis of Through-Silicon-
Via based 3D ICs. In Electronic Components and Technology Confer-
ence, 2011.
[62] V. Moroz, L. Smith, X.-W. Lin, D. Pramanik, and G. Rollins. Stress-
Aware Design Methodology. In Proc. Int. Symp. on Quality Electronic
Design, March 2006.
[63] K. Nabors and J. White. Fastcap: A multipole accelerated 3-d capaci-
tance extraction program. IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, 10:1447–1459, Nov., 1991.
[64] R. Nair, L. Berman, P. S. Hauge, and E. J. Yoffa. Generation of perfor-
mance constraints for layout. IEEE Trans. on Computer-Aided Design
of Integrated Circuits and Systems, 8(8):860–874, 1989.
146
[65] Jiwoo Pak, Mohit Pathak, Sung Kyu Lim, and David Z. Pan. Modeling
of Electromigration in Through-Silicon-Via Based 3D IC. In Electronic
Components and Technology Conference, 2011.
[66] David Z. Pan, Minsik Cho, and Kun Yuan. Manufacturability Aware
Routing in Nanometer VLSI. In Foundations and Trends in Electronic
Design Automation, volume 4, page 197, 2010.
[67] David Z. Pan, Jae-Seok Yang, Kun Yuan, and Minsik Cho. CAD for
Double Patterning Lithography. In IEEE International Conference on
IC Design and Technology, Jun 2010.
[68] David Z. Pan, Jae-Seok Yang, Kun Yuan, Minsik Cho, and Yongchan
Ban. Layout Optimizations for Double Patterning Lithography. In
IEEE 8th International Conference on ASIC (ASICON), Oct 2009.
[69] Paul Penfield and Jorge Rubinstein. Signal delay in RC tree networks.
In Proc. Design Automation Conf., Jun 1981.
[70] R.-S.Tsay. An exact zero-skew clock routing algorithm. In IEEE Trans.
on Computer-Aided Design of Integrated Circuits and Systems, Vol.12,
No.2, pp.242-249 1993.
[71] A. Rajaram, J. Hu, and R. Mahapatra. Reducing clock skew variability
via cross links. In Proc. Design Automation Conf., 2004.
147
[72] A. Rajaram, D. Z. Pan, and J. Hu. Improved algorithms for link based
non-tree clock network for skew variability reduction. In Proc. Int.
Symp. on Physical Design, 2005.
[73] J. Rubinstein and A. Neureuther. Post-decomposition assessment of
double patterning layout. In Proc. SPIE 6924, 2008.
[74] C. S. Selvanayagam, J. H. Lau, X. Zhang, S.K.W. Seah, K. Vaidyanathan,
and T. C. Chai. Nonlinear Thermal Stress/Strain Analysis of Copper
Filled TSV and their Flip-Chip Microbumps. In Electronic Components
and Technology Conference, 2008.
[75] N. Serin, T. Serin, S. Horzum, and Y. Celik. Annealing effects on the
properties of copper oxide thin films prepared by chemical deposition.
In Electronic Journals, volume 20, pages 398–401, May 2005.
[76] Weicheng Shiu, Hung Jen Liu, Jan Shiun Wu, Tsu-Li Tseng, Chun Te
Liao, Chien Mao Liao, Jerry Liu, and Troy Wang. Advanced self-aligned
double patterning development for sub-30-nm DRAM manufacturing.
In Proc. of SPIE, volume 7274, Feb 2009.
[77] C. S. Smith. Piezoresistance effect in germanium and silicon. In Physi-
cal Review, volume 94, pages 42–49, Apr 1954.
[78] Hisanori Sugimachi, Hitoshi Kosugi, Tsuyoshi Shibata, Junichi Kitano,
Koichi Fujiwara, Michihiro Mita, Akimasa Soyano, Shiro Kusumoto, Mo-
toyuki Shima, and Yoshikazu Yamaguchi. CD Uniformity improvement
148
for Double-Patterning Lithography (Litho-Litho-Etch) Using Freezing
Process. In Proc. of SPIE, volume 7273, Feb 2009.
[79] S. Suthram, J. C. Ziegert, T. Nishida, and S. E. Thompson. Piezore-
sistance Coefficients of (100) Silicon nMOSFETs Measured at Low and
High Channel Stress. In IEEE Electron Device Letters, volume 28, pages
58–60, Jan 2007.
[80] S. E. Thompson, M. Armstrong, and C. Auth et al. A 90 nm logic tech-
nology featuring strained-silicon. In IEEE Trans. on Electron Devices,
volume 51, pages 1790–1797, Nov 2004.
[81] S. E. Thompson, G. Sun, Y. Sung Choi, and T. Nishida. Uniaxial-
Process-Induced Strained-Si: Extending the CMOS Roadmap. In IEEE
Trans. on Electron Devices, volume 53, pages 1010–1020, May 2006.
[82] M. Totzeck, P. Graupner, T. Heil, A. Gohnermeier, O. Dittmann, D. Krah-
mer, V. Kamenov, J. Ruoff, and D. Flagello. Polarization influence on
imaging. Journal of Microlithography, Microfabrication and Microsys-
tems, 4(3):031108, 2005.
[83] N. Toyama, T. Adachi, Y. Inazuki, T. Sutou, Y. Morikawa, H. Mohri,
and N. Hayashi. Pattern decomposition for double patterning from
photomask viewpoint. In Proc. SPIE 6521, 2007.
[84] K. Uchida, T. Krishnamohan, K.C. Saraswat, and Y. Nishi. Physical
mechanisms of electron mobility enhancement in uniaxial stressed MOS-
149
FETs and impact of uniaxial stress engineering in ballistic regime. In
IEEE International Electron Devices Meeting, 2005.
[85] Vincent Wiaux, Staf Verhaegen, Shaunee Cheng, Fumio Iwamoto, Patrick
Jaenen, Mireille Maenhoudt, Takashi Matsuda, Sergei Postnikov, and
Geert Vandenberghe. Split and design guidelines for double patterning.
In Proc. of SPIE, volume 6924, Feb 2008.
[86] M. J. Wieland, G. de Boer, G. F. ten Berge, R. Jager, T. van de Peut,
J. J. M. Peijster, E. Slot, S. W. H. K. Steenbrink, T. F. Teepen, A. H. V.
van Veen, and B. J. Kampherbeek. MAPPER: high-throughput mask-
less lithography. In Proc. of SPIE, volume 7271, Feb 2009.
[87] Alfred K. Wong. Resolution Enhancement Techniques in Optical Lithog-
raphy. SPIE Publications, 2001.
[88] Obert Wood, Chiew-Seng Koay, Karen Petrillo, Hiroyuki Mizuno, and
Sudhar Raghunathan. Integration of EUV lithography in the fabrication
of 22-nm node devices. In Proc. of SPIE, volume 7271, Feb 2009.
[89] Jinjun Xiong, Vladimir Zolotov, and Lei He. Robust extraction of
spatial correlation. In Proc. Int. Symp. on Physical Design, pages 2–9,
New York, NY, USA, 2006. ACM Press.
[90] Y. Xu and C. Chu. GREMA: Graph reduction based mask assignment
for double patterning technology. In Proc. Int. Conf. on Computer
Aided Design, Nov 2009.
150
[91] Yue Xu and Chris Chu. A Matching Based Decomposer for Double
Patterning Lithography. In IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, March 2010.
[92] Jae-Seok Yang, Krit Athikulwongse, Young-Joon Lee, Sung Kyu Lim,
and David Z. Pan. TSV Stress Aware Timing Analysis with Appli-
cations to 3D-IC Layout Optimization. In Proc. Design Automation
Conf., June 2010.
[93] Jae-Seok Yang, Katrina Lu, Minsik Cho, Kun Yuan, and David Z. Pan.
A New GraphTheoretic, MultiObjective Layout Decomposition Frame-
work for Double Patterning Lithography. In Proc. Asia and South
Pacific Design Automation Conf., Jan 2010.
[94] Jae-Seok Yang and A. Neureuther. Crosstalk Noise Variation Assess-
ment and Analysis for the Worst Process Corner. In Proc. Int. Symp.
on Quality Electronic Design, March 2008.
[95] Jae-Seok Yang, Jiwoo Pak, Xin Zhao, Sung Kyu Lim, and David Z. Pan.
Robust Clock Tree Synthesis with Timing Yield Optimization for 3DICs.
In Proc. Asia and South Pacific Design Automation Conf., Jan 2011.
[96] Jae-Seok Yang and David Z. Pan. Overlay Aware Interconnect and
Timing Variation Modeling for Double Patterning Technology. In Proc.
Int. Conf. on Computer Aided Design, Nov 2008.
151
[97] Kun Yuan and David Z. Pan. WISDOM: Wire Spreading Enhanced
Decomposition of Masks in Double Patterning Lithography. In Proc.
Int. Conf. on Computer Aided Design, Nov 2010.
[98] Kun Yuan and David Z. Pan. E-Beam Lithography Stencil Planning
and Optimization with Overlapped Characters. In Proc. Int. Symp. on
Physical Design, March 2011.
[99] Kun Yuan, Jae-Seok Yang, and David Z. Pan. Double Patterning Layout
Decomposition for Simultaneous Conflict and Stitch Minimization. In
Proc. Int. Symp. on Physical Design, March 2009.
[100] Kun Yuan, Jae-Seok Yang, and David Z. Pan. Double Patterning Lay-
out Decomposition for Simultaneous Conflict and Stitch Minimization.
In IEEE Trans. on Computer-Aided Design of Integrated Circuits and
Systems, Feb 2010.
[101] W. Zhao and Y. Cao. New generation of Predictive Technology Model
for sub-45nm early design exploration. In IEEE Trans. on Electron
Devices, 2006.
[102] X. Zhao, D. Lewis, H.-H. S. Lee, and S. K. Lim. Pre-bond Testable
Low-Power Clock Tree Design for 3D Stacked ICs. In Proc. Int. Conf.
on Computer Aided Design, Nov 2009.
152
Vita
Jae-Seok Yang received the B.S. degree in electrical enginering from
Sogang Univercity, Seoul, Korea, in 1997, the M.S. degree in electrical engi-
neering and computer science from the University of California, Berkeley, in
2007. He is currently working toward the Ph.D. degree in electrical and com-
puter engineering at The University of Texas, Austin. He was with Samsung
semiconductor research center from 1999 to 2005 at Hwasung, Korea. From
fall 2010, he is working for Samsung semiconductor research center. He has
published over 14 papers in international conferences and journals.
Jae-Seok Yang’s research interests include nanometer VLSI design for
manufacturability and design automation for 3D-IC design. In particular, he
has worked on algorithm and modeling for robust double patterning lithog-
raphy, 3D integration with TSV, signal integrity analysis, statistical timing
analysis, clock tree synthesis and layout dependent stress modeling. He was
the recipient of the Best Paper Award at the SOC Design Conference, Seoul,
Korea, in 2002, the Samsung Scholarship in 2005, and the Best Paper Award
at the Asian and South Pacific Design Automation Conference in 2010. He has
served as reviewer for several international conferences and journals including
DAC, ICCAD, ASPDAC, SLIP, TCAD, TVLSI.
Permanent address: Jae-Seok Yang,[email protected]
153
This dissertation was typeset with LATEX† by the author.
†LATEX is a document preparation system developed by Leslie Lamport as a specialversion of Donald Knuth’s TEX Program.
154