Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | quentin-phillips |
View: | 212 times |
Download: | 0 times |
Clock Clustering and IO Opti-mization for 3D Integration
Samyoung Bang*, Kwangsoo Han‡,Andrew B. Kahng‡† and Vaishnav Srinivas‡
‡ECE and †CSE Departments, UC San Diego, La Jolla, CA 92093
*Samsung Electronics Co. Ltd, Hwaseong-si, South Korea
[email protected], {kwhan, abk, vaishnav}@ucsd.edu
2
Outline• Motivation• Power, Area and Timing Model• P&R and Timing Flow• Experimental Results • Conclusion
3
Motivation• For 3D integration with large bandwidth needs
between dies, choice of clocking options need to be made upfront• Tradeoff between area and power needed upfront• Affects floorplanning choices
Serializer
3DIO PLL
Deserializer
3DIO PLLPLL
Serializer
3DIO PLL
Deserializer3DIO PLL
PLL
4
Key Choices for Clocking Options• Local clustering
• Partition a given region into sub-regions• Clock synchronization scheme
• Synchronous• Source-synchronous• Asynchronous
• 3DIO frequency # of 3DIO
To enable design space pathfinding/exploration:• Power/Area/Timing model based on total bandwidth,
clustering, synchronization scheme, 3DIO frequency• Combine clocking and 3DIO power/area/timing
5
Clock entry point
Cluster
3DIO array
Data path
The layout of the bottom die
3DIO Clustering• Localize the clock tree of the 3D interconnect• Advantages when number of cluster increase:
• Size of cluster clock tree↓ (smaller skew, jitter)• Shorter data paths to 3DIO array at the center of each cluster• Enables efficient clocking schemes (forwarded clock, asynchro-
nous)
• Disadvantages when number of cluster increase:• Overhead to synchronize between clusters on top die• Overhead in cluster clock 3DIO per cluster
6
Synchronization Schemes for 3DIO Clocking• Synchronous
• Cluster clock tree is balanced to all F/Fs on both the bottom and the top die
• Simplest clocking scheme (similar to on-die)• Vulnerable to inter-die process/voltage varia-
tion (large skews)
• Source-synchronous• Forwarded clock from one die to another• No skew balancing needed across two dies• Require balance delays (Tb) within each die
on the data path to match the clock insertion delay
• Asynchronous• Separate clocks on each die• FIFO to help clock domain crossing • Obtain much smaller number of 3DIOs
due to higher speeds achievable with embedded clock and CDR techniques
(a) Synchronous clocking
(b) Source-synchronous clocking
Balance
delay (Tb)
Forwarded clock
(c) Asynchronous clocking
Serializer
Cluster clock IO clock
Deserializer
Cluster clockIO clock - recovered
Thold_fix
Thold_fix
Tdata
DDR
DDR
Tdata
Tclk0Tclk1
Tclk0 Tclk1
10
10
10
Asynchronous clocking
Serializer Deserializer
IO-clock Cluster
clock
Cluster clockIO-clock
(a) Synchronous clocking
(b) Source-synchronous clocking
Balance
delay (Tb)
Forwarded clock
(c) Asynchronous clocking
Serializer
Cluster clock IO clock
Deserializer
Cluster clockIO clock - recovered
Thold_fix
Thold_fix
Tdata
DDR
DDR
Tdata
Tclk0Tclk1
Tclk0 Tclk1
10
10
10
Synchronous clocking
Launch FFs Capture FFsData path
Bottom Top
(a) Synchronous clocking
(b) Source-synchronous clocking
Balance
delay (Tb)
Forwarded clock
(c) Asynchronous clocking
Serializer
Cluster clock IO clock
Deserializer
Cluster clockIO clock - recovered
Thold_fix
Thold_fix
Tdata
DDR
DDR
Tdata
Tclk0Tclk1
Tclk0 Tclk1
10
10
10
Source-synchronous clocking
Bottom Top
Launch FFs Capture FFsData path
Tb
7
Our Work• Given the choices of clock synchronization schemes, num-
ber of clusters and 3DIO frequency, find maximum band-width for the 3D interconnect given a max power and area constraints.
Optimal Clocking scheme for Max BW
Optimal Clocking frequency for Max BWMax power constraint
Max
are
a co
nstr
aint
s
Synch.
Source-synch.
Asynch.Max power constraint
Max
are
a co
nstr
aint
s
Max Achievable BW
Max power constraintMax
are
a co
nstr
aint
s
Max power constraintMax
are
a co
nstr
aint
s
Optimal number of clusters for Max BW
8
Outline• Motivation• Power, Area and Timing Model• P&R and Timing Flow• Experimental Results • Conclusion
9
3DIO/CTS Directed Graph• Primary inputs are indicated by circle• Rectangles are determined by the primary inputs• Solid and dotted arrow indicates positive and negative corre-
lation• Estimate the rounded rectangles as analytic expressions
#ClustersFreq.
Clockingscheme
RegionArea
BW
WNS
Skew outcome,clock ins. delay
Area
Power
# FFs
IO Freq.
Per-IOpower/area
# 3DIO
Max skew/transition
Jitter
Input
Deterministic
Est. outcome
Estimated
IncreaseDecrease
Clock WL
Clock buf. area
Data WL
Data buf. area
10
Clock Wirelength• Hierarchical approach to estimate clock wirelength
• Assume clock tree is well balanced because FFs are uni-formly distributed over the region area
• Length of Steiner minimal tree over N points uniformly dis-tributed within a given region Areg is proportional to
• Total clock wirelength is
Notation
Depth of clock tree (i == 0 for clock source)
Number of cluster
Total number of flip-flops
Fitted coefficients
i = 0
w0
i = 1w1
FFs FFs FFs
i = 2w2
NAreg /
CreggffregC
depth
i
iclk NAkNAkwW
max_
0
i
CN
ffN
gC kk ,
Cluster clock tree
Global clock tree
11
Clock Buffer Area• Tellez and Sarrafzadeh propose a method to insert the minimum
number of buffers under a given transition time (Tmax_tran) constraint
• Linearize the problem by using the concept of maxinum capacitance (Cmax)
Any buffer stage i with stage cap Ci < Cmax will have Ti_tran < Tmax_tran
• Using Cmax, we estimate the number of clock buffers (Ncbuf),
• Kashyap et al. discuss transition time degradation and Cmax can be expressed as follows,
• Total clock buffer area is
bufg
ffgffclk
cbuf
bufgcbuf
ffgffclkcbuf
CC
CNCWN
CNCNCWNC
max
0
0max
)(
2maxmax1
20max_ )()( RCkTT tran
20
2max_2max )()( TTkC tran
bufgtran
ffgffclk
bufgtran
ffgffclk
clkCTT
CNCWk
CTT
CNCWkA
20
2max_
022
02
max_
01
)()(
)(
)()(
)(
Wire (max length = Wmax)T0
Tmax_tran
12
Data Wirelength and Data Buffer Area• Data path wirelength is propor-
tional to the number of data wires and the cluster dimension
• Distribution exists based on sink placement wrt 3DIO cluster
3DIO
SinkCluster
d
p
• For data buffer area, we use a similar concept to clock buffer area estimation• Need to consider each data path separately
Cannot use total wirelength• Need minimum number of data buffers to meet hold timing
Cregffdata NANkW /0
13
3DIO/Overall Power and Area• 3DIO power and area models are based on CACTI-IO
• Overall (3DIO+clocking) power and area are
fkfkfkR
N)R,(R
kANA
ON
IO
TTIONIOIO
33
221
00
1
2min
statictermdynIOIO PPPNP
IOdataclkclkdataclkdataclktotal
IOdataclkdataclktotal
PAAkFAkAkWkWkP
AAAWkWkA
)()( 76543
21
Switching power Internal and leakage power
IO power
14
Outline• Motivation• Power, Area and Timing Model• P&R and Timing Flow• Experimental Results • Conclusion
15
3D P&R Flow - Synchronous• Synchronous
• Synthesize the cluster clock tree on the top die first to balance the clock tree on both dies
• Extract maximum clock insertion delay (Tclk1)
• Propagate the data path delay (Tdata) for the routing on top die
Gate-level netlist andSDC file generation
Custom placement
CTS (top die)
Extract max Tclk1
CTS and Route (bottom die)
Extract Tdata
Route (top die)
Reportpower/area/timing
Synchronous scheme
Tclk1
PropagatedClock 3DIO
The layout of top dieThe layout of the bottom die
Tdata
16
3D P&R Flow – Source-synchronous and Asynchronous• Source-synchronous
• Synthesize the clock tree and route on the bottom die, and separately synthe-size the clock tree only for the top die
• Extract balance delay Tb (i.e., Tclk1) for each capture FF and annotate the delays to the corresponding data 3DIOs
• Asynchronous• Run traditional 2D flow on both dies sep-
arately
Gate-level netlist andSDC file generation
Custom placement
CTS and Route (bottom die)
CTS (top die)
Extract Tb
Route (top die)
Reportpower/area/timing
Source-synchronous scheme
PropagatedClock 3DIO
The layout of top dieThe layout of the bottom die
Tclk1
Annotate Tb (i.e., Tclk1)
17
Conventional 2D STA vs. our 3D STA
• We focus on inter-die variation, and do not consider intra-die variation which can be com-prehended by timing derate or OCV
• Two process corners {BC, WC} for inter-die variation
• Assign the same corner on the paths on the same die
• Report worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners
Conventional 2D STA(without inter-die variation)
Our 3D STA
Setup slack = Tper – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, WC} + (T{capture, BC} – T{launch, WC})
TlaunchTc2q
Tdata1
Tcapture
Tdata2
Buffer on bottom die Buffer on top die
FF on bottom die FF on top die
Setup slack1 = Tper – Tsu – T{c2q, BC} – T{data1, BC} – T{data2, BC} + (T{capture, BC} – T{launch, BC}) slack2 = Tper – Tsu – T{c2q, BC} – T{data1, BC} – T{data2, WC} + (T{capture, WC} – T{launch, BC}) slack3 = Tper – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, BC} + (T{capture, BC} – T{launch, WC}) slack4 = Tper – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, WC} + (T{capture, WC} – T{launch, WC})
slack = min (slack1, slack2, slack3, slack4)
18
Outline• Motivation• Power, Area and Timing Model• P&R and Timing Flow• Experimental Results • Conclusion
19
Experimental Setup• P&R tool is Synopsys IC Compiler I-2013.12-SP1• Timing analysis tool is Synopsys PrimeTime H-
2013.06-SP2• We use a 65nm TSMC library• Design of experiments
• Bandwidth (10 – 200 GB/s)• Region area (25 – 100 mm2)• 3DIO clock frequency
• Synchronous (100 – 2000 MHz)• Source-synchronous (1500 – 4000 MHz)• Asynchronous (3500 – 8000 MHz)
• Number of clusters (1 – 25)• We select four data points for each parameter
256 design implementations for each clocking scheme
20
Model Fitting Approach• We use Artificial Neural Network (ANN) model for our fit,
guided by the directed graph• Iteratively progress through the directed graph to fit each
node• Clock wirelength• Data wirelength• Clock buffer area• Data buffer area• 3DIO power/area• Total Area/Power/WNS
• We use the Fmax for the timing model (instead of WNS)
• Multiple runs with different training, validation and test data sets Improved generality and robustness of the re-sulting models
#ClustersFreq.
Clockingscheme
RegionArea
BW
WNS
Skew outcome,clock ins. delay
Area
Power
# FFs
IO Freq.
Per-IOpower/area
# 3DIO
Max skew/transition
Jitter
Input
Deterministic
Est. outcome
Estimated
IncreaseDecrease
Clock WL
Clock buf. area
Data WL
Data buf. area
21
Area, Power and Timing Model Results
•Min-Max error within +/-20%
•For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation Larger error
•Mean error within +/-0.5%
Area
Power
Fmax
22
Design Space Results• Max BW:
• Figure shows the iso-bandwidth curves• Vertical and horizontal walls show min power/area required to hit a bandwidth
requirement• Clocking scheme:
• The asynchronous scheme is area-efficient• The synchronous scheme is power-efficient• The source-synchronous scheme provides a valuable tradeoff between power
and area along the knee of the iso-bandwidth curve. • The interesting tradeoffs between the schemes occurs along these knee points
as we change the power/area constraint tradeoffs.
Max BW Optimal clocking scheme for Max BW
23
Design Space Results• Cluster clock frequency:
• As power constraint gets tighter, frequency goes down• As area constraint gets tighter, frequency goes up• Source-synchronous schemes provide benefits at higher cluster frequen-
cies• The asynchronous scheme provides a way to keep the cluster frequency
down but still have high 3DIO frequency, through serialization• Number of clusters:
• Not monotonic along edges of hypercube and clocking scheme bound-aries
• Also sensitive to the total region areaOptimal # of Clusters for Max BWOptimal Cluster Frequency for Max BW
24
Outline• Motivation• Power, Area and Timing Model• P&R and Timing Flow• Experimental Results • Conclusion
25
Conclusion• We have developed a power, area and timing model for 3DIO
and CTS that includes clustering and three different clock synchronization schemes (synchronous, source-synchronous, asynchronous)
• Our model estimates power, area and timing within 20% error across a large range of bandwidths, region areas, numbers of clusters and 3DIO frequencies
• Our modeling methodology will enable architects to study and optimize the design space upfront
• Key takeaways:• Iso-bandwidth lines identify min area/power required to hit a particular
BW• Clocking scheme tradeoffs are interesting along the knee of iso-band-
width lines• Cluster frequency for asynchronous schemes can be kept low while
still reducing the number of 3DIO due to serialization
26
Future Work• Extend our model to be aware of
• Placement uniformity • Technology dependence• Datapath logics • More comprehensive STA including intra-die variation• Blockages• Asymmetric clustering• Different 3DIO placement • Serial 3DIO circuit options for asynchronous scheme
• 2.5D (interposer-based) design
Thank you
BACKUP
29
Synchronous• All end points on both dies are synchronized• Colored FFs are uniformly distributed over the region• Non-colored FFs are placed right next to the 3DIO array• Clock tree is vulnerable to the inter-die variation• Use DDR to minimize number of 3DIOs• Two factors affect to determine max 3DIO clock frequency (FIO)
• Clock skew due to the inter-die variation• Jitter
• Increase #clusters increase max FIO because clock tree becomes more robust to the inter-die variation
BW(GB/s)
Region Area (mm2)
NclusterFmax
(MHz)FIO
(MHz)
12 25 1 640 128011.25 25 25 900 1800
12 100 1 300 60011.25 100 25 600 120050.025 25 1 460 92050.625 25 25 900 1800200.25 100 1 300 600202.5 100 25 600 1200
30
Source-Synchronous• Forward clock one die to the another die
• For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced no inter-die variation
• Require balance delay Tb to compensate clock insertion delay Tclk1
• Two factors to determine max 3DIO clock frequency (FIO)• Skew between Tb and Tclk1 due to the intra-die variation• Jitter
BW(GB/s)
Region Area (mm2)
NclusterFmax
(MHz)FIO
(MHz)
12.095 25 1 820 164010.625 25 25 1700 340050.02 25 1 820 1640
46.875 25 25 1500 300012 100 1 500 100015 100 25 1200 2400
200 100 1 350 700195 100 25 1200 2400
31
Asynchronous
BW(GB/s)
Region Area (mm2)
NclusterFmax
(MHz)FIO
(MHz)
11.9 25 1 700 560025 25 25 1000 8000
49.7 25 1 700 560040 25 25 800 640012 100 1 400 320020 100 25 800 6400
200 100 1 125 1000200 100 25 500 4000
(a) Synchronous clocking
(b) Source-synchronous clocking
Balance
delay (Tb)
Forwarded clock
(c) Asynchronous clocking
Serializer
Cluster clock IO clock
Deserializer
Cluster clockIO clock - recovered
Thold_fix
Thold_fix
Tdata
DDR
DDR
Tdata
Tclk0Tclk1
Tclk0 Tclk1
10
10
10
• Use FIFO (1:8 serializer, 8:1 deserializer) to separate clock domain• No inter-die variation• Minimize the number of 3DIOs
• Require PLL for cluster clock for the top die and IO clock for both dies • Large power overhead
• One factor to determine max 3DIO clock frequency (FIO)• Jitter
32
Flow of Synch. Clocking SchemesBottom Top
0.307ns (bc)0.618ns (wc)
0.089ns (bc)0.200ns (wc)
0.125ns (bc)0.247ns (wc)
0.147ns (bc)0.306ns (wc)
Bottom Top
0.307ns (bc)0.618ns (wc)
0.089ns (bc)0.200ns (wc)
0.125ns (bc)0.247ns (wc)
0.147ns (bc)0.306ns (wc)
Delay to balance the clock insertion delays across dies0.307 - 0.125 = 0.182ns (bc)0.618 – 0.247 = 0.371ns (wc)
Input delay to prevent unnecessary hold buffer insertions 0.307 + 0.089 – 0.182 = 0.214ns
CTS, CTO and Routeon bottom die
Custom placementon bottom/top dies
CTS on top die
Extract delay
1
2
1
2
BW: 12GB/sAreg: 81mm2
nc: 4fclus: 1000MHz
Cluster buffer
Cluster buffer
33
Flow of Synch. Clocking Schemes
CTS on top die
Extract balance delay
CTO and Route on top die
1
2
3
3
STA 4
Run CTO and route at worst cornerconsidering hold time and clock uncertainty
Top
0.247ns (wc)
0.214ns
Top
0.247ns (wc)
0.214ns
4
Bottom Top
0.618ns (wc) 0.200ns (wc)
0.125ns (bc)0.306ns (wc)
Bottom Top
0.307ns (bc) 0.089ns (bc)
0.247ns (wc)0.147ns (bc)
Setup: 0.5 (half cycle) + 0.802(tclk) – 0.075 (tunc) - 0.008 (ts) – 1.195 (tdata) = 0.024ns
Hold: 0.683(tdata) - 0.576 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.017ns
0.371ns (wc)
0.071ns (bc)0.140ns (wc)
0.182ns (bc)Cluster buffer
Cluster buffer
34
Flow of Synch. Clocking SchemesBottom Top
0.618ns 0.200ns
0.247ns 0.306ns
Bottom Top
0.618ns0.200ns
0.247ns 0.306ns
CTS, CTO and Routeon bottom die
Custom placementon bottom/top dies
CTS on top die
Extract delay
1
2
1
2
BW: 12GB/sAreg: 81mm2
nc: 4fclus: 1000MHz
Balance the delay from clock source to data 3DIO and the delay from clock source to clock 3DIO 0.618 + 0.200 = 0.818ns
Annotate “balancing delay” 0.247ns
Cluster buffer
Cluster buffer
35
Flow of Source synch. Clocking Schemes
CTS on top die
Extract balance delay
CTO and Route on top die
1
2
3
3
STA 4
Run CTO and route at worst cornerconsidering hold time and clock uncertainty
Top
0.247ns (wc)
0.247ns
Top
0.247ns (wc)
0.247ns
4
Bottom Top
0.618ns 0.200ns
0.247ns0.306ns
Bottom Top
0.618ns 0.200ns
0.247ns 0.306ns
Setup: 0.5 (half cycle) + 1.371 (tclk) – 0.075 (tunc) - 0.008 (ts) – 1.471 (tdata) = 0.317ns
Hold:1.471 (tdata) - 1.371 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.010ns
0.818ns
0.100ns0.100ns
0.818ns
0.247ns 0.247ns
Cluster bufferCluster buffer
Balancing delayBalancing delay