Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B....

Clock Clustering and IO Opti-mization for 3D Integration

Samyoung Bang*, Kwangsoo Han‡,Andrew B. Kahng‡† and Vaishnav Srinivas‡

‡ECE and †CSE Departments, UC San Diego, La Jolla, CA 92093

*Samsung Electronics Co. Ltd, Hwaseong-si, South Korea

[email protected], {kwhan, abk, vaishnav}@ucsd.edu

2

Outline• Motivation• Power, Area and Timing Model• P&R and Timing Flow• Experimental Results • Conclusion

3

Motivation• For 3D integration with large bandwidth needs

between dies, choice of clocking options need to be made upfront• Tradeoff between area and power needed upfront• Affects floorplanning choices

Serializer

3DIO PLL

Deserializer

3DIO PLLPLL

Serializer

3DIO PLL

Deserializer3DIO PLL

PLL

4

Key Choices for Clocking Options• Local clustering

• Partition a given region into sub-regions• Clock synchronization scheme

• Synchronous• Source-synchronous• Asynchronous

• 3DIO frequency # of 3DIO

To enable design space pathfinding/exploration:• Power/Area/Timing model based on total bandwidth,

clustering, synchronization scheme, 3DIO frequency• Combine clocking and 3DIO power/area/timing

5

Clock entry point

Cluster

3DIO array

Data path

The layout of the bottom die

3DIO Clustering• Localize the clock tree of the 3D interconnect• Advantages when number of cluster increase:

• Size of cluster clock tree↓ (smaller skew, jitter)• Shorter data paths to 3DIO array at the center of each cluster• Enables efficient clocking schemes (forwarded clock, asynchro-

nous)

• Disadvantages when number of cluster increase:• Overhead to synchronize between clusters on top die• Overhead in cluster clock 3DIO per cluster

6

Synchronization Schemes for 3DIO Clocking• Synchronous

• Cluster clock tree is balanced to all F/Fs on both the bottom and the top die

• Simplest clocking scheme (similar to on-die)• Vulnerable to inter-die process/voltage varia-

tion (large skews)

• Source-synchronous• Forwarded clock from one die to another• No skew balancing needed across two dies• Require balance delays (Tb) within each die

on the data path to match the clock insertion delay

• Asynchronous• Separate clocks on each die• FIFO to help clock domain crossing • Obtain much smaller number of 3DIOs

due to higher speeds achievable with embedded clock and CDR techniques

(a) Synchronous clocking

(b) Source-synchronous clocking

Balance

delay (Tb)

Forwarded clock

(c) Asynchronous clocking

Serializer

Cluster clock IO clock

Deserializer

Cluster clockIO clock - recovered

Thold_fix

Thold_fix

Tdata

DDR

DDR

Tdata

Tclk0Tclk1

Tclk0 Tclk1

10

10

10

Asynchronous clocking

Serializer Deserializer

IO-clock Cluster

clock

Cluster clockIO-clock



Balance

delay (Tb)

Forwarded clock


Serializer


Deserializer


Thold_fix

Thold_fix

Tdata

DDR

DDR

Tdata

Tclk0Tclk1

Tclk0 Tclk1

10

10

10

Synchronous clocking

Launch FFs Capture FFsData path

Bottom Top



Balance

delay (Tb)

Forwarded clock


Serializer


Deserializer


Thold_fix

Thold_fix

Tdata

DDR

DDR

Tdata

Tclk0Tclk1

Tclk0 Tclk1

10

10

10

Source-synchronous clocking

Bottom Top

Launch FFs Capture FFsData path

Tb

7

Our Work• Given the choices of clock synchronization schemes, num-

ber of clusters and 3DIO frequency, find maximum band-width for the 3D interconnect given a max power and area constraints.

Optimal Clocking scheme for Max BW

Optimal Clocking frequency for Max BWMax power constraint

Max

are

a co

nstr

aint

s

Synch.

Source-synch.

Asynch.Max power constraint

Max

are

a co

nstr

aint

s

Max Achievable BW

Max power constraintMax

are

a co

nstr

aint

s

Max power constraintMax

are

a co

nstr

aint

s

Optimal number of clusters for Max BW

8


9

3DIO/CTS Directed Graph• Primary inputs are indicated by circle• Rectangles are determined by the primary inputs• Solid and dotted arrow indicates positive and negative corre-

lation• Estimate the rounded rectangles as analytic expressions

#ClustersFreq.

Clockingscheme

RegionArea

BW

WNS

Skew outcome,clock ins. delay

Area

Power

# FFs

IO Freq.

Per-IOpower/area

# 3DIO

Max skew/transition

Jitter

Input

Deterministic

Est. outcome

Estimated

IncreaseDecrease

Clock WL

Clock buf. area

Data WL

Data buf. area

10

Clock Wirelength• Hierarchical approach to estimate clock wirelength

• Assume clock tree is well balanced because FFs are uni-formly distributed over the region area

• Length of Steiner minimal tree over N points uniformly dis-tributed within a given region Areg is proportional to

• Total clock wirelength is

Notation

Depth of clock tree (i == 0 for clock source)

Number of cluster

Total number of flip-flops

Fitted coefficients

i = 0

w0

i = 1w1

FFs FFs FFs

i = 2w2

NAreg /

CreggffregC

depth

i

iclk NAkNAkwW

max_

0

i

CN

ffN

gC kk ,

Cluster clock tree

Global clock tree

11

Clock Buffer Area• Tellez and Sarrafzadeh propose a method to insert the minimum

number of buffers under a given transition time (Tmax_tran) constraint

• Linearize the problem by using the concept of maxinum capacitance (Cmax)

Any buffer stage i with stage cap Ci < Cmax will have Ti_tran < Tmax_tran

• Using Cmax, we estimate the number of clock buffers (Ncbuf),

• Kashyap et al. discuss transition time degradation and Cmax can be expressed as follows,

• Total clock buffer area is

bufg

ffgffclk

cbuf

bufgcbuf

ffgffclkcbuf

CC

CNCWN

CNCNCWNC

max

0

0max

)(

2maxmax1

20max_ )()( RCkTT tran

20

2max_2max )()( TTkC tran

bufgtran

ffgffclk

bufgtran

ffgffclk

clkCTT

CNCWk

CTT

CNCWkA

20

2max_

022

02

max_

01

)()(

)(

)()(

)(

Wire (max length = Wmax)T0

Tmax_tran

12

Data Wirelength and Data Buffer Area• Data path wirelength is propor-

tional to the number of data wires and the cluster dimension

• Distribution exists based on sink placement wrt 3DIO cluster

3DIO

SinkCluster

d

p

• For data buffer area, we use a similar concept to clock buffer area estimation• Need to consider each data path separately

Cannot use total wirelength• Need minimum number of data buffers to meet hold timing

Cregffdata NANkW /0

13

3DIO/Overall Power and Area• 3DIO power and area models are based on CACTI-IO

• Overall (3DIO+clocking) power and area are

fkfkfkR

N)R,(R

kANA

ON

IO

TTIONIOIO

33

221

00

1

2min

statictermdynIOIO PPPNP

IOdataclkclkdataclkdataclktotal

IOdataclkdataclktotal

PAAkFAkAkWkWkP

AAAWkWkA

)()( 76543

21

Switching power Internal and leakage power

IO power

14


15

3D P&R Flow - Synchronous• Synchronous

• Synthesize the cluster clock tree on the top die first to balance the clock tree on both dies

• Extract maximum clock insertion delay (Tclk1)

• Propagate the data path delay (Tdata) for the routing on top die

Gate-level netlist andSDC file generation

Custom placement

CTS (top die)

Extract max Tclk1

CTS and Route (bottom die)

Extract Tdata

Route (top die)

Reportpower/area/timing

Synchronous scheme

Tclk1

PropagatedClock 3DIO

The layout of top dieThe layout of the bottom die

Tdata

16

3D P&R Flow – Source-synchronous and Asynchronous• Source-synchronous

• Synthesize the clock tree and route on the bottom die, and separately synthe-size the clock tree only for the top die

• Extract balance delay Tb (i.e., Tclk1) for each capture FF and annotate the delays to the corresponding data 3DIOs

• Asynchronous• Run traditional 2D flow on both dies sep-

arately

Gate-level netlist andSDC file generation

Custom placement

CTS and Route (bottom die)

CTS (top die)

Extract Tb

Route (top die)

Reportpower/area/timing

Source-synchronous scheme

PropagatedClock 3DIO

The layout of top dieThe layout of the bottom die

Tclk1

Annotate Tb (i.e., Tclk1)

17

Conventional 2D STA vs. our 3D STA

• We focus on inter-die variation, and do not consider intra-die variation which can be com-prehended by timing derate or OCV

• Two process corners {BC, WC} for inter-die variation

• Assign the same corner on the paths on the same die

• Report worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners

Conventional 2D STA(without inter-die variation)

Our 3D STA

Setup slack = Tper – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, WC} + (T{capture, BC} – T{launch, WC})

TlaunchTc2q

Tdata1

Tcapture

Tdata2

Buffer on bottom die Buffer on top die

FF on bottom die FF on top die

Setup slack1 = Tper – Tsu – T{c2q, BC} – T{data1, BC} – T{data2, BC} + (T{capture, BC} – T{launch, BC}) slack2 = Tper – Tsu – T{c2q, BC} – T{data1, BC} – T{data2, WC} + (T{capture, WC} – T{launch, BC}) slack3 = Tper – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, BC} + (T{capture, BC} – T{launch, WC}) slack4 = Tper – Tsu – T{c2q, WC} – T{data1, WC} – T{data2, WC} + (T{capture, WC} – T{launch, WC})

slack = min (slack1, slack2, slack3, slack4)

18


19

Experimental Setup• P&R tool is Synopsys IC Compiler I-2013.12-SP1• Timing analysis tool is Synopsys PrimeTime H-

2013.06-SP2• We use a 65nm TSMC library• Design of experiments

• Bandwidth (10 – 200 GB/s)• Region area (25 – 100 mm2)• 3DIO clock frequency

• Synchronous (100 – 2000 MHz)• Source-synchronous (1500 – 4000 MHz)• Asynchronous (3500 – 8000 MHz)

• Number of clusters (1 – 25)• We select four data points for each parameter

256 design implementations for each clocking scheme

20

Model Fitting Approach• We use Artificial Neural Network (ANN) model for our fit,

guided by the directed graph• Iteratively progress through the directed graph to fit each

node• Clock wirelength• Data wirelength• Clock buffer area• Data buffer area• 3DIO power/area• Total Area/Power/WNS

• We use the Fmax for the timing model (instead of WNS)

• Multiple runs with different training, validation and test data sets Improved generality and robustness of the re-sulting models

#ClustersFreq.

Clockingscheme

RegionArea

BW

WNS

Skew outcome,clock ins. delay

Area

Power

# FFs

IO Freq.

Per-IOpower/area

# 3DIO

Max skew/transition

Jitter

Input

Deterministic

Est. outcome

Estimated

IncreaseDecrease

Clock WL

Clock buf. area

Data WL

Data buf. area

21

Area, Power and Timing Model Results

•Min-Max error within +/-20%

•For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation Larger error

•Mean error within +/-0.5%

Area

Power

Fmax

22

Design Space Results• Max BW:

• Figure shows the iso-bandwidth curves• Vertical and horizontal walls show min power/area required to hit a bandwidth

requirement• Clocking scheme:

• The asynchronous scheme is area-efficient• The synchronous scheme is power-efficient• The source-synchronous scheme provides a valuable tradeoff between power

and area along the knee of the iso-bandwidth curve. • The interesting tradeoffs between the schemes occurs along these knee points

as we change the power/area constraint tradeoffs.

Max BW Optimal clocking scheme for Max BW

23

Design Space Results• Cluster clock frequency:

• As power constraint gets tighter, frequency goes down• As area constraint gets tighter, frequency goes up• Source-synchronous schemes provide benefits at higher cluster frequen-

cies• The asynchronous scheme provides a way to keep the cluster frequency

down but still have high 3DIO frequency, through serialization• Number of clusters:

• Not monotonic along edges of hypercube and clocking scheme bound-aries

• Also sensitive to the total region areaOptimal # of Clusters for Max BWOptimal Cluster Frequency for Max BW

24


25

Conclusion• We have developed a power, area and timing model for 3DIO

and CTS that includes clustering and three different clock synchronization schemes (synchronous, source-synchronous, asynchronous)

• Our model estimates power, area and timing within 20% error across a large range of bandwidths, region areas, numbers of clusters and 3DIO frequencies

• Our modeling methodology will enable architects to study and optimize the design space upfront

• Key takeaways:• Iso-bandwidth lines identify min area/power required to hit a particular

BW• Clocking scheme tradeoffs are interesting along the knee of iso-band-

width lines• Cluster frequency for asynchronous schemes can be kept low while

still reducing the number of 3DIO due to serialization

26

Future Work• Extend our model to be aware of

• Placement uniformity • Technology dependence• Datapath logics • More comprehensive STA including intra-die variation• Blockages• Asymmetric clustering• Different 3DIO placement • Serial 3DIO circuit options for asynchronous scheme

• 2.5D (interposer-based) design

Thank you

BACKUP

29

Synchronous• All end points on both dies are synchronized• Colored FFs are uniformly distributed over the region• Non-colored FFs are placed right next to the 3DIO array• Clock tree is vulnerable to the inter-die variation• Use DDR to minimize number of 3DIOs• Two factors affect to determine max 3DIO clock frequency (FIO)

• Clock skew due to the inter-die variation• Jitter

• Increase #clusters increase max FIO because clock tree becomes more robust to the inter-die variation

BW(GB/s)

Region Area (mm2)

NclusterFmax

(MHz)FIO

(MHz)

12 25 1 640 128011.25 25 25 900 1800

12 100 1 300 60011.25 100 25 600 120050.025 25 1 460 92050.625 25 25 900 1800200.25 100 1 300 600202.5 100 25 600 1200

30

Source-Synchronous• Forward clock one die to the another die

• For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced no inter-die variation

• Require balance delay Tb to compensate clock insertion delay Tclk1

• Two factors to determine max 3DIO clock frequency (FIO)• Skew between Tb and Tclk1 due to the intra-die variation• Jitter

BW(GB/s)

Region Area (mm2)

NclusterFmax

(MHz)FIO

(MHz)

12.095 25 1 820 164010.625 25 25 1700 340050.02 25 1 820 1640

46.875 25 25 1500 300012 100 1 500 100015 100 25 1200 2400

200 100 1 350 700195 100 25 1200 2400

31

Asynchronous

BW(GB/s)

Region Area (mm2)

NclusterFmax

(MHz)FIO

(MHz)

11.9 25 1 700 560025 25 25 1000 8000

49.7 25 1 700 560040 25 25 800 640012 100 1 400 320020 100 25 800 6400

200 100 1 125 1000200 100 25 500 4000



Balance

delay (Tb)

Forwarded clock


Serializer


Deserializer


Thold_fix

Thold_fix

Tdata

DDR

DDR

Tdata

Tclk0Tclk1

Tclk0 Tclk1

10

10

10

• Use FIFO (1:8 serializer, 8:1 deserializer) to separate clock domain• No inter-die variation• Minimize the number of 3DIOs

• Require PLL for cluster clock for the top die and IO clock for both dies • Large power overhead

• One factor to determine max 3DIO clock frequency (FIO)• Jitter

32

Flow of Synch. Clocking SchemesBottom Top

0.307ns (bc)0.618ns (wc)

0.089ns (bc)0.200ns (wc)

0.125ns (bc)0.247ns (wc)

0.147ns (bc)0.306ns (wc)

Bottom Top

0.307ns (bc)0.618ns (wc)

0.089ns (bc)0.200ns (wc)

0.125ns (bc)0.247ns (wc)

0.147ns (bc)0.306ns (wc)

Delay to balance the clock insertion delays across dies0.307 - 0.125 = 0.182ns (bc)0.618 – 0.247 = 0.371ns (wc)

Input delay to prevent unnecessary hold buffer insertions 0.307 + 0.089 – 0.182 = 0.214ns

CTS, CTO and Routeon bottom die

Custom placementon bottom/top dies

CTS on top die

Extract delay

1

2

1

2

BW: 12GB/sAreg: 81mm2

nc: 4fclus: 1000MHz

Cluster buffer

Cluster buffer

33

Flow of Synch. Clocking Schemes

CTS on top die

Extract balance delay

CTO and Route on top die

1

2

3

3

STA 4

Run CTO and route at worst cornerconsidering hold time and clock uncertainty

Top

0.247ns (wc)

0.214ns

Top

0.247ns (wc)

0.214ns

4

Bottom Top

0.618ns (wc) 0.200ns (wc)

0.125ns (bc)0.306ns (wc)

Bottom Top

0.307ns (bc) 0.089ns (bc)

0.247ns (wc)0.147ns (bc)

Setup: 0.5 (half cycle) + 0.802(tclk) – 0.075 (tunc) - 0.008 (ts) – 1.195 (tdata) = 0.024ns

Hold: 0.683(tdata) - 0.576 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.017ns

0.371ns (wc)

0.071ns (bc)0.140ns (wc)

0.182ns (bc)Cluster buffer

Cluster buffer

34

Flow of Synch. Clocking SchemesBottom Top

0.618ns 0.200ns

0.247ns 0.306ns

Bottom Top

0.618ns0.200ns

0.247ns 0.306ns

CTS, CTO and Routeon bottom die

Custom placementon bottom/top dies

CTS on top die

Extract delay

1

2

1

2

BW: 12GB/sAreg: 81mm2

nc: 4fclus: 1000MHz

Balance the delay from clock source to data 3DIO and the delay from clock source to clock 3DIO 0.618 + 0.200 = 0.818ns

Annotate “balancing delay” 0.247ns

Cluster buffer

Cluster buffer

35

Flow of Source synch. Clocking Schemes

CTS on top die

Extract balance delay

CTO and Route on top die

1

2

3

3

STA 4

Run CTO and route at worst cornerconsidering hold time and clock uncertainty

Top

0.247ns (wc)

0.247ns

Top

0.247ns (wc)

0.247ns

4

Bottom Top

0.618ns 0.200ns

0.247ns0.306ns

Bottom Top

0.618ns 0.200ns

0.247ns 0.306ns

Setup: 0.5 (half cycle) + 1.371 (tclk) – 0.075 (tunc) - 0.008 (ts) – 1.471 (tdata) = 0.317ns

Hold:1.471 (tdata) - 1.371 (tclk) - 0.060 (tunc) - 0.030 (th) = 0.010ns

0.818ns

0.100ns0.100ns

0.818ns

0.247ns 0.247ns

Cluster bufferCluster buffer

Balancing delayBalancing delay

Date post:	13-Jan-2016
Category:	Documents
Upload:	quentin-phillips
View:	212 times
Download:	0 times

Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B....

Documents