+ All Categories
Home > Documents > Stratix™10 High Performance Routable Clock...

Stratix™10 High Performance Routable Clock...

Date post: 06-Feb-2018
Category:
Upload: doananh
View: 225 times
Download: 2 times
Share this document with a friend
65
Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How, Herman Schmit, David Lewis
Transcript
Page 1: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Stratix™10 High Performance Routable Clock NetworksCarl Ebeling, Dana How, Herman Schmit, David Lewis

Page 2: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

In the beginning, there was an FPGA

2

and clocks were simple

Page 3: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

But Moore’s Law was relentless

3

FPGAs became largerWe needed balanced trees

Page 4: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Trees Became Even Larger

4

Page 5: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Different Sizes of Fixed Clock Trees

5

Quadrant Clocks

Page 6: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

6

Peripheral clocks

Page 7: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

7

Overlay them all

Page 8: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

This Approach Does Not Scale Well

8

Cost: How many fixed clocks? What size trees?

Performance

Page 9: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Distribution and Clock Loss

9

Register setup (tsu) and propagation delay (tco)Clock skew (S)

tco tsuS

Clock period

Useful clock

Page 10: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

The Performance Implications of Clock Distribution

10

Advanced technologies Variation: Process, voltage, temperature Jitter Aging/End of Life

tsuS

Clock period

Useful clock

PVT

EOL

Jtco

Page 11: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

The Performance Implications of Clock Distribution

11

And now we want to double the clock frequency!

tsuS

Clock period

Useful clock

PVT

EOL

Jtco

Page 12: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

The Performance Implications of Clock Distribution

12

All these factors are a function of the path length Reducing the clock tree size reduces clock loss

Clock period

Useful clock

tsutco

Page 13: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Loss in Fixed Trees

13

Clock loss chasms

Page 14: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Region Alignment to Fixed Trees

14

Clock loss chasms

Page 15: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Region Alignment to Fixed Trees

15

Clock loss chasms

Page 16: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Region Alignment to Fixed Trees

16

tco tsu

Page 17: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Region Alignment to Fixed Trees

17

tsutco

Page 18: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Region Alignment to Fixed Trees

18

tsutco

Page 19: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Region Alignment to Fixed Trees

19

tsutco

Page 20: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

The Advantage of Customized Clock Trees

20

Fixed clock trees Custom clock trees

Several clock regions can share the same clock planeCustom clock regions minimize divergent insertion delayand clock loss

Page 21: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Stratix 10 Routable Clocks

21

Construct any clock tree Arbitrary size Arbitrary placement Support different types of tree: H-tree, fishbone, …

Flexibility has cost and delay implications But in the end, we’ll come out ahead

Page 22: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

22

Stratix 10 is an Array of “Sectors”

Sector

Page 23: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

23

The Routable Clocks are in the Seams between Sectors

Page 24: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

24

Clock Grid

32 clock wiresClock switchbox

Repeaters

Repeaters+ clock taps

Wires are bi-directional

Page 25: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

25

Intra-Sector Clock Distribution

Page 26: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

The Routable Clock Challenge

26

Provide sufficient flexibility With a minimum of added cost and delay

Page 27: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock “Planes”

27

Label clock segments 0 – 31 Each set of segments labeled N forms a plane: 32 clock planes

Page 28: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock “Planes”

28

Label clock segments 0 – 31 Each set of segments labeled N forms a plane: 32 clock planes

Page 29: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock “Planes”

29

Label clock segments 0 – 31 Each set of segments labeled N forms a plane: 32 clock planes

Page 30: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock “Planes”

30

Label clock segments 0 – 31 Each set of segments labeled N forms a plane: 32 clock planes

Page 31: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock “Planes”

31

H-tree can be constructed in one plane No crossovers

Page 32: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Intra-Plane Muxing

32

Simple 3-1 mux on each segmentSmall incremental delaySmall cost

Page 33: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Inter-Plane Muxing

33

Must support cross-overs Route from source to tree root More complex clock trees

Add one more multiplexer Connect plane N+1 to plane N

Page 34: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis

34

Identify the clock region

Page 35: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis

35

Identify the clock regionChoose a clock plane and generate a balanced clock tree

Page 36: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis

36

Identify the clock regionChoose a clock plane and generate a balanced clock treeRoute clock from source to root of clock tree

Page 37: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Non-canonical Clock Trees

37

Clock regions are generally not square Nor 2n x 2m

What about trees of arbitrary size and shape?

Page 38: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis Algorithm

38

Generate minimal height tree

Example: 2 x 5 clock region

Page 39: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

39

1. Cover with smallest enclosing 2n x 2n tree

Page 40: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

40

2. Prune tree

Page 41: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

41

2. Prune tree

Page 42: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

42

Page 43: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

43

3. Remove redundant segments in subtree

Page 44: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

44

4. Reduce tree height

Page 45: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Synthesis – Graphical Representation

45

Tradeoff tree height for less overlap

Page 46: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Tree Delay

46

Most connections go through a single primary mux Minimal additional delay

A few go through additional secondary mux Where overlap occurs in the tree

Path delays are inherently balanced Leaves are all on plane N, at the same depth All paths from source (plane N+k) to leaves go through the same number

of primary and same number of secondary muxes

Page 47: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Clock Insertion Delay

47

Flexibility does add insertion delay

Bi-directional clock wires add some delay Tradeoff delay for cost

Total delay relative to fixed clocks: ~1.3-1.4x delay for same distance

What is the effect on performance? Routable clock tree can be much smaller than fixed trees But for the same size, fixed trees have less clock loss

Page 48: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Performance Analysis 1

48

Clock loss modeled as a linear function of insertion delayCompare performance based on clock tree sizeAssumptions: Some critical path must cross the worst-case skew chasm

Example:Critical path = 760ps : What Fmax can I expect?

Depends on the size of the clock region

Page 49: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

49

Performance vs. Clock Region Size (760ps critical path)

25K LE 400K LE 1.6M LE 3M LE

Page 50: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Performance Analysis 2

50

Monte Carlo analysisLarge set of benchmarks From previous generation FPGA

Generate a variation profile according to modelLocate a design randomly on the dieCompute Fmax 1. Using a fixed clock tree

2. Using a custom clock tree for clock region

Repeat for many designs, variation profiles and placements

Page 51: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Example Variation

51

Analysis captures effect of variation correlation

Page 52: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Monte Carlo Experiment

52

Map design to random locations with fixed Clocks

Page 53: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Monte Carlo Experiment

53

Map using routable clocks

Page 54: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

54

Results – Routable Clocks Reduce Worst Case by 6%

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

1.01

1.02

0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08

99%ile

 Rou

table Clock De

lay

99%ile Global Clock Delay

Scatter Global vs Routable Clocks

Page 55: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Capacity

55

Does the clock grid with 32 planes provide enough clocks?Some designs use well over 100 clocks

Page 56: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Capacity Experiments

56

Used large Arria 10 designs with large clock demandsExtended the benchmarks scaled to Stratix 10 merged designs added many local transceiver clock regions

Mapped each clock region on Arria 10 to corresponding set of sectors on Stratix 10Used prototype clock tree synthesis tools Optimal, balanced clock trees

Generated clock trees using N clock planes N = 16 – 32

Page 57: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

57

Capacity Results

Base Benchmarks Expanded

Expanded/ MergedMerged

3clocks/ Expanded

Page 58: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Stratix 10 Clock Network Summary

58

Transition to custom-built clock trees Scalability Performance

Challenges: Provide sufficient flexibility Without too much delay overhead

Performance difference averages about 6% But can be larger if designs are partitioned well

Page 59: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Future Work

59

ASIC design methodologies for FPGAsMany opportunities for CAD tools Design partitioning Clock partitioning Iterative floorplanning, placement, clock tree synthesis Skew-aware place and route Take advantage of “Interesting” clock tree structures

Page 60: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Future Work

60

ASIC design methodologies for FPGAsMany opportunities for CAD tools Design partitioning Iterative floorplanning, placement, clock tree synthesis Skew-aware place and route Take advantage of “Interesting” clock tree structures

Buy Stratix 10!

Page 61: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Thank You

Page 62: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Performance Comparison – Fixed vs. Routable Clocks

62

Infeasible to run real experiments No benchmarks, no CAD tools

Model performance based on size of clock region Fixed global clocks vs. customized clocks

Change in design methodology Global clocks kill performance Long recognized in ASICs

Partition design into communicating clock regions GALS: Globally Asynchronous, Locally Synchronous May be meso-synchronous (same clock, arbitrary skew)

Sectors are large! ~25K LEs

Page 63: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

63

Performance vs. Clock Region Size (1200ps critical path)

Page 64: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

64

Performance vs. Clock Region Size (800ps critical path)

Page 65: Stratix™10 High Performance Routable Clock Networksisfpga.org/fpga2016/index_files/Slides/2_2.pdf · Stratix™10 High Performance Routable Clock Networks Carl Ebeling, Dana How,

Related Work

65

Very little academic work Limited to fixed clocks with routable source-to-root routing

Xilinx Ultrascale clocks “Clock region” tiles

Similar to sectors, but are not uniform Two separate clock networks:

“Routing”: Route source to clock tree“Distribution”: Build local clock tree

Clock tracks run through the center of CRs Trees not inherently balanced Seems biased towards fishbone clocks: Rib-and-Spine Includes tunable delays


Recommended