ECE 565: VLSI CAD Flow + Brief Intro to HLS, Logic Optimization &
Technology Mapping
Shantanu Dutt
ECE Dept., UIC
VLSI CAD Flow: Overview (High-Level Synthesis or HLS network of interconnected modules: functional units [FUs], registers, muxes, demuxes,
Logic optimization (mainly of controller fsm’s and other “random”logic; other module dessigns such as arith. FUs, regs, muxes, etc. are well known).
a) Design Rule Checking b) Layout Vs. Schematic Verification c) Electrical Rule Checking d) Electrical simulation and functional (logic) + metric (delay, power) verification.
a) Tech. Mapping (mapping small subckts to pre-designed cells in library). b) Physical Synthesis (inserting buffers, upsizing gates, deciding on multiple Vdd and Vth). Can be combined w/ PD.
(HLS)
(E.g., Using a Hardware Description Language: Verilog, VHDL, SystemC or Schematic Capture)
Iterate if simulation identifies metric problems (hot spots, timing violation, crosstalk)
Shantanu Dutt, UIC
System to Silicon Design
System Requirements Hardware Architecture
Σ
SynthesisAlgorithm
X[k] = Σ x[n]e-j2 πk/N
x[n] = Σ X[k]e+j2πk/N
System Requirements Hardware Architecture
Σ
SynthesisAlgorithm
X[k] = Σ x[n]e-j2 πk/N
x[n] = Σ X[k]e+j2πk/N
Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 6
[Source: MITRE]
Design For Test
Cont rol
Observe
PhysicalFabricate and TestSystem Integration Design For Test
Cont rol
Observe
PhysicalFabricate and TestSystem Integration
VLSI CAD Flow: Overview Constrained optimization problem: Opt. metric at almost all stages: One of power, delay, chip-area, total wire length (WL) w/ constraints (upper bounds) on the others plus on temperature (approx. power density), crosstalk (routing stage), yield/ variability (lower bound), etc. Very complex processes.
Disconnect between higher (above physical design) and lower stages (PD & lower): At higher stages some metrics such as those that are fully or partly interconnect-based (delay, area, dynamic power) cannot be v. accurately estimated incorrect decisions which percolate to lower stages (e.g. after routing between an adder and mult., interconnect length is much larger than anticipated due to say congestion and mis-routing) some specs (esp. delay) may not be met iterating back to higher stages to correct the approximations and re-design
(HLS)
Iter
ate
if s
imu
lati
on
iden
tifi
es
met
ric
pro
ble
ms
(ho
t sp
ots
, ti
min
g vi
ola
tio
n, c
ross
talk
)
Shantanu Dutt, UIC
VLSI CAD Flow: Overview
Iterate if simulation identifies metric problems (hot spots, timing violation, crosstalk)
Course Coverage: with particular focus on HLS and Physical Design
(HLS)
Shantanu Dutt, UIC
VLSI CAD Flow: Overview
Iterate if simulation identifies problems (hot spots, timing violation, crosstalk)
(HLS)
Partitioning cuts up the system into multiple subsystems to min. some metric (generally number of wires/nets cut—min-cut part.) either for multi-chip impl. or for recursive invokeation for placement on a single chip.
Floorplanning places major/large ploygon-shaped modules on a chip to minimize chip area, wirelength (WL), delay or dynamic power with constraints on the others.
Placement places smaller rectangular cells whose size/shape are well defined in sub-chip regions for the same goals as FP.
Routing interconnects cells/modules via hor/vert or 45 deg. wires for similar goals as above. Shantanu Dutt, UIC
7
Advanced VLSI Design CMPE 641ASIC Design Flow
Standard Cell Place and Route Flow
Adapted from: CMOS VLSI Design, A Circuits and Systems PerspectiAdapted from: CMOS VLSI Design, A Circuits and Systems Perspective, 3rd Editionve, 3rd Edition,, Neil Weste et al. © 2005 Pearson Addison-Wesley
A complete chip showing floorplanned modules
• Some modules are custom designed & layed-out (e.g., RAMs), while others are automated using, say, standard cells
Modules
Routing
Shantanu Dutt, UIC
Another complete chip showing RAMs (custom design) + standard-cell layout
RAMs
Standard-cell based layout (most of the chip)
Shantanu Dutt, UIC
What HLS Produces by Examples
Shantanu Dutt
UIC
Hardware Synthesis – Example 1
GenerationComputer Code
Unit (FSM)
ldc
lda
ldbCPU
Mux3_select
b c
a
ADDER
Mux
MuxMuxMux2_selectMux1_select
Hardware Synthesis
Exe
cuti
on in
a C
PU
Control
Read Bus A Read Bus B
Mux
Register File
Write Bus
Con
rtol
sig
nals
Conrtol signals
Programming language statement
a := b + c;
Load r5 bLoad r7 c
Store r2 a
(r2 <- r5 + r7)ADD r2 r5 r7
Reg_r/w
Reg_addr
Alu_oper_select
Mux_select
r2
32
ADD ALU
r7
r5
3232
Shantanu Dutt UIC 17
17
Hardware Synthesis – Example 2
O/Ps of code block
I/Ps to code block
SynthesizedRecursively
Computer
a computerExecution in
10
10
ldblda
Mux1_select
zero
ovflsign
Demux1_selectDemux
Block B forHardware
Mux
2’s compl
ba
Hardware Synthesis
Block B Adder
Block A forHardware
GenerationComputer Code
Programming language construct
end Block B of code;else beginend Block A of code;if (a <= b) then begin
assmb code
operations)
(Adder may bereused for other
Unit (FSM)Control
Con
rtol
sig
nals
Conrtol signals
Block C
B:
Load r3 bLoad r2 a
BZ ABNEG A
SUB r2 r2 r3
assmb codeBlock AA:JMP C
assmb codeC:
Shantanu Dutt UIC 18
18
Brief Intro to Logic Optimization & Technology Mapping
Extracted from Notes by:
Srinivas Devdas, MIT
3
Two-Level Logic Minimization
Can realize an arbitrary logic function in sum-of-products or two-level form
F1 = A B + A B D + A B C D+ A B C D + A B + A B D
F1 = B + D + A C + A C
Of great interest to find a minimum sum-of-products representation
– Solved problem even for functions with 100’s of inputs (variants of Quine-McCluskey)
4
Two-Level versus Multilevel
2-Level:
6 product terms which cannot be shared.24 transistors in static CMOS
Multi-level:
Note that B + C is a common term in f1 and f2
K = B + C 3 Levels20 transistors in static CMOSnot counting inverters
f1 = AB + AC + ADf2 = AB + AC + AE
f1 = ΑΚ + AD
f2 = AK + AE
5
Technologies
“Closed book”: gate-arraystandard-cell
“Open book”: CMOS Domino,complex gate static CMOS
LOGIC EQUATIONS
TECHNOLOGY-INDEPENDENTOPTIMIZATION
FactoringCommonality Extraction
LIBRARYTECH-DEPENDENT OPTIMIZATION(MAPPING, TIMING)
OPTIMIZED LOGIC NETWORK
6
Tech.-Independent Optimization
Involves:Minimizing two-level logic functions.Finding common subexpressions.Substituting one expression into another.Factoring single functions.
Factored versus Disjunctive forms
sum-of-products or disjunctive form
factored formmulti-level or complex gate
f = ac + ad + bc + bd + ae
f = a + b( ) c + d( ) + a e
7
Optimizations
Factor F
Extract common expression
F =f1 = AB + AC + AD + AE + A BC D E
f2 = AB + AC + AD + AF + A BC D F⎧⎨⎩
F =f1 = A B + C + D + E( ) + ABC DE
f2 = A B + C + D + F( ) + ABC DF⎧⎨⎩
G =g1 = B + C + Df1 = A g1 + E( ) + A E g1
f2 = A g1 + F( ) + A F g1
⎧⎨
⎩⎪
8
What Does “Best” Mean?
Transistor count AREANumber of circuits POWERNumber of levels DELAY
(Speed)
Need quick estimators of area, delay and powerwhich are also accurate
17
Tech.-Dependent Optimization
Area, delay and power dissipation cost functions
OPTIMIZED LOGIC EQUATIONS
TECHNOLOGY MAPPING
GATENETLIST
LIBRARYTIMING
CONSTRAINTS
18
“Closed Book” Technologies
A standard cell technology or library is typically restricted to a few tens of gatese.g., MSU library: 31 cells
Gates may be NAND, NOR, NOT, AOIs.
A
A
A
C
A
B
AB+C
B
C
A
19
Mapping via DAG Covering
Represent network in canonical form⇒ subject DAG
Represent each library gate with canonical forms for the logic function⇒ primitive DAGs
Each primitive DAG has a cost
Goal: Find a minimum cost covering of the subject DAG by the primitive DAGs
Canonical form: 2-input NAND gates and inverters
20
Sample Library
INVERTER 2
NAND2 3
NAND3 4
NAND4 5
21
Sample Library - 2
AOI21 4
AOI22 5
Standard cell library
• For each cell (e.g., NANDs, NORs, Invs,
AOIs)
– Functional information
– Timing information
• Output slew
• Intrinsic delay
• Input capacitance
– Physical footprint
– Power characteristics (leakage power)
Trivial Covering
Reduce netlist into ND2 gates → subject DAG
7 NAND2 = 21 (# pins) = 28 (# transistors) 5 INV = 10 (# pins) = 10 (# transistors
31 = 38 (pin cost) (area cost)
Covering #1
2 INV = 4 (# pins) = 4 (# trans’s) 2 NAND2 = 6 = 8 1 NAND3 = 4 = 6 1 NAND4 = 5 = 8
19 26 (pin cost) (area cost)
Covering #2
1 INV = 2 (# pins) = 2 (# trans’s) 1 NAND2 = 3 = 4 2 NAND3 = 8 = 12 1 AOI21 = 4 = ?
17 ? (pin cost) (area cost)
25
Sound Algorithmic approachNP-hard optimization problem
Tree covering heuristic: If subject and primitive DAGs are trees, efficient algorithm can find optimum cover in linear time⇒ dynamic programming formulation
DAG Covering
multiple fanout
Multiple fan-out
Partitioning a Graph
• Partition input netlist into a forest of trees • Solve each tree optimally • Stitch trees back together • Any issues to take care of when doing TM for each individual tree so that the stitching back can be done (this is really a bookkeeping issue rather than intrinsically algorithmic)?
27
Resulting Trees
Break at multiple fanout points
28
Dynamic Programming
Principle of optimality: Optimal cover for a tree consists of a match at the root of the tree plus the optimal cover for the sub-trees starting at each input of the match
x
y
z
p
Best cover forthis match usesbest covers forx, y, z
Best cover forthis match usesbest covers forp, z
Optimum Tree Covering
NAND2 3
AOI21 4 + 3 = 7
INV 11 + 2 = 13
NAND2 2 + 6 + 3 = 11
NAND2 3 + 3 = 6
NAND2 3
INV 2
Is it a good idea to use the largest possible “cover” or “cut” going top-down
or bottom-up (this may not be visible in this small example)?
DP Algorithm for Tech. Mapping of Tree Circuits
Procedure DP_TM(p: gate output, G: circuit); /* G is a tree circuit */
if DP_TM(p, G) is “stored” then return(its cost, corresp. solution);
Going backward from p to its fanins, produce all possible “cuts” (that separate the sub-ckt w/ p as o/p from the rest of the circuit) so that the # of wires cut <= max # of i/ps available for cells/gates in the library;
cost = infinity;
For each such cut Ci do
if the subckt containing p is mappable to some gate gi in the library AND does not include any “forbidden” interconnects then begin
cost(Ci) = cost(gi) + S qj in Ci (DP_TM(qj, G)[1]); /* if DP_TM(qj, G) has already been investigated, it will be “stored”, and there is no need to execute this procedure again */
if cost(Ci) < cost then { cost = cost(Ci); solution = corresponding soln. }
end if.
End for;
return(cost, solution);
end DP_TM;
Shantanu Dutt, UIC
Example Cuts in DP_TM C3
C2
C1
C4
Recursive Calls for DP_TM
C3’
C4’
C2’ C1’
Ci’ is the set of gate o/ps that are in the “complement” (remainder) subckt. of G generated by cut Ci
Shantanu Dutt, UIC
DP Algorithm for Tech. Mapping of Tree Circuits
Analysis:
• Optimality?
• Runtime analysis: – # of subsets (cuts) that can be generated from a gate g’ o/p q that includes g and includes fan-
ins only up to a fan-in size of m (= max. number of i/ps among all cells in the library: if k is the # of all gate i/ps that can be enclosed by a cut across all valid cuts w/o/p at q, then # of valid cuts = O(2k), and k itself is O(2m), and thus # of valid cuts is O(2**(2m)). However, since m is a constant like 6, the # of valid cuts per gate o/p turns out to be a medium-size constant in the range 20-50.
– # of 1st time calls of DP_TM for each gate o/p
– Total # of DP_TM calls
• What if the circuit is not a tree: how to process, optimal or non-optimal?
• How to obtain optimality for a general DAG and at the cost of how much increased runtime?
Shantanu Dutt, UIC
What about delay optimization?
Shantanu Dutt, UIC
Cin(g(C1))
Rout(g(C2)) Rout(g(C4)) Rout(g(C3))
C1
C2
C3 C4 d(g(C2))
d(g(C4))
d(g(C1))
Delay till root cut C2 i/p: D1
D2
D3
D1
• Each cell g has an o/p resistance Rout(g), input capacitance Cin(g) and intrinsic delay d(g).
What about delay optimization?
Shantanu Dutt, UIC
Cin(g(C1))
Rout(g(C2)) Rout(g(C4)) Rout(g(C3))
C1
C2
C3 C4 d(g(C2))
d(g(C4))
d(g(C1))
Delay till root cut C2 i/p: D1
D2
D3
D1
• Each cell g has an o/p resistance Rout(g), input capacitance Cin(g) and intrinsic delay d(g).
• Delay at the C1 input = max[D1 + Rout(g(C2))*Cin(g(C1)), D2 + Rout(g(C3))*Cin(g(C1)), D1 + Rout(g(C2))*Cin(g(C1)) ]
Extension to DAGs: What is the problem?
Shantanu Dutt, UIC
u
Ci Cj
Fig. 1
Ck
Cr
u
Ci Cj
Fig. 2
Ck
Cr
• 3 different calls, one for each fanout, to TM of DAG rooted at u (TM(u)).
• Will the solution for TM(u) that leads to the best overall solution for the entire problem be affected by which cut/cover TM(u) is called for when the optimization metric is:
a) Area or pin cost? b) Delay?
TM(u)
Appendix (not necessary to consider): Tackling fanouts in DAGs Black arcs represent parent-child relationship among cuts in a D&C context
The only way a sibling cut can change the subcircuit rooted at u is if it enclosed a “predecessor” of u as by the cut Cr in Fig. 2. But such a cut can only be valid if it cuts only one gate o/p, which has to be the one for its root node, which is not the case here (Cr is invalid as cuts two node o/ps). Such a “changing” cut is valid only if its root node also is a predecessor of u as in Fig. 3. In such cases we will need to do as many TM calls of the subcircuit subckt(u) rooted at u as there are changes to it made by other cuts, and this can potentially be exponential over all such combinations of TMs of similar nodes as u. Practically there may be very few such “changing” cuts of subckt(u), but even if there are 2, we need to consider the combination of all such changed subckt(u)’s for all u’s w/ fanouts leading to an exponential combination if nodes w/ fanouts are Q(n), where n is the total # of nodes. If such nodes are a constant then we get a linear complexity TM algorithm as in the tree case.
u
Ci Cj
Fig. 1
Ck
Cr
u
Ci Cj
Fig. 2
Ck
Cr
u
Ci Cj
Fig. 3
Ck
Cr
Cr’ Two cuts Cr and Cr’ that “change” subckt(u)