Date post: | 29-Jan-2016 |
Category: |
Documents |
Upload: | shonda-jefferson |
View: | 223 times |
Download: | 0 times |
1
EE/CPRE 465
VLSI Design Process
2
Outline• Design Partitioning• Design process: MIPS Processor as an example
– Architecture Design
– Microarchitecture Design
– Logic Design
– Circuit Design
– Physical Design
• Fabrication, Packaging, Testing
3
Coping with Complexity• How to design System-on-Chip?
– Many millions (even billions!) of transistors
– Tens to hundreds of engineers
• Structured Design• Partitioning of Design Process
4
Structured Design• Hierarchy: Divide and Conquer
– Recursively partition a system into modules
• Regularity– Reuse modules wherever possible
– Example: Uniformly sized transistors at circuit level
Standard cell library at gate level
• Modularity: well-formed interfaces– Allows modules to be treated as black boxes
• Locality– Physical and temporal
5
Partitioning of Design Process• Architecture Design: User’s perspective, what does it do?
– Instruction set, register set, and memory model
– MIPS, x86, PIC, ARM, Power, SPARC, Alpha,…
• Microarchitecture Design: how the architecture is partitioned into registers and functional units
– Single cycle, multcycle, pipelined, superscalar?
– For x86: 386, 486, Pentium, PII, PIII, P4, Core, Core 2, Atom, Celeron, Cyrix MII, AMD K5, Athlon, Phenom
• Logic Design: how are functional blocks constructed
– Ripple carry, carry lookahead, carry select adders
• Circuit Design: how are transistors used to implement the logic
– Complementary CMOS, pass transistors, domino
• Physical Design: chip layout
Two Types of Engineers• “Short and fat” engineers
– Understand a large amount about a narrow field
• “Tall and skinny” engineers– Understand something about a broad range of topics
• Digital VLSI design favors the tall and skinny engineer– can evaluate how choices in one part of the system impact
other parts of the system
6
7
MIPS Architecture• Example: subset of MIPS processor architecture
– Drawn from Patterson & Hennessy• MIPS is a 32-bit architecture with 32 registers
– Consider 8-bit subset using 8-bit datapath– Only implement 8 registers ($0 - $7)– $0 hardwired to 00000000– 8-bit program counter
Original MIPS Architecture
Simplified MIPS Architecture here
Data width 32 bits 8 bits
Address width 32 bits 8 bits
# of registers 32 8
Instruction length 32 bits 32 bits
8
Instruction Set
imm
x4
101000
9
Instruction Encoding• 32-bit instruction encoding
– Requires four cycles to fetch on 8-bit datapath
• Note that the destination register is specified by:– Bits 15:11 for R-type instructions
– Bits 20:16 for addi instruction
format example encoding
R
I
J
0 ra rb rd 0 funct
op
op
ra rb imm
6
6
6
65 5 5 5
5 5 16
26
add $rd, $ra, $rb
beq $ra, $rb, imm
j dest dest
10
Fibonacci (C)f0 = 1; f-1 = -1
fn = fn-1 + fn-2
f1 =0, f2 =1, f3 =1, f4 =2, f5 =3, ...
11
Fibonacci (Assembly)
12
Fibonacci (Binary)• Machine language program
13
Multicycle MIPS Microarchitecture
PCMux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[15: 11]
Mux
0
1
Mux
0
1
1
Instruction[7: 0]
Instruction[25 : 21]
Instruction[20 : 16]
Instruction[15 : 0]
Instructionregister
ALUcontrol
ALUresult
ALUZero
Memorydata
register
A
B
IorD
MemRead
MemWrite
MemtoReg
PCWriteCond
PCWrite
IRWrite[3:0]
ALUOp
ALUSrcB
ALUSrcA
RegDst
PCSource
RegWrite
Control
Outputs
Op[5 : 0]
Instruction[31:26]
Instruction [5 : 0]
Mux
0
2
JumpaddressInstruction [5 : 0] 6 8
Shiftleft 2
1
1 Mux
0
3
2
Mux
0
1ALUOut
Memory
MemData
Writedata
Address
PCEn
ALUControl
Shift left 2
14
Multicycle MIPS µ-arch (32-bit Design)
15
Multicycle Controller
PCWritePCSource = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond
PCSource = 01
ALUSrcA =1ALUSrcB = 00ALUOp= 10
RegDst = 1RegWrite
MemtoReg = 0
MemWriteIorD = 1
MemReadIorD = 1
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst= 0RegWrite
MemtoReg=1
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
MemReadALUSrcA = 0
IorD = 0IRWrite3
ALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
Instruction fetch
Instruction decode/register fetch
Jumpcompletion
BranchcompletionExecution
Memory addresscomputation
Memoryaccess
Memoryaccess R-type completion
Write-back step
(Op
='J
')
(Op
='L
B')
7
0
4
121195
1086
Reset
MemReadALUSrcA = 0
IorD = 0IRWrite2
ALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
1MemRead
ALUSrcA = 0IorD = 0IRWrite1
ALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
2MemRead
ALUSrcA = 0IorD = 0IRWrite0
ALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
3
16Chapter 5 of Patterson and Hennessy (32-bit Design)
Summary of Steps for Each Instruction Class
Step nameAction for
R-type Instruction
Action for load
Instruction
Action for store
Instruction
Action for branch
Instruction
Action for jump
Instruction
Instruction fetch
IR <= Memory[PC]PC <= PC + 4
Instruction decode /
register fetch
A <= Reg[IR[25:21]]B <= Reg[IR[20:16]]
ALUOut <= PC + (sign-extend(IR[15:0]) << 2)
Execution / address
computation / branch/jump completion
ALUOut <= A op B
ALUOut <= A + sign-extend(IR[15:0])If (A==B)
PC <= ALUOut
PC <= {PC[31:28],
IR[25:0], 2’b00}
Memory access /R-type
completion
Reg[IR[15:11]] <= ALUOut
MDR <= Memory[ALUOut]
Memory[ALUOut] <= B
Memory read completion
Reg[IR[20:16]]<= MDR
Become 4 steps in our 8-bit design
17Chapter 5 of Patterson and Hennessy (32-bit Design)
Instructions from ISA Perspective
• Consider each instruction from the perspective of ISA.• Example: Add instruction
– Instruction specified by the PC. – Operand registers are specified by bits 25:21 and 20:16 of the
instruction– New value is the sum of two registers. – Register written is specified by bits 15:11 of instruction.
Reg[Memory[PC][15:11]] <=
Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]]PC <= PC + 4
• In order to accomplish this we must break up the instruction.– kind of like introducing variables when programmingISA: Instruction Set Architecture
18Chapter 5 of Patterson and Hennessy (32-bit Design)
Breaking Down an Instruction
• ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <=Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]]
• Could break down to:– IR <= Memory[PC]– A <= Reg[IR[25:21]]– B <= Reg[IR[20:16]]– ALUOut <= A + B– Reg[IR[15:11]] <= ALUOut
• Don’t forgot an important part of the definition of arithmetic!– PC <= PC + 4
19Chapter 5 of Patterson and Hennessy (32-bit Design)
Idea Behind Multicycle Approach
• We define each instruction from the ISA perspective
• Break it down into steps:– Balance the amount of work to be done in different steps– Restrict each cycle to use only one major functional unit
• Introduce new registers as needed– A, B, ALUOut, MDR, IR, etc.
• Finally try and pack as much work into each step (avoid unnecessary cycles)
while also trying to share steps where possible(minimizes control, helps to simplify solution)
• Result: Our book’s multicycle implementation!
20Chapter 5 of Patterson and Hennessy (32-bit Design)
1. Instruction Fetch
2. Instruction Decode and Register Fetch
3. Execution, Memory Address Computation, or Branch / Jump Completion
4. Memory Access or R-type Instruction Completion
5. Memory Read Completion
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Five Execution Steps
6-8 cycles in our 8-bit design
Become 4 steps since we have an 8-bit design
21Chapter 5 of Patterson and Hennessy (32-bit Design)
• Use PC to get instruction and put it in the Instruction Register.• Increment the PC by 4 and put the result back in the PC.• Can be described succinctly using "Register-Transfer
Language“ (RTL):
IR <= Memory[PC];PC <= PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
Step 1: Instruction Fetch
Become 4 steps in our 8-bit design
22Chapter 5 of Patterson and Hennessy (32-bit Design)
• Read registers rs and rt in case we need them• Compute the branch address in case the instruction is a
branch• RTL:
A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);
• We are not setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
Step 2: Instruction Decode and Register Fetch
23Chapter 5 of Patterson and Hennessy (32-bit Design)
• ALU is performing one of three functions, based on instruction type
• R-type:
ALUOut <= A op B;
• Memory Reference:
ALUOut <= A + sign-extend(IR[15:0]);
• Branch:
if (A==B) PC <= ALUOut;
• Jump :
PC <= {PC[31:28], IR[25:0], 2’b00}
Step 3: Execution, Memory Address Computation, or Branch / Jump Completion (Instruction Dependent)
24Chapter 5 of Patterson and Hennessy (32-bit Design)
• Loads and stores access memory
MDR <= Memory[ALUOut];or
Memory[ALUOut] <= B;
• R-type instructions completion
Reg[IR[15:11]] <= ALUOut;
Step 4: Memory Access or R-type Instruction Completion
25Chapter 5 of Patterson and Hennessy (32-bit Design)
• Reg[IR[20:16]] <= MDR;
Step 5: Memory Read Completion
26
Logic Design• Start at top level
– Hierarchically decompose MIPS into units
• Top-level interface
reset
ph1
ph2
crystaloscillator
2-phaseclockgenerator MIPS
processor adr
writedata
memdata
externalmemory
memreadmemwrite
8
8
8
27
Block Diagram
datapath
controlleralucontrol
ph1
ph2
reset
memdata[7:0]
writedata[7:0]
adr[7:0]
memread
memwrite
op[5:0]
zero
pcen
regwrite
irwrite[3:0]
mem
toreg
iord
pcsource[1:0]
alusrcb[1:0]
alusrca
aluop[1:0]
regdst
funct[5:0]
alucontrol[2:0]
PCMux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[15: 11]
Mux
0
1
Mux
0
1
1
Instruction[7: 0]
Instruction[25 : 21]
Instruction[20 : 16]
Instruction[15 : 0]
Instructionregister
ALUcontrol
ALUresult
ALUZero
Memorydata
register
A
B
IorD
MemRead
MemWrite
MemtoReg
PCWriteCond
PCWrite
IRWrite[3:0]
ALUOp
ALUSrcB
ALUSrcA
RegDst
PCSource
RegWrite
Control
Outputs
Op[5 : 0]
Instruction[31:26]
Instruction [5 : 0]
Mux
0
2
JumpaddressInstruction [5 : 0] 6 8
Shiftleft 2
1
1 Mux
0
3
2
Mux
0
1ALUOut
Memory
MemData
Writedata
Address
PCEn
ALUControl
28
Hierarchical Designmips
controller alucontrol datapath
standardcell library
bitslice zipper
alu
and2
flopinv4x
mux2
mux4
ramslice
fulladder
nand2nor2
or2
inv
tri
29
HDLs• Hardware Description Languages
– Widely used in logic design
– Verilog and VHDL
• Describe hardware using code– Document logic functions
– Simulate logic before building
– Synthesize code into gates and layout
• Requires a library of standard cells
30
Verilog Examplemodule adder( input logic [7:0] a, b,
input logic c,output logic [7:0] s,output logic cout);
wire [6:0] carry;
fulladder fa0(a[0], b[0], c, s[0], carry[0]);fulladder fa0(a[1], b[1], carry[0], s[1], carry[1]);fulladder fa0(a[2], b[2], carry[1], s[2], carry[2]); . . . .fulladder fa0(a[7], b[7], carry[6], s[7], cout);
endmodule
module fulladder(input logic a, b, c, output logic s, cout);
sum s1(a, b, c, s);carry c1(a, b, c, cout);
endmodule module carry(input logic a, b, c, output logic cout)
assign cout = (a&b) | (a&c) | (b&c);endmodule
a b
c
s
cout carrysum
s
a b c
cout
fulladder
31
Circuit Design• How should logic be implemented?
– NANDs and NORs vs. ANDs and ORs?
– Fan-in and fan-out?
– How wide should transistors be?
• These choices affect speed, area, power• Logic synthesis makes these choices for you
– Good enough for many applications
– Hand-crafted circuits are still better
32
Example: Carry Logic• assign cout = (a&b) | (a&c) | (b&c);
ab
ac
bc
cout
x
y
z
g1
g2
g3
g4
Gate-level design: 26 transistors, 4 stages of gate delays
33
Example: Carry Logic• assign cout = (a&b) | (a&c) | (b&c);
a b
c
c
a b
b
a
a
b
coutcn
n1 n2
n3
n4
n5 n6
p6p5
p4
p3
p2p1
i1
i3
i2
i4
Transistor-level design: 12 transistors, 2 stages of gate delays
34
Gate-level Netlist
module carry(input a, b, c, output cout)
wire x, y, z;
and g1(x, a, b);and g2(y, a, c);and g3(z, b, c);or g4(cout, x, y, z);
endmodule
ab
ac
bc
cout
x
y
z
g1
g2
g3
g4
35
Transistor-Level Netlist
a b
c
c
a b
b
a
a
b
coutcn
n1 n2
n3
n4
n5 n6
p6p5
p4
p3
p2p1
i1
i3
i2
i4
module carry(input a, b, c, output cout)
wire i1, i2, i3, i4, cn;
tranif1 n1(i1, 0, a);tranif1 n2(i1, 0, b);tranif1 n3(cn, i1, c);tranif1 n4(i2, 0, b);tranif1 n5(cn, i2, a);tranif0 p1(i3, 1, a);tranif0 p2(i3, 1, b);tranif0 p3(cn, i3, c);tranif0 p4(i4, 1, b);tranif0 p5(cn, i4, a);tranif1 n6(cout, 0, cn);tranif0 p6(cout, 1, cn);
endmodule
36
SPICE Netlist.SUBCKT CARRY A B C COUT VDD GNDMN1 I1 A GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5PMN2 I1 B GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5PMN3 CN C I1 GND NMOS W=1U L=0.18U AD=0.5P AS=0.5PMN4 I2 B GND GND NMOS W=1U L=0.18U AD=0.15P AS=0.5PMN5 CN A I2 GND NMOS W=1U L=0.18U AD=0.5P AS=0.15PMP1 I3 A VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1 PMP2 I3 B VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1PMP3 CN C I3 VDD PMOS W=2U L=0.18U AD=1P AS=1PMP4 I4 B VDD VDD PMOS W=2U L=0.18U AD=0.3P AS=1PMP5 CN A I4 VDD PMOS W=2U L=0.18U AD=1P AS=0.3PMN6 COUT CN GND GND NMOS W=2U L=0.18U AD=1P AS=1PMP6 COUT CN VDD VDD PMOS W=4U L=0.18U AD=2P AS=2PCI1 I1 GND 2FFCI3 I3 GND 3FFCA A GND 4FFCB B GND 4FFCC C GND 2FFCCN CN GND 4FFCCOUT COUT GND 2FF.ENDS
37
Physical Design• Floorplan
– Area estimation
• Place & route– Standard cells
• Datapaths– Slice planning
38
Synthesized MIPS
Layout
39
MIPS Floorplan
datapath2700 x 1050
(2.8 M2)
alucontrol200 x 100
(20 k2)
zipper 2700 x 250
2700
1690
wiring channel: 30 tracks = 240
mips(4.6 M2)
bitslice 2700 x 100
control1500 x 400
(0.6 M2)
3500
3500
5000
5000
10 I/O pads
10 I/O pads
10 I/O pads
10 I/O pads
40
Area Estimation• Need area estimates to make floorplan
– Compare to another block you already designed
– Or estimate from transistor counts
– Budget room for large wiring tracks
– Your mileage may vary!
41
MIPS Layout
42
Standard Cells• Uniform cell height
• Uniform well height
• M1 VDD and GND rails
• M2 Access to I/Os
• Well / substrate taps
• Exploits regularity
43
Synthesized Controller• Synthesize HDL into gate-level netlist• Place & Route using standard cell library
44
Snap-Together Cells• Synthesized controller area is mostly wires
– Design is smaller if wires run through/over cells
– Smaller = faster, lower power as well!
• Design snap-together cells for datapaths and arrays– Plan wires into cells
– Pitch Matching required
– Connect by abutment• Exploits locality
• Takes lots of effort
A A A A
A A A A
A A A A
A A A A
B
B
B
B
C C D
45
MIPS Datapath• 8-bit datapath built from 8 bitslices (regularity)• Zipper at top drives control signals to datapath
46
MIPS ALU• Arithmetic / Logic Unit is part of bitslice
47
Slice Plans• Slice plan for bitslice
– Cell ordering, dimensions, wiring tracks
– Arrange cells for wiring locality
48
Design Verification• Fabrication is slow & expensive
– MOSIS 0.6m: $1000, 3 months– 65 nm: $3M, 1 month
• Debugging chips is very hard– Limited visibility into operation
• Prove design is right before building!– Logic simulation– Ckt. simulation / Formal verification– Layout vs. schematic (LVS) comparison– Design & electrical rule checks (DRC &
ERC)
• Verification is > 50% of effort on most chips!
Specification
ArchitectureDesign
LogicDesign
CircuitDesign
PhysicalDesign
=
=
=
=
Function
Function
Function
FunctionTimingPower
49
Fabrication & Packaging• Tapeout final layout• Fabrication
– 6, 8, 12” wafers
– Optimized for throughput,
not latency (10 weeks!)
– Cut into individual dice
• Packaging– Bond gold wires from die I/O pads to package
50
Testing• Test that chip operates
– Design errors
– Manufacturing errors
• A single dust particle or wafer defect kills a die– Yields from 90% to < 10%
– Depends on die size, maturity of process
– Test each part before shipping to customer