1 EE/CPRE 465 VLSI Design Process. 2 Outline Design Partitioning Design process: MIPS Processor as...

1

EE/CPRE 465

VLSI Design Process

2

Outline• Design Partitioning• Design process: MIPS Processor as an example

– Architecture Design

– Microarchitecture Design

– Logic Design

– Circuit Design

– Physical Design

• Fabrication, Packaging, Testing

3

Coping with Complexity• How to design System-on-Chip?

– Many millions (even billions!) of transistors

– Tens to hundreds of engineers

• Structured Design• Partitioning of Design Process

4

Structured Design• Hierarchy: Divide and Conquer

– Recursively partition a system into modules

• Regularity– Reuse modules wherever possible

– Example: Uniformly sized transistors at circuit level

Standard cell library at gate level

• Modularity: well-formed interfaces– Allows modules to be treated as black boxes

• Locality– Physical and temporal

5

Partitioning of Design Process• Architecture Design: User’s perspective, what does it do?

– Instruction set, register set, and memory model

– MIPS, x86, PIC, ARM, Power, SPARC, Alpha,…

• Microarchitecture Design: how the architecture is partitioned into registers and functional units

– Single cycle, multcycle, pipelined, superscalar?

– For x86: 386, 486, Pentium, PII, PIII, P4, Core, Core 2, Atom, Celeron, Cyrix MII, AMD K5, Athlon, Phenom

• Logic Design: how are functional blocks constructed

– Ripple carry, carry lookahead, carry select adders

• Circuit Design: how are transistors used to implement the logic

– Complementary CMOS, pass transistors, domino

• Physical Design: chip layout

Two Types of Engineers• “Short and fat” engineers

– Understand a large amount about a narrow field

• “Tall and skinny” engineers– Understand something about a broad range of topics

• Digital VLSI design favors the tall and skinny engineer– can evaluate how choices in one part of the system impact

other parts of the system

6

7

MIPS Architecture• Example: subset of MIPS processor architecture

– Drawn from Patterson & Hennessy• MIPS is a 32-bit architecture with 32 registers

– Consider 8-bit subset using 8-bit datapath– Only implement 8 registers ($0 - $7)– $0 hardwired to 00000000– 8-bit program counter

Original MIPS Architecture

Simplified MIPS Architecture here

Data width 32 bits 8 bits

Address width 32 bits 8 bits

# of registers 32 8

Instruction length 32 bits 32 bits

8

Instruction Set

imm

x4

101000

9

Instruction Encoding• 32-bit instruction encoding

– Requires four cycles to fetch on 8-bit datapath

• Note that the destination register is specified by:– Bits 15:11 for R-type instructions

– Bits 20:16 for addi instruction

format example encoding

R

I

J

0 ra rb rd 0 funct

op

op

ra rb imm

6

6

6

65 5 5 5

5 5 16

26

add $rd, $ra, $rb

beq $ra, $rb, imm

j dest dest

10

Fibonacci (C)f0 = 1; f-1 = -1

fn = fn-1 + fn-2

f1 =0, f2 =1, f3 =1, f4 =2, f5 =3, ...

11

Fibonacci (Assembly)

12

Fibonacci (Binary)• Machine language program

13

Multicycle MIPS Microarchitecture

PCMux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15: 11]

Mux

0

1

Mux

0

1

1

Instruction[7: 0]

Instruction[25 : 21]


Instruction[15 : 0]

Instructionregister

ALUcontrol

ALUresult

ALUZero

Memorydata

register

A

B

IorD

MemRead

MemWrite

MemtoReg

PCWriteCond

PCWrite

IRWrite[3:0]

ALUOp

ALUSrcB

ALUSrcA

RegDst

PCSource

RegWrite

Control

Outputs

Op[5 : 0]

Instruction[31:26]

Instruction [5 : 0]

Mux

0

2

JumpaddressInstruction [5 : 0] 6 8

Shiftleft 2

1

1 Mux

0

3

2

Mux

0

1ALUOut

Memory

MemData

Writedata

Address

PCEn

ALUControl

Shift left 2

14

Multicycle MIPS µ-arch (32-bit Design)

15

Multicycle Controller

PCWritePCSource = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond

PCSource = 01

ALUSrcA =1ALUSrcB = 00ALUOp= 10

RegDst = 1RegWrite

MemtoReg = 0

MemWriteIorD = 1

MemReadIorD = 1

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst= 0RegWrite

MemtoReg=1

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

MemReadALUSrcA = 0

IorD = 0IRWrite3

ALUSrcB = 01ALUOp = 00


Instruction fetch

Instruction decode/register fetch

Jumpcompletion

BranchcompletionExecution

Memory addresscomputation

Memoryaccess

Memoryaccess R-type completion

Write-back step

(Op

='J

')

(Op

='L

B')

7

0

4

121195

1086

Reset

MemReadALUSrcA = 0

IorD = 0IRWrite2



1MemRead

ALUSrcA = 0IorD = 0IRWrite1



2MemRead

ALUSrcA = 0IorD = 0IRWrite0



3

16Chapter 5 of Patterson and Hennessy (32-bit Design)

Summary of Steps for Each Instruction Class

Step nameAction for

R-type Instruction

Action for load

Instruction

Action for store

Instruction

Action for branch

Instruction

Action for jump

Instruction

Instruction fetch

IR <= Memory[PC]PC <= PC + 4

Instruction decode /

register fetch

A <= Reg[IR[25:21]]B <= Reg[IR[20:16]]

ALUOut <= PC + (sign-extend(IR[15:0]) << 2)

Execution / address

computation / branch/jump completion

ALUOut <= A op B

ALUOut <= A + sign-extend(IR[15:0])If (A==B)

PC <= ALUOut

PC <= {PC[31:28],

IR[25:0], 2’b00}

Memory access /R-type

completion

Reg[IR[15:11]] <= ALUOut

MDR <= Memory[ALUOut]

Memory[ALUOut] <= B

Memory read completion

Reg[IR[20:16]]<= MDR

Become 4 steps in our 8-bit design


Instructions from ISA Perspective

• Consider each instruction from the perspective of ISA.• Example: Add instruction

– Instruction specified by the PC. – Operand registers are specified by bits 25:21 and 20:16 of the

instruction– New value is the sum of two registers. – Register written is specified by bits 15:11 of instruction.

Reg[Memory[PC][15:11]] <=

Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]]PC <= PC + 4

• In order to accomplish this we must break up the instruction.– kind of like introducing variables when programmingISA: Instruction Set Architecture


Breaking Down an Instruction

• ISA definition of arithmetic:

Reg[Memory[PC][15:11]] <=Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]]

• Could break down to:– IR <= Memory[PC]– A <= Reg[IR[25:21]]– B <= Reg[IR[20:16]]– ALUOut <= A + B– Reg[IR[15:11]] <= ALUOut

• Don’t forgot an important part of the definition of arithmetic!– PC <= PC + 4


Idea Behind Multicycle Approach

• We define each instruction from the ISA perspective

• Break it down into steps:– Balance the amount of work to be done in different steps– Restrict each cycle to use only one major functional unit

• Introduce new registers as needed– A, B, ALUOut, MDR, IR, etc.

• Finally try and pack as much work into each step (avoid unnecessary cycles)

while also trying to share steps where possible(minimizes control, helps to simplify solution)

• Result: Our book’s multicycle implementation!


1. Instruction Fetch

2. Instruction Decode and Register Fetch

3. Execution, Memory Address Computation, or Branch / Jump Completion

4. Memory Access or R-type Instruction Completion

5. Memory Read Completion

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Five Execution Steps

6-8 cycles in our 8-bit design

Become 4 steps since we have an 8-bit design


• Use PC to get instruction and put it in the Instruction Register.• Increment the PC by 4 and put the result back in the PC.• Can be described succinctly using "Register-Transfer

Language“ (RTL):

IR <= Memory[PC];PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch

Become 4 steps in our 8-bit design


• Read registers rs and rt in case we need them• Compute the branch address in case the instruction is a

branch• RTL:

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

• We are not setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 2: Instruction Decode and Register Fetch


• ALU is performing one of three functions, based on instruction type

• R-type:

ALUOut <= A op B;

• Memory Reference:

ALUOut <= A + sign-extend(IR[15:0]);

• Branch:

if (A==B) PC <= ALUOut;

• Jump :

PC <= {PC[31:28], IR[25:0], 2’b00}

Step 3: Execution, Memory Address Computation, or Branch / Jump Completion (Instruction Dependent)


• Loads and stores access memory

MDR <= Memory[ALUOut];or

Memory[ALUOut] <= B;

• R-type instructions completion

Reg[IR[15:11]] <= ALUOut;

Step 4: Memory Access or R-type Instruction Completion


• Reg[IR[20:16]] <= MDR;

Step 5: Memory Read Completion

26

Logic Design• Start at top level

– Hierarchically decompose MIPS into units

• Top-level interface

reset

ph1

ph2

crystaloscillator

2-phaseclockgenerator MIPS

processor adr

writedata

memdata

externalmemory

memreadmemwrite

8

8

8

27

Block Diagram

datapath

controlleralucontrol

ph1

ph2

reset

memdata[7:0]

writedata[7:0]

adr[7:0]

memread

memwrite

op[5:0]

zero

pcen

regwrite

irwrite[3:0]

mem

toreg

iord

pcsource[1:0]

alusrcb[1:0]

alusrca

aluop[1:0]

regdst

funct[5:0]

alucontrol[2:0]

PCMux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15: 11]

Mux

0

1

Mux

0

1

1

Instruction[7: 0]



Instruction[15 : 0]

Instructionregister

ALUcontrol

ALUresult

ALUZero

Memorydata

register

A

B

IorD

MemRead

MemWrite

MemtoReg

PCWriteCond

PCWrite

IRWrite[3:0]

ALUOp

ALUSrcB

ALUSrcA

RegDst

PCSource

RegWrite

Control

Outputs

Op[5 : 0]

Instruction[31:26]

Instruction [5 : 0]

Mux

0

2

JumpaddressInstruction [5 : 0] 6 8

Shiftleft 2

1

1 Mux

0

3

2

Mux

0

1ALUOut

Memory

MemData

Writedata

Address

PCEn

ALUControl

28

Hierarchical Designmips

controller alucontrol datapath

standardcell library

bitslice zipper

alu

and2

flopinv4x

mux2

mux4

ramslice

fulladder

nand2nor2

or2

inv

tri

29

HDLs• Hardware Description Languages

– Widely used in logic design

– Verilog and VHDL

• Describe hardware using code– Document logic functions

– Simulate logic before building

– Synthesize code into gates and layout

• Requires a library of standard cells

30

Verilog Examplemodule adder( input logic [7:0] a, b,

input logic c,output logic [7:0] s,output logic cout);

wire [6:0] carry;

fulladder fa0(a[0], b[0], c, s[0], carry[0]);fulladder fa0(a[1], b[1], carry[0], s[1], carry[1]);fulladder fa0(a[2], b[2], carry[1], s[2], carry[2]); . . . .fulladder fa0(a[7], b[7], carry[6], s[7], cout);

endmodule

module fulladder(input logic a, b, c, output logic s, cout);

sum s1(a, b, c, s);carry c1(a, b, c, cout);

endmodule module carry(input logic a, b, c, output logic cout)

assign cout = (a&b) | (a&c) | (b&c);endmodule

a b

c

s

cout carrysum

s

a b c

cout

fulladder

31

Circuit Design• How should logic be implemented?

– NANDs and NORs vs. ANDs and ORs?

– Fan-in and fan-out?

– How wide should transistors be?

• These choices affect speed, area, power• Logic synthesis makes these choices for you

– Good enough for many applications

– Hand-crafted circuits are still better

32

Example: Carry Logic• assign cout = (a&b) | (a&c) | (b&c);

ab

ac

bc

cout

x

y

z

g1

g2

g3

g4

Gate-level design: 26 transistors, 4 stages of gate delays

33

Example: Carry Logic• assign cout = (a&b) | (a&c) | (b&c);

a b

c

c

a b

b

a

a

b

coutcn

n1 n2

n3

n4

n5 n6

p6p5

p4

p3

p2p1

i1

i3

i2

i4

Transistor-level design: 12 transistors, 2 stages of gate delays

34

Gate-level Netlist

module carry(input a, b, c, output cout)

wire x, y, z;

and g1(x, a, b);and g2(y, a, c);and g3(z, b, c);or g4(cout, x, y, z);

endmodule

ab

ac

bc

cout

x

y

z

g1

g2

g3

g4

35

Transistor-Level Netlist

a b

c

c

a b

b

a

a

b

coutcn

n1 n2

n3

n4

n5 n6

p6p5

p4

p3

p2p1

i1

i3

i2

i4

module carry(input a, b, c, output cout)

wire i1, i2, i3, i4, cn;

tranif1 n1(i1, 0, a);tranif1 n2(i1, 0, b);tranif1 n3(cn, i1, c);tranif1 n4(i2, 0, b);tranif1 n5(cn, i2, a);tranif0 p1(i3, 1, a);tranif0 p2(i3, 1, b);tranif0 p3(cn, i3, c);tranif0 p4(i4, 1, b);tranif0 p5(cn, i4, a);tranif1 n6(cout, 0, cn);tranif0 p6(cout, 1, cn);

endmodule

36

SPICE Netlist.SUBCKT CARRY A B C COUT VDD GNDMN1 I1 A GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5PMN2 I1 B GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5PMN3 CN C I1 GND NMOS W=1U L=0.18U AD=0.5P AS=0.5PMN4 I2 B GND GND NMOS W=1U L=0.18U AD=0.15P AS=0.5PMN5 CN A I2 GND NMOS W=1U L=0.18U AD=0.5P AS=0.15PMP1 I3 A VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1 PMP2 I3 B VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1PMP3 CN C I3 VDD PMOS W=2U L=0.18U AD=1P AS=1PMP4 I4 B VDD VDD PMOS W=2U L=0.18U AD=0.3P AS=1PMP5 CN A I4 VDD PMOS W=2U L=0.18U AD=1P AS=0.3PMN6 COUT CN GND GND NMOS W=2U L=0.18U AD=1P AS=1PMP6 COUT CN VDD VDD PMOS W=4U L=0.18U AD=2P AS=2PCI1 I1 GND 2FFCI3 I3 GND 3FFCA A GND 4FFCB B GND 4FFCC C GND 2FFCCN CN GND 4FFCCOUT COUT GND 2FF.ENDS

37

Physical Design• Floorplan

– Area estimation

• Place & route– Standard cells

• Datapaths– Slice planning

38

Synthesized MIPS

Layout

39

MIPS Floorplan

datapath2700 x 1050

(2.8 M2)

alucontrol200 x 100

(20 k2)

zipper 2700 x 250

2700

1690

wiring channel: 30 tracks = 240

mips(4.6 M2)

bitslice 2700 x 100

control1500 x 400

(0.6 M2)

3500

3500

5000

5000

10 I/O pads

10 I/O pads

10 I/O pads

10 I/O pads

40

Area Estimation• Need area estimates to make floorplan

– Compare to another block you already designed

– Or estimate from transistor counts

– Budget room for large wiring tracks

– Your mileage may vary!

41

MIPS Layout

42

Standard Cells• Uniform cell height

• Uniform well height

• M1 VDD and GND rails

• M2 Access to I/Os

• Well / substrate taps

• Exploits regularity

43

Synthesized Controller• Synthesize HDL into gate-level netlist• Place & Route using standard cell library

44

Snap-Together Cells• Synthesized controller area is mostly wires

– Design is smaller if wires run through/over cells

– Smaller = faster, lower power as well!

• Design snap-together cells for datapaths and arrays– Plan wires into cells

– Pitch Matching required

– Connect by abutment• Exploits locality

• Takes lots of effort

A A A A

A A A A

A A A A

A A A A

B

B

B

B

C C D

45

MIPS Datapath• 8-bit datapath built from 8 bitslices (regularity)• Zipper at top drives control signals to datapath

46

MIPS ALU• Arithmetic / Logic Unit is part of bitslice

47

Slice Plans• Slice plan for bitslice

– Cell ordering, dimensions, wiring tracks

– Arrange cells for wiring locality

48

Design Verification• Fabrication is slow & expensive

– MOSIS 0.6m: $1000, 3 months– 65 nm: $3M, 1 month

• Debugging chips is very hard– Limited visibility into operation

• Prove design is right before building!– Logic simulation– Ckt. simulation / Formal verification– Layout vs. schematic (LVS) comparison– Design & electrical rule checks (DRC &

ERC)

• Verification is > 50% of effort on most chips!

Specification

ArchitectureDesign

LogicDesign

CircuitDesign

PhysicalDesign

=

=

=

=

Function

Function

Function

FunctionTimingPower

49

Fabrication & Packaging• Tapeout final layout• Fabrication

– 6, 8, 12” wafers

– Optimized for throughput,

not latency (10 weeks!)

– Cut into individual dice

• Packaging– Bond gold wires from die I/O pads to package

50

Testing• Test that chip operates

– Design errors

– Manufacturing errors

• A single dust particle or wafer defect kills a die– Yields from 90% to < 10%

– Depends on die size, maturity of process

– Test each part before shipping to customer

Date post:	29-Jan-2016
Category:	Documents
Upload:	shonda-jefferson
View:	223 times
Download:	0 times

1 EE/CPRE 465 VLSI Design Process. 2 Outline Design Partitioning Design process: MIPS Processor as...

Documents