2/24/2005
1
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 111/1/2006
EE108A
Lecture 11: Pipelining
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 211/1/2006
Announcements
• Lab 5: easier to implement than lab 4, but harder to debug
• Quiz 2 will be on 11/14
• Metastability demonstration at next week’s lecture (11/7)– When good flip-flops go bad– This is cool because no one every demonstrates this except Bill!
2/24/2005
2
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 311/1/2006
Some Comments on Labs 4 and 5
• Complete your design before you start coding– Understand your interfaces– Prepare a timing table
(remember it takes a cycle to read a RAM/ROM)– Partition functions
• Keep it simple– Address each feature in exactly one place– Don’t change interfaces
• Incremental debugging– Get one piece working at a time– Find out where the signal stops
• Think about “event flow”– What event causes things to happen, then what event
• e.g., new_frame, then next_note, etc…
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 411/1/2006
System Design – a process
• Design (25% of the time)– Specification
• Understand what you need to build– Divide and conquer
• Break it down into manageable pieces– Define interfaces (this takes practice to get it right)
• Clearly specify every signal between pieces• Hide implementation• Choose representations
– Timing and sequencing• Overall timing – use a table• Timing of each interface – use a simple convention (e.g., valid – ready)
– Add parallelism as needed (pipeline or duplicate units)– Timing and sequencing (of parallel structures)– Design each module
• Code (25% of the time)• Verify (50+% of the time) <= Note the time here. Plan for it.
Iterate back to the top at any step as needed.
2/24/2005
3
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 511/1/2006
Making things Faster
• Question:– Why do people design circuits for computation?– Why not just use a programmable processor and a compiler?– Verilog is so much harder to program and debug...what’s the benefit?
• Answer:– Performance (parallelism)– Efficiency (just use what you need)
• You can tailor the circuit to be just the right tradeoff for your design• Exactly as much computation power as you need• Result: Much higher (100x or more) efficiency than a generic processor
but...much higher (1000x or more) cost to develop(when does this tradeoff make sense?)
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 611/1/2006
Two ways to make things faster: pipeline and parallel
A B C D
Master
A B C D
2/24/2005
4
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 711/1/2006
Pipelines
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 811/1/2006
More Like an Assembly Line
2/24/2005
5
Pipelines
• Like an assembly line – each pipeline stage does part of the work andpasses the ‘workpiece’ to the next stage
• Example 1: Pipelined 32b Adder
FAa0
b0
c0
s0
c1
FAa1
b1
s1
c2
FAa31
b31
s31
c32
c31
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1011/1/2006
Split into 4 8-bit adders
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adda[15:8]
b[15:8]s[15:8]
c16
Adda[23:16]
b[23:16]s[23:16]
c24
Adda[31:24]
b[31:24]s[23:16]
c32
2/24/2005
6
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1111/1/2006
Split into stages4 problems ‘in process’ at once
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
Pipeline Register(Stores intermediate results)
Pipeline DiagramIllustrates pipeline timing
Problems
Time
P0
P1
P2
P3
P4
7:0 15:8
0
Cycle
23:16 31:24
1 2 3 4 5 6 7
7:0 15:8 23:16 31:24
7:0 15:8 23:16 31:24
7:0 15:8 23:16 31:24
7:0 15:8 23:16 31:24
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
2/24/2005
7
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1311/1/2006
Movie
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1411/1/2006
Cycle 1
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
S0
P0
2/24/2005
8
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1511/1/2006
Cycle 2
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
S0
P1
P0
S1
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1611/1/2006
Cycle 3
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
P2
S2
S0
P0
P1
S1
2/24/2005
9
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1711/1/2006
Cycle 3
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
P2
S2
S0
P1
S1
P2
S2
Pipeline DiagramIllustrates pipeline timing
Problems
Time
P0
P1
P2
P3
P4
7:0 15:8
0
Cycle
23:16 31:24
1 2 3 4 5 6 7
7:0 15:8 23:16 31:24
7:0 15:8 23:16 31:24
7:0 15:8 23:16 31:24
7:0 15:8 23:16 31:24
a[31:24]
b[31:24]
a[23:16]
b[23:16]
Adda[7:0]
b[7:0]
c0
s[7:0]
c8
Adds[15:8]
c16
Adds[23:16]
Adds[23:16]
a[15:8]
b[15:8]
c24
c32 c32
Takes 3 cycles to fill the pipelineThen 1 output per cycleTakes 3 cycles to drain the pipeline
2/24/2005
10
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1911/1/2006
Latency and Throughput of a Pipeline
• Suppose before pipelining the delay of our 32b adder is 3200ps (100psper bit) and this adder can do one problem each 3200ps for a throughputof 1/3200ps = 312Mops
• What is the delay (latency) and throughput of the adder with pipelining?
Suppose tdCQ = 100ps, ts = 50ps, tk = 50ps (200ps Overhead for each FF)
• tpipe = n(tstage + tdCQ + ts + tk) 4(1000) = 4000
• Θ = n/tpipe = 1/(tstage + tdCQ + ts + tk) 4/4000 = 1Gops
• So it now takes 4000ps for the first result, but we get one every 1000ps• Big win if you’re doing lots of adds back-to-back…
…loss if you only do them every 4 cycles or less.
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2011/1/2006
Example 2: Processor Pipeline
PC
Inst
Cache
IRR
Regs
IRA
ALU
IRM
IRW
RegsMux
Data
Cache
Inst1
Inst2
Inst3
Inst4
Fetch Read ALU Mem Write
Fetch Read ALU Mem Write
Fetch Read ALU Mem Write
Fetch Read ALU Mem Write
2/24/2005
11
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2111/1/2006
• Example 3: Graphics rendering pipeline
Xformtriangles
Clip Light Rasterize
triangle pipeline
Shade Composite
fragment pipeline
Textures Z-bufferFrame
Buffer
triangle
fragment
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2211/1/2006
Example 4 – Packet Processing Pipeline
And each of these modules is internally pipelined
You get the idea. Lots of systems are organized this way.
Framer PolicingRoute
Lookup
Switch
Scheduling
Queue
Mgt
Output
SchedulerFramer
2/24/2005
12
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2311/1/2006
Pipelines: Key Point
• Slower for a single problem (overhead from FFs and partitioning)
• If we can input one new problem per cycle wecan get an output every cycle
• By splitting up the logic the cycles get shorter
• So, if we can keep the pipeline full, we can get dramaticallybetter throughput.
• But it’s rarely so easy...
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2411/1/2006
Issues with pipelines(all deal with time per stage)• Load balance (across stages)
– one stage takes longer to process each input than the others –becomes a ‘bottleneck’
– Example• Rasterizing an ‘average’ triangle in a graphics pipeline takes more time
than ‘lighting’ its vertices.• Variable load (across data)
– A given stage takes more time on some inputs than others– Example
• The the time needed to rasterize a triangle is proportional to the numberof fragments in the triangle. The average triangle may contain 20fragments, but triangles range from 0 to over 1M
• Long latency– A stage may require a long latency operation (e.g., texture access)
• Rigid Pipeline: each stage takes the same time, so all stages must takethe maximum time. (Inefficient if some are waiting for others.)
2/24/2005
13
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2511/1/2006
Load Balancing Pipelines
• Suppose transform takes 2 cycles and clip 4 cycles
• Clip is a ‘bottleneck’ pipeline stage
• Xform unit is busy only half the time
Xform Clip
Stall ClipXform
Stall ClipXform
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2611/1/2006
Load Balancing Solutions1 – Parallel copies of slow unit
Xform
Clip (1)
Distribute
Clip (2)
Join
Xform Clip (1)
Clip (2)Xform
Clip (1)Xform
Clip (2)Xform
2/24/2005
14
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2711/1/2006
Load Balancing Solutions2 – Split slow pipeline stage
Xform Clip A Clip B
Xform Clip A Clip B
Xform Clip A Clip B
Xform Clip A Clip B
Xform Clip A Clip B
When is it better to split? To copy?Throughput and latency are the same
Xform Clip A Clip B
Xform
Clip (1)
Distribute
Clip (2)
Join
2/24/2005
15
Variable loadStage A always takes 10 cycles.Stage B takes 5 or 15 cycles – averages 10 cyclesPipeline averages ____ cycles per element
A10 cycles
B5 or 15 cycles
IS
A B
A B
A B
SA B
ISA B
Stalling a rigid pipelineA stall in any stage halts all stages upstream of the stall pointinstantly (on the next clock)
What if we stopped all stages, not just upstream stages?
How does the delay of this structure scale with thenumber of stages?
Avalid
ready
int_ready
Bvalid
ready
int_ready
Cvalid
ready
int_ready
valid
ready
2/24/2005
16
Double BufferAdd an extra buffer to each stage that is filled during the firstcycle of a stall.
Full
Buffer Mux
validuvalidb
readyu
readyb
readyd
validd
int_ready
Input goes to stage whenready, or to buffer when stalled.
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3211/1/2006
What is the logic equation for “next_full”? “next_buf”? “mux_sel”?
Full
Buffer Mux
validuvalidb
readyu
readyb
readyd
validd
int_ready
2/24/2005
17
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3311/1/2006
Double Buffer Timing
stage
time
0 1 2 3
Not
ready
A B C
B CA
ABA
D
BA
Ready
DC
E
A
B
DC
F
E
A
B
CD
FE
A
B
C
D
0
1
2
3
4
5
FE
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3411/1/2006
Double Buffer Alternate Timing(how do you make this happen?)
stage
time
0 1 2 3
Not
ready
A B C
B CA
ABA
D
BA
Ready
DC
E
A
B
DC
FE
A
B
C
D
FE
A
B
C
D
E
F0
1
2
3
4
5
2/24/2005
18
Elastic PipelinesA FIFO between stages decouples timingAllows stages to operate at their ‘average’ speed
A10 cycles
B5 or 15 cycles
FIFO
I
A B
A B
A B
A B
A B
A B
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3611/1/2006
Resource SharingSuppose two pipeline stages need to access the same memory•
A B C D
Memory
Mux
a
Arb
d
How would you set the priority on the arbiter?
2/24/2005
19
EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3711/1/2006
Pipeline overview
• Divide large problem into stages assembly-line style• Divide evenly or load imbalance will occur
– Fix by splitting or copying bottleneck stage• Rigid pipelines have no extra storage between stages
– A stall on any stage halts all upstream stages– Hard to stop 100 stages at once
• Make this scalable with double-buffering• Variable load results in stalls and idle cycles on a rigid pipeline
– Make pipeline elastic by adding FIFOs between key stages