Lec Jan15 2009

Anshul Kumar, CSE IITD

CSL718 : Pipelined ProcessorsCSL718CSL718 : Pipelined Processors: Pipelined Processors

PipelineTimings15th Jan, 2009

Anshul Kumar, CSE IITD slide 2

Pipelined ProcessorsPipelined ProcessorsPipelined Processors

Function-parallel

Instr level (ILP) Thread level Process level

Pipelined processors

VLIWs Superscalar processors

Parallel architectures

Data-parallel

Intel’s terminology:• intra ILP

• inter ILP


Ideal PipeliningIdeal PipeliningIdeal Pipelining

TinstS stages


Determining Clock PeriodDetermining Clock PeriodDetermining Clock Period

Clock

Δt

CombReg Reg

Δt ≥

PP = propagation delay

Δt = Pmax

Pmax = max propagation delay

P


Ideal PipeliningIdeal PipeliningIdeal Pipelining

Δt = Tinst / S Effective CPI = 1Effective time per inst Teff = CPI * Δt

= 1 * Tinst / S

TinstS stages

Pmax = Tinst / S


Pipelining with hazardsPipelining with hazardsPipelining with hazards

Δt = Tinst / SCPI = 1 + (S - 1) * bTeff = (1 + (S - 1) * b) * Tinst / S

TinstS stages

Frequency of interruptions - b

Teff vs. S (Tinst = 10)

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10S

Teff b = .2

b = .1

b = .05


A more realistic viewA more realistic viewA more realistic view

Clock

CombReg Reg

P

Register output delay Register setup time

Clock skew


Clocking OverheadClocking OverheadClocking Overhead

• Fixed overhead c– Setup time – Output delay

• Variable overhead (stretching factor) k

– Clock skew

Δt = Pmax + k * Pmax + c= (1 + k) * Tinst / S + c

Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]

Teff vs. S (Tinst = 10, c = 1, k = .1)

0

2

4

68

10

12

14

1 3 5 7 9 11 13 15S

Teff b = .2

b = .1

b = .05


Pipelining with Clocking OverheadPipelining with Clocking OverheadPipelining with Clocking Overhead

Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]

Sopt = √

[(1 - b) * (1 + k) * Tinst / (b * c)]


Partitioning instruction into cycles with non-uniform stage times

Partitioning instruction into cycles Partitioning instruction into cycles with nonwith non--uniform stage timesuniform stage times

One action - one pipeline stage => large quantization overhead

Multiple actions per stage?Multiple stages per action?


ExampleExampleExample Put Away 2 ns

Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns


Optimal PipeliningOptimal PipeliningOptimal Pipelining

Tinst = 4+6+10+3+12+9+3+6+10+3+22+2 = 90 ns

b = 0.2 c = 4 ns k = 5%

Sopt = √

[(1 - b) * (1 + k) * Tinst / (b * c)]= 9.7 ⇒ 9

Pmax = 10 ns



Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns

Pmax = 10 ns

S = 10Δt = 14.5 nsS * Δt = 145 ns



Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns

S = 9

Pmax = 13 nsΔt = 17.65 nsS * Δt = 159 ns



Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns

Pmax = 20 ns

S = 5Δt = 25 nsS * Δt = 125 ns


ComparisonComparisonComparison

S Pmax Δt S * Δt Teff

9 13 17.65 159 45.89

10 10 14.50 145 40.60

5 20 25.00 125 45.00


Cycle QuantizationCycle QuantizationCycle Quantization

Delays are not integral multiple of clock periodTotal overhead = clocking overhead

+ quantization overheadΔt ≥

Tinst / S + c (ignoring k)

∴ S * Δt ≥

Tinst + S * cQuantization overhead = S * (Δt - c) -Tinst

This reduces as clock period becomes small


Other Timing ApproachesOther Timing ApproachesOther Timing Approaches

• Self Timed Circuits– No centralized free running clock– An operation begins as soon as its inputs are

available, that is, all its predecessors have completed

– Higher speed, lower power consumption• Wave Pipelining

– Omit inter-stage registers– Reduced clocking overhead


Conventional vs Wave PipeliningConventional Conventional vsvs Wave PipeliningWave Pipelining

Conventional Pipeline• Registers separate

adjoining stages• Clock period > max prop

delay• Inter-stage data stored in

registers

Wave Pipeline• No registers between

adjoining stages• Clock period less than

max prop delay• Waves of data propagate

through combinational network (effectively, data is stored in the combinational circuit delay!)


No pipeliningNo pipeliningNo pipeliningX

Clock

Reg Reg

X

X’ Y

X’Y

Conventional pipeliningConventional pipeliningConventional pipeliningX

Clock

Reg Reg

X

X’ Y Y’ Z Z’ W

X’Y

Y’Z

Z’W


Wave pipeliningWave pipeliningWave pipeliningX

Clock

Reg Reg

X

Z’ W

Z’W


TimingTimingTiming

Comb cktX Y

Clock

Reg Reg

X

Y

ppropagation delay

sset-up time

T ≥

p + sTclock period


Timing with clock skewTiming with clock skewTiming with clock skew

Comb cktX Y

Clock

Reg Reg

X

Y

p s

T

T ≥

p + s + 2δδ δ

Clock skew = ±δ


Variation in propagation delayVariation in propagation delayVariation in propagation delay

• Different delays in different paths • Delay variation due to process /

temperature/ power variations• Data-dependent delay variations


Timing for wave pipeliningTiming for wave pipeliningTiming for wave pipelining

Comb cktX Y

Clock

Reg Reg

X

Y

T ≥ Δ p + s + 4δ

±δ

pmin

pmax

Δp

T


Timing for wave pipelining (expanded view)

Timing for wave pipeliningTiming for wave pipelining (expanded view)(expanded view)

pmin ≥

(n-1) T + 2δnT ≥

pmax + s + 2δ

⇒ T ≥ Δ p + s + 4δ

Δp

T

X

Y

(n-1) T nTpmin pmax


ComparisonComparisonComparison

Conventional PipelineT ≥

pmax/n + s + 2δ

(plus cycle quantizationoverhead)

nT ≥

pmax + ns + 2nδ

Wave PipelineT ≥ Δ p + s + 4δ

nT ≥

pmax + s + 2δ


Problems with wave pipeliningProblems with wave pipeliningProblems with wave pipelining

• Need to balance delays• Narrow range of clock frequencies• Control difficult• Not very suitable for non-linear pipelines


ReferencesReferencesReferences1. M.J. Flynn, "Computer Architecture : Pipelined and Parallel

Processor Design", Narosa Publishing House/ Jones and Bartlett, 1996.

2. Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu, “Wave-Pipelining: A Tutorial and Research Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3, September 1998, pp. 464 – 474.

Date post:	05-Dec-2014
Category:	Technology
Upload:	ravi-soni
View:	858 times
Download:	0 times

Lec Jan15 2009

Technology