Computation GapInstruction Level Parallelism
A.R. HursonComputer Science Department
Missouri Science & [email protected]
Computation Gap
Computation gap is defined as the differebetween computational power demandedthe application environmentscomputational capability of the exiscomputers.
Today, one can find many applications whrequire orders of magnitude mcomputations than the capability of thecalled super-computers and super-systems.
Computation Gap
Some ApplicationsIt is estimated that the so called ProbSolving and Inference Systems requireenvironment with the computational powethe order of 100 MLIPS to 1 GLIPS (1 LIP100-1000 instructions).
Computation Gap
Some ApplicationsExperiences in Fluid Dynamics have shothat the conventional super-computerscalculate steady 2-dimensional flow in minu
However, conventional super-compurequire up to 20 hours to handle time depend2-dimensional flow or steady 3-dimensioflows on simple objects.
Computation Gap
Some ApplicationsNumerical Aerodynamics Simulator requireenvironment with a sustained speed of 1 bilFLOPS.
Strategic Defense Initiative requiresdistributed, fault tolerant compuenvironment with a processing rate ofMOPS.
Computation Gap
Some ApplicationsU.S. Patent Office and Trademark has a databassize 25 terabytes subject to search and update.
An angiogram department of a mid-size hosgenerates more than 64 * 1011 bits of data a year.
NASA's Earth Observing System will generate mthan 11,000 terabytes of data during the 15-year tperiod of the project.
Computation Gap
PerformanceCDC STAR-100 25-100 MFLOPDAP 100 MFLOPILLIAC IV 160 MFLOPHEP 160 MFLOPCRAY-1 25-80 MFLOPCRAY X-MP(1) 210 MFLOPCRAY X-MP(4) 840 MFLOPCRAY-2 250 MFLOPCDC CYBER
200400 MFLOP
Hitachi S-810(10) 315 MFLOPHitachi S-810(20) 630 MFLOPFujitsu FACOM
VP-50140 MFLOP
Fujitsu FACOMVP-100
285 MFLOP
Fujistu FACOMVP-200
570 MFLOP
Fujistu FACOMVP-400
1,140 MFLOP
Computation Gap
Performance
NEC SX-1 570 MFLOPSNEC SX-2 1,300 MFLOPSIBM RP3 1,000 MFLOPSMPP 8-bit integer 1,545-6,553 MIPSMPP 12-bit integer 795-4,428 MIPSMPP 16-bit integer 484-3,343 MIPSMPP 32-bit FL 165-470 MIPSMPP 40-bit FL 126-383 MIPS
Computation Gap
PerformanceNEC (Earth System) 35 tera FLOPSIBM Blue Gene 70 tera FLOPS
Computation Gap
Some New ApplicationsA recent estimate puts the amount of new informgenerated in 2002 to be 5 exabytes (1 exabyte= 1018
which is approximately equal to all words spoken by hubeings) and 92% of this information is in hard disk. Whgood fraction of this information is of transient intuseful information of archival value will continuaccumulate.The TREC database holds around 800 million static phaving 6 trillion bytes of plain text equal to the sizemillion books.The Google system routinely accumulates millions of pof new text information every week.
Computation Gap
1970 1980 1990 2000
Text
Image
Multimedia
Sensors
Binary
Computation GapProblem
Suppose a machine capable of handlingcharacters per second is in hand. How ldoes it take to search 25 terabytes of data?
25 * 10 12
10 6 = 25 * 10 6 sec. 4 * 10 5 min. 7 * 10 3 Hours 290 da
Even if the performance is improved by a factor of 1000, it takes about 8 hours to exhaustively search this database!
Computation Gap
ProblemNOT PRACTICAL!WHAT ARE THE SOLUTIONS?
Computation Gap
Reduce the amount of needed computati(advances in software technologyalgorithms).Improve the speed of the computers:
Physical Speed (Advances in hardwtechnology).Logical Speed (Advances in comparchitecture/organization).
Computation Gap
Architectural Advances of the Uprocessor Organization
Organization of the conventional uni-procesystems can be modified in order to removeexisting bottlenecks. For example, Accessis one of the problems in the von Neumorganization.
Computation Gap
Access GapAccess gap is defined as the time differebetween the CPU cycle time and the mmemory cycle time.
Access gap problem was created byadvances in technology.
Computation Gap
Access GapIn early computers such IBM 704, CPU and mmemory cycle time were identical — i.e., 12 sec.
IBM 360/195 had the logic delay of 5 sec per sthe CPU cycle time of 54 sec and the main memcycle time of .756 sec and CDC 7600 had theand main memory cycle time of 27.5 sec and
sec, respectively.
Computation Gap
System Architecture/OrganizationTo overcome the technological limitaticomputer designers have long been attractetechniques that are classified under the term"Concurrency".
Computation Gap
ConcurrencyConcurrency is a generic term which defthe ability of the computer hardwaresimultaneously execute many actions atinstant.Within this general term are several wrecognized techniques such as ParallelPipelining, and Multiprocessing.
Computation Gap
ConcurrencyAlthough these techniques have the same orand are often hard to distinguish, in practhey are different in their general approach.
Computation Gap
ConcurrencyParallelism achieves concurrencyreplicating/duplicating the hardware strucmany times,
Pipelining takes the approach of splittingfunction to be performed into smaller pieallocating separate hardware to each piece,overlapping operations of each piece.
Computation Gap
Concurrent SystemsClassification
• Feng's Classification• Flynn's Classification• Handler's Classification
Computation Gap
Concurrent SystemsFeng's Classification
• In this classification, the concurrent spacidentified as a two dimensional space based onbit and word multiplicities.
Computation GapConcurrent Systems
Feng's ClassificationMPP
(1,16384)
Staran(1,256)
C(1,N) D(M,N)
Cmmp(16,16)
(1,1)AIBM360
(32,1)( 1,1)
1 16 32 Bit Multiplicity
Wor
d M
ultip
licity
16
B
Computation Gap
Concurrent SystemsFeng's Classification
• Point A represents a pure sequential machine - iuni-processor with serial ALU.
• Point B represents a uni-processor with parALU.
• Point C represents a parallel bit slice organizatio• Point D represents a parallel word organ
organization.
Computation Gap
Concurrent SystemsFlynn's Classification
• Flynn has classified the concurrent space accorto the multiplicity of instruction and data stream
I= { Single Instruction Stream (SI), Multiple Instruction Stream (M
Single Data Stream (SD), Multiple Data Stream (MD) }D={
Computation Gap
Concurrent SystemsFlynn's Classification
• The Cartesian product of these two sets will defour different classes:− SISD− SIMD− MISD− MIMD
Computation Gap
Concurrent SystemsFlynn's Classification — Revisited
• The MIMD class can be further divided based on− Memory structure — global or distributed− Communication/synchronism mechanism — s
variable or message passing.
Computation Gap
Concurrent SystemsFlynn's Classification — Revisited
• As a result we have four additional classecomputers:− GMSV — Shared memory multiprocessors− GMMP — ?− DMSV — Distributed shared memory− DMMP — Distributed memory (multi-computers)
Computation Gap
Concurrent SystemsHandler’s Classification
• Handler has extended Feng's concurrent spacethird dimension, namely, the number of counits.
• Handler's space is defined as T=(k,d,w):− k number of control units,− d number of ALUs controlled by a control unit,− w number of bits handled by an ALU.
Computation Gap
Concurrent SystemsHandler’s Classification
Number of CUs
Cmmp(16,1,16)
16
4
1
116
64IBM 360/91
(1,3,64)
TI ASC(1,4,64)
ILLIAC IV*(4,64,64)
3
4
64
dNumber of
ALUs
k
Word Length
Computation Gap
Concurrent SystemsHandler’s Classification
• Point (1,1,1) represents von Neumann machineserial ALU.
• Point (1,1,M) represents von Neumann macwith parallel ALU.
Computation Gap
Concurrent SystemsHandler’s Classification
• To represent pipelining at different levels -macro pipeline, instruction pipeline and arithmpipeline - diversity, sequentiality,flexibility/adaptability, the original Handscheme has been extended by three varia(k΄,d΄,w΄) and three operators (+, *, v).
Computation Gap
Concurrent SystemsHandler’s Classification
• k΄ represents macro pipeline• d΄ represents instruction pipeline• w΄ represents arithmetic pipeline• + represents diversity (parallelism)• * represents sequentiality (pipelining)• v represents flexibility/adaptability
Computation Gap
Concurrent SystemsHandler’s Classification
• According to the extended Handler's scheme:
ILLIAC IV: (1*1, 1*1, 48*1) * (1*1, 64*1, 64*1)
Front-end (B 6700) Array Processor
CDC7600: (1*1, 1*9, 60*1)CDCStar: (1*1, 2*1, 64*4)
Computation Gap
Concurrent SystemsHandler’s Classification
• According to the extended Handler's scheme:DAP: (1*1,1*1, 32*1) * [ (1*1,128*1, 32*1) v (1*1, 4096*1, 1*1)]
front-end Array Processor
Computation Gap
QuestionsWhat are the motivations behindclassification of the computer systems?What are the shortcomings ofaforementioned classification schemes?Can you propose a new classification schem
Computation Gap
Why classify computer architecture?Generalize and identify the characteristicsdifferent systems.Group machines with common architectfeatures:
• To study systems easier.• To transfer solutions easier.
Computation Gap
Why classify computer architecture?Better estimate the weak and strong pointssystem:
• To utilize a system more effectively.Anticipate the future trends and thedevelopments:
• Research directions.
Computation Gap
Goals of a classification schemeCategorize all existing and foreseeable desigDifferentiate different designs.Assign an architecture to a unique class.
Computation Gap
SummaryComputation GapHow to reduce Computation Gap:• Advances in Software and Algorithms• Advances in Technology• Advances in Computer Organization/Architecture
ConcurrencyClassification• Feng• Flynn/Extended MIMD• Handler
Computation Gap
Concurrent SystemsScalar
Sequential Lookahead
I/E Overlap FunctionalParallelism
MultipleFunc. Units
Pipeline
Implicit vector Explicit vector
Memory-to-Memory
Register-to-Register
SIMD MIMD
AssociativeProcessor
ProcessorArray Multi-computer Multiprocessor
VLIWSuperScalar
Su
Computation Gap
Concurrent SystemsWe group concurrent systems into two group
• Control Flow• Data Flow
Computation GapConcurrent Systems
In the control flow model of computatexecution of an instruction activatesexecution of the next instruction.In the data flow model of computatavailability of the data activates the execuof the next instruction(s).
Computation Gap
Concurrent Systems
ConcurrentSystems
ControlFlow
DataFlow
Parallel Systems
MultiprocessorSystems
Pipelined Systems
Data DrivenSystems
Demand DrivenSystems
Ensemble PSIMD ArrAssociative
Loosely CoTightly Co
Linear/FeeUnifunctioStatic/Dyn
°°°
Static
Dynamic
Computation Gap
Concurrent SystemsWithin the scope of the control flow systwe distinguish three classes — namely:
• Parallel Systems• Pipeline Systems• Multiprocessors
Computation Gap
Concurrent SystemsThis distinction is due to the exploitationconcurrency and the interrelationships amthe control unit, processing elementsmemory modules in each group.
Computation Gap
Multiprocessor SystemsMultiprocessor systems can be groupedtwo classes:
• Tightly Coupled (Central Memory — not scalab• Loosely Coupled (Distributed Memory — scalab
Computation Gap
Multiprocessor SystemsTightly Coupled (Central Memory — not Scalashared memory modules are separated from proceby an interconnection network or a multiport interfaAll processors have equal access time to all memwords. Therefore, the memory access time (assumno conflict) is independent of the module baccessed (C.mmp, HEP, Encore's Multimax, CedarNYU Ultracomputer).
Computation Gap
Multiprocessor SystemsLoosely Coupled (Distributed Memory — Scalaeach processor has a local-public memory.Each processor can directly access its memory mobut all other accesses to non-local memory modmust be made through an interconnection networkaccess time varies with the location of the memmodule (Cm*, BBN Butterfly, and Dash).
Computation Gap
Multiprocessor SystemsBesides the higher throughput, multiprocesystems offer more reliability since failure inone of the redundant components can be tolerthrough system reconfiguration.Multiprocessor organization is a logical extenof the parallel system — i.e., array of proceorganization. However, the degree of freeassociated with the processors are much hithan it is in an array of processor.
Computation Gap
Multiprocessor SystemsThe independence of the processors andsharing of resources among the processorsboth desirable features — are achieved atexpense of an increase in complexity at bothhardware and software levels.
Computation Gap
Multi-computer SystemsA multi-computer system is a collectionprocessors, interconnected by a messapassing network.Each processor is an autonomous comphaving its own local memorycommunicating with each other throughmessage passing network (iPSC, nCUBE,Mosaic).
Computation Gap
Pipeline SystemsThe term pipelining refers to a design techniqueintroduces concurrency by taking a basic function tinvolved repeatedly in a process and partitioning itseveral sub-functions with the following properties:
Computation Gap
Pipeline Systems• Evaluation of the basic function is equivalen
some sequential evaluation of the sub-functions.• Other than the exchange of inputs and outputs, t
is no interrelationships between sub-functions.• Hardware may be developed to execute each
function.• The execution time of these hardware units
usually approximately equal.
Computation Gap
Pipeline SystemsUnder the aforementioned conditions, the spup from pipelining equals the number of pstages.However, stages are rarely balancedfurthermore, pipelining does involve sooverhead.
Computation Gap
Pipeline SystemsThe concept of pipelining can be implemenat different levels. With regard to this isone can address:
• Arithmetic Pipelining• Instruction Pipelining• Processor Pipelining
Computation Gap
Pipeline SystemsNon-pipelined instruction cycle:
Inst. i IF ID EX MEM WB
Inst. i+1 IF ID EX MEM WB
Inst. i+2 IF ID EX
Computation Gap
Pipeline SystemsPipelined instruction cycle:
Inst. i IF ID EX MEM WB
Inst. i+1 IF ID EX MEM WB
Inst. i+2 IF ID EX MEM WB
Inst. i+3 IF ID EX MEM WB
A pipelined instruction cycle gives a peak performaof one instruction every step.
Computation Gap
Pipeline Systems — ExampleAssume an unpipeline machine has a 10 ns ccycles. It requires four clock cycles for the ALUbranch operations and five clock cycles for the memreference operations. Calculate the average instrucexecution time, if the relative frequencies of toperations are 40%, 20%, and 40%, respectively.
Ave. instr. exec. time = 10 * [ (40%+20%) * 4 + 40%= 44 ns
Computation Gap
Pipeline Systems — ExampleNow assume we have a pipeline version ofmachine. Furthermore, due to the clock skew andup, pipelining adds 1 ns overhead to the clock tIgnoring the latency, now calculate the aveinstruction execution time.
Ave. instr. exec. time = 10 + 1 ns, andSpeed up = 44/11 = 4
Computation Gap
Pipeline Systems — ExampleAssume that the time required for the five units iinstruction cycle are, 10 ns, 8 ns, 10 ns, 10 ns, andFurther, assume that pipelining adds 1 ns overhFind the speed up factor:
Ave. instr. exec. timeunpipeline = 10 + 8 + 10 + 10 + 7= 45 ns
Ave. instr. exec. timepipeline = 11 nsSpeed up = 45/11 = 4.1
Computation Gap
Pipeline SystemsIssues of concern:
• Overlapping operations should not over comresources — Every pipe stage is active on eclock cycle.
• All operations in a pipe stage should compleone clock and any combination of operations shbe able to occur at once.
Computation Gap
Pipeline SystemsPipeline systems can be further classified as
• Linear Pipe / Feedback Pipe• Scalar Pipe / Vector Pipe• Uni-function Pipe / Multifunction Pipe• Statically/Dynamically Con-figured Pipe
Computation Gap
Pipeline SystemsUni-function Pipeline: Pipeline is dedicatedspecific function — CRAY-1 has 12 dedicpipelines.Multifunction Pipeline: Pipeline systemperform different functions either at different tior at the same time — TI-ACS hasmultifunction pipelines reconfigurable for a varof arithmetic operations.
Computation Gap
Pipeline SystemsStatic Pipeline: Pipeline system assumesconfiguration at a time.Dynamic Pipeline: Pipeline system allseveral functional configurations to esimultaneously.
Computation Gap
Pipeline Systems — ExampleIn a multifunction pipe of 5 stages calculatespeed-up factor for
Y = A(i) * B(i)i=1
n
Computation Gap
Pipeline Systems — ExampleProduct terms will be generated in (n-1steps.Additions will be performed in:5 + ( n/2 -1) + 5 + ( n/4 -1) + ... + 5 + (1-(4log2n + n) steps.Speed-up ratio
S = 5(2n-1)2n+4 log 2n+4
5 for large n
Computation Gap
Pipeline SystemsA concept known as hazard is a major concin a pipeline organization.A hazard prevents the pipeline from accepdata at the maximum rate that the staging clmight support.
Computation Gap
Pipeline SystemsA hazard can be of three types:
• Structural Hazard: Arises from resource conflicts wthe hardware cannot support all possible combinatioinstructions in simultaneous overlapped execution —different pieces of data attempt to use the same stathe same time.
• Data-Dependent Hazard: Arises when an instrudepends on the result of a previous instruction —pass through a stage is a function of the data value.
Computation Gap
Pipeline Systems• Control Hazard: Arises from the pipelinin
instructions that affect PC — Branch.
Computation Gap
Pipeline SystemsStructural Hazard (Assume a Single mempipeline system)
Inst. i IF ID EX MEM WB
Inst. i+1 IF ID EX MEM WB
Inst. i+2 IF ID EX MEM WB
Inst. i+3 IF ID EX MEM WB
Inst. i+4 IF ID EX MEM WBStall
Computation Gap
Pipeline SystemsData Hazard
• A data hazard is created whenever theredependence between instructions, and they are cenough that the overlap caused by pipelining wchange the order of access to an operand.
ADD R1, R2, R3SUB R4, R1, R5
Computation Gap
Pipeline SystemsData Hazard
• Data Hazard can be resolved with a simforwarding technique — If the forwarding harddetects that the previous ALU operation has wrto a source register of the current ALU operacontrol logic selects the forwarded result as the Ainput rather the value read from the register file.
Computation Gap
Pipeline SystemsData Hazard — Classification
• Assume i and j are two instructions and j issuccessor of i, then one could expect three typedata hazard:− Read after write (RAW)− Write after write (WAW)− Write after read (WAR)
Computation Gap
Pipeline SystemsData Hazard — Classification
• Read after write (RAW) — j reads a source beforwrites it (flow dependence).
• Write after write (WAW) — j writes into the samdestination as i does (output dependence).
LW R1, 0(R2) IF ID EX MEM1 WBMEM2
Add R1, R2, R3 IF ID EX WB
Computation Gap
Pipeline SystemsData Hazard — Classification
• Write after read (WAR) — j writes into a source(anti dependence).
SW 0(R1), R2 IF ID EX MEM1 WBMEM2
Add R2, R4, R3 IF ID EX WB
Computation GapPipeline Systems
Data Hazard — Forwarding• One can use the concept of data forwardin
overcome stall (s) due to data hazard.
Add R1, R2, R3 IF ID EX MEM WB
Sub R4, R1, R5 IF ID EX MEM WB
IF ID EX MEM WBAnd R6, R1, R7
IF ID EX MEM WBOR R8, R1, R9
IF ID EX MEMXOR R10, R1, R11
Computation Gap
Pipeline SystemsData Hazard — Forwarding
• In some cases data forwarding does not work:
LW R1, 0(R2) IF ID EX MEM WB
Sub R4, R1, R5 IF ID EX MEM WB
IF ID EX MEM WBAdd R6, R1, R7
IF ID EX MEM WBOR R8, R1, R9
Forwarding works
Forwardindoes not w
Computation Gap
Pipeline SystemsData Hazard — Stalling
• In cases where data forwarding does not workpipe has to be stalled:LW R1, A IF ID EX MEM WB
Add R4, R1, R7 IF ID EX MEM WB
IF ID EX MEM WBSub R5, R1, R8
IF ID EX MEM WAnd R6, R1, R7
Computation Gap
Pipeline SystemsData Hazard — Stalling
LW R1, A IF ID EX MEM WB
Add R4, R1, R7 IF ID EX MEM W
Sub R5, R1, R8 IF ID EX MEM
IF ID EXAnd R6, R1, R7
Stalls
Computation Gap
Pipeline SystemsData Hazard — Stalling
• The pipeline interlock detects a hazard and stallpipeline until the hazard is cleared .
• This delay cycle — bubble or pipeline stall, althe load data to be generated at the time it is neby the instruction.
Computation Gap
Pipeline SystemsData Hazard — Stalling
• Let us look at A B + C
LW R1, B IF ID EX MEM WB
IF ID EX MEMAdd R3, R1, R2
IF ID EX MST A, R3
LW R2, C IF ID EX MEM WB
Stall needed to allowload of C to complete
Fo
Computation Gap
Pipeline SystemsData Hazard — Example
• Assume 30% of the instructions are load andthe time the instruction following a load instrucdepends on the result of the load. If the hacreates a single-cycle delay, how much faster iideal pipelined machine?
CPIideal = 1CPInew = (.7 * 1 + .3 * 1.5) = 1.15
Computation Gap
Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling
• Compiler attempts to schedule the pipeline to athe stalls by rearranging the code sequenceliminate the hazard — Software support to adata hazard.
• Sometimes if compiler can not scheduleinterlocks, a no-op instruction may be inserted.
Computation Gap
Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling
• Let us look at the following sequenceinstructions:
a = b + cd = e - f
Computation Gap
Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling
IFIF
IFIF
Load Rb, b
Load Rc, c
ADD Ra, Rb, Rc
Store a, Ra
IDID
IDID EXEX
EXEX
MEM
MEM
MEM
MEM
WBWB
WBWB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WIF ID EX M
Load Re, e
Load Rf, f
SUB Rd, Re, RfStore d, Rd
Computation Gap
Pipeline SystemsControl Hazard
• If instruction i is a successful branch, then the Pchanged at the end of MEM phase. This mstalling the next instructions for three clock cycl
Computation Gap
Pipeline SystemsControl Hazard
Computation Gap
Pipeline SystemsControl Hazard — Observations
• Three clock cycles are wasted for every branch.• However, the above sequence is not even poss
since we do not know the nature of the instrucuntil after the instruction i + 1 is fetched.
Computation Gap
Pipeline SystemsControl Hazard — Solution
Computation Gap
Pipeline SystemsControl Hazard — Solution
• Still the performance penalty is severe.• What are the solution(s) to speed up the pipeline
Computation Gap
Pipeline SystemsControl Hazard — Reducing pipeline brpenalties
• Detect, earlier in the pipeline, whether or notbranch is successful,
• For a successful branch, calculate the value oPC earlier,
• It should be noted that, these solutions come aexpense of extra hardware,
Computation Gap
Pipeline SystemsControl Hazard — Reducing pipeline brpenalties
• Freeze the pipeline — Holding any instructionthe branch until the branch destination is knowEasy to enforce,
• Assume unsuccessful branch — Continue toinstructions as if the branch were a noinstruction. If a branch is taken, then stoppipeline and restart the fetch,
Computation Gap
Pipeline SystemsControl Hazard — Reducing pipeline brpenalties
• Assume the branch is successful — as soon atarget address is calculated, fetch and exeinstructions at the target,
• Delayed Branch — Software attempts to makesuccessor instruction valid and useful.
Computation Gap
Pipeline SystemsControl Hazard — Reducing pipeline brpenalties
• Assume branch is not successful:IF
IFIF
IF
IF
Inst. i + 1
Inst. i + 2
Inst. i + 3
Inst. i + 4
IDID
ID
ID
ID EX
EX
EXEX
EX
ME
MEM
MEM
MEM
MEM
W
WBWB
WBUntaken branch Inst.
Computation Gap
Pipeline SystemsControl Hazard — Reducing pipeline brpenalties
Computation Gap
Pipeline SystemsStructural Hazard
• For Statically Configured pipelines, one cpredict precisely when a structural hazard moccur and hence it is possible to schedulepipeline so that the collisions do not occur.
Computation Gap
Pipeline SystemsStructural Hazard
• Let Si (1 i n) denote a stage of a pipelineperforms a well defined subtask with a delay tim
i.• Define latency as the minimum time ela
between the initiation of two processes. Therefor a linear pipeline the latency is Max( i) 1 i
Computation Gap
Pipeline SystemsStructural Hazard
• Reservation Table represents the flow ofthrough the pipeline for one complete evaluatioa given function.
• It is a table which shows the activation ofpipeline stages at each moment of time.
Computation Gap
Pipeline SystemsStructural Hazard
• Assume the following pipeline:
inS0 S1 S2 S3
2 nd pass
3 rd pass S4
S5
i= t 0 i 6 and i 33=2 t
Computation GapPipeline Systems
Structural Hazard• The following is the reservation table of
aforementioned pipeline organization:Time
Stage t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12
S 0
S 1
S 2
S 3
S 4
S 5
S 6
xx
xx x
xx
x x
xx
x
xx
Computation Gap
Pipeline SystemsStructural Hazard
• For a given pipeline organization, one can alwderive its unique reservation table. Howdifferent pipeline organizations might have the sreservation table.
Computation Gap
Pipeline SystemsStructural Hazard
• A pipeline is statically configured if it assumesame reservation table for each activation.
• A pipeline is multiply configured if the reservatable is one from a set of reservation tables.
• A pipeline is dynamically configured ifactivation does not have a predetermined reservatable.
Computation Gap
Pipeline SystemsStructural Hazard
• A collision occurs if two or more activations atteto use the same pipeline segment simultaneously
• A collision will occur if reservation tables are ofby l time units and activation of the same pipsegment overlaps.
• l is called a forbidden latency — two activashould not be initiated l time units apart.
Computation Gap
Pipeline SystemsStructural Hazard
• Given a pipeline system, one can defineforbidden list L as the set of forbidden latencies
L = (l1, l2, ..., ln)
C = (Cn, Cn-1, ..., C1)n = Max(lj) lj L andCi = 1 if i LCi = 0 otherwise
• For a given L we can define the collision vectorC as:
Computation Gap
Pipeline SystemsStructural Hazard
• The collision vector can be interpreted thainitiation is allowed at every time unit such that0. This allows us to build a finite state diagrapossible initiations.
• Initial state is the collision vector, and each starepresented by a combination of collision vewhich have led to such a state.
Computation Gap
Pipeline SystemsStructural Hazard
• In case of our example we have:
L = (4,8,1,3,5)C = 10011101
• Initial state 10011101 indicates that we can haveinitiation at time 2, 6, or 7. If we have a new initat time 2 then the finite state machine would be istate (00100111) (10011101) = 10111111.Following such a procedure then we have:
Computation Gap
Pipeline SystemsStructural Hazard
Computation Gap
Pipeline SystemsStructural Hazard
• From State Diagram then one can design a controllregulate the initiation of the activations.
• In a state diagram:− The simple cycle is a cycle in which each state ap
only once.− The average latency of a cycle is the sum of its late
divided by the number of states in the cycle.− The greedy cycle is a cycle that always minimize
latency between the current initiation and the veryinitiation.
Computation Gap
Pipeline Systems — ExampleFor a 5-stage pipeline characterized by the followreservation table:
TimeStage
xx x
xx x
x x
x
0
1
5
2
4
3
x
1 2 3 4 5 6 7 8
a) Determine forbidden list and collision vector.b) Draw the state diagram and determine the minimal
average latency and the maximum throughput.
Computation Gap
Pipeline SystemsMultifunction Pipeline
• The scheduling method for static uni-funcpipelines can be generalized for multifuncpipelines.
• A pipeline that can perform P distinct functionsbe classified by P overlaid reservation tables.
Computation Gap
Pipeline SystemsMultifunction Pipeline
• Each task to be initiated can be associated wfunction tag to identify the reservation table tused.
• In this case collision may occur between twmore tasks with the same function tag ordistinct function tags.
Computation Gap
Pipeline SystemsMultifunction Pipeline
• For example, the following reservationcharacterizes a 3-stage 2-function pipeline
1
2
3
0 41 32
A
B
B
A
AB
A
A
B
B
TimeStage
Computation Gap
Pipeline SystemsMultifunction Pipeline
• A forbidden set of latencies for a multifunction pipis the collection of collision-causing latencies.
• A cross-collision vector marks the forbidden latebetween the functions - i.e., vAB represents the forbilatencies between A and B. Therefore, for a P funpipeline one can define P2 cross-collision vectors.
• P2 cross-collision vectors can be represented bcollision matrices.
Computation Gap
Pipeline SystemsMultifunction Pipeline
• For our example we have:
Cross Collision VectorsvAA=(0110) vAB=(1011)vBA=(1010) vBB=(0110)Collision Matrices
Computation Gap
Pipeline SystemsMultifunction Pipeline
• Similar to the uni-function pipeline, one can uscollision matrices in order to construct adiagram.
Computation Gap
Pipeline SystemsMultifunction Pipeline
01101010
01111111
10110111
01111010
11110111
B1
B4
B5 +B4
B4
A3
A3
A4
A4
A1
A4
A5 +
10110110
B1, B3
B1, B3
Computation Gap
Pipeline SystemsVector Processors
• A vector processor is equipped with multiple vepipelines that can be concurrently used uhardware or firmware control.
• There are two groups of pipeline processors:− Memory-to-Memory Architecture− Register-to-Register Architecture
Computation GapPipeline Systems
Vector Processors• Memory-to-Memory architecture that support
pipeline flow of vector operands directly frommemory to pipelines and then back to the mem(Cyber 205).
• Register-to-Register architecture that uses veregisters as operands for the functional pipe(Cray series that use size registers and Fujitsu2000 series that use reconfigurable vector registe
Computation Gap
Pipeline SystemsVector Processors
• Vector machines allow efficient use of pipeliwhile reducing memory latency and pipscheduling penalties.
• Computations on vector elements are maindependent from each other — lack ofhazards.
Computation GapPipeline Systems
Vector Processors• A vector instruction is equivalent to a loop,
implies:− Smaller program size, hence reducing the instru
bandwidth requirement.− Fewer number of control hazards.
• Vector instructions initiate regular operand fpattern — allowing efficient use of meminterleaving and efficient address calculations.
Computation Gap
Pipeline SystemsVector Processors — Vector Stride
• What if adjacent vector elements of a vector operare not positioned in sequence in the memory.
• The distance separating elements that ought to be meinto a single vector is called the stride.
• Almost all vector machines allow access to vectorsany constant stride. Some constant strides may cmemory-bank conflict.
Computation Gap
Pipeline SystemsVector Processors — Chaining
• Chaining allows a vector operation to start asas the individual elements of its vector sooperand become available.
• Result of the first functional unit (pipeline) inchain are forwarded to the second functional uni
Computation GapPipeline Systems
Efficient Use of Vector Processors — Memto-Memory Organization
• Increase the vector size — if possible:− Change the nesting order of the loop,− Convert multidimensional arrays into one-dimen
arrays,− Rearrange data into unconventional forms so that sm
vectors may be combined into a single large vector.• Perform as many operations on an input vecto
possible before storing the result vector back inmain memory.
Computation Gap
Pipeline SystemsEfficient Use of Vector Processors — Memto-Memory Organization
• Changing the nesting order of the loop:
Do I = 1, 100A(I, 1:60) = 0
End
Do J = 1, 60A(1:100,J) =
End
Computation Gap
Pipeline SystemsEfficient Use of Vector Processors — Memto-Memory Organization
• Convert multidimensional arrays intodimensional arrays:
Do I = 1, 100A(I, 1:60) = 0
EndA(1:6000) = 0
Computation Gap
Pipeline SystemsEfficient Use of Vector Processors — Regito-Register Organization
• Values often used in a program should be kept in intregisters.
• Perform as many operations on an input vectopossible before storing the result vector back in thememory.
• Organize vectors into sections of size equal to the leof the vector registers — Strip mining.
• Convert multidimensional arrays into one-dimensarrays.
Computation Gap
QuestionCompare and contrast Memory-to-MemoryRegister-to-Register pipeline systems agaeach other.