Date post: | 21-Nov-2014 |
Category: |
Documents |
Upload: | prabhakrishnan |
View: | 198 times |
Download: | 11 times |
PIPELINE PROCESSING
PARALLEL PROCESSING
• A parallel processing system is able to perform concurrent data processing to achieve faster execution time
• The system may have two or more ALUs and be able to execute two or more instructions at the same time
A computer employing parallel processing is also called parallel computer.
Parallel processing classification
Single instruction stream, single data stream – SISD
Single instruction stream, multiple data stream – SIMD
Multiple instruction stream, single data stream – MISD
Multiple instruction stream, multiple data stream – MIMD
Types of Parallel Computers
• Based on architectural configurations– Pipelined computers– Array processors– Multiprocessor systems
Architectural Classification
Number of Data Streams
Number ofInstructionStreams
Single
Multiple
Single Multiple
SISD SIMD
MISD MIMD
– Flynn's classification• Based on the multiplicity of Instruction Streams and
Data Streams• Instruction Stream
– Sequence of Instructions read from memory
• Data Stream– Operations performed on the data in the processor
COMPUTER ARCHITECTURES FOR PARALLEL PROCESSING
Von-Neuman based
Dataflow
Reduction
SISD
MISD
SIMD
MIMD
Superscalar processors
Superpipelined processors
VLIW(Very Long Instruction Word Arch.)
Nonexistence
Array processors
Systolic arrays
Associative processors
Shared-memory multiprocessors
Bus based Crossbar switch based Multistage IN based
Message-passing multicomputers
Hypercube Mesh Reconfigurable
PIPELINE PROCESSING
• Pipeline is a technique of overlapping the execution of several instructions to reduce the execution time of a set of instructions.
• It is a cascade of processing stages which are linearly connected to perform a fixed function over a stream of data flowing from one end to another.
Advantages of Pipeline Processing
• Reduced access time
• Increased throughput
Types of Pipeline Models
• Asynchronous pipeline
• Synchronous pipeline
Both models external inputs are fed into the first stage. The processed results are passed from stage Si to stage Si+1 for all i=1,2,…,k-1.
Final results appears in stage Sk.
Asynchronous Pipeline Model
• Data flow between adjacent stages is controlled by a handshaking protocol.
• Stage Si is ready to transmit data, it sends a ready signal to stage Si+1. After stage Si+1 receives the incoming data, it returns an ACK signal to stage Si.
• Advantages– Useful for designing communication channel
in message passing multicomputers
• Disadvantages– Variable throughput size– Different amounts of delay may be used in
different stages
Synchronous pipeline• Here clocked latches are used to interface
between stages. The latches are used to isolate input from outputs.
• Upon arrival of the clock pulses , all latches transfer data to the next stage simultaneously.
• Advantage– Equal delay in all stages
S R1 1 S R2 2 S R3 3 S R4 4Input
Clock
Instruction Execution steps
• Instruction fetch (IF) from MM
• Instruction Decoding (ID)
• Operand Fetch (OF), if any
• Execution of the decoded instruction (EX)
Non-pipelined computer
- 6 – Stage- Instruction fetch, Instruction Decode, Operand
Address calculate, Operand fetch, Execute, Write Result
Space-Time Diagram
1 2 3 4 5 6 7 8 9
T1
T1
T1
T1
T2
T2
T2
T2
T3
T3
T3
T3 T4
T4
T4
T4 T5
T5
T5
T5 T6
T6
T6
T6Clock cycles
Segment 1
2
3
4
Pipelined Computer
EX I1 I2 I3
OF I1 I2 I3
ID I1 I2 I3
IF I1 I2 I3 I4Stages/Time 1 2 3 4 5 6 7 8 9 10 11 12 13
In the first cycle instruction I1 is fetched from memory. In the second cycle another instruction I2 is fetched from memory and simultaneously I1 is decoded by the instruction decoding unit.
INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline
Instruction Pipeline
FI DA FO EX
FI DA FO EX
FI DA FO EX
i
i+1
i+2
Conventional
Pipelined
FI DA FO EX
FI DA FO EX
FI DA FO EX
i
i+1
i+2
PIPELINING
R1 Ai, R2 Bi Load Ai and Bi
R3 R1 * R2, R4 Ci Multiply and load Ci
R5 R3 + R4 Add
A technique of decomposing a sequential process into suboperations, with each subprocess being executed in a partial dedicated segment that operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai
R1 R2
Multiplier
R3 R4
Adder
R5
MemoryBi Ci
Segment 1
Segment 2
Segment 3
OPERATIONS IN EACH PIPELINE STAGE
ClockPulse
Segment 1 Segment 2 Segment 3
Number R1 R2 R3 R4 R5 1 A1 B1
2 A2 B2 A1 * B1 C1 3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6 9 A7 * B7 + C7
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE
1 2 3 4 5 6 7 8 9 10 12 1311
FI DA FO EX1
FI DA FO EX
FI DA FO EX
FI DA FO EX
FI DA FO EX
FI DA FO EX
FI DA FO EX
2
3
4
5
6
7
FI
Step:
Instruction
(Branch)
Fetch instructionfrom memory
Decode instructionand calculate
effective address
Branch?
Fetch operandfrom memory
Execute instruction
Interrupt?Interrupthandling
Update PC
Empty pipe
no
yes
yesno
Segment1:
Segment2:
Segment3:
Segment4:
Example: 6 tasks, divided into 4 segments
1 2 3 4 5 6 7 8 9
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Pipeline Performance
• Latency– It is the amount of time, that a single operation takes
to execute
• Throughput– It is the rate at which each operations gets executed. (Operations/second or operations/cycle)
Non-pipelined processor, throughput =
pipelined processor, throughput >
Latency
1
Latency
1
• Cycle time of pipeline processor– Dependent on 4 factor
• Cycle time of unpipelined processor• Number of pipeline stages• How evenly data path logic is divided among the stages• Latency of the pipeline stages
• If the logic is evenly divided, then the clock period of the pipeline processor is
• If the logic is cannot be evenly divided, then the clock period of the pipeline processor is
• Cycle Time = Longest pipeline stages + Pipeline latch latency• Latency of each pipeline = Cycle time of the pipeline x No. of pipeline stages
latencylatch pipelinestages .
pipelineofNo
CycleTimeCycleTime dunpipeline
pipelined
• An unpipelined processor has a cycle time of 25 ns. What is the cycle time of pipelined version of the rocessor with 5 evenly divided pipeline stages, if each pipeline latch latency of 1 ns? What if the processor is divided into 50 pipeline stages? What is the total latency of the pipeline? How about if the processor is divided into 50 pipeline stages?
Questions
Solution Given data: Cycle Timeunpipelined = 25 ns
No. of pipeline stages = 5
pipeline latch latency = 1 ns
latencylatch pipelinestages .
pipelineofNo
CycleTimeCycleTime dunpipeline
pipelined
= (25 / 5) + 1 ns = 6 ns
Therefore, Cycle time of the 5 pipeline stages = 6 ns
Latency of each pipeline = Cycletime of the pipeline x No. of pipeline stages
= 6 ns x 5 = 30 ns
For the 50 stage pipeline, cycle time = (25 ns / 50) + 1 ns = 1.5 ns
Therefore, Cycle time of the 50 pipeline stages = 1.5 nsLatency of each pipeline = Cycletime of the pipeline x No. of pipeline stages
= 1.5 ns x 50 = 75 ns
Questions
• Suppose an unpipelined processor with a 25 ns cycle time is divided into 5 pipeline stages with latencies of 5, 7, 3,6 and 4 ns. If the pipeline latch latency is 1 ns, what is the cycle time of the pipeline processor? What is the latency of the resulting pipeline?
Solution
• Here, unpipeline processor is used.
• The longest pipeline stage is : 7 ns
• Pipeline latch latency is = 1 ns• Cycle time = Longest pipeline stages + Pipeline latch Latency
= 7 + 1 = 8 ns
Therefore, cycle time of the unpipelined processor = 8 ns
There are 5 pipeline stages.
Total latency = Cycle Time of the pipeline x No. of pipeline stages
= 8 ns x 5 = 40 ns
Question• Suppose that an unpipelined processor has a
cycle time of 25 ns and that its datapath is made up of modules with latencies of 2,3,4,7,3,2 and 4 ns (in that order). In pipelining this processor, it is not possible to rearrange the order of the modules (For example, putting the register read stage before the instruction decode stage) or to divide a module into multiple pipeline stages (for complexity reasons). Given pipeline latches with 1 ns latency. What is the minimum cycle time that can be achieved by pipelining this processor?
Solution
• There is no limit on the number of pipeline stages.
• The minimum cycle time =
Latency of the longest module in the datapath + Pipeline latch time
= 7 + 1 ns
= 8 ns
Question • Given an unpipelined processor with a 10 ns
cycle time and pipeine latches with 0.5 ns latency?
a. What are the cycle times of pipelined versions of the processor with 2,4,7 and 16 stages if the datapath logic is evenly divided among the pipeline stages?
b. What is the latency of the pipelined versions of the processor?
c. How many stages of pipelining are required to achieve a cycle time of 2 ns and 1 ns?
Solution – a
latencylatch pipelinestages .
pipelineofNo
CycleTimeCycleTime dunpipeline
pipelined
Given data: Cycle Timeunpipelined = 10 nsNo. of pipeline stages = 2,4,7 and 16pipeline latch latency = 0.5 ns
Cycle time pipeline for 2 stage pipeline = (10 ns / 2) + 0.5 = 5.5 ns
Cycle time pipeline for 4 stage pipeline = (10 ns / 4) + 0.5 = 3 ns
Cycle time pipeline for 7 stage pipeline = (10 ns / 7) + 0.5
= 1.42857 + 0.5 = 1.92857 ns
Cycle time pipeline for 7 stage pipeline = (10 ns / 16) + 0.5
= 0.625 + 0.5 = 1.125 ns
Solution – b
• Latency of each pipeline
= Cycle time of the pipeline x No. of pipeline stages
Latency for 2 stage pipeline = 5.5 x 2 = 11 ns
Latency for 4 stage pipeline = 3 x 4 = 12 ns
Latency for 7 stage pipeline = 1.92857 x 7
= 13.49999 ns
Latency for 16 stage pipeline = 1.125 x 16 = 18 ns
Solution – C • 1st solve the number of pipeline stages
latencylatch pipelinestages .
pipelineofNo
CycleTimeCycleTime dunpipeline
pipelined
latency latch time stages pipeline of
pipelined Pipelinecycle
CycleTimeNumber dunpipeline
= (10 ns / (2ns – 0.5 ns)) = 10 / 1.5 = 6.6667
Therefore, Number of pipeline stages required to achieve 2 ns cycle time is 6.6667 = 7 stages (approx)
(since fractional part of pipeline stages is not allowed)
Similarly, Number of pipeline stages required to achieve 1 ns cycle = 10 ns / (1 ns – 0.5 ns) = 10/0.5 = 20 stages
Pipeline Hazards
Pipeline Hazards
• Pipeline increases the processor performance.– Several instructions are overlapped in the pipeline,
cycle time can be reduced, increasing the rate at which instructions are executed.
– There are number of factors that limits a pipeline ability to execute instructions at its peak rate, including dependencies between instructions, branches and the time required to access the memory.
Types of Hazards
• Instruction Hazards
• Structural Hazards
• Control Hazards
• Branches
Hazards in Pipelining
• Procedural dependencies => Control hazards– conditional and unconditional branches,
calls/returns
• Data dependencies => Data hazards– RAW (read after write)– WAR (write after read)– WAW (write after write)
• Resource conflicts => Structural hazards– use of same resource in different stages
Instruction Hazards– Occurs when instructions are R/W reg. that are used by other
instructions.• RAR Hazards
– Occurs when 2 instructions both read from the same reg.– Example:
» ADD R1, R2, R3» SUB R4, R5, R3
• RAW Hazards– Occurs when instruction reads a reg. that was written by prev. instructions– Example:
» ADD R1, R2, R3» SUB R4, R5, R1
• WAR Hazards– Occurs when output reg. of an instruction has been read by a prev. instructions– Example
» ADD R1, R2, R3» SUB R2, R5, R6
• WAW Hazards– Occurs when output reg. of an instruction has been written by prev. instructions– Example:
» ADD R1,R2,R3» SUB R1,R5,R6
STRUCTURAL HAZARDS
Structural Hazards
It occurs when the processor’s H/W is not capable of executing all the Instructions in the pipeline simultaneously.
Example: With one memory-port, a data and an instruction fetch cannot be initiated in the same clock
The Pipeline is stalled for a structural hazard<- Two Loads with one port memory -> Two-port memory will serve without stall
FI DA FO EXi
i+1
i+2
FI DA FO EX
FI DA FO EXstallstall
Control Hazards
• The delay between when a branch instructions enters the pipeline and the time at which the next instructions enters the pipeline is called the processor’s branch delay or Control Hazards.
• The delay is mainly due to the control flow of the program
CONTROL HAZARDS
Branch Instructions
- Branch target address is not known until the branch instruction is completed
- Stall -> waste of cycle times
FI DA FO EX
FI DA FO EX
BranchInstruction
NextInstruction
Target address available
Branches
• Branch instructions can also cause delay in pipelined processor because the processor cannot determine which instruction to fetch next until the branch has executed.
• Conditional branch instruction creates data dependencies between the branch instructions and the instruction fetch stage of the pipeline.
Question A. Identify all of the RAW hazards in this instruction queue
DIV R2, R5, R8SUB R9, R2, R7ASH R5, R14, R6MUL R11, R9, R5BEG R10, #0, R12OR R8, R15, R2
B. Identify all of the WAR hazards in the previous instruction sequence
C. Identify all of the WAW hazards in the previous instruction sequence
D. Identify all of the control hazards in the previous instruction sequence
Solution A). RAW hazards exists between following
instructions– Between DIV and SUB– Between ASH and MUL– Between SUB and MUL– Between DIV and OR.
B). RAW hazards exists between following
- Between DIV and ASH
- Between DIV and OR
C) There are no WAW hazardsD) There is only one control hazard. Between BEQ and OR