Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num...

transcript

Processor Level Parallelism

Improving the Pipeline

• Pipelined processor– Ideal speedup = num stages– Branches / conflicts mean limited returns after certain

• Instruction Level Parallelism– Ability to run multiple instructions at the same

Superscalar

• Superscalar : capable of running multiple instructions at a time– Multiple execution units• Widen slowest part of pipeline

Superscalar

• Multi-issue : Start multiple instructions per clock– Parallel pipes

Superscalar

• Multi-issue pipeline feeding multiple execution units

Superscalar

• Issue:Dependency issues just got MUCH harder…

Superscalar Pro/Con

• Good– The hardware solves everything:• Hardware solves scheduling/registers/etc…• Compiler can still help matters

– Binary compatibility• New hardware issues old instructions in a more

efficient way

• Bad– Complex hardware– Limit to scale

• VLIW : Very Large Instruction Word– One instruction contains multiple ops

• Instructions VERY large– 240 bits?– Wasted space addressed by bundles• No dependencies within bundle

Who does work?

• Compiler assembles long instructions– Reorders at compile time

• Compiler has more time,information

VLIW Uses

• Itanium : – EPIC : Explicitly Parallel Computing– 3 instruction bundles

VLIW Pro/Con

• Good– Simple hardware• Add new functional units with no new scheduling

hardware

– Better optimization in compiler

• Bad– Binary compatibility : compiler builds for one

specific hardware– Good compilers are HARD to write

ARM 15

• Modern CPU:

Processor Parallelism

• Process Parallelism : Run multiple instruction streams simultaneously

Process vs Thread

• Process : Program– Own memory space– Has at least one

thread

Process vs Thread

• Thread : Instruction sequence– Own registers/stack– Share memory

with otherthreads in process

Threaded Code

• Demo…

Context Switching

• Four threads running in 4-wide pipeline– Can't always fill all 4 issue slots– Have bubbles from memory access, page faults,

etc…

Context Switching

• Threads often have bubbles…

Multithreading

• MultithreadingAlternate threads to maximize hardware use– Course : run until stall, then switch

– Fine : switch every cycle

– Either one needs extra hardware

Multithreading Superscalar

• A 2-instruction wide pipeline with multithreading:– Still only one process per cycle

Fine grained Course grained

• SMT : Simultaneous Multithreading– AKA Hyperthreading

• Issue ops from multiple threads in one cycle

• Maximize use of functional units– But need to track registers each instruction goes

with…

SMT Challenges

• Resources must be duplicated or split– Split too thin hurts performance…– Duplicate everything and you aren't maximizing

use of hardware…

Intel vs AMD

• Variations on SMT

Getting Faster

• Pipelining helps to a point• Superscalar/VLIW helps to a point• SMT helps a bit• Chips getting faster

Getting Faster

• Pipelining helps to a point• Superscalar/VLIW helps to a point• SMT helps a bit• Chips getting faster• Only so much speedup possible– Power = heat– Power C V2 f

• C = Capacitance, how well it “stores” a charge• V = Voltage• f = frequency. I.e., how fast clock is (e.g., 3 GHz)

Power Density Prediction circa 2000

40048008

8080 8085

286 386486

Pentium® procP6

1970 1980 1990 2000 2010

Hot Plate

Nuclear Reactor

Rocket Nozzle

Source: S. Borkar (Intel)

Sun’s Surface

Core 2

Adapted from UC Berkeley "The Beauty and Joy of Computing"

Moore's Law Related Curves

Going Multi-core Helps Energy Efficiency• Power of typical integrated circuit C V2 f– C = Capacitance, how well it “stores” a charge– V = Voltage– f = frequency. I.e., how fast clock is (e.g., 3 GHz)

William Holt, HOT Chips 2005

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num...

Documents