IS360 - High Performance Computing
Basavaraj TalawarCSE, NITK
Course Syllabus● Definition, RISC ISA, RISC Pipeline,
Performance Quantification● Instruction Level Parallelism
– Pipeline Hazards, Combating hazards, Scheduling, Branch Prediction, Superscalar Processors, Out-of-Order execution
● Cache Memory● VLIW, Vector Processors● Interconnection Networks● Topics of current research
Course Structure● Textbook
– Hennessy and Patterson, Computer Architecture, A Quantitative Approach, MK, 5ed.
● References– Shen & Lipasti, Modern Processor Design,
● About Course– Quizzes – Week 5, Week 9, Week 13 – 30%
– Assignments – 20%
– Mini Project – 20%
– Final Exam – 30%
Course Objective● Identify the trade-offs involved in designing a
multiprocessor● Why?
– Improve the execution time of programs to be executed on them.
Definition● Computer Architecture
– Specific requirements of the target machine
– ISA design
– Cache and memory hierarchy
– I/O, storage, disk
– Multi-processors, networked systems
– Max performance, within constraints: cost, power, availability
Computer Architecture
Computer architecture is the design of the
abstraction/implementation layers that allow
us to execute information processing applications
efficiently using manufacturing technologies
Application
David Wentzlaff, ELE 475 – Computer Architecture, Princeton University
Algorithm
Programming Language
Operating System/Virtual Machines
Instruction Set Architecture
Microarchitecture
Register-Transfer Level
Gates
Circuits
Devices
Physics
Wikipedia: Moore's Law
Single Processor Performance
RISC
Move to multi-processor
Hennessy & Patterson, CA-QA, 5ed. MK, 2013
Intel Sandy Bridge – Successor to i7
http://images.bit-tech.net/content_images/2011/01/intel-sandy-bridge-review/sandy-bridge-die-map.jpg
● Instruction Level Parallelism: Superscalar, Very Long Instruction Word (VLIW)
● Long Pipelines (Pipeline Parallelism)
● Advanced Memory and Caches
● Data Level Parallelism: Vector, GPU
● Thread Level Parallelism: Multithreading, Multiprocessor, Multicore, Manycore
Architecture vs. Microarchitecture● Architecture
– Instruction Set Architecture
– Programmer visible state (Memory & Register)
– Operations (Instructions and how they work)
– Execution Semantics (interrupts)
– Input/Output
– Data Types/Sizes
● Microarchitecture/Organization:– Tradeoffs on how to implement ISA for some metric (Speed,
Energy, Cost)
– Examples: Pipeline depth, number of pipelines, cache size, silicon area, peak power, execution ordering, bus widths, ALU widths
Same Architecture, Different Microarchitectures
David Wentzlaff, ELE 475 – Computer Architecture, Princeton University
● AMD Athlon II X4– X86 Instruction Set, Quad Core,
Out-of-order, 2.9GHz, 125W
– Decode 3 Instructions/Cycle/Core
– 64KB L1 I Cache, 64KB L1 D Cache, 512KB L2 Cache
● Intel Atom– X86 Instruction Set, Single Core,
In-order, 1.6GHz, 2W
– Decode 2 Instructions/Cycle /Core
– 32KB L1 I Cache, 24KB L1 D Cache, 512KB L2 Cache
Trends in Technology● Integrated circuit technology
– Transistor density: 35%/year– Die size: 10-20%/year– Integration overall: 40-55%/year
● DRAM capacity: 25-40%/year (slowing)
● Flash capacity: 50-60%/year– 15-20X cheaper/bit than DRAM
● Magnetic disk technology: 40%/year– 15-25X cheaper/bit then Flash– 300-500X cheaper/bit than DRAM
Dynamic Energy & Power● Dynamic Energy
– Capacitive load x Voltage2
● Dynamic power– ½ x Capacitive load x Voltage2 x Frequency switched
Reducing Power● Techniques for reducing power:
– Do nothing well– Dynamic Voltage-Frequency Scaling– Low power state for DRAM, disks– Overclocking, turning off cores
Static Power● Static power consumption
– Currentstatic x Voltage– Scales with number of transistors– To reduce: Power Gating
Pipelining and Performance Recap
Operations and Operands
ALUControl
i1 i2
o
... Register File
.........
...Memory
PR
OC
ES
SO
R
Machine Models
ALU
...
.........
...
TOS
STACK
ALU
.........
...
ACCUMULATOR
ALU
...
.........
...
REGISTOR-MEMORY
ALU
...
.........
...
REGISTER-REGISTER
C = A + B
ALU
...
............
TOS
STACK
ALU
............
ACCUMULATOR
ALU
...
............
REGISTOR-MEMORY
ALU
...
............
REGISTER-REGISTER
Push APush BAddPop C
Load AAdd BStore C
Load R1, AAdd R3, R1, BStore R3, C
Load R1, ALoad R2, BAdd R3, R1, R2Store R3, C
Addressing Modes● Where do operands come from?
Add R1, R2, R3 Regs[R4] <- Regs[R3] + Regs[R2] Register
Add R4, R3, #5 Regs[R4] <- Regs[R3] + 5 Immediate
Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]
DisplacementAdd R4, R3, 100(R1)
Regs[R4] <- Regs[R3] + Mem[Regs[R1]]
Register IndirectAdd R4, R3, (R1)
Regs[R4] <- Regs[R3] + Mem[0x475] AbsoluteAdd R4, R3, (0x475)
Regs[R4] <- Regs[R3] + Mem[Mem[R1]]
Memory IndirectAdd R4, R3, @(R1)
Regs[R4] <- Regs[R3] + Mem[100 + PC]
PC relativeAdd R4, R3, 100(PC)
Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] + Regs[R5] * 4]
ScaledAdd R4, R3, 100(R1)[R5]
ISA Encoding● Fixed Width
– Eg.: RISC Architectures: MIPS, PowerPC, SPARC, ARM
● Variable Length● Mostly Fixed or Compressed
– Eg. CISC Architectures: IBM 360, x86, Motorola 68K, VAX, …
● Mostly Fixed or Compressed– Eg.: MIPS16, THUMB (only two formats 2 and 4 bytes)
● Very Long Instruction Words– Multiple instructions in a fixed width bundle– Eg.: Multiflow, HP/ST Lx, TI C6000
Example – MIPS64 ISA● RISC, load-store architecture, simple address● 32-bit instructions, fixed format● 32 64-bit GPRs, R0-R31, 32 64-bit FPRs, F0-F31
– R0 is hardwired to 0.
– Can hold 32-bit floats also (with other ½ unused).
– “SIMD” extensions operate on more floats in 1 FPR
● A few special registers– Floating-point status register
● Load/store 8-, 16-, 32-, 64-bit integers– All sign-extended to fill 64-bit GPR
– Also 32- bit floats/doubles
MIPS64 Addressing Modes● Register (Arithmetic, Logical ops only)● Immediate (Arithmetic, Logical ) & Displacement
(load/stores only)– 16-bit immediate/offset field
– Register indirect: use 0 as displacement offset
– Direct (absolute): use R0 as displacement base
● Byte-addressed memory, 64-bit address● Software-settable big-endian/little-endian flag● Alignment required
MIPS InstructionsData Transfer Instructions
Opcode/Mnemonic Examples
Load LB, LBU, LH, LHU, LW, LWU, LD, SD, L.S, L.D LD R1, 30(R2)L.S F0, 50(R3)
Store SB, SH, SW, SD, S.S, S.D SH R3, 502(R2)SB R2, R1(R3)
Move MOV.S, MOV.D MOV.S F2, F3
Arithmetic/Logical Instructions
Add, Subtract, Multiply, Divide, …
DADD, DADDI, DSUB, DMUL, DDIV, AND, OR, XOR, LUI, DSLL, SLT
DADDU R1, R2, R3LUI R1, #43SLT R1, R2, R3
Control Instructions
Branch, Jump, Control transfer
BEQZ, BNEZ, BEQ, BNE, J, JR, JAL, JALR, TRAP, ERET
J labelBEQ R1, R2, labelMOVZ R1, R2, R3
Floating Point
FP Arithmetic ADD.D, SUB.D, MUL.D, MADD.S
MIPS Instruction Formats● R-type.
● I-type.
● J-type
6 bits 5 bits 5 bits 5 bits 6 bits5 bits
op rs rt rd shamt funct
6 bits 5 bits 5 bits 16 bits
op rs rt immediate
6 bits 26 bits
op Offset added to PC
Implementation of RISC ISA - 1● Instruction Fetch (IF)
AD
D
PC
4
InstructionMemory
IR
NPC
IR Mem[PC]
NPC PC+4
Implementation of RISC ISA - 2● Instruction Decode/Register Fetch (ID)
RegistersIR
Imm Sign-extended immediate filed of IR
A Regs[rs]
SignExtend
A
B
Imm16 32
B Regs[rt]
rs
rt
rd
Implementation of RISC ISA - 3● Execution/Effective Address (EX)
AL
UALUOuput A + Imm
A
B
Imm
ALUOutput
MUX
ALUOuput A func B
ALUOuput A func Imm
Register-Register andRegister-Immediate Instructions
Memory Reference
Implementation of RISC ISA – 3 (cont)● Execution/Effective Address (EX)
AL
U
ALUOuput A + Imm
A
B
Imm
ALUOutput
MUX
ALUOuput A func B
ALUOuput A func Imm
Register-Register andRegister-Immediate Instructions
Memory Reference ALUOuput NPC + (Imm << 2);
Cond (A == 0)
Branch Instruction
NPC
MUX
Zero? Cond
Implementation of RISC ISA - 4● Memory Access/Branch Completion (MEM)
DataMemory
LMD
NPC
ALUOutput
Cond
MUX
PC
LMD Mem[ALUOutput]
Memory Reference
Mem[ALUOutput] B
if (Cond) PC ALUOutputBranch
B
Implementation of RISC ISA - 5● Write back (WB)
ALUOutput
MUX
LMD
Regs[rd] ALUOutput
Regs[rt] ALUOutput
Register-Register andRegister-Immediate Instructions
Regs[rt] LMD
Load Instruction
Registers
Implementation of RISC ISA - Stages● Instruction Fetch (IF)● Instruction Decode/Register Fetch (ID)
– Fixed field decoding
● Execution/Effective address (EX)● Memory Access (MEM)● Write back (WB)
MIPS Datapath
AD
D
PC
4
IM
NPC
RegsIR
SignExtend
A
B
Imm16 32
rs
rt
rd
AL
U ALUOutput
MUX
MUX
Zero? Cond
DM LMD MUX
MUX
Instruction Fetch Instruction Decode/Register Fetch
Execute/Address
Calculation
MemoryAccess
WriteBack
IF ID EX MEM WB
MIPS Pipeline
Hennessy & Patterson, CA-QA, Appendix C, 5ed. MK, 2013
IF ID EX MEM WB
MIPS Pipeline
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
i1
i2
i3
i4
...
Time(clock cycles)
1 2 3 4 5 6 7 8 9
Example: When will i1000 complete? What is the average clock cycles spent per Instruction? If the processor were not pipelined, when will i1000 complete? What is the average clock cycles spent per Instruction? Which is faster?
Measuring Performance● Metrics: Response Time, Throughput
● Speedup of X relative to Y
● Execution time
– Wall clock time: includes all system overheads
– CPU time: only computation time
n=ExecutionTimeY
ExecutionTime X
=PerformanceX
PerformanceY
Benchmarks● Kernels (e.g. Matrix Multiply), Toy programs (e.g.
Sorting), Synthetic benchmarks (e.g. Dhrystone)
● Server Benchmarks – SPECWeb, SPECFS, SPECjbb, SPECvirt_Sc
● Embedded Systems Benchmarks – EEMBC, Dhrystone
● Database Server Benchmarks – TPC
● Desktop Benchmarks – SPECInt, SPECfp, SPECpower.
– CINT2006: perlbench, bzip2, gcc, sjeng, libquantum, h264ref, etc.
– CFP2006: bwaves, gamess, zeusmp, leslie3d, povray, calculix .lbm, wrf , sphinx3
www.spec.org
Measuring Performance
SPECRatioA=ExecutionTime reference
ExecutionTime A
1.25=SPECRatioA
SPECRatioB
=
ExecutionTime reference
ExecutionTime A
ExecutionTime reference
ExecutionTime B
=ExecutionTime B
ExecutionTimeA
=Performance A
Performance B
Measuring Performance● Processor counters
– Instructions executed, Clock cycles completed
● Profile based, static modeling– H/w counters, Code instrumentation, Interpreting the
program at instruction level
● Trace-driven simulation– Memory references and instruction addresses
● Execution-driven simulation– Pipeline activity, data and instruction references
Amdahl's Law
● What is the overall speedup by enhancing the performance of a single block?
● Speedupenhanced (always >1)
● Fractionenhanced (always <1)
Speedupenhanced=ExecutionTime original
ExecutionTime enhancement
FP Arithmetic FP Arithmetic
Program Execution (Original)
FP Arith
Program Execution (Enhanced)
FP Arith
Amdahl's Law
The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster
mode can be used
ExecutionTimenew=ExecutionTimeold∗((1−Fractionenhanced )+Fractionenhanced
Speedupenhanced)
In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous version. What is the new performance number?
Objective: Make the program 10 times faster. Say, 25% of the program is waiting in I/O and cannot be enhanced. How much should the speedup of the enhanced computer be?
Processor Performance Equation
InstructionsProgram
∗Clock cyclesInstruction
∗Seconds
Clock cycle=
SecondsProgram
=CPU time
CPU clock cycles=∑i=1
n
IC i×CPI i
CPI=CPU clock cycles for a program
InstructionCount
CPI=∑i=1
n IC i
InstructionCount×CPI i
Pipeline PerformanceAn unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup?
Average Instruction Execution time = Clock cycle * Average CPI
Pipeline Performance
Speedup pipelining=Avg.instruction time unpipelinedAvg. instruction time pipelined
Speedup pipelining=1
1+Pipeline stall cycles per instruction×Pipeline depth