Parallel programming:Introduction to GPU architecture
Sylvain CollangeInria Rennes – Bretagne Atlantique
2
GPU internals
What makes a GPU tick?
NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering!
3
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
4
The free lunch era... was yesterday
1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements
Exponential performance increase
Software compatibility preserved
Do not rewrite software, buy a new machine!Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed. 2006
5
Computer architecture crash course
How does a processor work?
Or was working in the 1980s to 1990s:modern processors are much more complicated!
An attempt to sum up 30 years of research in 15 minutes
6
Machine language: instruction set
Registers
State for computations
Keeps variables and temporaries
Instructions
Perform computations on registers,move data between register and memory, branch…
Instruction word
Binary representation of an instruction
Assembly language
Readable form of machine language
Examples
Which instruction set does your laptop/desktop run?
Your cell phone?
01100111
R1, R3ADD
R0, R1, R2, R3... R31
Example
7
The Von Neumann processor
Let's look at it step by step
Decoder
Fetch unit
Arithmetic andLogic Unit
Memory
Register file
PC
Statemachine
OperationOperands
Instructionword
Load/StoreUnit
BranchUnit
Result bus
+1
8
Step by step: Fetch
The processor maintains a Program Counter (PC)
Fetch: read the instruction word pointed by PC in memory
Fetch unit
Memory
PC
Instructionword
01100111
01100111
9
Decode
Split the instruction word to understand what it represents
Which operation? → ADD
Which operands? → R1, R3
Decoder
Fetch unit
Memory
PC
OperationOperands
Instructionword
01100111
R1, R3 ADD
10
Read operands
Get the value of registers R1, R3 from the register file
Decoder
Fetch unit
Memory
Register file
PC
Operands
Instructionword
R1, R3
42, 17
11
Execute operation
Compute the result: 42 + 17
Decoder
Fetch unit
Arithmetic andLogic Unit
Memory
Register file
PC
OperationOperands
Instructionword
42, 17 ADD
59
12
Write back
Write the result back to the register file
Decoder
Fetch unit
Arithmetic andLogic Unit
Memory
Register file
PC
OperationOperands
Instructionword
Result bus
59
R1
13
Increment PC
Decoder
Fetch unit
Arithmetic andLogic Unit
Memory
Register file
PC
OperationOperands
Instructionword
Result bus
+1
14
Load or store instruction
Can read and write memory from a computed address
Decoder
Fetch unit
Arithmetic andLogic Unit
Memory
Register file
PC
OperationOperands
Instructionword
Load/StoreUnit
Result bus
+1
15
Branch instruction
Instead of incrementing PC, set it to a computed value
Decoder
Fetch unit
Arithmetic andLogic Unit
Memory
Register file
PC
OperationOperands
Instructionword
Load/StoreUnit
BranchUnit
Result bus
+1
16
What about the state machine?
The state machine controls everybody
Sequences the successive steps
Send signals to units depending on current state
At every clock tick,switch to next state
Clock is a periodic signal(as fast as possible)
Fetch-decode
Readoperands
Readoperands
ExecuteWriteback
IncrementPC
17
Recap
We can build a real processor
As it was in the early 1980's
Youare
here
How did processors become faster?
18
Reason 1: faster clock
Progress in semiconductor technologyallows higher frequencies
Frequencyscaling
But this is not enough!
19
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
20
Going faster using ILP: pipeline
Idea: we do not have to wait until instruction n has finishedto start instruction n+1
Like a factory assembly line
Or the bandeijão
21
Pipelined processor
Independent instructions can follow each other
Exploits ILP to hide instruction latency
Program
1: add, r1, r32: mul r2, r33: load r3, [r1]
Fetch Decode Execute Writeback
1: add
22
Pipelined processor
Independent instructions can follow each other
Exploits ILP to hide instruction latency
Program
1: add, r1, r32: mul r2, r33: load r3, [r1]
Fetch Decode Execute Writeback
1: add2: mul
23
Pipelined processor
Independent instructions can follow each other
Exploits ILP to hide instruction latency
Program
1: add, r1, r32: mul r2, r33: load r3, [r1]
Fetch Decode Execute Writeback
1: add2: mul3: load
24
Superscalar execution
Multiple execution units in parallel
Independent instructions can execute at the same time
Fetch
Decode Execute Writeback
Exploits ILP to increase throughput
Decode Execute Writeback
Decode Execute Writeback
26
Locality
Time to access main memory: ~200 clock cycles
One memory access every few instructions
Are we doomed?
Fortunately: principle of locality
~90% of memory accesses on ~10% of data
Accessed locations are often the same
Temporal localityAccess the same location at different times
Spacial localityAccess locations close to each other
27
Caches
Large memories are slower than small memories
The computer theorists lied to you:in the real world, access in an array of size n costs O(log n), not O(1)!
Think about looking up a book in a small or huge library
Idea: put frequently-accessed data in small, fast memory
Can be applied recursively: hierarchy with multiple levels of cache
L1cache
Capacity: 64 KBAccess time: 2 ns
L2cache
1 MB10 ns
L3cache
8 MB30 ns
Memory 8 GB60 ns
28
Branch prediction
What if we have a branch?
We do not know the next PC to fetch from until the branch executes
Solution 1: wait until the branch is resolved
Problem: programs have 1 branch every 5 instructions on average
We would spend most of our time waiting
Solution 2: predict (guess) the most likely direction
If correct, we have bought some time
If wrong, just go back and start over
Modern CPUs can correctly predict over 95% of branches
World record holder: 1.691 mispredictions / 1000 instructions
General concept: speculation
P. Michaud and A. Seznec. "Pushing the branch predictability limits with the multi-poTAGE+ SC predictor." JWAC-4: Championship Branch Prediction (2014).
29
Example CPU: Intel Core i7 Haswell
Up to 192 instructionsin flight
May be 48 predicted branches ahead
Up to 8 instructions/cycle executed out of order
About 25 pipeline stages at ~4 GHz
Quizz: how far does light travel during the 0.25 ns of a clock cycle?
Too complex to explain in 1 slide, or even 1 lecture
David Kanter, Intel's Haswell CPU architecture, RealWorldTech, 2012http://www.realworldtech.com/haswell-cpu/
30
Recap
Many techniques to run sequential programsas fast as possible
Discovers and exploits parallelism between instructions
Speculates to remove dependencies
Works on existing binary programs,without rewriting or re-compiling
Upgrading hardware is cheaper than improving software
Extremely complex machine
31
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
32
Technology evolution
Memory wall
Memory speed does not increase as fast as computing speed
More and more difficult to hide memory latency
Power wall
Power consumption of transistors does not decrease as fast as density increases
Performance is now limited by power consumption
ILP wall
Law of diminishing returns on Instruction-Level Parallelism
Pollack rule: cost performance²≃
Cost
Serial performance
Performance
Time
Gap
Compute
Memory
Time
Transistordensity
Transistorpower
Total power
33
Usage changes
New applications demandparallel processing
Computer games : 3D graphics
Search engines, social networks…“big data” processing
New computing devices arepower-constrained
Laptops, cell phones, tablets…
Small, light, battery-powered
Datacenters
High power supplyand cooling costs
34
Latency vs. throughput
Latency: time to solution
CPUs
Minimize time, at the expense of power
Throughput: quantity of tasks processed per unit of time
GPUs
Assumes unlimited parallelism
Minimize energy per operation
35
Amdahl's law
Bounds speedup attainable on a parallel machine
S=1
1−PPN
Time to runsequential portions
Time to runparallel portions
N (available processors)
S (speedup)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-ScaleComputing Capabilities. AFIPS 1967.
S SpeedupP Ratio of parallel
portionsN Number of
processors
36
Why heterogeneous architectures?
Latency-optimized multi-core (CPU)
Low efficiency on parallel portions:spends too much resources
Throughput-optimized multi-core (GPU)
Low performance on sequential portions
S=1
1−PPN
Heterogeneous multi-core (CPU+GPU)
Use the right tool for the right job
Allows aggressive optimizationfor latency or for throughput
Time to runsequential portions
Time to runparallel portions
M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008.
37
Example: System on Chip for smartphone
Big coresfor applications
Small coresfor background activity
GPU
Special-purposeaccelerators
Lots of interfaces
38
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
39
Vertices
The (simplest) graphics rendering pipeline
Fragments
Clipping, RasterizationAttribute interpolation
Z-CompareBlending
Pixels
Vertex shader
Fragment shader
Primitives(triangles…)
Framebuffer
Programmablestage
Parametrizablestage
Textures
Z-Buffer
40
How much performance do we need
… to run 3DMark 11 at 50 frames/second?
Element Per frame Per second
Vertices 12.0M 600M
Primitives 12.6M 630M
Fragments 180M 9.0G
Instructions 14.4G 720G
Intel Core i7 2700K: 56 Ginsn/s peak
We need to go 13x faster
Make a special-purpose accelerator
Source: Damien Triolet, Hardware.fr
42
Beginnings of GPGPU
20092004 20072002
7.x 8.0 9.08.1 9.0ca 9.0b 10.0 10.1 11
2000 2001 2003 2005 2006 2008
Microsoft DirectX
NVIDIA
NV10 NV20 NV30 NV40 G70 G80-G90 GT200
ATI/AMD
R100 R200 R300 R400 R500 R600 R700
Programmableshaders
FP 16 FP 32
FP 24 FP 64
SIMT
CTM CAL
CUDA
GPGPU traction
Dynamiccontrol flow
2010
GF100
Evergreen
Unified shaders
43
Today: what do we need GPUs for?
1. 3D graphics rendering for games
Complex texture mapping, lighting computations…
2. Computer Aided Design workstations
Complex geometry
3. GPGPU
Complex synchronization, data movements
One chip to rule them all
Find the common denominator
44
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
45
Little's law: data=throughput×latency
Intel Core i7 920
210
1500
350 ns
190
50
1,25
180
3 10 50 Latency (ns)
Throughput (GB/s)
L1
L2DRAM
NVIDIA GeForce GTX 580
30
320
J. Little. A proof for the queuing formula L= λ W. JSTOR 1961.
46
Reques ts
Hiding memory latency with pipelining
Memory throughput: 190 GB/s
Memory latency: 350 ns
Data in flight = 66 500 Bytes
At 1 GHz:190 Bytes/cycle,350 cycles to wait
Tim
e
1 cycle
...
In flight: 65 KB
Throughput:190B/cycle
Latency :350 cyc les
47
Consequence: more parallelism
GPU vs. CPU
8× more parallelism to feed more units (throughput)
8× more parallelism to hide longer latency
64× more total parallelism
How to find this parallelism?
Space ×8
×8
Tim
e
Reques ts
...
48
Sources of parallelism
ILP: Instruction-Level Parallelism
Between independent instructionsin sequential program
TLP: Thread-Level Parallelism
Between independent execution contexts: threads
DLP: Data-Level Parallelism
Between elements of a vector:same operation on several elements
add r3 r1, r2←mul r0 r0, r1←sub r1 ← r3, r0
Thread 1 Thread 2
Parallel
add mul Parallel
vadd r a,b←+ + +a1 a2 a3
b1 b2 b3
r1 r2 r3
49
Example: X ← a×X
In-place scalar-vector product: X ← a×X
Or any combination of the above
Launch n threads:X[tid] a * X[tid]←
Threads (TLP)
For i = 0 to n-1 do:X[i] a * X[i]←
Sequential (ILP)
X a * X←Vector (DLP)
50
Uses of parallelism
“Horizontal” parallelismfor throughput
More units working in parallel
“Vertical” parallelismfor latency hiding
Pipelining: keep units busy when waiting for dependencies, memory
A B C D
throughput
late
ncy
A B C D
A B
A
C
B
A
cycle 1 cycle 2 cycle 3 cycle 4
51
How to extract parallelism?
Horizontal Vertical
ILP Superscalar Pipelined
TLP Multi-coreSMT
Interleaved / switch-on-event multithreading
DLP SIMD / SIMT Vector / temporal SIMT
We have seen the first row: ILP
We will now review techniques for the next rows: TLP, DLP
52
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
53
Sequential processor
Focuses on instruction-level parallelism
Exploits ILP: vertically (pipelining) and horizontally (superscalar)
for i = 0 to n-1X[i] a * X[i]←
move i 0←loop:
load t X[i]←mul t a×t←store X[i] t←add i i+1←branch i<n? loop Sequential CPU
add i ← 18
store X[17]
mul
Fetch
Decode
Execute
Memory
Source code
Machine code
Memory
54
The incremental approach: multi-core
Source: Intel
Intel Sandy Bridge
Several processorson a single chipsharing one memory space
Area: benefits from Moore's law
Power: extra cores consume little when not in use
e.g. Intel Turbo Boost
55
Homogeneous multi-core
Horizontal use of thread-level parallelism
Improves peak throughput
IFID
EX
LSU
F
D
X
Mem
add i ← 18
store X[17]
mul
IFID
EX
LSU
F
D
X
Mem
add i ← 50
store X[49]
mul
Mem
ory
T0 T1Threads:
56
Example: Tilera Tile-GX
Grid of (up to) 72 tiles
Each tile: 3-way VLIW processor,5 pipeline stages, 1.2 GHz
Tile (1,1)
…Tile (1,2)
Tile (9,1)
Tile (1,8)
Tile (9,8)
…
… …
57
Interleaved multi-threading
mul
mul
add i ← 50
Fetch
Decode
Execute
Memoryload-storeunit
load X[89]
Memory
Vertical use of thread-level parallelism
Hides latency thanks to explicit parallelismimproves achieved throughput
store X[72]load X[17]
store X[49]
add i ←73
T0 T1 T2 T3Threads:
58
Example: Oracle Sparc T5
16 cores / chip
Core: out-of-order superscalar, 8 threads
15 pipeline stages, 3.6 GHz
Core 1
Thread 1Thread 2
Thread 8
Core 2 Core 16
…
59
Clustered multi-core
For eachindividual unit,select between
Horizontal replication
Vertical time-multiplexing
Examples
Sun UltraSparc T2, T3
AMD Bulldozer
IBM Power 7
Area-efficient tradeoff
Blurs boundaries between cores
br
mul
add i ← 50
Fetch
Decode
EX
L/S Unitload X[89]
Memory
store X[72]load X[17]
store X[49]
muladd i ←73
store
T0 T1 T2 T3
→ Cluster 1 → Cluster 2
60
Implicit SIMD
In NVIDIA-speak
SIMT: Single Instruction, Multiple Threads
Convoy of synchronized threads: warp
Extracts DLP from multi-thread applications
(0-3) store
(0) mul
F
D
X
Mem(0)
Mem
ory
(1) mul (2) mul (3) mul
(1) (2) (3)
(0-3) load
Factorization of fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers
T0
T1
T2
T3
61
Explicit SIMD
Single Instruction Multiple Data
Horizontal use of data level parallelism
Examples
Intel MIC (16-wide)
AMD GCN GPU (16-wide×4-deep)
Most general purpose CPUs (4-wide to 8-wide)
loop:vload T X[i]←vmul T a×T←vstore X[i] T←add i i+4←branch i<n? loop
Machine code
SIMD CPU
add i ← 20
vstore X[16..19
vmul
F
D
X
Mem
Mem
ory
62
Quizz: link the words
Parallelism
ILP
TLP
DLP
Use
Horizontal:more throughput
Vertical:hide latency
Architectures
Superscalar processor
Homogeneous multi-core
Multi-threaded core
Clustered multi-core
Implicit SIMD
Explicit SIMD
63
Quizz: link the words
Parallelism
ILP
TLP
DLP
Use
Horizontal:more throughput
Vertical:hide latency
Architectures
Superscalar processor
Homogeneous multi-core
Multi-threaded core
Clustered multi-core
Implicit SIMD
Explicit SIMD
64
Quizz: link the words
Parallelism
ILP
TLP
DLP
Use
Horizontal:more throughput
Vertical:hide latency
Architectures
Superscalar processor
Homogeneous multi-core
Multi-threaded core
Clustered multi-core
Implicit SIMD
Explicit SIMD
65
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
66
Hierarchical combination
Both CPUs and GPUs combine these techniques
Multiple cores
Multiple threads/core
SIMD units
67
Example CPU: Intel Core i7
Is a wide superscalar, but has also
Multicore
Multi-thread / core
SIMD units
Up to 117 operations/cycle from 8 threads
256-bitSIMDunits: AVX
Wide superscalar
Simultaneous Multi-Threading:2 threads
4 CPU cores
68
Example GPU: NVIDIA GeForce GTX 580
SIMT: warps of 32 threads
16 SMs / chip
2×16 cores / SM, 48 warps / SM
Up to 512 operations per cycle from 24576 threads in flight
Time
Core 1
Core 2
Core 16
Warp 3
Warp 1
Warp 47
SM1 SM16
……C
ore 17
Core 18
Core 32
Warp 4
Warp 2
Warp 48
…
69
Taxonomy of parallel architectures
Horizontal Vertical
ILP Superscalar / VLIW Pipelined
TLPMulti-core
SMTInterleaved / switch-on-
event multithreading
DLP SIMD / SIMT Vector / temporal SIMT
70
Classification: multi-core
Oracle Sparc T5
2
16 8
ILP
TLP
DLP
Horizontal Vertical
Cores Threads
Intel Haswell
8
8
4 2
SIMD(AVX)
CoresHyperthreading
4
10
12 8
IBM Power 8
2
8
16 2
Fujitsu SPARC64 X
General-purposemulti-cores:balance ILP, TLP and DLP
Sparc T:focus on TLP
71
Classification: GPU and many small-core
Intel MIC Nvidia Kepler AMD GCN
16
2
60 4
ILP
TLP
DLP 32
2
16×4 32
16
20×4
4
40
SIMD Cores SIMT Multi-threading
Cores×units
Kalray MPPA-256
5
17×16
Tilera Tile-GX
3
72
GPU: focus on DLP, TLPhorizontal and vertical
Many small-core:focus on horizontal TLP
Horizontal Vertical
72
Takeaway
All processors use hardware mechanisms to turn parallelism into performance
GPUs focus on Thread-level and Data-level parallelism
73
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
74
Computation cost vs. memory cost
Power measurements on NVIDIA GT200
Energy/op(nJ)
Total power(W)
Instruction control 1.8 18
Multiply-add on a 32-wide warp
3.6 36
Load 128B from DRAM 80 90
With the same amount of energy
Load 1 word from external memory (DRAM)
Compute 44 flops
Must optimize memory accesses first!
75
External memory: discrete GPU
Classical CPU-GPU model
Split memory spaces
Highest bandwidth from GPU memory
Transfers to main memory are slower
CPU GPU
Main memory Graphics memory
PCIExpress
16GB/s
26GB/s 290GB/s
Ex: Intel Core i7 4770, Nvidia GeForce GTX 780
8 GB 3 GB
76
External memory: embedded GPU
Most GPUs today
Same memory
May support memory coherence
GPU can read directly from CPU caches
More contention on external memory
CPU GPU
Main memory
26GB/s
8 GB
Cache
77
GPU: on-chip memory
Conventional wisdom
Cache area in CPU vs. GPUaccording to the NVIDIACUDA Programming Guide:
GPU Register files+ caches
NVIDIA GM204 GPU
8.3 MB
AMD Hawaii GPU
15.8 MB
Intel Core i7 CPU
9.3 MB
GPU/accelerator internal memory exceeds desktop CPUs
But... if we include registers:
78
Registers: CPU vs. GPU
Registers keep the contents of local variables
CPU GPU
Registers/thread 32 32
Registers/core 256 65536
Read / Write ports 10R/5W 2R/1W
Registers keep the contents of local variables
Typical values
GPU: many more registers, but made of simpler memory
79
Internal memory: GPU
Cache hierarchy
Keep frequently-accessed data
Reduce throughput demand on main memory
Managed by hardware (L1, L2) or software (shared memory)
Core
L1
L2 L2 L2
Crossbar
Core
L1
Core
L1
External memory
290 GB/s
~2 TB/s
6 MB
1 MB
80
Caches: CPU vs. GPU
On CPU, caches are designed to avoid memory latency
Throughput reduction is a side effect
On GPU, multi-threading deals with memory latency
Caches are used to improve throughput (and energy)
CPU GPU
LatencyCaches,
prefetchingMulti-threading
Throughput Caches
81
GPU: thousands of cores?
NVIDIA GPUs G80/G92(2006)
GT200 (2008)
GF100(2010)
GK104(2012)
GK110(2012)
GM204(2014)
Exec. units 128 240 512 1536 2688 2048
SM 16 30 16 8 14 16
Computational resources
Number of clients in interconnection network (cores)stays limited
AMD GPUs R600(2007)
R700 (2008)
Evergreen(2009)
NI(2010)
SI(2012)
VI(2013)
Exec. Units 320 800 1600 1536 2048 2560
SIMD-CU 4 10 20 24 32 40
82
Takeaway
Result of many tradeoffs
Between locality and parallelism
Between core complexity and interconnect complexity
GPU optimized for throughput
Exploits primarily DLP, TLP
Energy-efficient on parallel applications with regular behavior
CPU optimized for latency
Exploits primarily ILP
Can use TLP and DLP when available
83
Next time
Next Tuesday, 1:00pm, room 2014CUDA
Execution model
Programming model
API
Thursday 1:00pm, room 2011Lab work: what is my GPU and when should I use it?
There may be available seats even if you are not enrolled