of 37
7/30/2019 08 Performance Overview
1/37
Understanding GPUs ThroughUnderstanding GPUs Through
BenchmarkingBenchmarking
Mike HoustonMike HoustonStanford UniversityStanford University
7/30/2019 08 Performance Overview
2/37
2
IntroductionIntroduction
Key areas for GPGPU
Memory latency behavior
Memory bandwidths Upload/Download
Instruction rates
Branching performance
Chips analyzed
ATI X1900XTX (R580)
NVIDIA 7900GTX (G71)
NVIDIA 8800GTX (G80)
AMD HD2900 (R600)
7/30/2019 08 Performance Overview
3/37
3
GPUBenchGPUBench
An open-source suite of micro-benchmarks
GL
DX9 (alpha version)
Developed at Stanford to aid our
understanding of GPUs
Vendors wouldnt directly tell us arch details
Behavior under GPGPU apps different than games
and other benchmarks
Library of results
http://graphics.stanford.edu/projects/gpubench/
7/30/2019 08 Performance Overview
4/37
4
Memory latencyMemory latency
Questions
Can latency be hidden?
Does access pattern affect latency?
7/30/2019 08 Performance Overview
5/37
5
MethodologyMethodology
Try different numbers of texture fetches
Different access patterns:
Cache hit every fetch to the same texel
Sequential every fetch increments address by 1
Random dependent lookup with random texture
Increase the ALU ops of the shader
ALU ops must be dependent to avoid
optimization
GPUBench test: fetchcost
7/30/2019 08 Performance Overview
6/37
6
Fetch costFetch cost ATIATI cache hitcache hit
ATI X1800XT ATI X1900XTX
4 ALU ops 12 ALU ops
Cost = max(ALU, TEX)
X1900XTX has 3X the ALUs per pipe
7/30/2019 08 Performance Overview
7/37
7
Fetch costFetch cost ATIATI cache hitcache hit
ATI X1900XTX
5 ALU ops
Cost = max(ALU, TEX)
AMD HD2900XT
12 ALU ops
7/30/2019 08 Performance Overview
8/37
8
Fetch costFetch cost ATIATI sequentialsequential
ATI X1800XT ATI X1900XTX
8 ALU ops 24 ALU ops
Cost = max(ALU, TEX)
X1900XTX has 3X the ALUs per pipe
7/30/2019 08 Performance Overview
9/37
9
Fetch costFetch cost ATIATI sequentialsequential
ATI X1900XTX
24 ALU ops
Cost = max(ALU, TEX)
AMD HD2900XT
20 ALU ops
7/30/2019 08 Performance Overview
10/37
10
Fetch costFetch cost NVIDIANVIDIA cache hitcache hit
Cost = sum(ALU, TEX)
4 ALU op penalty
NVIDIA 7900 GTX
7/30/2019 08 Performance Overview
11/37
11
Fetch costFetch cost NVIDIANVIDIA sequentialsequential
NVIDIA 7900 GTX
8 ALU op issue penalty
Cost = sum(ALU, TEX)
7/30/2019 08 Performance Overview
12/37
12
Fetch costFetch cost NVIDIA 8800 GTXNVIDIA 8800 GTX
Cost = max(ALU, TEX)
Cache sequential
4 ALU ops
8 ALU ops
NVIDIA 8800 GTXNVIDIA 8800 GTX
7/30/2019 08 Performance Overview
13/37
13
Bandwidth to ALUsBandwidth to ALUs
Questions
Cache performance?
Sequential performance? Random-read performance?
7/30/2019 08 Performance Overview
14/37
14
MethodologyMethodology
Cache hit
Use a constant as index to texture(s)
Sequential Use fragment position to index texture(s)
Random
Index a seeded texture with fragment position to
look up into input texture(s)
GPUBench test: inputfloatbandwidth
7/30/2019 08 Performance Overview
15/37
15
ResultsResults
ATI X1900XTX NVIDIA 7900GTX
Better effective
cache bandwidth
Better random
bandwidth
Sequential bandwidth
(SEQ) about the same
7/30/2019 08 Performance Overview
16/37
16
ResultsResults
NVIDIA 8800GTX
2X bandwidth of
7900GTX
NVIDIA 7900GTX
7/30/2019 08 Performance Overview
17/37
17
ResultsResults
NVIDIA 8800GTX AMD HD2900XT
7/30/2019 08 Performance Overview
18/37
18
Off-board bandwidthOff-board bandwidth
Questions
How fast can we get data on the board (download)?
How fast can we get data off the board (readback)?
GPUBench tests:
download
readback
7/30/2019 08 Performance Overview
19/37
19
DownloadDownload
ATI X1900XTX NVIDIA 7900GTX
Host to GPU is slow
7/30/2019 08 Performance Overview
20/37
20
DownloadDownload
NVIDIA 7900GTX NVIDIA 8800GTX
Next generation not much better
7/30/2019 08 Performance Overview
21/37
21
ReadbackReadback
ATI X1900XTX NVIDIA 7900GTX
GPU to host is slow
ATI GL Readback
performance is
abysmal
7/30/2019 08 Performance Overview
22/37
22
ReadbackReadback
NVIDIA 7900GTX NVIDIA 8800GTX
Next generation not much better
7/30/2019 08 Performance Overview
23/37
23
Instruction Issue RateInstruction Issue Rate
Questions
What is the raw performance achievable?
Do different instructions have different costs? Vector vs. scalar issue differences?
7/30/2019 08 Performance Overview
24/37
24
MethodologyMethodology
Write longshaders with dependent
instructions
>100 instructions All instructions dependent
But try to structure to allow for multi-issue
Test float1 vs. float4 performance
GPUBench tests:
instrissue
7/30/2019 08 Performance Overview
25/37
25
ResultsResults Vector issueVector issue
ATI X1900XTX NVIDIA 7900GTX
= More costly than others
7/30/2019 08 Performance Overview
26/37
26
ResultsResults Vector issueVector issue
ATI X1900XTX NVIDIA 7900GTX
Faster
ADD/SUB
Peak (single instruction)
FLOPS with MAD
7/30/2019 08 Performance Overview
27/37
27
ResultsResults Vector issueVector issue
NVIDIA 7900GTX NVIDIA 8800GTX
8800GTX is 37%
faster (peak)
7/30/2019 08 Performance Overview
28/37
28
ResultsResults Vector issueVector issue
AMD HD2900XT NVIDIA 8800GTX
7/30/2019 08 Performance Overview
29/37
29
When benchmarks go wrongWhen benchmarks go wrong
Smart compilers subverting testing and optimizing
away shaders. Bug found in previous subtract test.No clever way to write RCP test found yetAlways sani t y check r esul t s agai nst t heoret i cal
peak!! !
NVIDIA 7800GTX
GPUBench 1.2
7/30/2019 08 Performance Overview
30/37
30
ResultsResults Scalar issueScalar issue
NVIDIA 7900GTX NVIDIA 8800GTX
8800GTX is a scalar issue
processor
7/30/2019 08 Performance Overview
31/37
31
Branching PerformanceBranching Performance
Questions
Is predication better than branching?
Is using Early-Z culling a better option? What is the cost of branching?
What branching granularity is required?
How much can I really save branching around heavycomputation?
7/30/2019 08 Performance Overview
32/37
32
MethodologyMethodology
Early-Z
Set a Z-buffer and compare function to mask out compute
Change coherence of blocks
Change sizes of blocks
Set differing amounts of pixels to be drawn
Shader Branching
If{ do a little }; else { LOTS of math}
Change coherence of blocks
Change sizes of blocks
Have differing amounts of pixels execute heavy math branch
GPUBench tests:
branching
7/30/2019 08 Performance Overview
33/37
33
ResultsResults Early-Z - NVIDIAEarly-Z - NVIDIA
NVIDIA 7900GTX
4x4 coherence isalmost perfect!
Random is bad!
7/30/2019 08 Performance Overview
34/37
34
ResultsResults Branching - NVIDIABranching - NVIDIA
NVIDIA 7900GTX
Fully coherent
has good
performance
But overhead
7/30/2019 08 Performance Overview
35/37
35
ResultsResults Branching - NVIDIABranching - NVIDIA
NVIDIA 7900GTX
Performance
increases with
branch
coherence
Need > 32x32 branch coherence
7/30/2019 08 Performance Overview
36/37
36
ResultsResults Branching - NVIDIABranching - NVIDIA
NVIDIA 8800GTX
Performance
increases with
branch
coherence
Need > 16x16 branch coherence
(Turns out 16x4 is as good as 16x16 )
7/30/2019 08 Performance Overview
37/37
37
SummarySummary
Benchmarks can help discern app behavior
and architecture characteristics
We use these benchmarks as predictivemodels when designing algorithms
Folding@Home
ClawHMMer
CFD
Be wary of driver optimizations
Driver revisions change behavior
Raster order, scheduler, compiler