Post on 07-Sep-2014
description
transcript
CS 354Performance Analysis
Mark KilgardUniversity of TexasApril 26, 2012
CS 354 2
Today’s material
In-class quiz On acceleration structures lecture
Lecture topic Graphic Performance Analysis
CS 354 3
My Office Hours Tuesday, before class
Painter (PAI) 5.35 8:45 a.m. to 9:15
Thursday, after class ACE 6.302 11:00 a.m. to 12
Randy’s office hours Monday & Wednesday 11 a.m. to 12:00 Painter (PAI) 5.33
CS 354 4
Last time, this time
Last lecture, we discussed Acceleration structures
This lecture Graphics Performance Analysis
Projects Project 4 on ray tracing on Piazza
Due May 2, 2012 Get started!
CS 354 5
Daily Quiz1. Multiple choice: Which is
NOT a bounding volume representation
a) sphere
b) axis-aligned bounding box
c) object aligned bounding box
d) bounding graph point
e) convex polyhedron
2. True or False: Place objects within a uniform grid is easier than placing objects within a KD tree.
3. True of False: Volume rendering can be accelerated by the GPU by drawing blended slices of the volume.
On a sheet of paper• Write your EID, name, and date• Write #1, #2, #3 followed by its answer
CS 354 6
Graphics Performance Analysis
Generating synthetic images by computer is computationally—and bandwidth—intensive Achieving interactive rates is key
60 frames/second ≈ real-time interactivity Worth optimizing
Entertainment and intuition tied to interactivity How do we think about graphics
performance analysis?
CS 354 7
Framing Amdahl’s Law
Assume a workload with two parts First part in A% Second part is B% Such that A% + B% = 100%
If we have a technique to speedup the second part by N times But have no speedup for the first part What overall speed up can we expect?
CS 354 8
Amdahl’s Equation
Assume A% + B% = 100% If the un-optimized effort is 100%, the optimized
effort should be smaller
Speedup is ratio of UnoptimizedEffort to OptimizedEffort
NBB
NBA
Speedup
)1(
1%%
%100
NBAffortOptimizedE %%
CS 354 9
Who was Amdahl?
Gene Amdahl CPU architect for IBM in 1960s
Helped design IBM’s System/360 mainframe architecture
Left IBM to found Amdahl computer Building IBM compatible mainframes
Why? Evaluating whether to invest in parallel
processing or not
CS 354 10
Parallelization
Broadly speaking, computer tasks can be broken into two portions Sequential sub-tasks
Naturally requires steps to be done in a particular order Examples: text layout, entropy decoding
Parallel sub-tasks Problem splits into lots of independent chunks of work Chunks of work can be done by separate processing units
simultaneously: parallelization Examples: tracing rays, shading pixels, transforming
vertices
CS 354 11
Serial Work SandwichingParallel Work
CS 354 12
Example of Amdahl’s Law
Say a task is 50% serial and 50% parallel Consider using 4 parallel processors on the
parallel portion Speedup: 1.6x
Consider using 40 parallel processor on parallel portion Speedup: 1.951x
Consider limit:25.5.
1lim
nn
CS 354 13
Graph of Amdahl’s Law
CS 354 14
Pessimism about Parallelism?
Amdahl’s Law can instill pessimism about parallel processing
If the serial work percent is high, adding parallel units has low benefit Assumes fixed “problem” size So workload stays same size even as parallel
execution resources are added So why do GPUs offer 100’s of cores
then?
CS 354 15
Gustafson's Law
Observation by John Gustafson With N parallel unit, bigger problems can be attacked
Great example Increasing GPU resolution Was 640x480 pixels, now 1920x1200 More parallel units means more pixels can be
processed simultaneously Supporting rendering resolutions previously unattainable
Problem size improvement)1( NANleproblemSca
CS 354 16
Example
Say a task is 50% serial and 50% parallel Consider using 4 parallel processors on the
parallel portion Problem scales up: 2.5x
Consider 100 parallel processors Problem scales up: 50.5x
Also consider heterogeneous nature of graphics processing units
CS 354 17
Coherent Work vs.Incoherent Work
Not all parallel work is created equal Coherent work = “adjacent” chunks of work
performing similar operations and memory accesses Example: camera rays, pixel shading Allows sharing control of instruction execution Good for caches
Incoherent work = “adjacent” chunks of work performing dissimilar operations and memory accesses Examples: reflection, shadow, and refraction rays Bad for caches
CS 354 18
Coherent vs. Incoherent Rays
coherent = camera rays coherent = light rays
incoherent = reflected rays
CS 354 19
Keeping Work Coherent?
How do we keep work concurrent? Pipelines
Careful because they can introduce latency Data structures SPMD (or SIMD) execution
Single Program, Multiple Data To exploit Single Instruction, Multiple Data (SIMD)
units Bundling “adjacent” work elements helps cache and
memory access efficiency
CS 354 20
Pipeline Processing
Parallel and naturally coherent
CS 354 21
A Simplified Graphics PipelineApplication
Vertex batching & assembly
Triangle assembly
Triangle clipping
Triangle rasterization
Fragment shading
Depth testing
Color update
Application-OpenGL API boundary
Framebuffer
NDC to window space
Depth buffer
CS 354 22
Another View of the Graphics Pipeline
GeometryProgram
3D Applicationor Game
OpenGL API
GPUFront End
VertexAssembly
VertexShader
Clipping, Setup,and Rasterization
FragmentShader
Texture Fetch
RasterOperations
Framebuffer Access
Memory Interface
CPU – GPU Boundary
OpenGL 3.3
Attribute Fetch
PrimitiveAssembly
Parameter Buffer Readprogrammable
fixed-function
Legend
CS 354 23
Modeling Pipeline Efficiency
Rate of processing for sequential tasks Assume three tasks Run time is sum of each operation’s time
A+B+C Rate of processing in a pipeline
Assume three tasks, treated as stages Performance gated by slowest operation
Three operations in pipeline: A, B, C Run time = max(A,B,C)
CS 354 24
Hardware Clocks
Heart beat of hardware Measured in frequency
Hertz (Hz) = cycles per second Megahertz, gigahertz = million, billion Hz
Faster clocks = faster computation and data transfer
So why not simply raise clocks? High clocks consume more power Circuits are only rated to a maximum clock
speed before becoming unreliable
CS 354 25
Clock Domains Given chip may have multiple clocks running Three key domains (GPU-centric)
Graphics clock—for fixed-function units Example uses: rasterization, texture filtering, blending Optimize for throughput, not latency
Can often instance more units instead of raising clocks Processor clock—for programmable shader units
Example: shader instruction execution Generally higher than graphics clock
Because optimized for latency rather than throughput Memory clock—for talking to external memory
Depends on speed rating of external memory Other domains too
Display clock, PCI-Express bus clock Generally not crucial to rendering performance
CS 354 26
PrimitiveProgram
3D Pipeline Programmable Domains run on Unified Hardware
Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains Plus tessellation + compute (not shown below)
GPUFront End
VertexAssembly
VertexProgram
,Clipping, Setup,
and Rasterization
FragmentProgram
Texture Fetch
RasterOperations
Framebuffer Access
Memory Interface
Attribute Fetch
PrimitiveAssembly
Parameter Buffer Read
Can beunifiedhardware!
CS 354 27
Memory Bandwidth
Raw memory bandwidth Physical clock rate
Examples: 3 Ghz Memory bus width
64-bit, 128-bit, 192-bit, 256-bit, 384-bit Wider buses are faster but more expensive to route all those wires
Signaling rate Double data rate (DDR) means signals are sent on the rising and
falling clock edges Often logical memory clock rate includes signaling rate
Computing raw memory bandwidth
busWidthlocksignalPerCockphysicalClbandwidth
CS 354 28
Latency vs. Throughput
Raw bandwidth is reduced by memory utilization bandwidth Unrealistic to expect 100% utilization GPUs are much better than CPUs generally
Trade-off Maximizing throughput (utilization) increases
latency Minimizing latency reduces utilization
CS 354 29
Computing Bandwidth
Example: GeForce GTX 680 Latest NVIDIA generation 3.54 billion transistors in 28 nm process
Memory characteristics 6 GHz memory clock (includes signaling rate) 256-bit memory interface = 192 gigabytes/second
6 billion × 256 bits/clock × 1byte/8bits
[GK104 die]
[GeForce GTX 680board]
CS 354 30
0
20
40
60
80
100
120
140
160
180
200
GeForce2GTS
GeForce3 GeForce4 Ti4600
GeForce FX GeForce6800 Ultra
GeForce7800 GTX
Rawbandwidth
Effective rawbandwidthwithcompression
Expon.(Effective rawbandwidthwithcompression)Expon. (Rawbandwidth)
GeForce PeakMemory Bandwidth Trends
128-bit interface 256-bit interface
Gig
abyt
es p
er s
econ
d
CS 354 31
Effective GPUMemory Bandwidth
Compression schemes Lossless depth and color (when multisampling)
compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression
Avoidance useless work Early killing of fragments (Z cull) Avoiding useless blending and texture fetches
Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches
CS 354 32
Other Metrics
Host bandwidth Vertex pulling Vertex transformation Triangle rasterization and setup Fragment shading rate Shader instruction rate Raster (blending) operation rate Early Z reject rate
CS 354 33
Kepler GeForce GTX 680High-level Block Diagram
8 Streaming Multiprocessors (SMX)
1536 CUDA Cores 8 Geometry Units 4 Raster Units 128 Texture units 32 Raster operations 256-bit GDDR5
memory
CS 354 34
Kepler Streaming Multiprocessor
8 more copies of this
CS 354 35
Prior Generation StreamingMultiprocessor (SM)
Multi-processor execution unit (Fermi) 32 scalar processor
cores Warp is a unit of
thread execution of up to 32 threads
Two workloads Graphics
Vertex shader Tessellation Geometry shader Fragment shader
Compute
CS 354 36
Power Gating
Computer architecture has hit the “power wall” Low-power operation is at a premium
Battery-powered devices Thermal constraints Economic constraints
Power Management (PM) works to reduce power by Lower clocks when performance isn’t required Disabling hardware units
Avoids leakage
CS 354 37
Scene Graph Labor High-level division of scene graph labor Four pipeline stages
App (application) Code that manipulates/modifies the scene graph in response to
user input or other events Isect (intersection)
Geometric queries such as collision detection or picking Cull
Traverse the scene graph to find the nodes to be rendered Best example: eliminate objects out of view
Optimize the ordering of nodes Sort objects to minimize graphics hardware state changes
Draw Communicating drawing commands to the hardware Generally through graphics API (OpenGL or Direct3D)
Can map well to multi-processor CPU systems
CS 354 38
App-cull-draw Threading
App-cull-draw processing on one CPU core
App-cull-draw processing on multiple CPUs
CS 354 39
Scene Graph Profiling
Scene graph should help provide insight into performance
Process statistics What’s going on? Time stamps
Database statistics How complex is the scene in any frame?
CS 354 40
Example:Depth Complexity Visualization
How many pixels are being rendered? Pixels can be rasterized by multiple objects Depth complexity is the average number of times a
pixel or color sample is updated per frame
yellow and black indicate higher depth complexity
CS 354 41
Example:Heads-up Display of Statistics
Process statistics How long is
everything taking? Database statistic
What is being rendered?
Overlaying on active scene often value Dynamic update
CS 354 42
Benchmarking
Synthetic benchmarks focus on rendering particular operations in isolation What is the blended pixel performance
Application benchmarks Try to reflect what a real application would do
CS 354 43
Tips for InteractivePerformance Analysis
Vary things you can control Change window resolution
Making it smaller and seeing better performance Null driver analysis
Skip the actual rendering calls What if the driver was *infinitely” fast
Use occlusion queries to monitor how many samples (pixels) are actually got to need
Keep data on the GPU Let GPU do Direct Memory Access (DMA) Keep from swapping textures and buffers
Easy when multi-gigabyte graphics cards available
CS 354 44
Next Class
Next lecture Surfaces Programmable tessellation
Reading None
Project 4 Project 4 is a simple ray tracer Due Wednesday, May 2, 2012