of 32
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
1/32
Power of Realtime 3D-RendeRaja Koduri
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
2/32
We ate our GPU cake -
And had more too! 16+ years of (sugar) high!
In every GPU generation More performance and performance-per-watt
More programmability, precision and features
While maintaining compatibility with 8+ generati
APIs
vuoi la botte piena e la moglie ubriaca
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
3/32
Chip Power = Static Power + Dynamic Power
System Power = CPU + GPU + Other
Static power is leakage of inactive transistors
Static Power = N*V*e-Vt
Dynamic power is from active switching transistors
Dynamic Power = A*N*C*F*V2
N - Number of transistors
V - Voltage
C - Capacitance per transistor
F - Frequency
Vt - Thresh-hold Voltage
A - Activity Factor
Energy = Power(Watts) * Time
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
4/32
Desktops
0-10 Watts TDP 10-50 Watts TDP 50-300 Watts T
Mobile Devices
(Phones, tablets etc)
Mobile Computers
(Computers with abattery)
Desktops, ServeConsoles etc
(Always plugged
TDP - Thermal design power
Maximum amount of power that the thermal system csustain
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
5/32
Sub Moores law scaling of performance-per-w N wants to go up, but Sub-Moore scaling on C and V.
GPUs scale better than CPUs
Beware of the Fine Print though
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
6/32
Towards lower power computers
Battery lifeThermal limits
Acoustics
$-per-KiloWattHourGreen
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
7/32
0 W
10 W
50 W
300 W
Mobile Devices
Desktops
Mobile Computers
HW vendors compelled to succeed in mobile
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
8/32
Chip Power = N*V*e-Vt + A*N*C*F*V2
CPUs
Prioritize Frequency - higher V and Vt
Spend N for caches, cores, flexibility and compute qu
GPUs
Lower Frequency and Voltage
Spend N for shaders, textures, pixels etc
FixedFunction
Lowest N, F and V for a given task
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
9/32
If your workload parallelizes well on a GPUUse the GPU
Optimize for System Energy
Its only a win if
GPU_Workload_power*GPU_Executiontime +CPU_GPU_feeding_power*CPU_GPU_feeding_t is less thanCPU_Workload_power*CPU_Executiontime
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
10/32
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
11/32
Power off unused areas(Power gating)When not in use Static Power ~= 0
Fine print Latency with power toggle (few nano-seconds to a few hundred microsecond
Too aggressive switching may cause performance problems, too conservativswitching may lead to wasted power
Static Power = N*V*e-Vt
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
12/32
Primary OS+HW strategy is to control F & V
Based on history of A
Applications have control of Activity(A)
Lower A, leads to lower F and V and lower power.
Fine Print
Switching F&V states can range between a few milliseconds to few secon
Dynamic Power = A*N*C*F*V2
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
13/32
0
0.3
0.5
0.8
1.0
Activity/Performance/Powersampleillustration
Activity Frequency Power
Time
Ac
tivity/Frequency/Power
State Switch latency
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
14/32
Applications
GPU API/Runtime
GPU UserMode Drivers
GPU Kernel Mode DriverDrivers
GPU Power ManagementDriver
GPUPower u-Controller
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
15/32
Control frame rate to minimum desired
Pumping out more frames than user can see is
wasteful anyways
Tip 1
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
16/32
0 10 20 30 40 50 60 70 80 90
400
0
50
100
150
200
250
300
350
50 Watt System
Frames-per-sec
Power in Watts
Power Limit @50W
Simple rendering (like menus)
Complex re(like game p
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
17/32
0 10 20 30 40 50 60 70 80 90
400
0
50
100
150
200
250
300
350
25 Watt Syste
Frames-per-sec
Power in Watts
PowerLimit@25W
Simple rendering (like menus)
Complex ren(like game p
Thermal limit
60 fps app
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
18/32
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
19/32
Optimize the frame rendering time to a minimum
Dont stop optimizing when you hit your minimum
frame rate targets!
Tip 2
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
20/32
1 2 4 6 8 10 12 14
100
0
10
20
30
40
50
60
70
80
90
Time in Milliseconds
Activity
Case1 Energy 4*Pmax+12*Pstatic
Pstatic is near zero in power optimized GPUs,So
Case1 Energy 4*Pmax
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
21/32
1 2 4 6 8 10 12 14
100
0
10
20
30
40
50
60
70
80
90
Time in Milliseconds
Activity
Case1 Energy 4*Pmax
Case2 Energy 16*Pmax
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
22/32
Dont scatter work in a frame (coalesce)
Insufficient idle intervals for power-state reduction
Tip 3
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
23/32
1 2 4 6 8 10 12
100
0
10
20
30
40
50
60
70
80
90
Time in Milliseconds
Activity
May not be enough idle time for switching to lower powe
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
24/32
Avoid spin-loops
Tip 4
Eg:- CPU waiting on GPU
This looks like real work to CPU
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
25/32
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
26/32
A complex subject
Dynamic scheduling systems based on power thermal feedback
Hardware v/s Software schedulers
Scheduling CPU and GPU
Many more topics
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
27/32
Premature declaration of death of fixed-fu
in GPU hardware
What new candidates can we move to
FixedFunction?
What interface principles should
FixedFunction hardware adopt to be
mainstream programmer friendly?
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
28/32
Can we build Predictive models toaugment current reactive models?
Should there be APIs for apps to influenpower states and monitor feedback?
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
29/32
2560x1600 Screen @60Hz ~ 250 MPixels/Sec
A 25W GPU today
5800 MPixels/Sec17400 MTexels/Sec696000 MFLOPS
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
30/32
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
31/32
Reduction Choices Reduce N, but that reduces capabilities
Reduce V, but limits performance and not in yourcontrol(process limits)
Reduce Vt, Slower transistor, lower performance
Static Power = N*V*e-Vt
Dominant battery life factor for common usage scenar
7/29/2019 03-powerOf3DRendering-BPS2011-koduri
32/32
1 2 4 6 8 10 12 14
100
0
10
20
30
40
50
60
70
80
90
Time in Milliseconds
Activity
Case1 Energy 4*Pmax
Case3 Energy 8*0.7*P?