Darrell Boggs, CPU Architecture
Co-authors: Gary Brown, Bill Rozas,
Nathan Tuck, K S Venkatraman
HOT CHIPS 2014
NVIDIA’S DENVER PROCESSOR
2
The First 64-bit Android Kepler-Class
Chip
with Dual Denver CPUs TEGRA K1
3 3
TEGRA K1 192-core Kepler-Class Chip
One Chip — Two Versions
Pin Compatible
Quad A15 CPUs
32-bit
3-way Superscalar
Up to 2.3GHz
32K+32K L1$
Dual Denver CPUs
64-bit
7-way Superscalar
Up to 2.5GHz
128K+64K L1$
4
DENVER VALUE PROPOSITION
Highest performance and very energy-efficient ARMv8 processor
Greater dynamic sharing with GPU
Extended battery life
Low latency power-state transition
Best web browsing experience
Designed to bring PC-class performance to the ARM ecosystem
Content creation
Gaming
Enterprise applications
5 5
DENVER CPU Highest Perf ARMv8 CPU
7-wide superscalar
Aggressive HW prefetcher
Dynamic Code Optimization Optimize once, use many times
OOO execution without the power
Denver Core
IFUBPU
L1 Inst Cache
128 K – 4 Way
ARM HW Decoder
SchedulerL1 Data Cache
64 K – 4 Way
JSR IEU0 IEU1 FP0 FP1 LS0 LS1
HW
Prefetch
6
Branch
Mul
Integer Integer
FP
NEON
FP
NEON
Load
Store
Integer
Load
Store
Integer
Scheduler
Decoder
(predecode)
8 wide
Scheduler
Decoder
3 wide
Branch Integer Integer Mul
FP
NEON Load Store
FP
NEON
TEGRA K1 SUPERSCALAR ARCHITECTURE
Branch: 1
Integer: 2
Multiply: 1
Floating Point/Neon: 2 x 64-bit
LD/ST: 1 LD and 1 ST
Peak IPC 3
Branch: 1
Integer: 2 (+ Mul) + 2
Floating Point/Neon: 2 x 128-bit
LD/ST: 2 LD and/or ST
Peak IPC 7+
Cortex-A15 Tegra K1-32 Denver Tegra K1-64
7
Denver
RF wrJCC CMPBypass
I$ Rd Way Sel PickDec PB
EL4 EE5 ES6 EW7
Correct Target
ITLB RF wrBypass ALUBypassSch
EB1 EA2 ED3 EL4 EE5 ES6 EW7SB2
Misprediction
Signal
13 Cycle Penalty
IC2 IW3 IN4 IN5 SB1IP1
RF Rd
EB0
Pipeline Microarchitecture – Mispredict Penalty
Tegra K1-32
15 cycle mispredict
Tegra K1-64
13 cycle mispredict
Higher ILP and efficiency Lower is better
8
CORE CLUSTER RETENTION STATE
New power management state: CC4
Allows cache and architectural state retention
Allows voltage to be reduced below Vmin to a retention voltage
Fast entry and exit latencies enable wider range of use
9
DENVER IDLE POWER IMPROVES WITH RETENTION
Overhead from energy
required to flush L2 Leakage
Efficiency
crossover
Power penalty if
entered for short
durations
10
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units
Optimizer
Optimization
m-interrupt
11
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units
Optimizer
Optimization Cache
Optimized
mcode Dynamic
Profile
Information
Unrolls Loops Renames registers Reorders Loads and Stores Improves control flow Removes unused computation Hoists redundant computation Sinks uncommonly executed computation Improves scheduling
12
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units Optimization Cache
Optimized
mcode Optimization
Lookup Optimized code
execution from
optimization cache
13
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units
Optimization
Lookup
Optimizer
Optimization Cache
Optimized
mcode
Optimized
mcode
Optimized
mcode
Optimized
mcode
Dynamic
Profile
Information
Chaining
14
DENVER PERFORMANCE
0%
50%
100%
150%
200%
250%
300%
DMIPS SPECInt 2K SPECFP 2K AnTuTu 4 Geekbench 3Single-Core
GoogleOctane v2.0
16MBMemcpy(GB/s)
16MBMemset(GB/s)
16MBMemread
(GB/s)
Perf
orm
ance R
ela
tive t
o T
egra
K1 3
2
Baytrail (Celeron N2910)
Krait-400 (8974-AA)
iPhone 5s (A7 Cyclone)
Haswell (Celeron 2955U)
Tegra K1 64
15
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
Profile of execution
Optimized ucode execution HW Decoder
execution
Dynamic Code Optimizer
16
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
Profile of new ARM instructions
17
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
Profile of Instructions Per Cycle
IPC
18
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
19
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
20
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
21
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
22
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
23
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION 3% of benchmark run
%
Exec
Types
New
ARM
Code
IPC
24
CONCLUSION
Dynamic Code Optimization is the architecture of the future
Breaks the out-of-order window physical limitation
Opens synergy between HW and SW that current architectures lack
Improves efficiency by optimizing once and using many times
Delivering PC-class performance to mobile form factors
Enables PC-class gaming experience
Enables true enterprise applications
Enables content creation
25
ACKNOWLEDGMENT
We would like to thank the CPU team in NVIDIA for all the creativity, hard work, and dedication to bring this vision to a reality.