+ All Categories
Home > Documents > HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill...

HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill...

Date post: 28-Apr-2018
Category:
Upload: vannhan
View: 220 times
Download: 6 times
Share this document with a friend
25
Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR
Transcript
Page 1: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

Darrell Boggs, CPU Architecture

Co-authors: Gary Brown, Bill Rozas,

Nathan Tuck, K S Venkatraman

HOT CHIPS 2014

NVIDIA’S DENVER PROCESSOR

Page 2: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

2

The First 64-bit Android Kepler-Class

Chip

with Dual Denver CPUs TEGRA K1

Page 3: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

3 3

TEGRA K1 192-core Kepler-Class Chip

One Chip — Two Versions

Pin Compatible

Quad A15 CPUs

32-bit

3-way Superscalar

Up to 2.3GHz

32K+32K L1$

Dual Denver CPUs

64-bit

7-way Superscalar

Up to 2.5GHz

128K+64K L1$

Page 4: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

4

DENVER VALUE PROPOSITION

Highest performance and very energy-efficient ARMv8 processor

Greater dynamic sharing with GPU

Extended battery life

Low latency power-state transition

Best web browsing experience

Designed to bring PC-class performance to the ARM ecosystem

Content creation

Gaming

Enterprise applications

Page 5: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

5 5

DENVER CPU Highest Perf ARMv8 CPU

7-wide superscalar

Aggressive HW prefetcher

Dynamic Code Optimization Optimize once, use many times

OOO execution without the power

Denver Core

IFUBPU

L1 Inst Cache

128 K – 4 Way

ARM HW Decoder

SchedulerL1 Data Cache

64 K – 4 Way

JSR IEU0 IEU1 FP0 FP1 LS0 LS1

HW

Prefetch

Page 6: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

6

Branch

Mul

Integer Integer

FP

NEON

FP

NEON

Load

Store

Integer

Load

Store

Integer

Scheduler

Decoder

(predecode)

8 wide

Scheduler

Decoder

3 wide

Branch Integer Integer Mul

FP

NEON Load Store

FP

NEON

TEGRA K1 SUPERSCALAR ARCHITECTURE

Branch: 1

Integer: 2

Multiply: 1

Floating Point/Neon: 2 x 64-bit

LD/ST: 1 LD and 1 ST

Peak IPC 3

Branch: 1

Integer: 2 (+ Mul) + 2

Floating Point/Neon: 2 x 128-bit

LD/ST: 2 LD and/or ST

Peak IPC 7+

Cortex-A15 Tegra K1-32 Denver Tegra K1-64

Page 7: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

7

Denver

RF wrJCC CMPBypass

I$ Rd Way Sel PickDec PB

EL4 EE5 ES6 EW7

Correct Target

ITLB RF wrBypass ALUBypassSch

EB1 EA2 ED3 EL4 EE5 ES6 EW7SB2

Misprediction

Signal

13 Cycle Penalty

IC2 IW3 IN4 IN5 SB1IP1

RF Rd

EB0

Pipeline Microarchitecture – Mispredict Penalty

Tegra K1-32

15 cycle mispredict

Tegra K1-64

13 cycle mispredict

Higher ILP and efficiency Lower is better

Page 8: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

8

CORE CLUSTER RETENTION STATE

New power management state: CC4

Allows cache and architectural state retention

Allows voltage to be reduced below Vmin to a retention voltage

Fast entry and exit latencies enable wider range of use

Page 9: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

9

DENVER IDLE POWER IMPROVES WITH RETENTION

Overhead from energy

required to flush L2 Leakage

Efficiency

crossover

Power penalty if

entered for short

durations

Page 10: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

10

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units

Optimizer

Optimization

m-interrupt

Page 11: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

11

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units

Optimizer

Optimization Cache

Optimized

mcode Dynamic

Profile

Information

Unrolls Loops Renames registers Reorders Loads and Stores Improves control flow Removes unused computation Hoists redundant computation Sinks uncommonly executed computation Improves scheduling

Page 12: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

12

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units Optimization Cache

Optimized

mcode Optimization

Lookup Optimized code

execution from

optimization cache

Page 13: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

13

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units

Optimization

Lookup

Optimizer

Optimization Cache

Optimized

mcode

Optimized

mcode

Optimized

mcode

Optimized

mcode

Dynamic

Profile

Information

Chaining

Page 14: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

14

DENVER PERFORMANCE

0%

50%

100%

150%

200%

250%

300%

DMIPS SPECInt 2K SPECFP 2K AnTuTu 4 Geekbench 3Single-Core

GoogleOctane v2.0

16MBMemcpy(GB/s)

16MBMemset(GB/s)

16MBMemread

(GB/s)

Perf

orm

ance R

ela

tive t

o T

egra

K1 3

2

Baytrail (Celeron N2910)

Krait-400 (8974-AA)

iPhone 5s (A7 Cyclone)

Haswell (Celeron 2955U)

Tegra K1 64

Page 15: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

15

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

Profile of execution

Optimized ucode execution HW Decoder

execution

Dynamic Code Optimizer

Page 16: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

16

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

Profile of new ARM instructions

Page 17: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

17

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

Profile of Instructions Per Cycle

IPC

Page 18: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

18

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

Page 19: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

19

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

Page 20: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

20

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

Page 21: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

21

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

Page 22: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

22

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

Page 23: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

23

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION 3% of benchmark run

%

Exec

Types

New

ARM

Code

IPC

Page 24: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

24

CONCLUSION

Dynamic Code Optimization is the architecture of the future

Breaks the out-of-order window physical limitation

Opens synergy between HW and SW that current architectures lack

Improves efficiency by optimizing once and using many times

Delivering PC-class performance to mobile form factors

Enables PC-class gaming experience

Enables true enterprise applications

Enables content creation

Page 25: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3

25

ACKNOWLEDGMENT

We would like to thank the CPU team in NVIDIA for all the creativity, hard work, and dedication to bring this vision to a reality.


Recommended