+ All Categories
Home > Documents > Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was...

Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was...

Date post: 10-Apr-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
18
Alex Ramirez, NVIDIA Research Embedded Supercomputing at NVIDIA
Transcript
Page 1: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

Alex Ramirez, NVIDIA Research

Embedded Supercomputing at NVIDIA

Page 2: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

2

The project Build HPC using embedded commodity technology

Tibidabo Tegra 2

KAYLA Tegra 3 +

Quadro GPU

Pedraforca Tegra 3 + Tesla K20

Mont-Blanc Exynos 5250

Page 3: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

3

Tegra enabled ARM Multicore testing But the embedded GPU was not programmable …

Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

Tegra 3 4x ARM Cortex-A9 VFP for 64-bit FP 3x faster GPU …

Tegra 4 4x ARM Cortex-A15

72-core embedded GPU …

Page 4: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

4

Samsung Exynos 5250

2x ARM Cortex-A15 @ 1.7 GHz

4-core ARM Mali T604

OpenCL 1.1

Dual-channel DDR3

USB 3.0 to 1 GbE bridge

First embedded + programmable GPU

Page 5: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

5

NVIDIA Tegra K1

4x ARM Cortex-A15 @ 2.3 GHz

32KB + 32KB L1 cache

2 MB L2 cache

192-core Kepler GPU

CUDA 6

The first embedded CUDA GPU

Page 6: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

6

Jetson DevKit Clusters @ SC’14 Lead the way … and they will follow

Page 7: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

7

NVIDIA Tegra K1-64 (Denver)

2x NVIDIA Denver CPU @ 2.5 GHz

64-bit ARM v8

128KB + 64KB L1 cache

2 MB L2 cache

192-core Kepler GPU

CUDA 6

Pin compatible with Tegra K1

64-bit ARM CPU

Page 8: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

8

NVIDIA Denver CPU

7-wide superscalar

2x FP pipelines

Dynamic code optimization

OOO performance with in-order execution

Aggressive HW prefetcher

Highest performance ARMv8 Mobile CPU

Page 9: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

9

NVIDIA Tegra X1

4x ARM Cortex A57

48KB + 32KB L1 cache

2 MB L2 cache

4x ARM Cortex A53

32KB + 32KB L1 cache

512KB L1 cache

256-core Maxwell GPU

Embedded GPU overhaul

Page 10: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

10

Drive-PX

2x Tegra X1

4x ARM Cortex-A57

4x ARM Cortex-A53

256-core Maxwell GPU

2.3 TFLOPS (32-bit FP)

1 GbE cluster interconnect

Embedded supercomputer for autonomous driving

Page 11: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

11

Drive-PX2

2x Tegra X2

4x ARM Cortex-A57

2x NVIDIA Denver2

2x 3840-core Pascal GPU

8 TFLOPS (64-bit FP)

24 TFLOPS (16-bit FP)

1 GbE cluster interconnect

Discrete GPU for embedded deep learning capabilities

Page 12: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

12

This is what 8 TFLOPS looked like in 2001

ASCI White, LLNC, CA

Page 13: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

13

This is what 8 TFLOPS look like in 2016 Drive-PX2, under the hood of your next self-driving car

Page 14: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

14

This is what 40 TFLOPS looked like in 2004 MareNostrum, Barcelona Supercomputing Center, Spain

Page 15: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

15

This is what 40 TFLOPS look like in 2016 DGX-1 Deep Learning System

Page 16: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

16

NVlink in NVIDIA’s NUMA link

NVlink across Tegra SoCs NVlink across Tegra and GPU

For GPU’s only?

Tegra

XYZ

Tegra

XYZ

Tegra

XYZ

Tegra

XYZ

Tegra

XYZ

Tesla

Ω

HBM

HBM

HBM

HBM

Tegra

XYZ

Tesla

Ω

HBM

HBM

HBM

HBM

* Diagram does not imply any current or planned NVIDIA products

Page 17: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

17

The self-hosted GPU accelerator ?

NVlink across Tegra and GPU Self-hosted GPU package with heterogeneous memory

DDR + HBM

Why integrate the CPU with the GPU?

Tegra

XYZ

Tesla

Ω

HBM

HBM

HBM

HBM

Tegra

XYZ

Tesla

Ω

HBM

HBM

HBM

HBM

* Diagram does not imply any current or planned NVIDIA products

Page 18: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

18

Conlusions

Embedded SoCs are gaining HPC compute capabilities

The benefits of SoC customization have not yet been exploited

Scaling performance across multiple SoC is still challenging …

… but connecting many of the is getting easier

Heterogeneous IP blocks across heterogeneous SoCs

Heterogeneous memories

Homogeneous programming model !?

Unleash the genie …

Phenomenal cosmic powers … Itty bitty living space


Recommended