Experiences Using Tegra K1 and X1 for Highly Energy ... · Eric McCreath Alistair P. Rendell...

Experiences Using Tegra K1 and X1 for Highly EnergyEfficient Computing

Gaurav Mitra Andrew Haigh Luke Angove Anish VargheseEric McCreath Alistair P. Rendell

Research School of Computer ScienceAustralian National University

Canberra, Australia

April 07, 2016

Introduction & Background

Overview

1 Introduction & Background

2 Power Measurement Environment

3 Experimental Platforms

4 Approach

5 Results & Analysis

6 Conclusion

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20


Use of low-powered SoCs for HPC

Nvidia Jetson TK1: ARM + GPU SoC

Nvidia Jetson TX1: ARM + GPU SoC

TI Keystone II: ARM + DSP SoC

Adapteva Parallella: ARM + 64-core NoC

TI BeagleBoard: ARM + DSP SoC

Terasic DE1: ARM + FPGA SoC

Rockchip Firefly: ARM + GPU SoC

Freescale Wandboard: ARM + GPU SoC

Cubieboard4: ARM + GPU SoC

http://cs.anu.edu.au/systems


http://cs.anu.edu.au/systems


Use of low-powered SoCs for HPC

In order for SoC processors to be considered viable exascale buildingblocks, important factors to explore include:

Absolute performance

Balancing use of different on-chip devices

Understanding the performance-energy trade-off



Contributions

Environment for monitoring and collecting high resolution powermeasurements for SoC systems

Understanding the benefits of exploiting both the host CPU andaccelerator GPU cores simultaneously for critical HPC kernels

Performance and energy comparisons with conventional HPC systems- Intel Xeon CPUs and NVIDIA K20 and K80 GPUs


Power Measurement Environment

Measurement Requirements

SoC systems generally consume very low power ∼ few Watts

Subtle differences in energy consumption triggered by different factorssuch as the use of CPU or on-chip GPU cores

Changes in DC current supplied to SoC system boards must bereliably measured

Current use ranges from µAmps to a few Amps, a very high-precisionammeter must be used to measure subtle changes



Measurement Apparatus

µCurrent Gold: High-precision ammeterfor measuring low-currents

An mbed LPC1768 micro-controller with a12-bit ADC (0-3.3V) used to measureanalog output signals from µCurrent Gold

The ADC has a resolution of0.81±0.40mV, which corresponds to0.81mA. This is 9.7±4.8mW at 12V.

https://www.eevblog.com/projects/ucurrent/

https://developer.mbed.org/


https://www.eevblog.com/projects/ucurrent/

https://developer.mbed.org/platforms/mbed-LPC1768/




Experimental Platforms

Experimental Platforms

TK1 TX1 SANDY HASWELL

CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3

CPU Cores 4 4 2×8 2×12

CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz

RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3

GPU GK20A GM20B K20m (GK110) K80 (GK210)

GPU Cores 192 256 2496 2496

GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz

GPU RAM Shared Shared 5GB 12GB

CUDA v6.5 v7.0 v7.0 v7.5


Approach

Evaluation Kernel

C = A× B C1 C2

=

A

× B1 B2

↙↘ ↙↘ ↙↘

C1 = A× B1 C2 = A× B2 C1

=

A

× B1

C2

=

A

× B2

CPU GPU


Approach

Approaches

Traditional methods:

Assign all work to GPU or CPU

Static Partitioning: Partition work between GPU and CPU based onapriori information

Beaumont et al., Matrix Multiplication on Heterogeneous PlatformsC. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPUComputingDonfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorizationwith Multicore and GPUs

Dynamic Partitioning:Papadrakakis et al., A New Era in Scientific Computing: Domain DecompositionMethods in Hybrid CPU-GPU Architectures

→ Existing approaches do not consider the use of shared physical memoryor the implications for energy efficiency


Approach

Our approach

Static partitioning:

Guess a partition based on experimentallymeasured peak performances of CPU andGPU

Used the achieved peaks to refine thepartition

Repeat until convergence

Suitable for repeated calculations of thesame size

Use of shared memory on SoC systems:

CUDA driver automatically protectsCUDA-allocated memory during kernelexecution phase

We circumvent this by immediatelyunprotecting using mprotect() thememory after initiating a kernel execution

Dynamic partitioning:

CPU and GPU remove chunks of matrixcolumns from a workqueue

Chunk size must be sufficient to occupyCPU and GPU fully

On traditional discrete GPU systems,copies have to be carefully scheduled

Implemented using OpenMP

Two threads, one each for CPU andGPU, taking work off a master queue

The GPU thread executes at the expenseof doing productive work on the CPUcores


Results & Analysis

Results: Best split performance

Platform Matrix CPU GPU CPU SPLITSize GFLOPS GFLOPS SPLIT COLS GFLOPS

DGEMM

TK1 4096 14 12 2176 26

TX1 4096 18 9 2608 25

SANDY 8192 311 836 2128 1099

HASWELL 16384 804 1124 6912 1870

SGEMM

TK1 4096 34 205 448 227

TX1 4096 38 391 128 399

SANDY 16384 643 2318 3392 2887

HASWELL 16384 1753 2526 6896 4109


Results & Analysis

Best Split Search - Tegra K1/X1

10

15

20

25

DGEMM

GFLOPS TK1 GFLOPS TX1 GFLOPS

60

80

100

120

JOULES

TK1 JOULES TX1 JOULES

0

100

200

300

400

SGEMM

GFLOPS TK1 GFLOPS TX1 GFLOPS

0 1,000 2,000 3,000 4,000

20

40

60

Split Size Given to CPU

JOULES

TK1 JOULES TX1 JOULES


Results & Analysis

Best Split Search - Intel + NVIDIA GPUs

500

1,000

1,500

DGEMM

GFLOPS SANDY GFLOPS HASWELL GFLOPS

50

100

JOULES

SANDY JOULES HASWELL JOULES

1,000

2,000

SGEMM

GFLOPS SANDY GFLOPS HASWELL GFLOPS

0 1,000 2,000 3,000 4,000

20

40

60

Split Size Given to CPU

JOULES

SANDY JOULES HASWELL JOULES


Results & Analysis

Performance Scaling - TK1

16 32 64 128 256 512 1024 2048 409602468

10121416182022242628

DG

EM

MG

FL

OP

S

CPUGPU

SPLITDYNAMIC

TBALANCEPEAK (CPU+GPU)

16 32 64 128 256 512 1024 2048 40960

20406080

100120140160180200220240260280

Matrix Dimension M=N=K

SG

EM

MG

FL

OP

S


Results & Analysis

Performance Scaling - TX1

16 32 64 128 256 512 1024 2048 40960

5

10

15

20

25

30

DG

EM

MG

FL

OP

S

CPUGPU

SPLITDYNAMIC

TBALANCEPEAK (CPU+GPU)

16 32 64 128 256 512 1024 2048 40960

50

100

150

200

250

300

350

400

450

500


SG

EM

MG

FL

OP

S


Results & Analysis

Energy Efficiency - TX1 - SGEMM

128 256 512 1024 2048 4096

10−10

10−9

10−8

4.22 · 10−10

3.75 · 10−11


Joul

es/F

LO

P(S

P)

CPUGPU

SPLITTBALANCEDYNAMIC


Results & Analysis

Energy Efficiency - Haswell - SGEMM

512 1024 2048 4096 8192 16384

10−10

10−9

1.76 · 10−10

8.24 · 10−11


Joul

es/F

LO

P(S

P)

CPUGPU

SPLITTBALANCEDYNAMIC


Conclusion

Conclusion

A high accuracy and high resolution energy measurement system introduced here enables

tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune

and produce best-performance and best-energy optimized libraries.

How might a running application use information on energy usage to dynamicallychange its behaviour?

Use of shared physical memory on SoC systems eliminates transfer overhead

Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit wasobserved from exploting both CPU and GPU together

The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80.

Contact:

[email protected]

https://www.linkedin.com/in/alistair-rendell-6230b72


mailto:[email protected]

https://www.linkedin.com/in/alistair-rendell-6230b72

Date post:	07-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Experiences Using Tegra K1 and X1 for Highly Energy ... · Eric McCreath Alistair P. Rendell...

Documents