Experiences Using Tegra K1 and X1 for Highly EnergyEfficient Computing
Gaurav Mitra Andrew Haigh Luke Angove Anish VargheseEric McCreath Alistair P. Rendell
Research School of Computer ScienceAustralian National University
Canberra, Australia
April 07, 2016
Introduction & Background
Overview
1 Introduction & Background
2 Power Measurement Environment
3 Experimental Platforms
4 Approach
5 Results & Analysis
6 Conclusion
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20
Introduction & Background
Use of low-powered SoCs for HPC
Nvidia Jetson TK1: ARM + GPU SoC
Nvidia Jetson TX1: ARM + GPU SoC
TI Keystone II: ARM + DSP SoC
Adapteva Parallella: ARM + 64-core NoC
TI BeagleBoard: ARM + DSP SoC
Terasic DE1: ARM + FPGA SoC
Rockchip Firefly: ARM + GPU SoC
Freescale Wandboard: ARM + GPU SoC
Cubieboard4: ARM + GPU SoC
http://cs.anu.edu.au/systems
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20
Introduction & Background
Use of low-powered SoCs for HPC
In order for SoC processors to be considered viable exascale buildingblocks, important factors to explore include:
Absolute performance
Balancing use of different on-chip devices
Understanding the performance-energy trade-off
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20
Introduction & Background
Contributions
Environment for monitoring and collecting high resolution powermeasurements for SoC systems
Understanding the benefits of exploiting both the host CPU andaccelerator GPU cores simultaneously for critical HPC kernels
Performance and energy comparisons with conventional HPC systems- Intel Xeon CPUs and NVIDIA K20 and K80 GPUs
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20
Power Measurement Environment
Measurement Requirements
SoC systems generally consume very low power ∼ few Watts
Subtle differences in energy consumption triggered by different factorssuch as the use of CPU or on-chip GPU cores
Changes in DC current supplied to SoC system boards must bereliably measured
Current use ranges from µAmps to a few Amps, a very high-precisionammeter must be used to measure subtle changes
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20
Power Measurement Environment
Measurement Apparatus
µCurrent Gold: High-precision ammeterfor measuring low-currents
An mbed LPC1768 micro-controller with a12-bit ADC (0-3.3V) used to measureanalog output signals from µCurrent Gold
The ADC has a resolution of0.81±0.40mV, which corresponds to0.81mA. This is 9.7±4.8mW at 12V.
https://www.eevblog.com/projects/ucurrent/
https://developer.mbed.org/
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20
Power Measurement Environment
Power Measurement Environment
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20
Experimental Platforms
Experimental Platforms
TK1 TX1 SANDY HASWELL
CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3
CPU Cores 4 4 2×8 2×12
CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz
RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3
GPU GK20A GM20B K20m (GK110) K80 (GK210)
GPU Cores 192 256 2496 2496
GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz
GPU RAM Shared Shared 5GB 12GB
CUDA v6.5 v7.0 v7.0 v7.5
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20
Approach
Evaluation Kernel
C = A× B C1 C2
=
A
× B1 B2
↙↘ ↙↘ ↙↘
C1 = A× B1 C2 = A× B2 C1
=
A
× B1
C2
=
A
× B2
CPU GPU
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20
Approach
Approaches
Traditional methods:
Assign all work to GPU or CPU
Static Partitioning: Partition work between GPU and CPU based onapriori information
Beaumont et al., Matrix Multiplication on Heterogeneous PlatformsC. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPUComputingDonfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorizationwith Multicore and GPUs
Dynamic Partitioning:Papadrakakis et al., A New Era in Scientific Computing: Domain DecompositionMethods in Hybrid CPU-GPU Architectures
→ Existing approaches do not consider the use of shared physical memoryor the implications for energy efficiency
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20
Approach
Our approach
Static partitioning:
Guess a partition based on experimentallymeasured peak performances of CPU andGPU
Used the achieved peaks to refine thepartition
Repeat until convergence
Suitable for repeated calculations of thesame size
Use of shared memory on SoC systems:
CUDA driver automatically protectsCUDA-allocated memory during kernelexecution phase
We circumvent this by immediatelyunprotecting using mprotect() thememory after initiating a kernel execution
Dynamic partitioning:
CPU and GPU remove chunks of matrixcolumns from a workqueue
Chunk size must be sufficient to occupyCPU and GPU fully
On traditional discrete GPU systems,copies have to be carefully scheduled
Implemented using OpenMP
Two threads, one each for CPU andGPU, taking work off a master queue
The GPU thread executes at the expenseof doing productive work on the CPUcores
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20
Results & Analysis
Results: Best split performance
Platform Matrix CPU GPU CPU SPLITSize GFLOPS GFLOPS SPLIT COLS GFLOPS
DGEMM
TK1 4096 14 12 2176 26
TX1 4096 18 9 2608 25
SANDY 8192 311 836 2128 1099
HASWELL 16384 804 1124 6912 1870
SGEMM
TK1 4096 34 205 448 227
TX1 4096 38 391 128 399
SANDY 16384 643 2318 3392 2887
HASWELL 16384 1753 2526 6896 4109
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20
Results & Analysis
Best Split Search - Tegra K1/X1
10
15
20
25
DGEMM
GFLOPS TK1 GFLOPS TX1 GFLOPS
60
80
100
120
JOULES
TK1 JOULES TX1 JOULES
0
100
200
300
400
SGEMM
GFLOPS TK1 GFLOPS TX1 GFLOPS
0 1,000 2,000 3,000 4,000
20
40
60
Split Size Given to CPU
JOULES
TK1 JOULES TX1 JOULES
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20
Results & Analysis
Best Split Search - Intel + NVIDIA GPUs
500
1,000
1,500
DGEMM
GFLOPS SANDY GFLOPS HASWELL GFLOPS
50
100
JOULES
SANDY JOULES HASWELL JOULES
1,000
2,000
SGEMM
GFLOPS SANDY GFLOPS HASWELL GFLOPS
0 1,000 2,000 3,000 4,000
20
40
60
Split Size Given to CPU
JOULES
SANDY JOULES HASWELL JOULES
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20
Results & Analysis
Performance Scaling - TK1
16 32 64 128 256 512 1024 2048 409602468
10121416182022242628
DG
EM
MG
FL
OP
S
CPUGPU
SPLITDYNAMIC
TBALANCEPEAK (CPU+GPU)
16 32 64 128 256 512 1024 2048 40960
20406080
100120140160180200220240260280
Matrix Dimension M=N=K
SG
EM
MG
FL
OP
S
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20
Results & Analysis
Performance Scaling - TX1
16 32 64 128 256 512 1024 2048 40960
5
10
15
20
25
30
DG
EM
MG
FL
OP
S
CPUGPU
SPLITDYNAMIC
TBALANCEPEAK (CPU+GPU)
16 32 64 128 256 512 1024 2048 40960
50
100
150
200
250
300
350
400
450
500
Matrix Dimension M=N=K
SG
EM
MG
FL
OP
S
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20
Results & Analysis
Energy Efficiency - TX1 - SGEMM
128 256 512 1024 2048 4096
10−10
10−9
10−8
4.22 · 10−10
3.75 · 10−11
Matrix Dimension M=N=K
Joul
es/F
LO
P(S
P)
CPUGPU
SPLITTBALANCEDYNAMIC
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20
Results & Analysis
Energy Efficiency - Haswell - SGEMM
512 1024 2048 4096 8192 16384
10−10
10−9
1.76 · 10−10
8.24 · 10−11
Matrix Dimension M=N=K
Joul
es/F
LO
P(S
P)
CPUGPU
SPLITTBALANCEDYNAMIC
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20
Conclusion
Conclusion
A high accuracy and high resolution energy measurement system introduced here enables
tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune
and produce best-performance and best-energy optimized libraries.
How might a running application use information on energy usage to dynamicallychange its behaviour?
Use of shared physical memory on SoC systems eliminates transfer overhead
Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit wasobserved from exploting both CPU and GPU together
The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80.
Contact:
https://www.linkedin.com/in/alistair-rendell-6230b72
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20