+ All Categories
Home > Documents > On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 ›...

On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 ›...

Date post: 25-Jun-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
31
synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing Mayank Daga, Ashwin M. Aji , and Wu-chun Feng Dept. of Computer Science
Transcript
Page 1: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

On the Efficacy of a Fused CPU+GPU

Processor (or APU) for Parallel Computing

Mayank Daga, Ashwin M. Aji, and Wu-chun Feng

Dept. of Computer Science

Page 2: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

“Sampling” of fields that use GPUs

Mac OS X Cosmology

Molecular Dynamics

and ModelingComputational Fluid

Dynamics32

Page 3: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

GPUs in HPC

Rank Computer Rmax Rpeak %age (Rmax/Rpeak)

1 K computer – SPARC64 VIIIfx

2.0 GHz, Tofu Interconnect

8162.00 8773.00 93.03 %

2 Tianhe-1A -NUDT TH MPP, X5670 2.93Ghz

6C, NVIDIA GPU FT-1000 8C

2566.00 4701.00 54.6 %

3 Jaguar – Cray XT5-HE Opteron

6-core 2.6Ghz

1759.00 2331.00 75.5 %

4 Nebulae – Dawning TC6300

Blade, Intel X5650, NVIDIA Tesla

C2050 GPU

1271.00 2984.30 42.6 %

5 TSUBAME 2.0 – HP ProLiant

SL390s G7 Xeon 6C X5670,

NVIDIA GPU

1192.00 2287.63 52.1 %

http://www.top500.org

Systems with GPUs achieve only ~50 % of Rpeak

Systems without GPUs achieve ~84 % of Rpeak

33

Page 4: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Architecture of Discrete GPUs

Thread Execution Control

SIMD Engines (~500 Gflop/s)

Device Memory

Sys

tem

Mem

ory

(H

ost

)X86

CPU

Cores

DMA/PCIe

Thread

Processors

Thread

Processors

Thread

Processors

34

Page 5: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

s

s

s

p

p/N

p'

Symmetric Multi-Core (N-cores)

Accelerator-based SystemData Transfer Overhead

Sequential Processor

A Reason for Poor Efficiency

Overhead

35

Page 6: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

s

s

s

p

p/N

p'

Symmetric Multi-Core (N-cores)

Accelerator-based SystemData Transfer Overhead

Sequential Processor

A Reason for Poor Efficiency

Overhead

0 50 100 150 200

Discrete GPU

Multi-core CPU (4 cores)

Single-core CPU

Time (ms)

FMAD

Serial Time Parallel Time Overhead

36

Page 7: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

s

s

s

p

p/N

p'

Symmetric Multi-Core (N-cores)

Accelerator-based SystemData Transfer Overhead

Sequential Processor

Ideal Efficiency Scenario

Overhead

37

Page 8: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

s

s

s

p

p/N

p'

Symmetric Multi-Core (N-cores)

Accelerator-based System

Sequential Processor

Ideal Efficiency Scenario

Overhead

38

Page 9: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Ideal Placement of CPU and GPU Cores

Thread Execution Control

SIMD Engines

Device Memory

Sys

tem

Mem

ory

(H

ost

)X86

CPU

Cores

DMA/PCIe

Thread

Processors

Thread

Processors

Thread

Processors

39

Page 10: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Ideal Placement of CPU and GPU Cores

Thread Execution Control

SIMD Engines

Device Memory

Thread

Processors

Thread

Processors

Thread

Processors

X86

CPU

Cores

Towards a “fused” CPU+GPU…

40

Page 11: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Outline

• Motivation

• AMD Fusion APU – A Fused CPU+GPU

• Revisiting Amdahl’s Law

• Experimental Analysis

– Application Benchmarks

– Results and Discussion

• Conclusions and Future Work

Page 12: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

AMD Fusion APU – A Fused CPU+GPU

Thread Execution Control

SIMD Engines

Platform Interfaces

X86 CPU

Cores Thread

Processors

Thread

Processors

Thread

Processors

Hig

h P

erfo

rman

ce B

us

and

Mem

ory

Co

ntr

olle

r

Unified Video Decoder

System Memory

42

Page 13: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

State of the Data Transfer

• Discrete GPU

• AMD Fusion APU (1st Generation)

Device MemorySystem Memory

(Host)

PCIe Transfer

(x86) (SIMD Engines)

memcpy

System Memory

AMD provides high speed block transfer engines that move data

between the x86 and SIMD memory partitions.

192 MB

(AMD, “AMD Fusion Family of APUs: Enabling a Superior, Immersive PC Experience”)

Page 14: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Outline

• Motivation

• AMD Fusion APU – A Fused CPU+GPU

• Revisiting Amdahl’s Law

• Experimental Analysis

– Application Benchmarks

– Results and Discussion

• Conclusions and Future Work

Page 15: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Revisiting Amdahl’s Law

Symmetric Multi-core Asymmetric Multi-core

Speedup values for different serial fractions

Higher Efficiency of Asymmetric Chips

(M. Hill and M. Marty, “Amdahl’s Law in the Multi-core Era”)

Page 16: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

s

s

s

p

p/N

p'

Symmetric Multi-Core (N-cores)

Accelerator-based Systemo

Sequential Processor

Revisiting Amdahl’s Law

o

46

Page 17: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

s

s

p

p' Accelerator-based Systemo

Sequential Processor

Revisiting Amdahl’s Law

• oDiscreteGPU vs. oFusion

– Fusion is expected to be better than discrete GPUs

• p’DiscreteGPU vs. p’Fusion

– Depends on several factors, like algorithmic mapping, memory bandwidth, number of compute units, etc.

? ?

Page 18: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Implications

• Asymmetric chips always offer better efficiency than

symmetric chips…

– …if researchers continue to address scheduling and

overhead challenges

• Fusing CPU and GPU cores reduce data transfer

overheads to a great extent

• AMD Fusion, Intel Knights Ferry, and NVIDIA Tegra

are all steps in the right direction.

– Our focus today: AMD Fusion

Page 19: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Outline

• Motivation

• AMD Fusion APU – A Fused CPU+GPU

• Revisiting Amdahl’s Law

• Experimental Analysis

– Application Benchmarks

– Results and Discussion

• Conclusions and Future Work

Page 20: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Experimental Analysis

• Systems

– AMD Zacate APU

o Engineering sample of AMD Fusion

o Dual CPU cores + 80 GPU cores

– AMD Radeon HD 5870

o High-powered discrete GPU

o 1600 GPU cores

– AMD Radeon HD 5450

o Low-powered discrete GPU

o 80 GPU cores

Page 21: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Experimental Setup

Page 22: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Experimental Analysis

• Application Benchmarks

– Bandwidth Test

o Measures PCIe bandwidth for discrete GPU

o Measures memory bandwidth for APU

– FFT

o Measures performance of a 2-D Fast Fourier Transform

o Computes multiple FFTs of size 512 in parallel

– MD

o Measures performance of pairwise calculation of Lennard-

Jones potential

– Scan

o Measures performance of the parallel prefix sum algorithm on a

large array of floating point data

– Reduction

o Measures performance of a sum reduction operation using

floating point data

SHOC Benchmark

Page 23: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Bandwidth Test

0

0.5

1

1.5

2

2.5

1 4 16 64 256 1024 4096 16384 65536

Ban

dw

idth

(G

B/

s)

Size (KB)

Zacate APU Radeon HD 5870 Radeon HD 5450

0

0.5

1

1.5

2

2.5

1 4 16 64 256 1024 4096 16384 65536

Ban

dw

idth

(G

B/

s)

Size (KB)

Zacate APU Radeon HD 5870 Radeon HD 5450

Host to Device

Device to Host

53

Page 24: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Fast Fourier Transform (FFT)

0 100 200 300 400 500 600

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU4

816

32

64

Time (ms)

Pro

ble

m S

ize (

MB

)

Data Transfer Kernel Execution

APU reduces data transfer times

for all problem sizes.

Kernel Execution time is more for

APU because of its lower memory

bandwidth

Page 25: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Molecular Dynamics

0 20 40 60 80 100 120 140 160 180

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

AMD Radeon HD5870

AMD HD5450

AMD Zacate APU

12288

24576

36864

73728

Time (ms)

Nu

mb

er

of A

tom

s

Data Transfer Kernel Execution

APU reduces data transfer times for

all problem sizes.

The kernel executes fastest on

discrete AMD 5870 due to more and

faster GPU cores. The fused Zacate

APU is next fastest

Compute-bound

Page 26: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Scan

0 20 40 60 80 100 120 140

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

24

816

32

Time (ms)

Pro

ble

m S

ize (

MB

)

Data Transfer Kernel Execution

Total execution time is equal for

discrete and fused GPUs

This is stunning given that discrete

GPUs have 20-times more cores

These cores are computationally

more powerful as well

I/O-bound

Page 27: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Reduction

0 10 20 30 40 50 60 70 80 90 100

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

AMD Radeon HD5870

AMD Zacate APU

48

16

32

64

Time (ms)

Vecto

r S

ize (

MB

)

Data Transfer Kernel Execution

Total execution time is 3-times

better for the APU

The efficacy of the APU increases

as the problem size increases

I/O-bound

Page 28: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

0

20

40

60

80

100

120

4 8 16 32 64T

ime

(m

s)

Vector Size (MB)

Total Execution Time

AMD Fusion AMD Radeon HD 5870

Reduction

58

0

20

40

60

80

100

4 8 16 32 64

Tim

e (

ms

)

Vector Size (MB)

Transfer Time

AMD Fusion AMD Radeon HD 5870

0

5

10

15

4 8 16 32 64

Tim

e (

ms

)

Vector Size (MB)

Kernel Execution Time

AMD Fusion AMD Radeon HD 5870

3x

Page 29: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Outline

• Motivation

• AMD Fusion APU – A Fused CPU+GPU

• Revisiting Amdahl’s Law

• Experimental Analysis

– Application Benchmarks

– Results and Discussion

• Conclusions and Future Work

Page 30: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Future Work

• A more robust model also capturing the

computational differences between fused and

discrete GPUs

• Power modeling based on AMD’s Power Gating

technology

Page 31: On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

synergy.cs.vt.edu

Conclusions

• Fused CPU+GPU is a step in the right direction for

efficient supercomputers

– Data transfer overhead is largely mitigated (up to 6x)

– Application execution time can be largely sped up (up to 3x

in some cases)

– No change is needed in the programming model

• But this is still not a panacea

– GPU cores on the APU are not yet as powerful or as plentiful

in number as the discrete GPUs

– Device memory bandwidth does not yet match that of

discrete GPUsContacts

• Mayank daga ([email protected])

• Ashwin M. Aji ([email protected])

• Dr. Wu-chun Feng ([email protected]) Questions?


Recommended