AMD’s Uni ed CPU & GPU Processor Concept€¦ · Uni ed CPU & GPU Processor Concept Sven Nobis...

AMD’s Unified CPU & GPU ProcessorConcept

Advanced Seminar Computer Engineering

Sven Nobis

Institute of Computer Engineering (ZITI)University of Heidelberg

February 5, 2014

AMD’sUnified CPU

& GPUProcessorConcept

Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA

HeterogeneousUnified MemoryAccess

HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Overview

1 Introduction

2 BackgroundCPU vs. GPUCurrent Platforms: OpenCL & CUDA

3 Related Work

4 The way to HSAHeterogeneous Unified Memory Access

5 Heterogeneous System ArchitectureConceptsSystem ComponentsDevelopment Tools

6 Conclusion / Outlook

2/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Previous: Single-Core Era

INFLECTIONS IN PROCESSOR DESIGN

© Copyright 2012 HSA Foundation. All Rights Reserved. 5

?

Sin

gle

-th

rea

d

Perf

orm

ance

Time

we are

here

Enabled by: Moore’s

Law

Voltage Scaling

Constrained by:

Power

Complexity

Single-Core Era

Mod

ern

Ap

plic

atio

n

Pe

rfo

rma

nce

Time (Data-parallel exploitation)

we are

here

Heterogeneous

Systems Era

Enabled by: Abundant data

parallelism

Power efficient

GPUs

Temporarily

Constrained by:Programming

models

Comm.overhead

Th

rou

ghp

ut

Pe

rfo

rma

nce

Time (# of processors)

we are

here

Enabled by: Moore’s Law

SMP

architecture

Constrained by:Power

Parallel SW

Scalability

Multi-Core Era

Assembly C/C++ Java … pthreads OpenMP / TBB …Shader CUDA OpenCL

C++ and Java

[8, P. 5]

3/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Today: Multi-Core Era



?

Sin

gle

-th

rea

d

Perf

orm

ance

Time

we are

here


Law

Voltage Scaling

Constrained by:

Power

Complexity

Single-Core Era

Mod

ern

Ap

plic

atio

n

Pe

rfo

rma

nce


we are

here

Heterogeneous

Systems Era


parallelism

Power efficient

GPUs

Temporarily


models

Comm.overhead

Th

rou

ghp

ut

Pe

rfo

rma

nce


we are

here


SMP

architecture


Parallel SW

Scalability

Multi-Core Era


C++ and Java

[8, P. 5]

4/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Today till future: Heterogeneous System Era



?

Sin

gle

-th

rea

d

Perf

orm

ance

Time

we are

here


Law

Voltage Scaling

Constrained by:

Power

Complexity

Single-Core Era

Mod

ern

Ap

plic

atio

n

Pe

rfo

rma

nce


we are

here

Heterogeneous

Systems Era


parallelism

Power efficient

GPUs

Temporarily


models

Comm.overhead

Th

rou

ghp

ut

Pe

rfo

rma

nce


we are

here


SMP

architecture


Parallel SW

Scalability

Multi-Core Era


C++ and Java

[8, P. 5]

5/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Introduction

Today’s problems on CPU /GPU programming

programmability barriercommunication costs

SolutionAMD’s Unified CPU & GPUProcessor Concept?

→ Heterogeneous SystemArchitecture (HSA)

[3, P. 4]

6/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Introduction

Today’s problems on CPU /GPU programming

programmability barriercommunication costs

SolutionAMD’s Unified CPU & GPUProcessor Concept?

→ Heterogeneous SystemArchitecture (HSA)

[3, P. 4]

6/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Overview

1 Introduction


3 Related Work




7/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

CPU vs. GPU

CPU:LCU

Latency Compute Unit

GPU:TCU

Throughput Compute Unit

8/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

OpenCL & CUDA

Both well-established platforms for GPU programming

Compute Unified Device Architecture (CUDA)

ProprietaryOnly for NVIDIA GPUs

Open Computing Language (OpenCL)

Open standardATI, NVIDIA, Intel, ...Not only GPUs

9/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

OpenCLPlatform Model

[10]

10/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

OpenCLExecution Model

[5, P. 11]

11/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Overview

1 Introduction


3 Related Work




12/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Related Work

In CUDA [4]

Unified Virtual Addressing (UVA) in CUDA 4Unified Memory in CUDA 6

→ Developer view to the memory

Implicit copy & pinning

In OpenCL

Shared Virtual Memory

Copy is still necessary (for fast access)

13/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Overview

1 Introduction


3 Related Work




14/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

CPU and GPU cores in a single die

APU

GPU

CPU

Llano

[3, P. 2] [7, P. 7]

15/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

hUMA: Heterogeneous Unified Memory Access

Today: Non-UniformMemory Access

Different/partitionedphysical memory percompute unitMultiple virtualmemory address spaces

hUMA: HeterogeneousUnified Memory Access

Same physical memorySame virtual memoryfor all compute units

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (TODAY)

Multiple Virtual memory address spaces


CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (HSA)

Common Virtual Memory for all HSA agents


CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA

[2, P. 7], [2, P. 8]

16/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

hUMA: Heterogeneous Unified Memory Access

Today: Non-UniformMemory Access

Different/partitionedphysical memory percompute unitMultiple virtualmemory address spaces

hUMA: HeterogeneousUnified Memory Access

Same physical memorySame virtual memoryfor all compute units

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (TODAY)

Multiple Virtual memory address spaces


CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (HSA)

Common Virtual Memory for all HSA agents


CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA

[2, P. 7], [2, P. 8]

16/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

hUMA: Heterogeneous Unified Memory Access (2)

Required: hUMA Memory Controller

FeaturesShared page table support

Same large address space as the CPUPage faulting

Coherent memory regions

Fully coherent shared memory modelLike on today’s SMP CPU systems

17/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Overview

1 Introduction


3 Related Work




18/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Concepts

Unified Address Space

Already mentioned with hUMA

Unified Programming Model

Queuing

HSA Intermediate Language

19/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

ConceptsUnified Programming Model

Current programming models→ Treating the GPU as a remote processor

Extending existing concepts to use HSAProgramming languages like C++Task parallel and data parallel APIs like C++ AMP

Stay in developers environment

#include <iostream> #include <amp.h> using namespace concurrency; int main() // "Hello World" in C++ AMP { int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

array_view<int> av(11, v); parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) { av[idx] += 1; });

for(unsigned int i = 0; i < av.extent.size(); i++) std::cout << static_cast<char>(av(i)); } [6]

20/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

ConceptsQueuing - Current

[5, P.9]

21/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

ConceptsQueuing - New!

[5, P.9]

22/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

ConceptsHSA Intermediate Language

HSAIL: HSA Intermediate Language

BytecodeDesigned for data parallel programmingGPU independent

Generated by compilation stack (later)

Bytecode is compiled at runtime

to the Hardware Instruction Set of the current device

Execution Model is similar to OpenCL

23/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

System Components

APU

Software stack

Compilation StackRuntime StackSystem (Kernel) Software

24/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

System ComponentsCompilation Stack

[5, P. 15]25/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

System ComponentsRuntime-Stack

[5, P. 16]26/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Development Tools

OpenCL

C++ AMP: C++ Accelerated Massive Parallelism

BOLT Library

Aparapi

27/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Development ToolsOpenCL

”HSA is an optimized platform architecture for OpenCL- Not an alternative to OpenCL” [8, P. 13]

OpenCL on HSA will benefit from its features

28/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Development ToolsBOLT Library

Simple Example:

5 | BOLT | June 2012

SIMPLE BOLT EXAMPLE #include <bolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device bolt::sort(a.begin(), a.end()); }

§ Interface similar to familiar C++ Standard Template Library

§ No explicit mention of C++ AMP or OpenCL™ (or GPU!) –  More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™

§ Direct use of host data structures (ie std::vector)

§ bolt::sort implicitly runs on the platform –  Runtime automatically selects CPU or GPU (or both)

[9, P.5]

29/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Development ToolsBOLT and C++ AMP

Simple Example:

6 | BOLT | June 2012

BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR #include <bolt/transform.h> #include <vector> struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) restrict(cpu,amp) { return _a * xx + yy; }; }; void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); };

[9, P.6]

30/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Overview

1 Introduction


3 Related Work




31/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Conclusion

Interesting concept

Simplifies developmentOpen up new possibilities

Open platform

In heavy developmentMissing hardware with hUMA

→ Outlook

Software components not ready

→ A lot of potential

32/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

Outlook

Middle of January 2014:

Kaveri APU is available[1]Desktop APUSupport for

hUMAQueuing

Can connect bothDDR3 and GDDR5 [11]

Server APU follows:

BerlinARM-Based: Seattle

[11]

33/37

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

References I

[1] Benz, Benjamin: AMD fordert mit Kaveri Intels Core i5heraus. Heise Online. http://heise.de/-2085447.Version: Januar 2014

[2] Bratt, Ian: HSA Queueing. HOT CHIPS 2013.http://www.slideshare.net/hsafoundation/

hsa-queuing-hot-chips-2013. Version: August 2013

[3] Froning, Holger: Lecture 02 – CUDA Programming.Lecture: GPU Computing, 2013

[4] Harris, Mark: Unified Memory in CUDA 6.http://devblogs.nvidia.com/parallelforall/

unified-memory-in-cuda-6/. Version: November 2013

34/37

http://heise.de/-2085447

http://www.slideshare.net/hsafoundation/hsa-queuing-hot-chips-2013

http://www.slideshare.net/hsafoundation/hsa-queuing-hot-chips-2013

http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

References II

[5] Kyriazis, George: A Heterogeneous SystemArchitecture: Technical Review / HSA Foundation. AMD,August 2012. – Forschungsbericht. – Rev. 1.0 S.

[6] Moth, Daniel: ”Hello world” in C++ AMP.http://blogs.msdn.com/b/nativeconcurrency/

archive/2012/03/04/

quot-hello-world-quot-in-c-amp.aspx.Version: Marz 2012

[7] Rogers, Phil: THE PROGRAMMER’S GUIDE TO THEAPU GALAXY. AMD Fusion Developer Summit.http://www.slideshare.net/hsafoundation/

afds-keynote-the-programmers-guide-to-the-apu-galaxy.Version: Juni 2011

35/37

http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/04/quot-hello-world-quot-in-c-amp.aspx



http://www.slideshare.net/hsafoundation/afds-keynote-the-programmers-guide-to-the-apu-galaxy

http://www.slideshare.net/hsafoundation/afds-keynote-the-programmers-guide-to-the-apu-galaxy

AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

References III

[8] Rogers, Phil: Heterogeneous System ArchitectureOverview. HOT CHIPS 2013.http://de.slideshare.net/hsafoundation/

hsa-intro-hot-chips2013-final. Version: August2013

[9] Sander, Ben: BOLT: A C++ Template Library for HSA.AMD Fusion Developer Summit.http://www.slideshare.net/hsafoundation/

bolt-for-hsa-by-ben-sanders. Version: Juni 2012

[10] Staff, AMD: OpenCL™ and the AMD APP SDK v2.4.http://developer.amd.com/resources/

documentation-articles/articles-whitepapers/

opencl-and-the-amd-app-sdk-v2-4/. Version: April2011

36/37

http://de.slideshare.net/hsafoundation/hsa-intro-hot-chips2013-final

http://de.slideshare.net/hsafoundation/hsa-intro-hot-chips2013-final

http://www.slideshare.net/hsafoundation/bolt-for-hsa-by-ben-sanders

http://www.slideshare.net/hsafoundation/bolt-for-hsa-by-ben-sanders

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-and-the-amd-app-sdk-v2-4/



AMD’sUnified CPU


Sven Nobis

Introduction

Background

CPU vs. GPU

OpenCL &CUDA

Related Work

The way toHSA


HSA

Concepts

SystemComponents

DevelopmentTools

Conclusion /Outlook

References

References IV

[11] Windeck, Christof: AMD Kaveri: Feinheiten aus denDatenblattern. Heise Online.http://heise.de/-2088349. Version: Januar 2014

37/37

http://heise.de/-2088349

Date post:	07-Jun-2020
Category:	Documents
Upload:	others
View:	38 times
Download:	0 times

AMD’s Uni ed CPU & GPU Processor Concept€¦ · Uni ed CPU & GPU Processor Concept Sven Nobis...

Documents