The GPGPU Continuum

Ofer Rosenberg

The GPU continuum workshop, April 25 2013

THE GPGPU CONTINUUM

CONTENT

• Intel’s Compute Continuum

• GPGPU Evolution

• The GPGPU Continuum

• Mobile GPGPU challenges

• GPGPU Continuum challenges

• Towards the Continuum

INTEL’S “COMPUTE CONTINUUM” FROM IDC 2010

INTEL’S “COMPUTE CONTINUUM” FROM IDC 2010

GPGPU EVOLUTION

2004 – Stanford University: Brook for GPUs

2006 – AMD releases CTM

NVIDIA releases CUDA

2008 – OpenCL 1.0 released

G80 – 346 GFLOPS R580 – 375 GFLOPS

GPGPU EVOLUTION

Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1

1,024 Intel Xeon E5450 CPUs

5,120 Radeon 4870 X2 GPUs

Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A

14,336 Xeon X5670 CPUs

7,168 Nvidia Tesla M2050 GPUs

Tianhe-1 : 563 TFLOPS

Tianhe-1A : 2577 TFLOPS

Source: http://www.top500.org/lists/

GPGPU EVOLUTION

2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)

Nexus 10 (ARM Mali T604)

Android 4.2 adds GPU support for Renderscript

2014 – NVIDIA Tegra 5 will support CUDA

2013 – GPGPU Continuum becomes a reality

THE GPGPU CONTINUUM

Apple A6 GPU

25 GFLOPS

< 2W

ORNL TITAN SC

27 PFLOPS

8200 KW

AMD G-T16R

46 GFLOPS*

4.5W

NVIDIA GTX Titan

4500 GFLOPS

250W

Intel i7-3770

511 GFLOPS*

77W* GFLOPS of CPU+GPU

Take Intel’s vision on Compute Continuum, and aspire for that on the GPGPU continuum:

A common ecosystem

built on a common (SW) architecture

INTRO TO LEADING MOBILE GPU VENDORS

Qualcomm Adreno 320

• Part of Snapdragon S4

• Unified Shader

• SIMD4 ?

• Supports OpenCL 1.1 (E)

• 50 GFlops

http://kyokojap.myweb.hinet.net/gpu_gflops/

Imagination PowerVR 543

• Apple, Samsung, Motorola,

Intel

• Unified Shaders

• Supports OpenCL 1.1 (E)

• 38 Gflops (Apple’s MP4 ver)

ARM Mali T604

• 4 Cores

• Multiple “pipes” per core

• Supports OpenCL 1.1

• 68 GFlops

Vivante CG4000

• Unified Shaders

• 4 Cores, SIMD4 each

• Supports OpenCL 1.2

• 48 Gflops

NVIDIA Tegra 4

• 6 X 4-wide Vertex shaders

• 4 X 4-wide Pixel Shaders

• No GPGPU support

• 74 GFLOPS

MOBILE GPGPU CHALLENGES

• Many Different GPU Architectures

• Optimizing for each sets high bar on development costs

• Development Tools

• Immature (stability, performance)

• No common SDK / Debugger / Profiler (different per vendor)

• Ecosystem

• Lack of libraries, wizards, middleware Slow & expensive development

• Distribution Model

• Driver updates are part of OS distribution (no more per-month updates…)

• End users are less likely to update version higher standards on stability &

performance of driver release

• Security – the unspoken issue (hole) …

GPGPU CONTINUUM CHALLENGES

• Many Different GPU Architectures

• Optimizing for each sets high bar on development costs

• Development Tools

• Immature (stability, performance)

• No common SDK / Debugger / Profiler (different per vendor)

• Ecosystem

• Lack of libraries, wizards, middleware Slow & expensive development

• Distribution Model

• End users are less likely to update version higher standards on stability &

performance of driver release

• Security – the unspoken issue (hole) …

These challenges are a barrier to GPGPU adoption across the continuum

TOWARDS THE CONTINUUM (1) - LANGUAGES

• Welcome to the GPGPU (SW) jungle …

GPU



OpenCL

CUDADirect

Compute

Render

ScriptGPU



OpenCL

CUDADirect

Compute

Render

Script

OpenACC

C++ AMP

Fortran

Aparapi

(Java)

PyOpenCL

NumbaPro

(Python)

WebCL

GPU



OpenCL

CUDADirect

Compute

Render

Script

OpenACC

C++ AMP

Fortran

Aparapi

(Java)

PyOpenCL

NumbaPro

(Python)

WebCL

GPU

A Jungle of languages… but are these the right ones ?


• Current GPGPU languages are C/C++

based

• There are “binding” to Python, Java,

Javascript – but kernels are still C/C++

• Current developers trends:

• Managed languages (Java , C#)

• Scripting languages (Python, PHP)

• Higher abstraction & manageability:

• More room for tools to excel on

optimization

• Mitigate difference between GPU

architectures

Data from CodeEval.com, based on 100K+ code samples

https://sites.google.com/site/pydatalog/pypl/PyPL-PopularitY-of-

Programming-Language

GPGPU languages need to evolve

TOWARDS THE CONTINUUM (2) - SOFTWARE STACK

LLVM IR

CUDA

Vendor X IL

Vendor X GPU


LLVM IR

OpenCL CUDA

Vendor X IL

Vendor X GPU


• Most GPGPU languages already use

LLVM compilation framework

• Slight “flavors” of LLVM IR

• Most languages also posses similar

“API capabilities” set

LLVM IR

Render

ScriptOpenCL CUDA

OpenACC

Vendor X IL

Vendor X GPU


• Most GPGPU languages already use

LLVM compilation framework

• Slight “flavors” of LLVM IR

• Most languages also posses similar

“API capabilities” set

• Defining a common stack based on

LLVM & common API will:

• Improve the compiler

• Increase driver quality & stability

• Enable unified debugger / profiler

• …

LLVM IR

Render

ScriptOpenCL CUDA

OpenACC

Vendor X IL

Vendor X GPU

Define GPGPU Virtual Machine based on LLVM

TAKEAWAYS

• GPGPU Continuum is here - from Mobile devices to HPC

• Vision: A common ecosystem built on a common (SW)

architecture

• Challenges: many architectures, immature tools, ecosystem

QUESTIONS

• Q: What about “Heterogeneous Computing” ?

• A: Go back, replace each “GPGPU” with “Heterogeneous

Computing” – and it all fits…

• More ?

SOME SOURCES:

• http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-in-Ivy-Bridge.html

• http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012/PDP2833.pdf

• http://www.anandtech.com/show/6787/nvidia-tegra-4-architecture-deep-dive-plus-tegra-4i-phoenix-hands-on/5

• http://www.anandtech.com/show/5077/arms-malit658-gpu-in-2013-up-to-10x-faster-than-mali400

• http://www.chipdesignmag.com/pallab/2011/06/30/arm-mali-gpu-unifying-graphics-across-platforms/

• http://en.wikipedia.org/wiki/Adreno#Renaming_to_Adreno

• http://en.wikipedia.org/wiki/PowerVR#Series_5_.28SGX.29

• http://en.wikipedia.org/wiki/Mali_(GPU)

• http://johndayautomotivelectronics.com/?p=12412

• http://www.cnx-software.com/2013/01/19/gpus-comparison-arm-mali-vs-vivante-gcxxx-vs-powervr-sgx-vs-nvidia-geforce-ulp/

• http://www.brightsideofnews.com/print/2013/1/30/rise-of-vivante-fastest-tablet-gpu-on-the-market.aspx

• https://www.uplinq.com/2012/schedule/accelerating-your-android-application-renderscript-and-llvm-0

• http://www.androidauthority.com/adreno-320-features-performance-benchmarks-103269/

http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-in-Ivy-Bridge.html











































http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012/PDP2833.pdf








http://www.anandtech.com/show/6787/nvidia-tegra-4-architecture-deep-dive-plus-tegra-4i-phoenix-hands-on/5




























http://www.anandtech.com/show/5077/arms-malit658-gpu-in-2013-up-to-10x-faster-than-mali400




























http://www.chipdesignmag.com/pallab/2011/06/30/arm-mali-gpu-unifying-graphics-across-platforms/




















http://en.wikipedia.org/wiki/Adreno

http://en.wikipedia.org/wiki/Adreno

http://en.wikipedia.org/wiki/PowerVR






http://en.wikipedia.org/wiki/Mali_(GPU)

http://en.wikipedia.org/wiki/Mali_(GPU)

http://johndayautomotivelectronics.com/?p=12412



http://www.cnx-software.com/2013/01/19/gpus-comparison-arm-mali-vs-vivante-gcxxx-vs-powervr-sgx-vs-nvidia-geforce-ulp/




































http://www.brightsideofnews.com/print/2013/1/30/rise-of-vivante-fastest-tablet-gpu-on-the-market.aspx
























https://www.uplinq.com/2012/schedule/accelerating-your-android-application-renderscript-and-llvm-0


















http://www.androidauthority.com/adreno-320-features-performance-benchmarks-103269/












Date post:	13-May-2015
Category:	Technology
Upload:	ofer-rosenberg
View:	790 times
Download:	2 times

The GPGPU Continuum

Technology