THE FUTURE OF HETEROGENEOUS PROCESSING FUTURE OF HETEROGENEOUS PROCESSING Yaki Tebeka, Fellow AMD...

THE FUTURE OF HETEROGENEOUS PROCESSING

Yaki Tebeka, Fellow AMD Feb 7th, 2012

2 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public

TOP500.ORG - Top 20 Nov ‘11

3 "Jaguar"

6 “Cielo”

8 “Hopper”

11 “Kraken XT5”

12 “HERMIT” (AMD OpteronTM 6200 Series processors)

19 “HECToR” (AMD OpteronTM 6200 Series processors)

20 “Gaea C2” (AMD OpteronTM 6200 Series processors)

AMD TECHNOLOGY IN THE TOP500

- AMD OpteronTM processors power supercomputers in 14 counties, including the fastest

supercomputers in 11 countries and retain the crown of fastest computer in the USA

- 12 OEMs in Top500 with AMD technology

- 63 systems on the TOP500 are powered by AMD Opteron™ processors, including 7

systems powered by AMD OpteronTM 6200 Series processors

Source: www.top500.org


“BULLDOZER” MODULE TECHNOLOGY

Full Performance From Each Core

Leadership Multi-Threaded Micro-Architecture

Shared Double-sized FPU

Amortizes very powerful 256-bit unit across

both cores

Improved IPC

Micro-architecture and ISA enhancements

SSE4.1/4.2, AVX, FMA4, SSSE3, XOP

Virtualization Enhancements

Faster switching between VMs

AMD-V™ extended migration support

High Frequency / Low-Power Design

Core Performance Boost

“Boosts” frequency of cores when available

power allows

No idle core requirement

Power efficiency enhancements

Significantly reduced leakage power

More aggressive dynamic power mgt

Dedicated execution units per core

No shared execution units as with SMT

Dedicated Components

Shared at the module level

Shared at the chip level


WELCOME TO THE REVOLUTION:

RISE OF GPUS AS COMPUTE DEVICES

1st ERA: FIXED FUNCTION

2nd ERA: SIMPLE SHADER

3rd ERA: GRAPHICS PARALLEL CORE

3rd ERA EVOLVES: GPU COMPUTE


EMERGING GENERAL COMPUTE

Cloud-based Computing

Commercial cloud, cloud-based gaming,

and virtual desktop

Massive Data Mining

Image, video, audio processing

Pattern analytics and search

Research

Research clusters with mixed workloads

Production HPC

Seismic, financial analysis, Pharmaceutical

* Wattage is TDP and that GFLOPs are theoretical

HD 7970


VECTOR UNIT

Each ALU can perform 32-bit single precision IEEE float or integers

operations per cycle

Each set of four ALUs can perform 64-bit double precision IEEE float or

integer operations

A vector unit is a multi-precision SIMD comprised of 16 Single Precision

ALUs

All ALUs share the same instruction pointer and 256 registers

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU


COMPUTE UNIT

A compute unit has

– Four Vector Units (total of 64 ALUs)

– A sequencer that issues instructions for wavefronts in each vector units

– A scalar unit, used for branching and pointer arithmetic's

– Local data storage (64 KB)

– L1 Cache (16 KB) with load/store/fetch address units and filters

Local data

storage

Sequencer

Scalar unit

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

L1 Cache


TAHITI GPU

The “Tahiti” GPU can be found in AMD Radeon HD 7970

32 compute units (total of 2048 ALUs)

Up to 925 MHz core clock speed yields

– 3.8 TFLOPS of 32-bit math

– 947 GFLOPS of double-precision math

L1 cache offers about 2 TB/s of bandwidth at this card’s clock rate,

backed by a larger 768 KB L2 cache

3GB GDDR5 memory, 264GB/s bandwidth


A NEW ERA OF COMPUTING

?

Sin

gle

-thre

ad P

erf

orm

ance

Time

we are

here

Enabled by:

Moore’s Law

Voltage

Scaling

Constrained by:

Power

Complexity

Single-Core Era

Modern

Application

Perf

orm

ance

Time (Data-parallel exploitation)

we are

here

Heterogeneous

Systems Era

Enabled by:

Abundant data

parallelism

Power efficient

GPUs

Temporarily

Constrained by:

Programming

models

Comm.overhead

Thro

ughput

Perf

orm

ance

Time (# of processors)

we are

here

Enabled by:

Moore’s Law

SMP

architecture

Constrained by:

Power

Parallel SW

Scalability

Multi-Core Era


APU

THE BENEFITS OF HETEROGENEOUS COMPUTING

x86 CPU owns

the Software World

• Windows®, MacOSand

Linux® franchises

• Thousands of apps

• Established

Programming and

memory model

• Mature tool chain

• Extensive backward

compatibility for

applications and OSs

• High barrier to entry

GPU Optimized for

Modern Workloads

• Enormous parallel

computing capacity

• Outstanding

performance-per -watt-

per-dollar

• Very efficient hardware

threading

• SIMD architecture well

matched to modern

workloads: video, audio,

graphics

APU (Accelerated

Processing Unit)


MAINSTREAM A-SERIES AMD APU: “LLANO”

Up to four x86 CPU cores AMD Turbo CORE frequency acceleration

Array of Radeon™ Cores Discrete-class DirectX® 11 performance

3rd Generation Unified Video Decoder

Blu-ray 3D stereoscopic support

PCIe® Gen2

Dual-channel DDR3

35W - 100W TDP

AMD A-Series APU

Performance:

Up to 29GB/s System Memory Bandwidth

Up to 500 Gflops of Single Precision Compute


THE OPPORTUNITY WE ARE SEIZING

Make the unprecedented processing capability of the GPU and

the APU as accessible to programmers as the CPU is today


COMMITTED TO OPEN STANDARDS

AMD drives open and de-facto

standards

Open standards are the basis for

large ecosystems

Open standards win over time

– SW developers want their

applications to run on multiple

platforms from multiple

hardware vendors

DirectX®


EVOLUTION OF HETEROGENEOUS COMPUTING A

rch

ite

ctu

re M

atu

rity

& P

rog

ram

me

r A

cce

ssib

ility

Po

or

Ex

ce

lle

nt

2012 - 2020 2009 - 2011 2002 - 2008

Graphics & Proprietary

Driver-based APIs

Proprietary Drivers Era

“Adventurous” programmers

Exploit early programmable

“shader cores” in the GPU

Make your program look like

“graphics” to the GPU

CUDA™, Brook+, etc

OpenCL™, DirectCompute

Driver-based APIs

Standards Drivers Era

Expert programmers

C and C++ subsets

Compute centric APIs , data

types

Multiple address spaces with

explicit data movement

Specialized work queue based

structures

Kernel mode dispatch

Heterogeneous System

Architecture

Architected Era

Mainstream programmers

Full C++

GPU as a co-processor

Unified coherent address space

Task parallel runtimes

Nested Data Parallel programs

User mode dispatch

Pre-emption and context

switching


HSA FEATURE ROADMAP

System

Integration

GPU compute

context switch

GPU graphics

pre-emption

Quality of Service

Extend to

Discrete GPU

Architectural

Integration

Unified Address Space

for CPU and GPU

Fully coherent memory

between CPU & GPU

GPU uses pageable

system memory via

CPU pointers

Optimized

Platforms

Bi-Directional Power

Mgmt between CPU

and GPU

GPU Compute C++

support

User mode scheduling

Physical

Integration

Integrate CPU & GPU

in silicon

Unified Memory

Controller

Common

Manufacturing

Technology


HETEROGENEOUS SYSTEM ARCHITECTURE – AN OPEN

PLATFORM

Open Architecture, published specifications

– HSAIL virtual ISA

– HSA memory model

– HSA dispatch

ISA agnostic for both CPU and GPU

Inviting partners to join us, in all areas

– Hardware companies

– Operating Systems

– Tools and Middleware

– Applications

HSA review committee planned


Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, Radeon, AMD-V and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other

names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

OpenCL and the OpenCL logo are trademarks of Apple, Inc. and are used by permission by Khronos.

© 2012 Advanced Micro Devices, Inc. All Rights Reserved.

Date post:	27-May-2018
Category:	Documents
Upload:	lamque
View:	215 times
Download:	0 times

THE FUTURE OF HETEROGENEOUS PROCESSING FUTURE OF HETEROGENEOUS PROCESSING Yaki Tebeka, Fellow AMD...

Documents