THE FUTURE OF HETEROGENEOUS PROCESSING
Yaki Tebeka, Fellow AMD Feb 7th, 2012
2 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
TOP500.ORG - Top 20 Nov ‘11
3 "Jaguar"
6 “Cielo”
8 “Hopper”
11 “Kraken XT5”
12 “HERMIT” (AMD OpteronTM 6200 Series processors)
19 “HECToR” (AMD OpteronTM 6200 Series processors)
20 “Gaea C2” (AMD OpteronTM 6200 Series processors)
AMD TECHNOLOGY IN THE TOP500
- AMD OpteronTM processors power supercomputers in 14 counties, including the fastest
supercomputers in 11 countries and retain the crown of fastest computer in the USA
- 12 OEMs in Top500 with AMD technology
- 63 systems on the TOP500 are powered by AMD Opteron™ processors, including 7
systems powered by AMD OpteronTM 6200 Series processors
Source: www.top500.org
3 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
“BULLDOZER” MODULE TECHNOLOGY
Full Performance From Each Core
Leadership Multi-Threaded Micro-Architecture
Shared Double-sized FPU
Amortizes very powerful 256-bit unit across
both cores
Improved IPC
Micro-architecture and ISA enhancements
SSE4.1/4.2, AVX, FMA4, SSSE3, XOP
Virtualization Enhancements
Faster switching between VMs
AMD-V™ extended migration support
High Frequency / Low-Power Design
Core Performance Boost
“Boosts” frequency of cores when available
power allows
No idle core requirement
Power efficiency enhancements
Significantly reduced leakage power
More aggressive dynamic power mgt
Dedicated execution units per core
No shared execution units as with SMT
Dedicated Components
Shared at the module level
Shared at the chip level
4 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
WELCOME TO THE REVOLUTION:
RISE OF GPUS AS COMPUTE DEVICES
1st ERA: FIXED FUNCTION
2nd ERA: SIMPLE SHADER
3rd ERA: GRAPHICS PARALLEL CORE
3rd ERA EVOLVES: GPU COMPUTE
5 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
EMERGING GENERAL COMPUTE
Cloud-based Computing
Commercial cloud, cloud-based gaming,
and virtual desktop
Massive Data Mining
Image, video, audio processing
Pattern analytics and search
Research
Research clusters with mixed workloads
Production HPC
Seismic, financial analysis, Pharmaceutical
* Wattage is TDP and that GFLOPs are theoretical
HD 7970
6 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
VECTOR UNIT
Each ALU can perform 32-bit single precision IEEE float or integers
operations per cycle
Each set of four ALUs can perform 64-bit double precision IEEE float or
integer operations
A vector unit is a multi-precision SIMD comprised of 16 Single Precision
ALUs
All ALUs share the same instruction pointer and 256 registers
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
7 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
COMPUTE UNIT
A compute unit has
– Four Vector Units (total of 64 ALUs)
– A sequencer that issues instructions for wavefronts in each vector units
– A scalar unit, used for branching and pointer arithmetic's
– Local data storage (64 KB)
– L1 Cache (16 KB) with load/store/fetch address units and filters
Local data
storage
Sequencer
Scalar unit
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
L1 Cache
8 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
TAHITI GPU
The “Tahiti” GPU can be found in AMD Radeon HD 7970
32 compute units (total of 2048 ALUs)
Up to 925 MHz core clock speed yields
– 3.8 TFLOPS of 32-bit math
– 947 GFLOPS of double-precision math
L1 cache offers about 2 TB/s of bandwidth at this card’s clock rate,
backed by a larger 768 KB L2 cache
3GB GDDR5 memory, 264GB/s bandwidth
9 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
A NEW ERA OF COMPUTING
?
Sin
gle
-thre
ad P
erf
orm
ance
Time
we are
here
Enabled by:
Moore’s Law
Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
Modern
Application
Perf
orm
ance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
Abundant data
parallelism
Power efficient
GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Thro
ughput
Perf
orm
ance
Time (# of processors)
we are
here
Enabled by:
Moore’s Law
SMP
architecture
Constrained by:
Power
Parallel SW
Scalability
Multi-Core Era
10 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
APU
THE BENEFITS OF HETEROGENEOUS COMPUTING
x86 CPU owns
the Software World
• Windows®, MacOSand
Linux® franchises
• Thousands of apps
• Established
Programming and
memory model
• Mature tool chain
• Extensive backward
compatibility for
applications and OSs
• High barrier to entry
GPU Optimized for
Modern Workloads
• Enormous parallel
computing capacity
• Outstanding
performance-per -watt-
per-dollar
• Very efficient hardware
threading
• SIMD architecture well
matched to modern
workloads: video, audio,
graphics
APU (Accelerated
Processing Unit)
11 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
MAINSTREAM A-SERIES AMD APU: “LLANO”
Up to four x86 CPU cores AMD Turbo CORE frequency acceleration
Array of Radeon™ Cores Discrete-class DirectX® 11 performance
3rd Generation Unified Video Decoder
Blu-ray 3D stereoscopic support
PCIe® Gen2
Dual-channel DDR3
35W - 100W TDP
AMD A-Series APU
Performance:
Up to 29GB/s System Memory Bandwidth
Up to 500 Gflops of Single Precision Compute
12 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
THE OPPORTUNITY WE ARE SEIZING
Make the unprecedented processing capability of the GPU and
the APU as accessible to programmers as the CPU is today
13 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
COMMITTED TO OPEN STANDARDS
AMD drives open and de-facto
standards
Open standards are the basis for
large ecosystems
Open standards win over time
– SW developers want their
applications to run on multiple
platforms from multiple
hardware vendors
DirectX®
14 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
EVOLUTION OF HETEROGENEOUS COMPUTING A
rch
ite
ctu
re M
atu
rity
& P
rog
ram
me
r A
cce
ssib
ility
Po
or
Ex
ce
lle
nt
2012 - 2020 2009 - 2011 2002 - 2008
Graphics & Proprietary
Driver-based APIs
Proprietary Drivers Era
“Adventurous” programmers
Exploit early programmable
“shader cores” in the GPU
Make your program look like
“graphics” to the GPU
CUDA™, Brook+, etc
OpenCL™, DirectCompute
Driver-based APIs
Standards Drivers Era
Expert programmers
C and C++ subsets
Compute centric APIs , data
types
Multiple address spaces with
explicit data movement
Specialized work queue based
structures
Kernel mode dispatch
Heterogeneous System
Architecture
Architected Era
Mainstream programmers
Full C++
GPU as a co-processor
Unified coherent address space
Task parallel runtimes
Nested Data Parallel programs
User mode dispatch
Pre-emption and context
switching
15 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
HSA FEATURE ROADMAP
System
Integration
GPU compute
context switch
GPU graphics
pre-emption
Quality of Service
Extend to
Discrete GPU
Architectural
Integration
Unified Address Space
for CPU and GPU
Fully coherent memory
between CPU & GPU
GPU uses pageable
system memory via
CPU pointers
Optimized
Platforms
Bi-Directional Power
Mgmt between CPU
and GPU
GPU Compute C++
support
User mode scheduling
Physical
Integration
Integrate CPU & GPU
in silicon
Unified Memory
Controller
Common
Manufacturing
Technology
16 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
HETEROGENEOUS SYSTEM ARCHITECTURE – AN OPEN
PLATFORM
Open Architecture, published specifications
– HSAIL virtual ISA
– HSA memory model
– HSA dispatch
ISA agnostic for both CPU and GPU
Inviting partners to join us, in all areas
– Hardware companies
– Operating Systems
– Tools and Middleware
– Applications
HSA review committee planned
17 | The Future of Heterogeneous Processing | Feb 7, 2012 | Public
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, Radeon, AMD-V and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other
names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple, Inc. and are used by permission by Khronos.
© 2012 Advanced Micro Devices, Inc. All Rights Reserved.