Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | amd-developer-central |
View: | 974 times |
Download: | 3 times |
THE PROGRAMMER’S GUIDE TO REACHING FOR THE CLOUD PHIL ROGERS, CORPORATE FELLOW, AMD
NOV. 11, 2013
3 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
MODERN CLOUD WORKLOADS ARE HETEROGENEOUS
Video is expected to represent two thirds of mobile data traffic by 2017 ‒ Video is continuously being captured, uploaded, transcoded and streamed ‒ Video processing is inherently parallel … and can be accelerated
Big data growing exponentially with Exabytes of data crawled monthly ‒ Indexing the web and extracting high definition information ‒ Map reduce is a heterogeneous workload
Natural User Interfaces are still in their infancy ‒ Accurate extraction of meaning from gesture and voice ‒ Getting to the fingertips and voice inflections
SCALAR CONTENT WITH A GROWING MIX OF PARALLEL CONTENT
NEED TO SIMULTANEOUSLY INCREASE PERFORMANCE AND REDUCE POWER
4 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
FUTURE TECHNOLOGY GROWTH WILL ACCELERATE THE TREND
Rapid growth of Sensor Networks ‒ Drives exponential increase in data
Internet of Everything (IoE) results in explosion of data sources ‒ Another exponential growth in data
at local and cloud level
Context Aware Computing is a Huge Big-Data Problem ‒ Both local and cloud compute must
get faster/lower power
DRIVING FUTURE DEMAND FOR LOCAL AND CLOUD PARALLEL EFFICIENCY
Source: Cisco IBSG, 2013
HOW MUCH VALUE IS AT STAKE IN THE IOE ECONOMY?
$9.5 trillion
From industry-specific
use cases
$4.9 trillion
From cross-industry
use cases
$14.4 trillion
RAPID GROWTH OF THE NUMBER OF THINGS CONNECTED TO THE INTERNET
1995 2000 2005 2010 2015 2020
50B
“Fixed” Computing (you go to the device)
Mobility / BYOD (the device goes with
you)
Internet of Things (age of devices)
Internet of Everything (people, process, data,
things)
5 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSA APU PROCESSORS OPERATE HARMONIOUSLY AT LOW POWER
Techniques include: ‒ Image Stabilization, Super Resolution, Deblur, Deinterlace, Lighting & Contrast
Enhancements examine pixels from a large number of video frames ‒ Super-resolution based on information from surrounding frames
Algorithms can be run on multiple processors in the APU ‒ CPU, GPU, DSPs, Fixed Function Accelerators ‒ Convolutions, motion estimation, histograms,
format conversions, etc. ‒ Processing flows freely between processors
for best efficiency
EXAMPLE: VIDEO ENHANCEMENT
6 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HETEROGENEOUS PROCESSORS - EVERYWHERE SMARTPHONES TO SUPER-COMPUTERS
Phone
Tablet
Notebook
Workstation
Dense Server
Super computer
A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT
7 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HOW DOES HSA MAKE THIS ALL WORK?
Enables acceleration of languages like Java, C++ AMP and Python
All processors use the same addresses, and can share data structures in place
Heterogeneous computing can use all of virtual and physical memory
Extends multicore coherency to the GPU and other processors
Pass work quickly between the processors
Enables quality of service
HSA FOUNDATION – BUILDING THE ECOSYSTEM
HSA in 2013
9 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSA FOUNDATION AT LAUNCH BORN IN JUNE 2012
Founders
10 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSA FOUNDATION TODAY – NOVEMBER 2013 A GROWING AND POWERFUL FAMILY
Founders
Promoters
TBA at APU-13
Supporters
Contributors
Universities NTHU Programming Language Lab
NTHU System Software Lab C O M P U T E R S C I E N C E
11 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSA FOUNDATION PROGRESS
Membership growing rapidly ‒ 2-3 new members per month ‒ Universities enrolling
Four working groups generating specifications ‒ HSA Programmers Reference Manual published ‒ HSA System Architecture spec going to ratification by the
end of the year ‒ Runtime WG and Tools WG will publish early next year
HSA Development platforms to ship in early 2014
WHAT AN AMAZING FIRST YEAR
12 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSAIL
Kernel Fusion Driver (KFD)
HSA Core Runtime
HSA Finalizer
HSA Helper Libraries
OpenCL™ App
OpenCL Runtime
Java App
Java JVM (Sumatra)
Python App
Fabric Engine RT
C++ AMP App
Various Runtimes
PROGRAMMING LANGUAGES PROLIFERATING ON HSA
Workloads
14 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HIGH EFFICIENCY VIDEO CODEC – HEVC (H.265) VALUE PROPOSITION
30% TO 50% MORE EFFICIENT THAN H.264 AT 1080P RESOLUTION
30% to 50%
HEVC VISUAL QUALITY IS SIGNIFICANTLY BETTER THAN H.264 AT ANY GIVEN BIT RATE
H.264 @ 500 kbps
H.265 @ 500 kbps
4K VIDEO BENEFITS ARE EVEN MORE SIGNIFICANT WITH HEVC
4K Ultra HDTV Sony XBR $4999
4K Video Cameras GoPro $399
15 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HIGH EFFICIENCY VIDEO CODEC – HEVC (H.265)
Source: Cisco VNI Mobile Forecast, 2013
Traffic Share Exabytes Per Month
0
2
4
6
8
10
12
2012 2013 2014 2015 2016 2017
Mobile Video Mobile Web/DataMobile M2M Mobile File Sharing
66.5%
24.9%
5.1%
3.5%
WHY HEVC WILL PROLIFERATE
The next generation MPEG video encoding standard Significantly higher efficiency (up to 50% lower bit
rates at given quality) than AVC (H.264) Highly beneficial for HD video (1080p or below) Especially beneficial for 4K video Scales to 8K Ultra High Definition video (up to
8192×4320) Computationally complex, but by design easier to
parallelize than H.264
CLOUD VIDEO PROVIDERS NEED THE HIGHER COMPRESSION FOR QUALITY OF SERVICE
16 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
ALL STAGES OF HEVC ARE ACCELERATED ON THE APU
Decrypt Decode and decompress Scaling and Enhancement Encode and compress Encrypt
HEVC (H.265) ACCELERATION EFFICIENT CLOUD DEPLOYMENT
ENCODE IS THE HEAVIEST STAGE
Leverage point for compression
Highly parallel Algorithms improve
monthly Must stay programmable
H.265 ENCODING IS 5 – 10X MORE COMPUTATIONALLY COMPLEX THAN H.264
Picture can be divided into Macroblock regions with a much wider range of sizes and shapes
Motion vectors have 33 prediction directions compared to 8 for H.264
17 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
OVERVIEW OF B+ TREES
B+ Trees are a special case of B Trees
Fundamental data structure used in several popular database management systems ‒ SQLite ‒ CouchDB
A B+ Tree … ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a block-oriented
context
Order (b) of a B+ Tree measures the capacity of its nodes
7 8
d7 d8
6
d6
5
d5
4
d4
3
d3
2
d2
1
d1
2 4 6 7
3 5
18 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
APPLICATIONS THAT USE B/B+ TREES
http://www.sqlite.org/famous.html http://wiki.apache.org/couchdb/CouchDB_in_the_wild
primary data store on the client-side
Mail, Safari, iPhone, iPod, iTunes
Firefox and Thunderbird
Android, Chrome
multi-data center key-value store
market-data framework
large hadron collider
19 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HOW WE ACCELERATE
Utilize coarse-grained parallelism in B+ Tree searches ‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads ‒ Increase throughput (transactions per second for OLTP)
B+ Tree searches on an HSA enabled APU ‒ Allows much larger B+ Trees to be searched, than traditional GPU compute ‒ Eliminates data-copies since CPU and GPU cores can access the same memory
20 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
1M search queries in parallel
Input B+ Tree contains 112 million keys and uses 6GB of memory
Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP
Software: OpenCL on HSA
RESULTS
0
1
2
3
4
5
6
7
8 16 32 64 128Sp
eedu
p Order of B+ Tree
Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation
Results measured in AMD Labs on “Kaveri” APU, 35W TDP, 16GB DRAM
21 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
REVERSE TIME MIGRATION (RTM)
A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists
A memory-intensive and highly parallel algorithm
RTM is run on massive data sets
A natural scale out algorithm
Often run today on 100K node CPU systems
Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future.
Marine crews Land crews
HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS
22 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH
MINING BIG DATA
Multi-stage pipeline or parallel processing stages
Traditional GPU Compute is challenged by copies
APU with HSA accelerates each stage in place ‒ Sort ‒ Compression ‒ Regular expression parsing ‒ CRC generation
Acceleration of large data search scales out across the cluster of APU nodes
Input HDFS
Output HDFS
HDFS Replication
HDFS Replication
sort
copy
merge
split 0 map
part 0 reduce
split 1 map
split 2 map
part 1 reduce
Programming Languages
24 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
PROGRAMMING MODELS EMBRACING HSAIL AND HSA THE RIGHT LEVEL OF ABSTRACTION
UNDER DEVELOPMENT Java: Project Sumatra OpenJDK 9 OpenMP from SuSE C++ AMP, based on CLANG/LLVM Python and KL from Fabric Engine
NEXT DSLs: Halide, Julia, Rust Fortran JavaScript Open Shading Language R
25 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSA ENABLES DEVELOPERS TO LEVERAGE HC … EASILY & NATURALLY
PREFERRED PROGRAMMING LANGUAGES
Java, C++, OpenMP, Python *
SVM, Coherence, GPU Enqueue
OpenJDK/Sumatra, Fabric Engine
TRANSPARENT CALLS TO POPULAR LIBRARIES
OpenCV, SciPy, NumPy, ImageMagick, Bolt, …
Arbitrary data structures, SVM, Coherence, User mode queueing
OpenCV API, Bolt STL library
USING CONVENTIONAL METHODS
Arbitrary data structures, malloc, function pointers, call-backs, recursion, semaphores, atomics
SVM, Coherence, User-mode queueing, GPU Enqueue, HSAIL
Linked-list/tree traversal + other complex shared host data structures
* Java 8, C++ AMP, OpenMP 4.0 next generation standards and extensions
26 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
C++ AMP ACCELERATION GOES MULTI-PLATFORM
Herb Sutter Announced C++ AMP for the Windows® Platform at ADS 2011
We very much liked the single source model of development, and decided to extend it to be multi-platform
Today we are announcing C++ AMP is moving beyond Microsoft® Windows to embrace Linux. We will offer this acceleration on both our APUs and our discrete GPUs
We are also bringing Bolt STL Library support to C++ AMP
AVAILABLE IN OPEN SOURCE 1H-2014
C++AMP LLVM-IR or SPIR 1.2
HSAIL
SPIR 1.2
Any HSA Implementation
Any OpenCL™+SPIR Implementation
LLVM Compiler
CLANG Front-end
27 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HSA ENABLEMENT OF JAVA
JAVA 8 – HSA ENABLED APARAPI
Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel
algorithms ‒ Initially targeted at multi-core.
APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at
runtime via HSAIL
JVM
Java Application
HSAIL
HSA Finalizer & Runtime
APARAPI + Lambda API
CPU ISA GPU ISA
GPU CPU
JAVA 7 – OpenCL ENABLED APARAPI
AMD initiated Open Source project APIs for data parallel algorithms
‒ GPU accelerate Java applications ‒ No need to learn OpenCL™
Active community captured mindshare ‒ ~20 contributors ‒ >7000 downloads ‒ ~150 visits per day
JVM
Java Application
OpenCL™
OpenCL™ Compiler & Runtime
APARAPI API
CPU ISA GPU ISA
GPU CPU
JAVA 9 – HSA ENABLED JAVA (SUMATRA)
Adds native GPU acceleration to Java Virtual Machine (JVM)
Developer uses JDK Lambda, Stream API JVM uses GRAAL compiler to generate HSAIL JVM decides at runtime to execute on either
CPU or GPU depending on workload characteristics.
JVM
Java Application
HSAIL
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT backend
CPU ISA GPU ISA
GPU CPU
We will provide HSA Enabled Aparapi on Java 8
to bridge between Aparapi on Java 7 and HSA/Sumatra on Java 9
28 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
JAVA DEMO WELCOME GARY FROST TO THE STAGE
29 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
NBODY REVISTED
NBody problem: ‒ Calculate the position of ‘N’ bodies in 3D space by computing the gravitational effect each has on all
of the others and updating it’s position.
A Java sequential NBody implementation would start with an Object for each Body.
Then we would iterate over all bodies updating the position of each
A pre Java 8 Java ‘parallel’ version would not fit so nicely on this slide ;)
public class Body{ // State of object
private float x, y, z, m, vx, vy, vz; // Method to update position relative to other bodies void updatePosition(Body[] bodies){ /* code omitted */ }
}
for (Body b: bodies) { b.updatePosition(bodies) });
30 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
JAVA 8’S ‘PROJECT LAMBDA’ SIMPLIFIES PARALLEL PROGRAMMING
Offers an alternate syntax for processing arrays/collections of data
To process a stream in parallel we just tag the stream with the parallel() modifier
In Java 8 a parallel stream executes across all CPU cores.
In Java 9 (Sumatra) a parallel stream executes across all CPU and GPU cores
for (Body b; bodies) b -> updatePosition(bodies);
Arrays.stream(bodies) // wrap array in a stream .forEach(b -> b.updatePosition(bodies);
Arrays.stream(bodies) // Wrap an array in a stream .parallel(); // tag the stream as parallel .forEach(b -> b.updatePosition(bodies);
31 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
JAVA DEMO
32 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
JAVA AND THE CLOUD
Java 8 and Java 9 provide parallel acceleration
Parallel workloads are proliferating in the cloud
Hadoop framework for scale out
HSA APUs provide workload acceleration
THE RIGHT LANGUAGE WITH ACCELERATION ON CLOUD APUS
“THE ROLE OF JAVA™ IN HETEROGENEOUS COMPUTING, AND HOW YOU CAN HELP”
DON’T MISS THE KEYNOTE TOMORROW FROM ORACLE’S NANDINI RAMANI
Programming Tools
34 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
ANNOUNCING AMD’S UNIFIED SDK
Access to AMD APU and GPU programmable components
Component installer - choose just what you need
APP SDK 2.9
Web-based sample browser Supports programming standards: OpenCL™, C++ AMP Code samples for accelerated open source libraries:
‒ OpenCV, OpenNI, Bolt, Aparapi
OpenCL™ source editing plug-in for visual studio Now supports Cmake
AMD Unified SDK
MEDIA SDK 1.0 BETA
GPU accelerated video pre/post processing library Leverage AMD's media encode/decode acceleration blocks Library for low latency video encoding Supports both Windows Store and Classic desktop
Initial release includes: ‒ APP SDK v2.9 ‒ Media SDK 1.0 Beta
35 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
ANNOUNCING AMD V1.3
AMD’s comprehensive heterogeneous developer tool suite including: ‒ CPU and GPU Profiling ‒ GPU kernel Debugging ‒ GPU kernel analysis
New features in version 1.3: ‒ Supports Java ‒ Integrated static kernel analysis ‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products
CPU PROFILER
Time-based profiling Analyze call-chain relationships Java profiling with inline
function support Cache-line utilization profiling Supports latest AMD processors
GPU PROFILER
OpenCL™ Application Trace Profile OpenCL kernels Timeline visualization of GPU
counter data Kernel Occupancy Viewer Remote GPU Profiling
GPU DEBUGGER
Real-time OpenCL kernel debugging with stepping and variable display
OpenCL and OpenGL API Statistics
Object visualization Remote GPU debugging
STATIC KERNEL ANALYZER
Compile, analyze and disassemble OpenCL Kernels
View kernel compilation errors/warnings
Estimate kernel performance View generated ISA code View registers
36 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
OPEN SOURCE LIBRARIES ACCELERATED BY AMD
OpenCV
Most popular computer vision library
Now with many OpenCL™ accelerated functions
Bolt
C++ template library Provides GPU off-load for
common data-parallel algorithms
Now with cross-OS support and improved performance/functionality
clMath
AMD released APPML as open source to create clMath
Accelerated BLAS and FFT libraries
Accessible from Fortran, C and C++
Aparapi
OpenCL™ accelerated Java 7 Java APIs for data parallel
algorithms (no need to learn OpenCL™
37 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
AMD APUS, HSA – CLIENT TO THE CLOUD
Parallel workloads are booming ‒ Acceleration where the data is ‒ On the client for a snappy user experience ‒ In the cloud for scalable services
HSA enabled APUs in the cloud ‒ Big data analytics ‒ Video processing ‒ Science, imaging, genomics ‒ Unleashing the Java development community
Acceleration at all tiers of the cloud ‒ Data centers, media hubs, cloud periphery
A CONVERGENCE AT THE RIGHT TIME
38 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
A SPECIAL GUEST
Gary Campbell Infrastructure Technology Strategy CTO HP
39 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. and Microsoft and Windows are trademarks of Microsoft Corp. Other names are for informational purposes only and may be trademarks of their respective owners.
GARY CAMPBELL INFRASTRUCTURE TECHNOLOGY STRATEGY CTO
HP
41 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
MOONSHOT SERVER CARTRIDGE WITH AMD
MOONSHOT SERVER CARTRIDGE WITH AMD FUTURE AVAILABILITY
Cartridge config • 4 x Quad-core 1.5 GHz, • 8 x 1GbE NICs • 4 x 8GB Memory • 32GB iSSD per SOC Chassis config • 45 AMD Opteron X2150 cartridges • Dual 180 x 1GbE switch modules • Dual 40GbE uplink modules • 4 x 1500 watt platinum PS (n+1) • Chassis management module • 5 Dual-rotor, hot plug fans (N+1)
* Future availability
42 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
AMD + HP MOONSHOT = BEST SOLUTION FOR HOSTED DESKTOPS
Up to 90% faster deployment*
Up to 6x faster graphics frames per second*
Up to 44% better TCO &
12% less power*
• Built on HP Moonshot technology for 45% of remote desktop market • Dedicated CPU and GPU support for 180 users in a single chassis • Predictable cost, scaling, and performance with pre-determined sizing
SIMPLIFIED DEPLOYMENT
REDUCED TCO
CONSISTENT USER PERFORMANCE
* Based on HP internal estimates compared to traditional desktops
43 | AMD DEVELOPER SUMMIT | NOVEMBER 2013
HP INVESTING IN INNOVATION ACROSS THE ECOSYSTEM OFFERING THE RESOURCES AND SCALE TO HELP DESIGNERS REACH MAINSTREAM MARKETS
Discovery Labs in U.S., France, China and Singapore - plus HP expertise and services
Moonshot Concierge Support
HP Pathfinder Innovation Ecosystem
Select technology partnerships with the industry’s “best of the best” innovators
Solution Builder
Program
Leading Technology
Partnerships
3x
Faster time to
innovation
Service & Consulting
HP Discovery Lab
Acquire on your terms
$
Watch Discovery Lab video http://www.youtube.com/watch?v=ZuO-zcmjvgw Email Discovery Lab ([email protected]) to find out more