Post on 30-Mar-2018
transcript
1 | Stanford HPC Workshop 2011
Agenda
AMD Opteron 6200 and 4200 Series
Processor Overview
AMD HPC Ecosystem and Developer
Information
Resources for Developers and Customer
Benchmarking
2 | Stanford HPC Workshop 2011
INTRODUCING THE NEW AMD OPTERON 6200 AND 4200
SERIES PROCESSORS
AMD Opteron™ 6200 Series Processor
(“Interlagos”)
Scalability Up to 4 sockets with up to 16 cores Up to 2 sockets with up to 8 cores
Memory 4 memory channels up to 1600 MHz memory 2 memory channels up to 1600 MHz memory
Frequency Up to 3.3 GHz base frequency & up to 3.7 GHz frequency using AMD Turbo CORE technology*
Cache
Cache: L1 - 16KB data per core + 64KB
instruction
per module;
L2 - 1MB per core; L3 - 16MB per socket
Cache: L1 - 16KB data per core + 64KB
instruction per module; L2 - 1MB per core;
L3 – 8MB per socket
I/O Four x16 HyperTransport™ technology 3.0 links
@ up to 6.4GT/s per link
Three x16 HyperTransport™ technology 3.0
links
@ up to 6.4GT/s per link
Power 85W to 140 W TDP (Consistent with AMD
Opteron 6100 Series)
35 to 95W TDP (Consistent with AMD
Opteron 4100 Series)
AMD Opteron™ 4200 Series Processor
(“Valencia”)
The world’s first x86
16-core processor1
The world’s lowest x86
power-per-core2
1 Based on 16-core AMD Opteron 6200 Series processor compared to 6-core Intel Xeon 5600 Series and 10-core Intel Xeon E7 processors. 2 As of Nov 1, 2011, AMD Opteron™ processor Models 4200 EE have the lowest known power per core of any x86 server processor, at 35W TDP (35W /8 = 4.375W/core).
Intel 's lowest power per core server processor, L5630, is 40W TDP (40W/4 = 10W/core). See http://www.intel.com/Assets/PDF/prodbrief/323501.pdf. Previous record held
by AMD Opteron processor Models 4100 EE at 35W TDP / 6 cores = 5.83 W/core.
3 | Stanford HPC Workshop 2011
“BULLDOZER” MODULE TECHNOLOGY
Full Performance From Each Core
Leadership Multi-Threaded Micro-Architecture
Shared Double-sized FPU
Amortizes very powerful 256-bit unit across
both cores
Improved IPC
Micro-architecture and ISA enhancements
SSE4.1/4.2, AVX, FMA4, SSSE3, XOP
Virtualization Enhancements
Faster switching between VMs
AMD-V extended migration support
High Frequency / Low-Power Design
Core Performance Boost
“Boosts” frequency of cores when available
power allows
No idle core requirement
Power efficiency enhancements
Significantly reduced leakage power
More aggressive dynamic power mgt
Dedicated execution units per core
No shared execution units as with SMT
Dedicated Components
Shared at the module level
Shared at the chip level
4 | Stanford HPC Workshop 2011
FLEX FP 256-BIT FPU AND NEW "BULLDOZER" INSTRUCTIONS
• A flexible floating point unit shared between 2 integer cores
• Simultaneously executes two 128-bit instructions or one 256-bit instruction
• Saves die space and conserves power for majority of non-FP applications
• Dedicated floating point scheduler, which minimizes latency for floating point applications
Instructions Applications/Use Cases
SSSE3, SSE4.1,
SSE4.2
(AMD and Intel)
• Video encoding and transcoding
• Biometrics algorithms
• Text-intensive applications
AESNI
PCLMULQDQ
(AMD and Intel)
• Application using AES encryption
• Secure network transactions
• Disk encryption (MSFT BitLocker)
• Database encryption (Oracle)
• Cloud security
AVX
(AMD and Intel)
Floating point intensive applications:
• Signal processing / Seismic
• Multimedia
• Scientific simulations
• Financial analytics
• 3D modeling
FMA4
(AMD Unique)
• Vector and matrix multiplications
• Polynomial evaluations
• Chemistry, physics, quantum
mechanics and digital signal
processing
XOP
(AMD Unique)
• Numeric applications
• Multimedia applications
• Algorithms used for audio/radio
5 | Stanford HPC Workshop 2011
DESIGNED TO DRIVE DOWN POWER REQUIREMENTS
Low and Ultra Low Voltage Memory
1.35v DIMMs reduce voltage by 10%;
1.25v DIMMs reduce voltage by 16%¹
Reduces Idle CPU
Power By Up to 46%²
C6 power state
Shuts down clocks and power to idle cores
Intelligent Circuit
Design
All New Design
Minimizes the number of active transistors for lower power and better
performance
Enables More Power
Control for IT
TDP Power Cap
Flexibility to set power limits without capping
frequency
Up to 56% better power-per-core than Xeon³
¹Regular voltage=1.5v, low voltage=1.35v, ultra-low voltage=1.25v; ² ² Based on internal testing as of 8/2011: AMD Opteron™ processor model 6174 (12-core 2.2GHz) consumes 11.7W in
active idle C1E power state, while AMD Opteron™ processor model 6276 (16-core 2.3GHz) consumes only 6.4W in the active idle C1E power state with new C6 power gating employed.
System configuration: “Drachma” reference design kit, 32GB (8 x 4GB DDR3-1333) memory, 500GB SATA disk drive, Microsoft® Windows Server® 2008 x64 Enterprise Edition R2. SVR-
60; ³based on AMD Opteron 4200 Series processor with 8 cores at 35W TDP versus lowest wattage, highest core Intel Xeon processor with 6 cores at 60W TDP.
More Low Power
Memory Choices
6 | Stanford HPC Workshop 2011
AMD TURBO CORE TECHNOLOGY
*Based on AMD Opteron™ 6200 Series processors with up to 300 MHz in P1 boost state and up to 1 GHz+ in P0 boost state over base P2 clock frequency.
Base frequency with
TDP headroom
All core boost activated
(up to 500MHz)
Max turbo activated
(up to 1GHz+, half cores)
All Core Boost
When there is TDP headroom in a
given workload, AMD Turbo CORE
technology is automatically activated
and can increase clock speeds by
300-500 MHz* across all cores.
Max Turbo Boost
When a lightly threaded workload sends half the
“Bulldozer” modules into C6 sleep state but also
requests max performance, AMD Turbo CORE
technology can increase clock speeds by up to
1 GHz+* across half the cores.
+
7 | Stanford HPC Workshop 2011
THE HPC LEADER
¹-3 See complete benchmark data on slides 27-29.
Customer Requirements:
Scalable performance
Strong floating point performance
High memory throughput
More cores for highly threaded apps
Wide range of technical instructions
Linux OS
Open64
GCC
PGI Compilers
24-88% better performance at
significantly lower price¹
With almost twice the FLOPs
per sq. ft. with “Interlagos”, it
would take Intel almost 2 racks
to match AMD in density and
performance²
Greatest FLOPs per Sq. Foot
Superior Performance¹
HPC
-
50
100
150
200
SP
EC
FP
ST
RE
AM
LIN
PA
CK
LA
MM
PS
NA
MD
WR
F
Xeon 5670 AMD Opteron 6276
73GB/s memory
throughput3
73% more memory
bandwidth than Intel3
Maximum cores
per rack2
More FLOPs per sq. foot2
33% lower cost per core4
HPC ISV ECOSYSTEM &
DEVELOPER INFORMATION
Scot Schultz, Sr. Strategic Alliance Manager, HPC
Scot.Schultz@amd.com
9 | Stanford HPC Workshop 2011
AMD STRATEGIC HPC ISV COMMERCIAL PARTNER
ECOSYSTEM INCLUDES LEADING GLOBAL FORTUNE 500
COMPANIES…
10 | Stanford HPC Workshop 2011
DESIGN/SIMULATION – EVERY CUSTOMER IS UNIQUE
Ansys Mechanical
Workbench
Algor
Moldflow
CFDesign
NX Mechatronics
NX Nastran
CATIA/DELMIA SIMULIA
Abaqus/Standard
MSC Nastran
LS-DYNA Implicit
LS-DYNA Explicit
Marc, Dytran, Adams
STAR products
HyperWorks RADIOSS Acusolve
PAM-series
PowerFLOW
XFlow
Fluent CFX
COMSOL Multiphysics
CoreTech Moldex3D
DEM Solutions EDEM
Flow Science Flow3D
Impetus AFEA
Metariver Technology SAMADAII
Wolfram Mathemathica
From conceptual design to simulation – customers depend on many ISV packages
Adoption of AMD technology ensures optimized multi-disciplinary,
concurrent workflows and accelerated design practices.
11 | Stanford HPC Workshop 2011
HPC ISV ECOSYSTEM ENGAGEMENT STRATEGY
AMD is engaged with Commercial HPC ISV partners and Open Source Software applications world-wide
• Focus is on initial performance, tuning and identifying optimization opportunities with ISV receiving development platforms
AMD is engaged with interconnect partners, such as Mellanox Technologies ConnectX® technology
• Focus is on ensuring drivers and OFED releases are interoperable
and optimized
AMD is engaged with community of middleware, MPI, job schedulers and virtual SMP technology companies
• Focus is on ensuring successfully tuned and optimized solutions on
AMD hardware
12 | Stanford HPC Workshop 2011
OPTIMIZING HPC APPLICATIONS… START HERE!
Application Optimization by recompiling with
optimized compiler and tuning for new architecture
Application optimization by linking to ACML 5.x
(AMD Core Math Library)
Open Source Software optimization by
recompiling with optimized compiler
13 | Stanford HPC Workshop 2011
APPLICATION OPTIMIZATION BY LINKING TO ACML
ACML 5.0 Overview
Full implementation of Level 1, 2 and 3 Basic
Linear Algebra Subroutines (BLAS), with key
routines optimized for high performance on AMD
Opteron™ processors.
A full suite of Linear Algebra (LAPACK) routines.
As well as taking advantage of the highly-tuned
BLAS kernels, a key set of LAPACK routines has
been further optimized to achieve considerably
higher performance than standard LAPACK
implementations.
A comprehensive suite of Fast Fourier
Transforms (FFTs) in both single-, double-,
single-complex and double-complex data types.
Random Number Generators in both single- and
double-precision.
Compiler Support
• Absoft Pro Fortran
• GFORTRAN
• Intel Fortran (Linux,
Windows)
• NAG Fortran
• Open64
• PGI Fortran (Linux,
Windows)
For more information on ACML, go to: http://developer.amd.com/libraries/acml/pages/default.aspx
14 | Stanford HPC Workshop 2011
APPLICATION OPTIMIZATION BY LINKING TO ACML
Linear Algebra
Fast Fourier
Transforms
(FFT)
Others Compiler Support
ACML 5.0
(Aug 2011)
• SGEMM (single
precision)
• DGEMM (double
precision)
• L1 BLAS
• Complex-to-
Complex (C-C)
single precision
FFTs
• Random
Number
Generators
• AVX
compiler
switch for
Fortran
• Absoft
• GCC 4.6
• Open64 4.2.5
• PGI 11.8, 11.9
• ICC 12
• Cray to begin deployment
of ACML with their
compiler with ACML 5.0
ACML 5.1
(Dec 2011)
• CGEMM (complex
single decision)
• ZGEMM (complex
double precision)
• Real-to-complex
(R-C) single
precision FFTs
• Double precision
C-C and R-C FFTs
All compilers listed for
ACML 5.0 will be
supported
For additional information on ACML, go to:
http://developer.amd.com/libraries/acml/pages/default.aspx
15 | Stanford HPC Workshop 2011
APPLICATION OPTIMIZATION BY RECOMPILE
* Additional information: http://developer.amd.com/tools/open64/Documents/open64.html
“Bulldozer” compiler optimizations enabled by –march=bdver1*
• Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP)
• Automatically selects instructions to improve performance (intrinsics and inline)
• Automatic calls to libM (math library) functions that use these new instructions
• Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes
• Adjusted to take advantage of the improved hardware prefetcher
• Improvements in code layout and alignment to take advantage of shared compute
unit, e.g. “dispatch scheduling”
Production quality code generation tool designed for
high performance parallel computing workloads and
enabling the developer to build and optimize C, C++,
and Fortran applications targeting x86 Linux platforms
16 | Stanford HPC Workshop 2011
95%
100%
105%
110%
115%
120%
125%
4.2.3 barcelona +bdver1 +AVX +FMA4
Relative Improvements Polyhedron 2011 Benchmark using Open64 4.2.5.2 Compiler and AMD
OpteronTM Model 6272 processor*
O3
Ofast
More info on Polyhedron 2011: http://www.polyhedron.com More info on Open64: http://developer.amd.com/open64
APPLICATION OPTIMIZATION BY RECOMPILING
17 | Stanford HPC Workshop 2011
APPLICATION OPTIMIZATION BY RECOMPILING
Additional information: http://developer.amd.com/tools/gnu/pages/default.aspx
“Bulldozer” compiler optimizations enabled by –march=bdver1
• Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP)
• Automatically selects instructions to improve performance (intrinsics and inline)
• Scalar and vector libm calls available with AMD Libm
• Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes
• Memset/Memcpy inliner heuristics
• Defaults to 128-bit vectorization
• Improvements in code layout and alignment
Available from the Free Software Foundation (FSF)
and offering support for the latest AMD processor-
based platforms. GCC 4.6 includes support for AMD
Opteron 4200 and 6200 Series processors.
18 | Stanford HPC Workshop 2011
APPLICATION OPTIMIZATION BY RECOMPILING
Compiler Status
SSSE3 SSE4.1-.2
AVX
FMA4 XOP
Auto Generates
Code Comments
GCC 4.6.2 Available GCC 4.4 is included in RHEL 6.0
distribution and should be updated to
GCC 4.6.2 for optimized support
Microsoft Visual Studio 2010 SP1 Available No
Supports new instructions but does not
auto generate code
Open64 4.2.5 Available http://developer.amd.com/open64
Open64 4.5 Planned for
Dec 2011
Will provide incremental performance and
functionality improvements
PGI 11.9 Available
PGI Unified Binary™ technology
combines into a single executable or
object file code optimized for multiple AMD
and Intel processors
http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf Compiler Optimization Quick Guide:
Compiler Support Summary
19 | Stanford HPC Workshop 2011
The central resource for tools, technologies, best practices, and expert
guidance to optimize your software solution performance on AMD platforms.
AMD DEVELOPER CENTRAL // CODE FASTER, FASTER CODE
RESOURCES FOR ISV
DEVELOPERS AND CUSTOMER
BENCHMARKING
Scot Schultz, Sr. Strategic Alliance Manager, HPC
Scot.Schultz@amd.com
21 | Stanford HPC Workshop 2011
LATEST AVAILABLE AMD OPTERON™ CLUSTER RESOURCES
Vesta
(11) Dell™ PowerEdge R815 Compute Nodes, AMD Opteron™ 6276 Compute Nodes
704 CPU Cores, 128GB DDR3-1333/node, Dual Mellanox ConnectX®-2 40Gb/s IB
Request access at http://www.hpcadvisorycouncil.com
Dodecas
(8) AMD 6000 Series Platforms, AMD Opteron™ 6276 Compute Nodes + AMD ATI FirePro
7800 GPU
256 CPU Cores, 32GB DDR3-1066/node, Mellanox ConnectX ® 40Gb/s InfiniBand
Request access at http://www.hpcadvisorycouncil.com
Mercury
(6) Dell™ C6145 Compute Nodes, AMD Opteron™ 6276 Compute Nodes
384 CPU Cores, 128GB DDR3-1333/node, Dual Mellanox ConnectX®-2 40Gb/s IB
Request access at http://www.hpcadvisorycouncil.com
22 | Stanford HPC Workshop 2011
REFERENCES
x86 Compiler Quick Reference Guide for “Bulldozer” processors
http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf
Using the x86 Open64 Compiler Suite
http://developer.amd.com/tools/open64/Documents/open64.html
x86 Open64 4.2.5.2 Release Notes
http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt
ACML 5.0 Information
http://developer.amd.com/libraries/acml/features/pages/default.aspx
Software Optimization Guide for “Bulldozer” processors
http://support.amd.com/us/Processor_TechDocs/47414.pdf
AMD64 Architecture Programmer’s Manual : 128-Bit and
256-Bit XOP and FMA4 Instructions
http://support.amd.com/us/Embedded_TechDocs/43479.pdf
23 | Stanford HPC Workshop 2011
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States
and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of
their respective owners.
©2011 Advanced Micro Devices, Inc. All rights reserved.