Productive Parallel Programming for Intel® Xeon Phi™ Coprocessors
Bill Magro!Director and Chief Technologist!Technical Computing Software!Intel Software & Services Group!
Still an Insatiable Need For Computing
PetaFlop Systems of Today Are The Client And Handheld Systems 10 years Later
10 PFlops 1 PFlops
100 TFlops 10 TFlops 1 TFlops
100 GFlops 10 GFlops
1 GFlops 100 MFlops
100 PFlops
10 EFlops 1 EFlops
100 EFlops
1993 2017 1999 2005 2011 2023
1 ZFlops
2029
Weather Prediction
Medical Imaging
Genomics Research
Source: www.top500.org
Forecast
Approaching Exascale
Some believe…
• Virtually none of today’s hardware or software technologies can be improved or modified to reach exascale
• A complete revolution is needed We believe…
• Evolution of today’s technologies + hardware and software innovation can get us there
• A systems approach – with co-design – is critical
2003 2005 2007 2009 2011
90 nm 65 nm 45 nm 32 nm 22 nm
Invented SiGe
Strained Silicon
2nd Gen. SiGe
Strained Silicon
2nd Gen. Gate-Last
High-k Metal Gate
Invented Gate-Last
High-k Metal Gate
First to Implement Tri-Gate
STRAINED SILICON
HIGH-k METAL GATE
TRI-GATE
22nm A Revolutionary
Leap in Process
Technology
37% Performance Gain at
Low Voltage*
>50% Active Power
Reduction at Constant Performance*
Moore’s Law: Alive and Well
The foundation for all computing… including Exa-Scale
SiGeSiGe
Source: Intel *Compared to Intel 32nm Technology
Intel® Xeon Phi™ Coprocessor [Knights Corner]: Power Efficiency
Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters
Copyright © 2012 Intel Corporation. All rights reserved. 6 Visual and Parallel Computing Group
1381 1380 1266
0
200
400
600
800
1000
1200
1400
MFL
OP
S/W
att
Higher is Better Source: www.green500.org
Intel Corp Knights Corner Top500 #150 June 2012 72.9 kW
Nagasaki Univ. ATI Radeon Top500 #456 June 2012 47 kW
Barcelona Supercomputing Center Nvidia Tesla 2090 Top500 #177 June 2012 81.5 kW
+ + +
Myth: explicitly managed locality is in and caches are out! Reality: Caches remain path to high performance and
efficiency
0
5
10
15
20
25
30
35
40
45
50
Memory BW L2 Cache BW L1 Cache BW
Relative BW Relative BW/Watt
#1 Green 500 Cluster WORLD RECORD!
“Beacon” at NICS Intel® Xeon® Processor +
Intel Xeon Phi™ Coprocessor Cluster Most Power Efficient on the List
2.449 GigaFLOPS / Watt 70.1% efficiency
Other brands and names are the property of their respective owners. Source: www.green500.org as of Nov 2012
Reaching Exascale Power Goals Requires Architectural & Systems Focus
• Memory (2x-5x) – New memory interfaces (optimized memory control and xfer) – Extend DRAM with non-volatile memory
• Processor (10x-20x) – Reducing data movement (functional reorganization, > 20x) – Domain/Core power gating and aggressive voltage scaling
• Interconnect (2x-5x) – More interconnect on package – Replace long haul copper with integrated optics
• Data Center Energy Efficiencies (10%-20%) – Higher operating temperature tolerance – 480V to the rack and free air/water cooling efficiencies
9
Reliability of these machines requires a systems approach
• Transparent process migration • Holistic fault detection and recovery • Reliable end to end communications • Integrated memory in storage layer for
fast checkpoint and workflow • N+1 scale reliable architectures
consistent with stacked memory constraints
• System wide power management and dynamic optimization
• Must design for system level debug capability.
Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),
Reliability is the primary force driving next generation designs
’93 ‘95 ‘97 ‘99 ’01 ‘03 ‘05 ‘07 ‘09 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 Top System Concurrency Trend
Extreme Parallelism
MTTI Measured in Minutes
0.1 Failures per socket per year:
1E+04
1E+05 1E+06
1E+07 1E+08 DRAM Chip Count
Socket Count
2004 2006 2008 2010 2012 2014 2016 0
1
10
100 1000
MTT
I (ho
urs)
Cou
nt
Time to save a global Global Checkpoint
Tim
e
Crossover point
Foundation of Performance: Computing
12
Intel® Xeon® processor • Ground-breaking real-world application performance • Industry-leading energy efficiency • Meets broad set of HPC challenges
Architecture for Discovery
Intel® Xeon Phi™ product family • Based on Intel® Many Integrated Core (MIC) architecture • Leading performance for highly parallel workloads • Common Intel Xeon programming model • Productive solution for highly-parallel computing
1 Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance
13
Up to 73% performance boost vs. prior gen1 on HPC suite of applications Over 2X improvement on key industry benchmarks Significantly reduce compute time on large, complex data sets with Intel® Advanced Vector Extensions Integrated I/O cuts latency while adding capacity & bandwidth
Up to 4 channels DDR3 1600 memory
Up to 8 cores Up to 20 MB cache
Integrated PCI Express*
Intel® Xeon® E5-2600 processors
Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery
14
Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads
Up to 8GB memory with up to 352 GB/s bandwidth
512-bit SIMD instructions
Linux operating system, IP addressable
Standard programming languages and tools
Leading to Groundbreaking results
Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel®
Xeon® processor E5 family-based server.2
Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
0
0.4
0.8 0
1
2
3
4
5
6
7
8
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor
BIG GAINS FOR SELECT APPLICATIONS
Fraction Parallel
% Vector
Performance
Scale to many-core
Vectorize
Parallelize
15
Performance Potential of Intel® Xeon Phi™ Coprocessors
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
640
1,860
0
500
1000
1500
2000
2S Intel® Xeon® Processor
1 Intel® Xeon Phi™ coprocessor
SGEMM (GF/s)
Synthetic Benchmark Summary (Intel® MKL)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performance
Up to 2.9X
309
883
0
200
400
600
800
1000
2S Intel® Xeon® processor
1 Intel® Xeon Phi™ coprocessor
DGEMM (GF/s)
303
803
0
200
400
600
800
1000
2S Intel® Xeon® processor
1 Intel® Xeon Phi™ coprocessor
HPLinpack (GF/s)
79
175 181
0
50
100
150
200
2S Intel® Xeon® processor
1 Intel® Xeon Phi™
coprocessor
1 Intel® Xeon Phi™
coprocessor
STREAM Triad (GB/s)
Up to 2.8X
Up to 2.6X
Up to 2.2X
Notes 1. Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720 2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 15360 x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672
EC
C O
n
EC
C O
ff
86%
Effi
cien
t
82%
Effi
cien
t
75%
Effi
cien
t
Higher is Better Higher is Better Higher is Better Higher is Better
145X FASTER
67.097 SECONDS
0.46 SECONDS
2.3X FASTER
0.197 SECONDS
STEP 2. USE COPROCESSORS
Run all or part of the optimized code on Intel®
Xeon Phi™ coprocessors
Current Performance
STEP 1. OPTIMIZE CODE
Parallelize and vectorize code and continue to run on
multi-core Intel Xeon processors
STARTING POINT Typical serial code
running on multi-core Intel® Xeon® processors
PARALLELIZING FOR HIGH PERFORMANCE Example: SAXPY
340X FASTER
18
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF)
1
1.45
0 0.2 0.4 0.6 0.8
1 1.2 1.4 1.6
Speedup (Higher is Better)
• 2S Intel® Xeon® processor E5-2670 with four-node cluster configuration
• 2S Intel® Xeon® processor E5-2670 +
Intel® Xeon Phi™ coprocessor (pre-production HW/SW) with four-node cluster configuration
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel or third party measured results as of December, 2012. Configuration Details: Please see backup slides.. For more information go to http://www.intel.com/performance Any difference in system hardware or software design or configuration may affect actual performance.
19
• Application: Weather Research and Forecasting (WRF)
• Status: WRF v3.5 coming soon
• Code Optimization: – Approximately two dozen files with less than 2,000
lines of code were modified (out of approximately 700,000 lines of code in about 800 files, all Fortran standard compliant)
– Most modifications improved performance for both the host and the co-processors
• Performance Measurements: V3.5Pre and NCAR supported CONUS2.5KM benchmark (a high resolution weather forecast)
• Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies
INTEL CONFIDENTIAL 20
PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor
UP TO
2.23X Acceleware 8th Order Isotropic
Variable Velocity1
Seismic
UP TO
2X Sandia National
Labs MiniFE2 Finite Element Analysis
20
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero) 3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)
UP TO
2.05X China Oil & Gas
Geoeast Pre-stack Time Migration3
INTEL CONFIDENTIAL 21
PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor
UP TO 10.75X
Monte Carlo SP2
Finance
UP TO
2.7X Jefferson Lab
Lattice QCD Physics
UP TO 7X Black-Scholes SP2
21
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2. Includes additional FLOPS from transcendental function unit 3. Intel Measured Oct. 2012
SPEED-UP
1.8X Intel Labs
Ray Tracing3
Embree Ray Tracing
Achieving Productive Parallelism with Intel® Xeon Phi™ Coprocessors
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
• Industry-leading performance from advanced compilers
• Comprehensive libraries
• Parallel programming models
• Insightful analysis tools
More Cores. Wider Vectors. Performance Delivered Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
Serial Performance
Scaling Performance
Efficiently
Multicore Many-core
128 Bits
256 Bits
512 Bits
50+ cores
More Cores
Wider Vectors Task & Data Parallel
Performance
Distributed Performance
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Parallel Performance Potential
• If your performance needs are met by a an Intel Xeon® processor, they will be achieved with fewer threads than on a coprocessor
• On a coprocessor: – Need more threads to
achieve same performance – Same thread count can
yield less performance
Intel® Xeon Phi™ excels on highly parallel applications
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Maximizing Parallel Performance
• Two fundamental considerations: – Scaling: Does the application already scale to the limits of Intel®
Xeon® processors?
– Vectorization and memory usage: Does the application make good use of vectors or is it memory bandwidth bound?
• If both are true for an application, then the highly parallel and power-efficient Intel Xeon Phi coprocessor is most likely to be worth evaluating.
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Family of Parallel Programming Models
Intel® Cilk™ Plus C/C++ language extensions to simplify parallelism Open sourced & Also an Intel product
Intel® Threading Building Blocks Widely used C++ template library for parallelism Open sourced & Also an Intel product
Domain-Specific Libraries Intel® Math Kernel Library
Established Standards Message Passing Interface (MPI) OpenMP* Coarray Fortran OpenCL*
Research and Development Intel® Concurrent Collections Offload Extensions Intel® SPMD Parallel Compiler
Choice of high-performance parallel programming models
Applicable to Multi-core and Many-core Programming *
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Many-core Multicore
Single-source approach to Multi- and Many-Core
Multicore CPU Multicore CPU
Intel® MIC architecture co-processor
Source
Compilers Libraries,
Parallel Models
Clusters with Multicore and Many-core
… …
Multicore Cluster
Clusters “Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL
“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Inspector XE, Intel® VTune™ Amplifier XE,
Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Inspector XE, Intel® VTune™ Amplifier
XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP
Intel® Parallel Studio XE
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Inspector XE, Intel® VTune™ Amplifier
XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP
Intel® Parallel Studio XE
Intel® Trace Analyzer and Collector
Intel® MPI Library
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Inspector XE, Intel® VTune™ Amplifier
XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP
Intel® Parallel Studio XE
Intel® Trace Analyzer and Collector
Intel® MPI Library
Intel® Xeon Phi™ Coprocessor- Game Changer for HPC Build your applications on a known compute platform…
and watch them take off sooner.
With Intel® Xeon Phi™ Coprocessor
“We ported millions of lines of code in only days and completed accurate runs. Unparalleled productivity… most of this software does not run on a GPU and never will”. — Robert Harrison,
National Institute for Computational Sciences, Oak Ridge National Laboratory
Complex code porting
With restrictive special purpose hardware
32
New learning
Familiar tools & runtimes
7 Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNL's Plans and Perspectives.” National Institute of Computational Sciences (NICS), 2011.
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Achieving Parallelism in Applications
Intel® Math Kernel Library MPI*
Intel® Threading Building Blocks
Intel® Cilk™ Plus
OpenMP*
Pthreads*
Intel® Math Kernel Library
Array NotaBon: Intel® Cilk™ Plus
Auto vectorizaBon
Semi-‐auto vectorizaBon: #pragma (vector, ivdep, simd)
OpenCL*
C/C++ Vector Classes (F32vec16, F64vec8)
Intrinsics
Ease of use
Fine control
Parallelization Options Vector Options
IA Benefit: Wide Range of Development Options
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Range of models to meet application needs
Foo( ) Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Spectrum of Programming Models and Mindsets
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
CPU
Many-core
Co-processor
Multi-Core Centric Many-Core Centric
MulB-‐Core Hosted General purpose serial and parallel
compu0ng
Offload Codes with highly-‐ parallel phases
Many Core Hosted Highly-‐parallel codes
Symmetric Codes with balanced
needs
CPU Coprocessor
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
MIC Host
PCIe
Platform Runtimes
Operating Environment View
ABI
File I/O
Intel® Xeon® processor
“LSB”
Linux
ABI
SCIF Sockets / OFED
Linux Standard Base: • IP • SSH • NFS
A flexible, familiar, compatible operating environment
Intel® Xeon Phi™ coprocessor
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
MIC
PThreads PThreads OpenMP OpenMP
Cilk Plus
Cilk Plus
TBB
AO-MKL AO-MKL
Host
PCIe
TBB MKL
Programming View
OpenCL
Intel ® Xeon® Processor
C++/FTN
OpenCL
C++/FTN Language Extensions for Offload
SCIF / OFED / IP
MKL Intra-node
parallel
Node performance and offload
Same Parallel Models for Processor and Co-processor
MPI MPI Intra- and inter-node parallel
Intel® Xeon Phi™ coprocessor
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Range of models to meet application needs
Foo( ) Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Spectrum of Programming Models and Mindsets
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
CPU
Many-core
Co-processor
Multi-Core Centric Many-Core Centric
MulB-‐Core Hosted General purpose serial and parallel
compu0ng
Offload Codes with highly-‐ parallel phases
Many Core Hosted Highly-‐parallel codes
Symmetric Codes with balanced
needs
CPU Coprocessor
Programming Intel® Xeon Phi™ based Systems (MPI+Offload)
• MPI ranks on Intel® Xeon® processors (only)
• All messages into/out of Xeon processors
• Offload models used to accelerate MPI ranks
• TBB, OpenMP*, Cilk Plus, Pthreads within coprocessor
• Homogenous network of hybrid nodes:
CPU
MIC
CPU
MIC
CPU
MIC
CPU
MIC
Network
Data
Data
Data
Data
MPI
Offload
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Offload Code Examples • C/C++ Offload Pragma #pragma offload target (mic) #pragma omp parallel for reducBon(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count;
• FuncBon Offload Example #pragma offload target(mic) in(transa, transb, N, alpha, beta) \ in(A:length(matrix_elements)) \ in(B:length(matrix_elements)) \ inout(C:length(matrix_elements)) sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
• Fortran Offload DirecBve !dir$ omp offload target(mic) !$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo
• C/C++ Language Extension class _Shared common { int data1; char *data2; class common *next; void process();
}; _Shared class common obj1, obj2; _Cilk _spawn _Offload obj1.process(); _Cilk_spawn obj2.process();
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Range of models to meet application needs
Foo( ) Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Spectrum of Programming Models and Mindsets
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
CPU
Many-core
Co-processor
Multi-Core Centric Many-Core Centric
MulB-‐Core Hosted General purpose serial and parallel
compu0ng
Offload Codes with highly-‐ parallel phases
Many Core Hosted Highly-‐parallel codes
Symmetric Codes with balanced
needs
CPU Coprocessor
Programming Intel® Xeon Phi™ based Systems (MIC Native)
• MPI ranks on Intel MIC (only) • All messages into/out of
coprocessor • TBB, OpenMP*, Cilk Plus,
Pthreads used directly within MPI processes
• Programmed as homogenous network of many-core CPUs:
CPU
MIC
CPU
MIC
CPU
MIC
CPU
MIC
Network
Data
Data
Data
Data
MPI
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Range of models to meet application needs
Foo( ) Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Spectrum of Programming Models and Mindsets
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
CPU
Many-core
Co-processor
Multi-Core Centric Many-Core Centric
MulB-‐Core Hosted General purpose serial and parallel
compu0ng
Offload Codes with highly-‐ parallel phases
Many Core Hosted Highly-‐parallel codes
Symmetric Codes with balanced
needs
CPU Coprocessor
Programming Intel® MIC-based Systems (Symmetric)
• MPI ranks on coprocessor and Intel® Xeon® processors
• Messages to/from any core • TBB, OpenMP*, Cilk Plus,
Pthreads used directly within MPI processes
• Programmed as heterogeneous network of homogeneous nodes:
CPU
MIC
CPU
MIC
CPU
MIC
CPU
MIC
Network
Data
Data
Data
Data
MPI
Data
Data
Data
Data
MPI
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Range of models to meet application needs
Foo( ) Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Spectrum of Programming Models and Mindsets
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
Main( ) Foo( ) MPI_*( )
CPU
Many-core
Co-processor
Multi-Core Centric Many-Core Centric
MulB-‐Core Hosted General purpose serial and parallel
compu0ng
Offload Codes with highly-‐ parallel phases
Many Core Hosted Highly-‐parallel codes
Symmetric Codes with balanced
needs
CPU Coprocessor
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Other brands and names are the property of their respective owners.
A GROWING ECOSYSTEM: Developing today on Intel® Xeon Phi™ coprocessors
45
Approved for Public Presentation
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Software, Drivers, Tools & Online Resources
Tools & Software Downloads
Getting Started Development Guides
Video Workshops, Tutorials, & Events
Code Samples & Case Studies
Articles, Forums, & Blogs
Associated Product Links
http://software.intel.com/mic-developer
Keys to Productive Parallel Performance
Determine the best platform target for your application • Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors
- or both
Choose the right Xeon-centric or MIC-centric model for your application
Vectorize your application
Parallelize your application • With MPI (or other multi-process model) • With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.)
• Go asynchronous: overlap computation and communication
Maintain unified source code for CPU and coprocessors
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logo, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
intel.com/software/products
Legal Disclaimer & Optimization Notice
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
This slide MUST be used with any slides removed from this presentation
Legal Disclaimers • All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
• Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization
• No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security
• Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost
• Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/
• Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). No exemptions required
• Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900ppm bromine and 900ppm chlorine.
• Intel, Intel Xeon, Intel Core microarchitecture, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
• Copyright © 2011, Intel Corporation. All rights reserved.
50
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Legal Disclaimers: Performance • Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm.
• Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
• Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.
• SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.
• TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. • SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See
http://www.sap.com/benchmark for more information. • INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
• Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.
This slide MUST be used with any slides with performance data removed from this presentation
51 51
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
WRF Configuration (Backup)
• Measured by Intel 3/27/2013 (Endeavor Cluster) • Runs in symmetric model. • Citation: WRF Graphic and detail: John Michalakes 11/12/12; Hardware: 2
socket server with Intel® Xeon® processor E5-2670 (8C, 2.6GHz. 115W), each node equipped with one Intel® Xeon Phi™ coprocessor (SE10X B1, 61 core, 1.1GHz, 8GB @ 5.5GT/s) in a 8 node FDR cluster. WRF is available from the US National Center for Atmospheric Research in Boulder, Colorado
• It is available from http://www.wrf-model.org/ • All KNC optimizations are in the V3.5 svn today • Results obtained under MPSS 3552, Compiler rev 146, MPI rev 30 on SE10X
B1 KNC (61c, 1.1Ghz, 5.5Gts) • WRF CONUS2.5km workload available from
www.mmm.ucar.edu/wrf/WG2/bench/ • Performance comparison is based upon average timestep, we ignore
initialization and post simulation file operations.