AMD’S X86 OPEN64 COMPILER
Michael LaiAMD
3 | AMD’s x86 Open64 Compiler | June 2011
CONTENTS
Brief History
AMD and Open64
Compiler Overview
Major Components of Compiler
Important Optimizations
Recent Releases
Performance
Applications and Libraries
Heterogeneous Computing
More Information
4 | AMD’s x86 Open64 Compiler | June 2011
BRIEF HISTORY
Started as SGI® MIPSpro/Pro64 Compiler in the 1990’s
Open sourced in 2000 as Pro64 Compiler; later renamed to Open64 Compiler
Has been re-targeted to many architectures (MIPS, IA-64, x86-64, ARM, …)
Popular among industry and academia; used for both production and research
Open64 Steering Group (with members from industry and universities)
Major contributors include: AMD, Intel, HP, PathScale, Tsinghua University, Chinese Academy of Sciences, University of Houston, University of Delaware, SimpLight, …
5 | AMD’s x86 Open64 Compiler | June 2011
AMD AND OPEN64
AMD’s x86 Open64 Compiler:– Pull down from www.open64.net (leverage open source community)– Work on bug fixes, new development and infrastructure, advanced optimizations– Keep in sync with www.open64.net– Check changes back into www.open64.net (contribute to open source community)
http://developer.amd.com:– First AMD release was version 4.2.2 in April 2009– Most recent AMD release was version 4.2.5 in April 2011
Active participant in the open source community:– Member of the Open64 Steering Group (OSG)– Many AMD global and local gatekeepers (design and code discussions and reviews)– Release management and testing– Present at workshops, tutorials, forums
6 | AMD’s x86 Open64 Compiler | June 2011
COMPILER OVERVIEW
Language standards– ANSI C99, ISO C++98
Conforms to ISO/IEC 9899: 1999, Programming Languages – C standard
Conforms to ISO/IEC 14882: 1998(E), Programming Languages – C++ standard
– Compatible with gcc– Fortran 77, 90, 95
Conforms to ISO/IEC 1539-1: 1997, Programming Languages – Fortran
– Inter-language calling support– IEEE 754 floating point support– OpenMP 2.5 for shared memory systems
Platform highlights– x86 32-bit and x86 64-bit code generation– Large file support on 32-bit systems– Vector and scalar SSE* code generation– AVX, XOP, FMA4 code generation– Optimized C/C++ and math libraries– Optimized AMD Core Math Library (ACML)– MPICH2 for distributed and shared
memory systems
7 | AMD’s x86 Open64 Compiler | June 2011
COMPILER OVERVIEW
Global optimizations, e.g. – Partial redundancy elimination – Constant propagation and code motion – Strength reduction and expression simplification – Dead code elimination and common
subexpression elimination
Loop-nest optimizations, e.g. – Loop fusion and distribution – Loop interchange and cache locality optimization – Vectorization for SSE*/AVX code generation – Software prefetching
Code generation and optimizations, e.g. – Advanced register allocation – Loop unrolling, peephole optimizations – Instruction selection and scheduling
Feedback-directed optimizations, e.g. – Code layout– Function inlining and de-virtualization– Register allocation– Value specialization
Interprocedural analyses and optimizations, e.g. – Function inlining and cloning – Alias analysis– Data re-layout optimizations for structures
and arrays– Constant propagation and dead code elimination
Multi-core scalability optimizations
OpenMP support and automatic parallelization
8 | AMD’s x86 Open64 Compiler | June 2011
MAJOR COMPONENTS OF COMPILER
Frontend– Generates a WHIRL file from each input source file
Backend– Generates an object file from each WHIRL file
Linker– Generates an executable file from the object files
IPA– Pass1: ipl– Pass 2: ipa_link
9 | AMD’s x86 Open64 Compiler | June 2011
source source source
WHIRL WHIRL WHIRL
.o .o .o
a.out
linker
frontend frontend frontend
backend backend backend
10 | AMD’s x86 Open64 Compiler | June 2011
source source source
WHIRL WHIRL WHIRL
.o .o .o
WHIRL WHIRL
.o .o
a.out
ipa_link
linker
frontend frontend frontend
ipl ipl ipl
backend backend
11 | AMD’s x86 Open64 Compiler | June 2011
IMPORTANT OPTIMIZATIONS
Backend– LNO (loop nest optimization)
Traditional loop transformations such as loop blocking, interchange, fusion, distribution
Software prefetching
Vectorization
– WOPT (global optimization)Build control flow graphs
Data flow analysis
Traditional global scalar optimizations such as constant folding, partial redundancy elimination, etc.
– CG (code generation)Instruction selection and scheduling
Machine dependent optimizations such as address mode optimization and peephole optimization
Emit instructions for the target machine
12 | AMD’s x86 Open64 Compiler | June 2011
IMPORTANT OPTIMIZATIONS
IPA (interprocedural analysis)– Pass1: ipl
Local analysis
– Pass 2: ipa_linkWhole program analysis
Data layout optimizations
Function inlining, cloning
Constant propagation
Dead function elimination
Profile feedback directed optimization– -fb-create – -fb-opt
13 | AMD’s x86 Open64 Compiler | June 2011
RECENT RELEASES
Release 4.2.2 (April 2009)
– Support for 2 MB huge pages
– Improved loop fusion (proactive loop fusion) and loop unrolling
– Improved head/tail duplication, if-merging, scalar replacement and constant folding optimizations
– Improved interprocedural alias analysis
– Improved partial inlining and inlining of virtual functions
– More advanced re-layout optimization for structure members
– Improved instruction selection and instruction scheduling
– Improved tuning of library functions
14 | AMD’s x86 Open64 Compiler | June 2011
RECENT RELEASES
Release 4.2.3 (December 2009)– Improved interprocedural analysis to include structure array copy optimization
and array remapping optimization– Improved loop optimizations: loop unrolling, loop unroll and jam, triangular loops,
proactive loop interchange, loop distribution, loop peeling– Improved redundancy elimination optimizations for stores and memory initialization; better integration
of reassociation and common subexpression elimination; enhanced expression factorization– Improved instruction selection and addressing code generation– Improved vectorization– Extended prefetching to include arrays with inductive base addresses– Enhanced loop multi-versioning– Improved OpenMP and auto-parallelization code generation– Improved tuning of OpenMP and parallel runtime library functions– Introduced advanced optimizations to improve scalability/bandwidth utilization
of multi-core processors (-mso)
15 | AMD’s x86 Open64 Compiler | June 2011
RECENT RELEASES
Release 4.2.4 (June 2010)– Improved function inlining heuristics and enhanced inline expansion of library functions– Enhanced framework for multi-versioning– Improved inductive expression simplification and if-merging optimization– Improved code generation for the % operator– Improved interprocedural analysis for indirect function calls, virtual functions, and functions with
"noreturn" attribute– Optimized exception handling– Optimized processing of Fortran 90 temporary arrays– Improved processor affinity mapping in the OpenMP and parallel runtime library– Added support for 1 GB huge pages
16 | AMD’s x86 Open64 Compiler | June 2011
RECENT RELEASES
Release 4.2.5 (March 2011)– Optimized code generation for the new AMD Opteron Family 15h processors ("Bulldozer" core) (including instruction
groups SSE*, AVX, XOP, FMA4) (-march=bdver1)– Support for iso_c_binding, a Fortran 2003 feature– Enhanced framework to support better vectorization– Improved vectorization for outer loops and loops containing conditionals– Enhanced framework to support better aliasing– Modified -O3 to enable more powerful floating-point optimizations by default– Improved compatibility with newer versions of gcc for function prototype definitions under OpenMP– Compiler build infrastructure enhanced to be similar to other linux application builds involving configure,
make and make install– Incremental improvements to many generic optimizations such as loop fusion, dead code elimination, if merging, if
conversion, function inlining, register pressure tuning, structure splitting, etc.– Incremental improvements for C++ applications such as function de-virtualization, exception handling, etc.– General correctness improvements including bug fixes for problems in Fortran intrinsics, Fortran frontend, Fortran I/O,
x86 alignment, OpenMP– General improvements to reduce the compilation times of large C++/Fortran applications
17 | AMD’s x86 Open64 Compiler | June 2011
PERFORMANCE
Used in benchmark submission, for example:– HP®
– Dell™– IBM®
– Sun® (Oracle®)– SGI®
Performance on AMD platforms:– Best performing compiler
Both integer and floating point benchmark suites
Performance on Intel platforms:– Among the best performing compilers
Both integer and floating point benchmark suites
18 | AMD’s x86 Open64 Compiler | June 2011
APPLICATIONS AND LIBRARIES
Libraries and utilities, for example:– ACML (Fortran)– BLAST (C/C++)– Charm++ (C++)– CLHEP (C++)– FFTW (C)– Goto BLAS (Fortran)– MPICH/MPICH2 (Fortran, C/C++)– NetCDF (Fortran, C/C++)– LAM/MPI (Fortran, C/C++)– OpenMPI (Fortran, C/C++)– GSL (C/C++)
19 | AMD’s x86 Open64 Compiler | June 2011
APPLICATIONS AND LIBRARIES
Large applications, for example:– GEANT4 (C/C++)– GROMACS (Fortran, C/C++)– NAMD (C/C++)– NWChem (Fortran, C/C++)– POP (Fortran)– POV-Ray (C++)– WRF (Fortran)
Benchmarks, for example:– HPCC (Fortran, C/C++)– SPEC CPU2006 (Fortran, C/C++)– SPEC OMP2001 (Fortran, C/C++)
20 | AMD’s x86 Open64 Compiler | June 2011
HETEROGENEOUS COMPUTING
Existing optimizations– Vectorization, register allocation, IPA, …
New types of optimizations, for example:– Pointer class analysis– Variance analysis– Multi-versioning
Framework and infrastructure already present
21 | AMD’s x86 Open64 Compiler | June 2011
MORE INFORMATION
http://developer.amd.com– Downloads
Source code and binaries
– DocumentationQuick reference guide
User’s guide and developer’s guide
White papers and videos
Knowledge base articles
– SupportOnline help
Forum
AMD Developer Central Help Request
QUESTIONS
23 | AMD’s x86 Open64 Compiler | June 2011
Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limitedto product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
© 2011 Advanced Micro Devices, Inc. All rights reserved.