1 Intel® Compilers For Xeon™ Processor.

Post on 11-Jan-2016

214 views 1 download

transcript

1

www.intel.com/software/productswww.intel.com/software/products

Intel® CompilersIntel® CompilersFor Xeon™ For Xeon™ ProcessorProcessor

AgendaAgenda

General General Xeon™ processor optimizationsXeon™ processor optimizations Loop level optimizationsLoop level optimizations Multi-pass optimizationsMulti-pass optimizations OtherOther

AgendaAgenda

General General Xeon™ processor optimizationsXeon™ processor optimizations Loop level optimizationsLoop level optimizations Multi-pass optimizationsMulti-pass optimizations OtherOther

General OptimizationsGeneral Optimizations

/Od, -O0: disable optimizations/Od, -O0: disable optimizations /Zi, -g: Create Symbols/Zi, -g: Create Symbols /O1, -O1: Optimizes for speed without /O1, -O1: Optimizes for speed without

increasing code size – i.e. disables increasing code size – i.e. disables library function inlininglibrary function inlining

/O2, -O2 – default – Optimize for speed/O2, -O2 – default – Optimize for speed /O3, -O3 – High-level optimizations/O3, -O3 – High-level optimizations

AgendaAgenda

General General Xeon™ processor optimizationsXeon™ processor optimizations Loop level optimizationsLoop level optimizations Multi-pass optimizationsMulti-pass optimizations OtherOther

Instruction SchedulingInstruction Scheduling Schedule instructions to be optimal for Schedule instructions to be optimal for

specific processor instruction latencies and specific processor instruction latencies and cache sizescache sizes

WindowsWindows LinuxLinux

PentiumPentium®® processors and processors and Pentium processors with Pentium processors with MMX™ technologyMMX™ technology

-G5-G5 -tpp5-tpp5

Pentium Pro, Pentium II Pentium Pro, Pentium II andandPentium III processorsPentium III processors

-G6 -G6 (Default)(Default)

-tpp6 -tpp6 (Default)(Default)

Pentium 4 processorPentium 4 processor -G7-G7 -tpp7-tpp7

Note: default may change in future compilers

Shift/Multiply LatencyShift/Multiply Latency PentiumPentium

– Shift has ~1x latency of addsShift has ~1x latency of adds

– Multiply has ~10x latency of addsMultiply has ~10x latency of adds

Pentium Pro, II, and IIIPentium Pro, II, and III– Shift has ~1x latency of addsShift has ~1x latency of adds

– Multiply has ~3x latency of addsMultiply has ~3x latency of adds

Pentium 4 Pentium 4 (may change in future releases)(may change in future releases)

– Shift has ~8x latency of addsShift has ~8x latency of adds

– Multiply has ~26x latency of addsMultiply has ~26x latency of adds

Under the Covers: P4

Compiler accounts for these differences for you!

for (int i=0;i<length;i++) {for (int i=0;i<length;i++) {p[i] = q[i] * 32; }p[i] = q[i] * 32; }

.B1.7: .B1.7: # # -tpp6-tpp6

movl (%ebx,%edx,4),%eax movl (%ebx,%edx,4),%eax shll $5, %eaxshll $5, %eax movl %eax, (%esi,%edx,4) movl %eax, (%esi,%edx,4) incl %edx incl %edx cmpl %ecx, %edx cmpl %ecx, %edx jl .B1.7 jl .B1.7

.B1.7: # .B1.7: # -tpp7-tpp7 movl (%ebx,%edx,4),%eax movl (%ebx,%edx,4),%eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eaxaddl %eax, %eax movl %eax, (%esi,%edx,4) movl %eax, (%esi,%edx,4) addl $1, %edx addl $1, %edx cmpl %ecx, %edx cmpl %ecx, %edx jl .B1.7 jl .B1.7

Under the Covers: Xeon

Which Processor: [a]xWhich Processor: [a]x??

To require at least...To require at least... UseUse Windows*Windows* Linux*Linux*

Pentium Pro and Pentium II processors Pentium Pro and Pentium II processors with with CMOVCMOV and and FCMOV FCMOV instructionsinstructions

ii QaxiQaxi axiaxi

Pentium processors with MMX instructionsPentium processors with MMX instructions MM QaxMQaxM axMaxM

Pentium III processor with Streaming SIMD Pentium III processor with Streaming SIMD Extensions (implies Extensions (implies ii and and MM above) above)

KK QaxKQaxK axKaxK

Pentium 4 processor withPentium 4 processor with Streaming SIMD Streaming SIMD Extensions 2 (implies Extensions 2 (implies ii, , MM and and KK above) above)

WW QaxWQaxW axWaxW

Automatic Processor Automatic Processor DispatchDispatch Single executable Single executable

– Pentium 4 target that runs on all x86 processors. Pentium 4 target that runs on all x86 processors.

For Target Processor it uses:For Target Processor it uses:– Processor Specific OpcodesProcessor Specific Opcodes– Prefetch (Pentium III only)Prefetch (Pentium III only)– VectorizationVectorization

Low Overhead Low Overhead – Some increase in code sizeSome increase in code size

Can mix and match:Can mix and match: -xK –axW together makes Xeon/Pentium 4 the target -xK –axW together makes Xeon/Pentium 4 the target

and Pentium III the defaultand Pentium III the default

AgendaAgenda

General General Xeon™ processor optimizationsXeon™ processor optimizations Loop level optimizationsLoop level optimizations Multi-pass optimizationsMulti-pass optimizations OtherOther

VectorizationVectorization

Automatically converts Automatically converts loopsloops to utilize to utilize MMX/SSE/SSE2 instructions and registers.MMX/SSE/SSE2 instructions and registers.

Data types: Data types: char/short/int/float/char/short/int/float/doubledouble– (but not mixed)(but not mixed)

Can Use Short Vector Math LibraryCan Use Short Vector Math Library Enabled through Enabled through -[Q]xW, -[Q]xK, -[Q]axW, -[Q]axK-[Q]xW, -[Q]xK, -[Q]axW, -[Q]axK

-vec_report3 tells you which loops were -vec_report3 tells you which loops were vectorized, and if not, why not.vectorized, and if not, why not.

High Level OptimizerHigh Level Optimizer

• Windows: /O3 or Linux: -O3 Windows: /O3 or Linux: -O3 • Use with –xW, -xK, -QxW, -QxK, etc.Use with –xW, -xK, -QxW, -QxK, etc.

– additional loop optimizationsadditional loop optimizations

– more aggressive dependency analysismore aggressive dependency analysis

– scalar replacementscalar replacement

– software prefetch (-xK on Pentium III)software prefetch (-xK on Pentium III) Loops must meet criteria related to those Loops must meet criteria related to those

for vectorizationfor vectorization

Under the Covers: Xeon

SMP parallelismSMP parallelism OpenMPOpenMP

–Easy multithreading using directivesEasy multithreading using directives

–Use KSL tools for DevelopmentUse KSL tools for Development

–Use Intel tools to optimize for IA in Use Intel tools to optimize for IA in tandem with OpenMPtandem with OpenMP

Auto-parallelizationAuto-parallelization–Simple loops threaded by compiler aloneSimple loops threaded by compiler alone

Loops must meet certain criteria…Loops must meet certain criteria…

OpenMP* SupportOpenMP* Support OpenMP 1.1 for Fortran & 1.0 for C / C++OpenMP 1.1 for Fortran & 1.0 for C / C++

– Debugger info support for OpenMPDebugger info support for OpenMP

– Assure for Threads supported with Intel CompilerAssure for Threads supported with Intel Compiler

OpenMP switches:OpenMP switches:– -Qopenmp, -openmp (or -openmpP)-Qopenmp, -openmp (or -openmpP)

– -QopenmpS, -openmpS -QopenmpS, -openmpS (serial, for debugging)(serial, for debugging)

– -openmp_report[n] (diagnostics)-openmp_report[n] (diagnostics)

– works in conjunction with vectorizationworks in conjunction with vectorization

Auto ParallelizationAuto Parallelization

Auto-parallelization: Automatic threading Auto-parallelization: Automatic threading of loops without having to manually insert of loops without having to manually insert OpenMP* directive.OpenMP* directive.– -Qparallel (Windows*), -parallel (Linux*)-Qparallel (Windows*), -parallel (Linux*)

– -Qpar_report[n], -par_report[n] (diagnostics)-Qpar_report[n], -par_report[n] (diagnostics)

Better to use OpenMP directives Better to use OpenMP directives – Compiler can identify “easy” candidates for Compiler can identify “easy” candidates for

parallelization, but large applications are parallelization, but large applications are difficult to analyze.difficult to analyze.

AgendaAgenda

General and processor optimizationGeneral and processor optimization Loop level optimizationsLoop level optimizations Multi-pass optimizations Multi-pass optimizations

– Inter Procedural OptimizationInter Procedural Optimization

– Profile Guided OptimizationProfile Guided Optimization

OtherOther

Inter-Procedural Inter-Procedural Optimizations (IPO)Optimizations (IPO)

-Qip, -ip: Enables interprocedural-Qip, -ip: Enables interproceduraloptimizations for single fileoptimizations for single filecompilation.compilation.

-Qipo, -ipo: Enables interprocedural-Qipo, -ipo: Enables interproceduraloptimizations across files.optimizations across files.

Inter-Procedural Inter-Procedural Optimizations (IPO)Optimizations (IPO) More benefits than just inliningMore benefits than just inlining

–Partial inliningPartial inlining

– Interprocedural constant propagationInterprocedural constant propagation

–Passing arguments in registersPassing arguments in registers

–Loop-invariant code motionLoop-invariant code motion

–Dead code eliminationDead code elimination

–Helps vectorization, memory Helps vectorization, memory disambiguationdisambiguation

Pass 1

Pass 2

virtual .obj and .il files

executable

Compiling: Windows*: icl -c /Qipo main.c func1.c func2.cLinux*: icc -c -ipo main.c func1.c func2.c

Linking: Windows*: icl /Qipo main.obj func1.obj func2.objLinux*: icc -ipo main.obj func1.obj func2.obj

IPO Usage: 2 Step ProcessIPO Usage: 2 Step Process

Windows* Hint: LINK=link.exe should be replaced with LINK=xilink.exeie: xilink /Qipo <link commands> main.obj func1.obj func2.obj

Use execution-time feedback to guide optUse execution-time feedback to guide opt Helps I-cache, paging, branch-predictionHelps I-cache, paging, branch-prediction Enabled Optimizations:Enabled Optimizations:

– Basic block orderingBasic block ordering

– Better register allocationBetter register allocation

– Better decision of functions to inlineBetter decision of functions to inline

– Function orderingFunction ordering

– Switch-statement optimizationSwitch-statement optimization

– Better vectorization decisionsBetter vectorization decisions

Profile-Guided Profile-Guided Optimizations (PGO)Optimizations (PGO)

Instrumented CompilationInstrumented CompilationWindowsWindows: : icl icl /Qprof_gen/Qprof_gen prog.c prog.cLinux: icc Linux: icc -prof_gen-prof_gen prog.c prog.c

Instrumented ExecutionInstrumented Executionprog.exe (on a typical dataset)prog.exe (on a typical dataset)

Feedback CompilationFeedback CompilationWindows: icl Windows: icl /Qprof_use/Qprof_use prog.c prog.cLinux: icc Linux: icc -prof_use-prof_use prog.c prog.c

DYN file containingDYN file containingdynamic info: .dyndynamic info: .dyn

Instrumented Instrumented Executable: Executable: prog.exeprog.exe

Merged DYNMerged DYNSummary File: .dpiSummary File: .dpiDelete old dyn files if Delete old dyn files if you don’t want their info you don’t want their info included tooincluded too

Step 1Step 1

Step 2Step 2

Step 3Step 3

PGO Usage: 3 Step ProcessPGO Usage: 3 Step Process

Applications with lots of functions, calls, or Applications with lots of functions, calls, or branching that are not loop-boundbranching that are not loop-bound– Examples: Databases, Decision-support Examples: Databases, Decision-support

(enterprise), MCAD(enterprise), MCAD

– Apps with computation spread throughout; not Apps with computation spread throughout; not confined to kernelsconfined to kernels

Considerations:Considerations:– Different paradigm for builds - 3 stepsDifferent paradigm for builds - 3 steps

– Schedule time in final stages of development when Schedule time in final stages of development when code is more stable.code is more stable.

– Use representative data set(s) (not for corner cases)Use representative data set(s) (not for corner cases)

When To Use PGOWhen To Use PGO

Programs That BenefitPrograms That Benefit

Consistent hot pathsConsistent hot paths Many if statements or switchesMany if statements or switches Nested if statements or switchesNested if statements or switches

PGO

Significant Benefit

Little Benefit

Indirect BranchesIndirect Branches

Indirect Branches not as predictableIndirect Branches not as predictable–Compared with conditional branchesCompared with conditional branches

–Usually generated for switch statementsUsually generated for switch statements

–Have much larger relative latency than Have much larger relative latency than Direct BranchesDirect Branches

Intel Compiler does:Intel Compiler does:–Optimizes likely cases to use conditional Optimizes likely cases to use conditional

branchesbranches

Under the Covers: P4

AgendaAgenda

General and processor optimizationGeneral and processor optimization Loop level optimizationsLoop level optimizations Multi-pass optimizationsMulti-pass optimizations OtherOther

– Float point precision Float point precision

– Math Libraries Math Libraries

– OtherOther

Floating Point PrecisionFloating Point PrecisionWindowsWindows LinuxLinux DescriptionDescription

-Op-Op -mp-mp Strict ANSI C and IEEE 754 Floating Strict ANSI C and IEEE 754 Floating Point (subset of -Za/-ansi)Point (subset of -Za/-ansi)

-Za-Za -Xc-Xc Strict ANSI C and IEEE 754Strict ANSI C and IEEE 754

-Qlong_double-Qlong_double -long_double-long_double long double=80, not the default of 64long double=80, not the default of 64

-Qprec*-Qprec* -mp1-mp1 Precision closer to - but not quite – Precision closer to - but not quite – ANSI ; faster than ANSIANSI ; faster than ANSI

-Qprec_div*-Qprec_div* -prec_div*-prec_div* Turn off - division into reciprocal Turn off - division into reciprocal multiplymultiply

-Qpc-Qpcn*n* -pc-pcn*n* Round to Round to nn precision. n={32,64,80} precision. n={32,64,80}

-Qrcd*-Qrcd* -rcd*-rcd* Remove code that truncates during Remove code that truncates during float to integer conversionsfloat to integer conversions

* Only available on IA32

Math LibrariesMath Libraries

Intel’s LIBM (libimf on Linux)Intel’s LIBM (libimf on Linux) Short Vector Math Library (SVML)Short Vector Math Library (SVML)

– Used when vectorizing loops which have math Used when vectorizing loops which have math functions in themfunctions in them

Automatically used when neededAutomatically used when needed– LIB (windows), LD_LIBRARY_PATH (Linux) LIB (windows), LD_LIBRARY_PATH (Linux)

environment variablesenvironment variables

Common math functionsCommon math functions– sin/cos/tan/exp/sqrt/log , etcsin/cos/tan/exp/sqrt/log , etc

Processor dispatch for every IA processorProcessor dispatch for every IA processor

Libraries on LinuxLibraries on Linux

-i_dynamic link to shared libraries -i_dynamic link to shared libraries (default) (default)

-static link to static libraries-static link to static libraries -shared -shared createcreate a shared object a shared object -Vaxlib link to portability library-Vaxlib link to portability library

Other SwitchesOther Switches

More SwitchesMore Switches PragmasPragmas

– #pragma IVDEP #pragma IVDEP – hints to compiler that loops are independent hints to compiler that loops are independent

and can be vectorized and can be vectorized

See Compiler User’s Guide and ReferenceSee Compiler User’s Guide and Reference icc –help | icl -helpicc –help | icl -help http://www.intel.com/software/productshttp://www.intel.com/software/products Intel Developer ForumIntel Developer Forum

SummarySummary

Presented the major optimization switches Presented the major optimization switches of the Intel Compilerof the Intel Compiler– General SwitchesGeneral Switches

– Vectorization & High Level OptimizationsVectorization & High Level Optimizations

– Profile Guided OptimizationsProfile Guided Optimizations

– InterProcedural OptimizationsInterProcedural Optimizations

Explained how the Intel Compiler takes Explained how the Intel Compiler takes advantage of current IAadvantage of current IA

Optimized PovRay using the Intel CompilerOptimized PovRay using the Intel Compiler