Date post: | 20-Jan-2018 |
Category: |
Documents |
Upload: | joel-hancock |
View: | 217 times |
Download: | 0 times |
Michael J. Voss and Rudolf EigenmannPPoPP, ‘01
(Presented by Kanad Sinha)
Motivation General choices for adaptive
optimization ADAPT
The ArchitectureThe LanguageAn example
Results
There’s only so much optimization that can be performed at compile-time.
Have to generate code for generic system models – make compile-time assumptions that may be sensitive to input, unknown till runtime.
Convergence of technologies – difficult to generate common binary to exploit individual system characteristics.
Possible solution?
“Use of adaptive and dynamic optimization paradigms, where optimization is performed at runtime when complete system and input knowledge is available.”
Choose from statically generated code-variants+ Easy- May not result in max possible optimization- Can result in code explosion
Parameterization+ Single copy of source- May still not result in max possible optimization
Dynamic compilation+ Complete input and system knowledge – max optimization possible- Considerable runtime overhead
Automated De-Coupled Adaptive Program Optimization
Generic framework, which leverages existing tools
Uses a domain-specific language, AL, by which adaptive techniques can be specified
…
Supports dynamic compilation and parameterization
Enables optimizations through “runtime sampling”
Facilitates an iterative modification and search approach
3 functions of a dynamic/adaptive optimization system
Evaluate effectiveness of particular optimization for current input & system information
Apply optimization if profitable
Re-evaluate applied optimizations and tune according current runtime conditions
Runtime system consists of:Modified version of applicationRemote optimizer
has source code description of target machine stand-alone tools & compilers
Local optimizer agent of remote-optimizer on
system detects hot-spots tracks multiple interval
contexts (here, loop bounds) runs in separate thread
Optimization and execution truly asynchronous
LO invokes RO, when hotspot detected
RO tunes the interval using available tools, according to user-specified heuristics
RPC returns
If new code available, dynamically link to application as the new best/experimental version, depending on RO’s message
Candidate code sections have 2 control flow paths through best known version through experimental versionEach of these can be replaced
dynamically
Flag indicates which version to execute
Monitor experimental versions of each context collected data used as
feedback if better, swap with best known
version
Optimization process outside critical path/decoupled from execution
ADAPT Language (AL) *
Features:Uses an LL1 grammar => simple parserDomain specific language with C-style formatDefines reserved words that at runtime contain
useful input data and system information
* “A full description of ADAPT language is beyond the scope of this paper”, and by extension, this presentation.
Initialize some variables Constraints Interface to tool to be
used This block defines the
heuristic
Statement Description
constraint(compile-time constraint)
Supplies a compile-time constraint
apply_spec(condition,type, syntax[,params])
A description of a tool or flag
collect (event list) execute;
Initiates the monitoring of an experimental code version
mark_as_best Specifies that the code variant that would be generated under the current runtime conditions is a new best known version
end_phase Denotes the end of an optimization phase
Test Machines: 6 core Sun ULTRA Enterprise 4000, single-core Pentium II Linux workstation
Experiment Result
Useless Copying - Run a dynamically compiled version of code without applying any optimization
• Less than ~5%• Some cases show a speed-up!
Specialization – Loop bounds replaced as constants by their runtime value.
Average improvement: •E4000: 13.6%•Pentium: 2.2%
Flag Selection – Experiment with various combinations of compiler flags
Average improvement: •E4000: 35%•Pentium: 9.2%Identified some non-intuitive choices
Loop Unrolling – Loop unrolled by factors that evenly divide no. of iterations of innermost loop to a maximum factor of 10.
Average improvement: •E4000: 18%•Pentium: 5%
Loop Tiling – Loops deemed appropriate tiled for ½, ¼, .., 1 /16 of L2 cache size
Average improvement: •E4000: 13.5%•Pentium: 9.8%
Parallelization – Loops deemed appropriate by Polaris parallelized
Average improvement: •E4000: 51.8%
There’s advantage in doing runtime optimization
Can be applied to general-purpose programs as well
For full-blown runtime optimization, need to move optimization process outside the critical path
if (questions(“?!”) == 1)
delay();
THANK_YOU(“Have a great
weekend!”);