Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | carly-whitling |
View: | 215 times |
Download: | 1 times |
Dynamic Optimization using ADORE Framework
10/22/2003
Wei Hsu
Computer Science and Engineering Department
University of Minnesota
• Compiler Optimization:
The phases of compilation that generates good code to make as efficiently use of the target machines as possible.
• Static Optimization:
Compile time optimization – one time, fixed optimization that will not change after distribution.
• Dynamic Optimization:
Optimization performed at program execution time – adaptive to the execution environment.
Background
Instruction scheduling
Cache prefetching
Examples of Compiler Optimizations
Ld R1,(R2)Add R3,R1,R4Ld R5,(R6)Add R7,R5,R4
Ld R1,(R2)Ld R5,(R6)Add R3,R1,R4Add R7,R5,R4
Ld R1,(R2)Addi R2,R2,64Add R3,R1,R4
Ld R1,(R2)prefetch 256(R2)Addi R2,R2,64Add R3,R1,R4
Frequent data cache misses !!
In the last 15 years, the computer performance has increased by ~1000 times. Clock rate increased by ~100 X Micro-architecture contributed ~5X
(the number of transistors doubles every 18 months)
Compiler optimization added ~2-3X for single processors (some overlap between clock rate and micro-architectures, and some overlap between micro-architecture and compiler optimizations)
Is Compiler Optimization Important ?
Speed up from Compiler Optimization
0
1
2
3
4
5
6
SPEC95Int (running on HP-PA8000)
Sp
eed
up
O1
O2
O3
O4
O4 + PBO
Speed up from Compiler Optimization
0
5
10
15
20
25
30
35
40
Spec95fp (Running on HP-PA-8000)
Sp
eed
up
O1
O2
O3
O4
O4 + PBO
Excellent Benchmark Performance
02
468
101214
1618
Spec2000Int (Runing on HP/Intel Itanium)
Sp
eed
up O2
O3
O3 + PBO
Mediocre Application Performance
• Many application binaries not optimized by compilers.
• ISV releases one binary for all machines in the same architecture (e.g. P5), but the binary may not run efficiently on the user’s machine (e.g. P6).
• ISV might have optimized code with some profiles exercising different parts of the application than what is actually executed.
• Application is built from many shared libraries, but no cross-library optimizations.
Performance not effectively delivered for end-users!!
Instruction scheduling
Cache prefetching
Examples of Compiler Optimizations
Ld R1,(R2)Add R3,R1,R4Ld R5,(R6)Add R7,R5,R4
Ld R1,(R2)Ld R5,(R6)Add R3,R1,R4Add R7,R5,R4
Ld R1,(R2)Addi R2,R2,64Add R3,R1,R4
Ld R1,(R2)prefetch 256(R2)Addi R2,R2,64Add R3,R1,R4
What if the load latency is 4 clocks instead of 2?
Does the compiler know where are data cache misses?
Execution environment can be quite different from the assumption made at compile time.Code should be optimized for the
machine it runs onCode should be optimized by how
the code is usedCode should be optimized when all
executables are availableCode should be optimized only the
part that matters
A Case for Dynamic Optimization
ADORE ADaptive Object code RE-optimization
• The goal of ADORE is to create a system that transparently finds and optimizes performance critical code at runtime.– Adapting to new micro-architectures– Adapting to different user environments– Adapting to dynamic program behavior– Optimizing shared library calls
• A prototype ADORE has been implemented on the Itanium/Linux platform.
Framework of ADORE
Main Program
OptimizedTracePool
Main Thread
Trace Selector
Optimizer
PatcherPhase
Detector
User Event Buffer (UEB)
DynOpt Thread
Kernel SpaceSystem Sample
Buffer (SSB)
Current Optimizations in ADORE
• We have implemented – Data cache prefetching– Trace selection and layout
• We are investigating and testing the following optimizations– Instruction scheduling with control and data
speculation– Instruction cache prefetching– Partial dead code elimination
Speedup
-10%
0%
10%
20%
30%
40%
50%
60%
70%
bzip2 gz
ipm
cf vpr
pars
erga
p
vorte
xgc
c
amm
p art
applu
equa
ke
face
rec
fma3
dluc
as
mes
asw
imBlas
t
O2 + RuntimePrefetching
Performance Impact of O2/O3 Binary
Mcf
0
1
2
3
4
5
6
7
8
9
Execution Time
CPI
Original Program
with ADORE
Art
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Execution Time
CPI
with ADORE
Original Program
Optimizing BLAST with ADORE
• BLAST is the most popular tool used in bioinformatics. Several faculty members and research colleagues are using it.
• Used as a benchmark by companies to test their latest systems and processors
• The performance of BLAST matters.
Speedup from BLAST queriesSpeedUp of queries from ADORE and ECC over ORC
-15
-10
-5
0
5
10
15
20
25
30
35
blastn ntnt.1
blastn ntnt.45min
blastp nraa.1
blastp nraa.10
tblastn ntaa.q1
blastx nrnt.1
blastx nrnt.45min
blastx nrnt.10
Queries
Per
cen
tag
e S
pee
du
p
Speedup from ADORE Speedup from Ecc
Cycle Accounting for Various Queries
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
blastn, nt,nt.1 (Singleshort query)
blastn, nt,nt.45min
(Single longquery)
blastp nraa.1 (Singleshort query)
blastp, nr,aa.10
(Multipleshort
queries)
tblastn, nt,aa.q1
(Singlemediumquery)
blastx, nr,nt.1 (Singleshort query)
blastx, nr,nt.45min(Single
medium-short query)
blastx, nr,nt.10
(Multipleshort
queries)
Queries
Pe
rce
nta
ge
of
tota
l c
yc
les
Support register dependency stalls
Integer register dependency stalls
RSE stalls
FPU stalls
Branch misprediction stalls
I-Cache stalls
D-Cache stalls
Unstalled cycles
Observations from BLAST• ADORE is robust. It can handle real, large
application code.• ADORE does not speed up all queries, since
the code is already running quite efficiently on Itanium systems. It adds about 1-2% of profiling and optimization overhead.
• ADORE does speed up one long query by 30%.
• It is difficult to further improve performance of BLAST by static compilers.
Future Direction of ADORE
• Show more performance on more real applications
• Make ADORE more transparent– Compiler independent– Exception handling
• Study the impact of compiler annotations
• Study architectural/Micro-architectural support for ADORE
ADORE Group• Professors
– Prof. Wei-Chung Hsu– Prof. Pen-Chung Yew– Dr. Bobbie Othmer
• Graduate Students–Howard Chen–Jiwei Lu–Jinpyo Kim–Sagar Dalvi–Rao Fu–WeiChuan Dong
–Abhinav Das–Dwarakanath Rajagopal–Ananth Lingamneni–Vijayakrishna Griddaluru–Amruta Inamdar–Aditya Saxena
Summary• Dynamic Binary Optimization customizes
performance delivery.
• The ADORE project at U. of Minnesota is a research dynamic binary optimizer. It demonstrates a good performance potential.
• With architecture/micro-architecture and static compiler support, a future dynamic optimizer could be more effective, more adaptive and more applicable.
Conclusion
Be Adaptive !!
Be Dynamic !!
Dynamic Translation• Fast Simulation
– SimOS (Stanford), SHADE (SUN)• Migration
– DAISY, BOA (IBM), Virtual PC, ARIES (HP), Crusoe (Transmeta)
• Internet applications– Java HotSpot, MS dot NET
• Performance Tools (dynamic instrumentation)– Paradyn and EEL (UW), Caliper (HP)
• Optimization– Dynamo, Tinker (NCSU), Morph (Harvard),
DyC (UW)