Post on 14-Dec-2015
transcript
Binary Translation Using Peephole Superoptimizers
Sorav Bansal, Alex AikenStanford University
Binary Translation
• Allow one ISA to run on another• Applications
– Portability (e.g., running legacy software)
– Virtualization– Backward and Forward Compatibility– On-chip binary translation– Java Virtual Machines
Hypervisor
x86 hardware
x86 OS
x86app
x86app
Binary Translator
powerpcapp
powerpc OS
Binary Translation
x86 hardware
OS
x86app
x86app
Binary Translator
powerpcapp
x86 hardware
OS
x86app
x86appBinary Translator
powerpcapp
Binary Translation Wish-list
Performance
Large Complex ISAs
Retargetability OS Compatibility
Talk Outline
SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
Superoptimization
• Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code
Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0;}
On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1
Superoptimization
• Enumerate all sequences up to a certain length
and
• Compare each enumerated sequence with target function for equivalence
Talk Outline
SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
Peephole SuperoptimizationUse a superoptimizer to
automatically infer peephole optimizations
add $1, reg inc reg
mul $2, reg shl reg
… …Table of Peephole Optimizations
[S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]
pattern replace-with
Peephole SuperoptimizerStep 1
a.out
010001001011110100011101101011101010100010101010001010100010001010101001010100101010101001010000101011111101100101010101101111010010101001010100101010010101001110011111010010001101111011011101010001001101010101010101010101010101010101010100110100100101010101010101010101000011111101010111101010001111010101011101110110111011101110111010100110110010101011011
01…
01100101
mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
Harvest instruction sequences that
can potentially be optimized.
Canonicalize and store them. Target Sequences
Peephole Superoptimization
Step 2mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
Target Sequences
mov %eax, %ecx
add $333, %eax
inc (%eax)
…Brute force
Optimization Optimized Sequences
Equivalence Test
ExecutionTest
BooleanTest
Two sequences
pass
fail fail
not-equivalent not-equivalent
equivalent
Peephole Superoptimization
Step 3mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
mov %eax, %ecx
add $333, %eax
inc (%eax)
…
Table of Peephole Optimizations
Talk Outline
SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
Application to Binary Translation
• Our approach: Use lots of peephole transformations
pattern(ppc)
translate-to(x86)
shl %eax
add %ecx,%eax
addi r1,r1,1
mullw r1,r1,2
add r1,r1,r2
inc %eax
ppcx86register map
r1eax
r1eax
r1eax; r2ecx
Peephole Binary Translation
mr r1, r2mr r2, r1
lis r1, 0x12ori r1, r1, 0x3456
ldl r2, (r1)addi r2, r2, 1stl r2, (r1)
…
mov %eax, %ecx
mov $0x123456, Mr1
inc (%eax)
…
r1 eaxr2 ecx
r1 Mr1
r1 eaxr2 ecx
…
source arch.(ppc)
register map destination arch.(x86)
Register Map Selection
• The best code may require changing the register map from one code point to another
• The choice of register maps affects the choice of instruction selection and vice-versa
Register Map Selection
li r1, 123addi r2, r2, 1subf r2, r1, r2ori r1, r1, 31
powerpc sequence:?x86 sequence:
Instruction costsIf accesses memory, 10
Else, 1
Switching CostsRM or MR : 10
Cost Model
P0P1P2P3
exit
At entry: r1Mr1 ; r2Mr2
At exit: r1Mr1 ; r2Mr2
Example
Register Map Selection
li r1, 123
r1 Mr1 ; r2 Mr2entry
addi r2,r2,1
subf r2,r1,r2
ori r1,r1,31
movl $123, Mr1r1 Mr1
0
10
incl Mr2r2 Mr2
0
10
subl Mr1, eaxr1 Mr1 ; r2 eax
10 10
exit
orl $31, Mr1 10r1 Mr1
0
10
Total 40Total 20
Grand Total 60
r1 Mr1 ; r2 Mr2
Instruction costsIf accesses memory, 10
Else, 1
Switching CostsRM or MR : 10
Greedy Strategy
P0:
P1:
P2:
P3:
li r1, 123
r1 Mr1 ; r2 Mr2entry
addi r2,r2,1
subf r2,r1,r2
ori r1,r1,31
exit
movl $123, eaxr1 eax
10
1
incl ecxr2 ecx
10
1
subl eax, ecxr1 eax ; r2 ecx
0
1
orl $31, eax 1r1 eax0
20
Total 4Total 40
Grand Total 44
r1 Mr1 ; r2 Mr2
Switching CostsRM or MR : 10
Instruction costsIf accesses memory, 10
Else, 1
Register Map SelectionOptimal Solution
Register Map Selection
• Use Dynamic Programming– near-optimal solution– account for translations spanning
multiple instructions– simultaneously perform instruction-
selection and register-mapping
Talk Outline
SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
Powerpc X86 Translator Implementation
• Superoptimizer– Use a PPC emulator (Qemu) for execution
test– Use a SAT solver (zChaff) for boolean test
• Static user-level translator– ELF 32-bit ppc/Linux binary ELF 32-bit
x86/Linux binary– Translate most (but not all) system calls
Implementation
Endianness: ppc big-endian ; x86 little-endian
– Convert all memory writes to big-endian (source)
– Convert all memory reads to little-endian (dest)
Compiler Optimizations– Problem:PowerPC optimizer staggers data-
dependent instructions to reduce pipeline stalls
– Solution: Cluster data-dependent instructions in basic block before translation
• Many Issues– Condition Codes, Endianness, System Calls,
Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations
Experimental Results• Setup
– Pentium4 3.0 GHz, 1MB Cache, 4GB Memory– gcc 4.0.1, glibc 2.3.6– Use soft-float library– Statically-linked input executables
• Benchmarks– Microbenchmarks, SPEC CINT2000
• Metrics– Compare against natively-compiled code– Compare against other binary translators
• Qemu, Apple’s Rosetta
Experimental Setup
• For our experiments– there are around 750 translation rules
in the peephole table– the translation table is computed
offline and it can take up to a week to compute the peephole rules
Experimental Results:Setup
C source
PowerPCexecutable
x86executable
gcc <options> -arch=ppc gcc <options> -arch=x86
Peephole Binary Translation
x86executable
Compare
Microbenchmarks
emptyloop A bounded for-loop doing nothing
fibo Compute first few fibonacci numbers
quicksort Quicksort on 64-bit integers
mergesort Mergesort on 64-bit integers
bubblesort Bubblesort on 64-bit integers
hanoi1 Towers of Hanoi Algorithm 1
hanoi2 Towers of Hanoi Algorithm 2
hanoi3 Towers of Hanoi Algorithm 3
traverse Traverse a linked list
binsearch Binary search on a sorted array
Microbenchmarks99 11
9
81 83
75
85
107
81
69
65
319
93 92
71 70
140
90
68
61
127
128
90
84
65 62
144
80
67
62
129
0
10
20
30
40
50
60
70
80
90
100em
ptyl
oop
fibo
quic
ksor
t
mer
geso
rt
bubs
ort
hano
i1
hano
i2
hano
i3
trav
erse
bins
earc
h
O0 O2 O2 -omit-f rame-pointer
Perc
enta
ge o
f nati
ve (
%)
avg: 90% of native
Experimental Results: Microbenchmarks
• We sometimes outperform native performance on these small benchmarks!– gcc generates better code for
powerpc primarily because it has the luxury of many registers
– Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.
Experimental Results:SPEC CINT2000
66
53
66
87
59
167
4243
57
95
67
153
74
0
10
20
30
40
50
60
70
80
90
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
O0 O2
Perc
enta
ge o
f nati
ve (
%)
Comparisons with Qemu and Rosetta
• Qemu– Use same PowerPC and x86 executables as used
for our own translator
• Rosetta– Runs on Mac OS X and hence supports on Mac
executables– Recompiled the benchmarks on Mac using the
same compiler version (gcc 4.0.1)– Mac Hardware: Intel Core 2 Duo 1.83GHz
processor, 32KB L1-cache, 2MB L2-cache and 2GB memory
Comparisons with Qemu and Rosetta
18
12 15
48
16
55
11
65
59
85
54
43
66
53
66
87
59
167
42
0102030405060708090
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
-O0 -O2
avg: 3% faster than rosetta avg: 12% faster than rosetta
25
13
22
64
21
58
54 53
82
49
74
43
57
95
67
153
010
20304050
607080
90100
bzip
2
gap
gzip
mcf
pars
er
twol
f
qemu rosetta peep
Translation Time• Takes 2-6 minutes to translate a 650KB
executable (around 100K instructions)– majority of time spent in optimal register map
computation
• It is possible to reduce this to <10 seconds– For 98K instructions (<0.01% of time), use any
register map. Fast (<1second)– For other 2K, use optimal computation
Conclusions and Future Work
• A scheme to perform efficient binary translation using a superoptimizer– Competitive performance– Simplified Design
• Other applications– Just-in-time compilation– Machine virtualization
Q&A Thank you.
Backup Slides