Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University.

transcript

Binary Translation Using Peephole Superoptimizers

Sorav Bansal, Alex AikenStanford University

Binary Translation

• Allow one ISA to run on another• Applications

– Portability (e.g., running legacy software)

– Virtualization– Backward and Forward Compatibility– On-chip binary translation– Java Virtual Machines

Hypervisor

x86 hardware

x86 OS

x86app

Binary Translator

powerpcapp

powerpc OS

Binary Translation

x86 hardware

x86app

Binary Translator

powerpcapp

x86 hardware

x86app

x86appBinary Translator

powerpcapp

Binary Translation Wish-list

Performance

Large Complex ISAs

Retargetability OS Compatibility

Talk Outline

SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Superoptimization

• Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code

Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0;}

On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1

Superoptimization

• Enumerate all sequences up to a certain length

• Compare each enumerated sequence with target function for equivalence

Talk Outline

SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Peephole SuperoptimizationUse a superoptimizer to

automatically infer peephole optimizations

add $1, reg inc reg

mul $2, reg shl reg

… …Table of Peephole Optimizations

[S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]

pattern replace-with

Peephole SuperoptimizerStep 1

010001001011110100011101101011101010100010101010001010100010001010101001010100101010101001010000101011111101100101010101101111010010101001010100101010010101001110011111010010001101111011011101010001001101010101010101010101010101010101010100110100100101010101010101010101000011111101010111101010001111010101011101110110111011101110111010100110110010101011011

01100101

mov %eax, %ecxmov %ecx, %eax

sub $123, %eaxadd $456, %eax

movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)

Harvest instruction sequences that

can potentially be optimized.

Canonicalize and store them. Target Sequences

Peephole Superoptimization

Step 2mov %eax, %ecxmov %ecx, %eax

Target Sequences

mov %eax, %ecx

add $333, %eax

inc (%eax)

…Brute force

Optimization Optimized Sequences

Equivalence Test

ExecutionTest

BooleanTest

Two sequences

fail fail

not-equivalent not-equivalent

equivalent

Peephole Superoptimization

Step 3mov %eax, %ecxmov %ecx, %eax

mov %eax, %ecx

add $333, %eax

inc (%eax)

Table of Peephole Optimizations

Talk Outline

SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Application to Binary Translation

• Our approach: Use lots of peephole transformations

pattern(ppc)

translate-to(x86)

shl %eax

add %ecx,%eax

addi r1,r1,1

mullw r1,r1,2

add r1,r1,r2

inc %eax

ppcx86register map

r1eax; r2ecx

Peephole Binary Translation

mr r1, r2mr r2, r1

lis r1, 0x12ori r1, r1, 0x3456

ldl r2, (r1)addi r2, r2, 1stl r2, (r1)

mov %eax, %ecx

mov $0x123456, Mr1

inc (%eax)

r1 eaxr2 ecx

r1 Mr1

r1 eaxr2 ecx

source arch.(ppc)

register map destination arch.(x86)

Register Map Selection

• The best code may require changing the register map from one code point to another

• The choice of register maps affects the choice of instruction selection and vice-versa

li r1, 123addi r2, r2, 1subf r2, r1, r2ori r1, r1, 31

powerpc sequence:?x86 sequence:

Instruction costsIf accesses memory, 10

Else, 1

Switching CostsRM or MR : 10

Cost Model

P0P1P2P3

At entry: r1Mr1 ; r2Mr2

At exit: r1Mr1 ; r2Mr2

Example

li r1, 123

r1 Mr1 ; r2 Mr2entry

addi r2,r2,1

subf r2,r1,r2

ori r1,r1,31

movl $123, Mr1r1 Mr1

incl Mr2r2 Mr2

subl Mr1, eaxr1 Mr1 ; r2 eax

orl $31, Mr1 10r1 Mr1

Total 40Total 20

Grand Total 60

r1 Mr1 ; r2 Mr2

Else, 1

Greedy Strategy

li r1, 123

r1 Mr1 ; r2 Mr2entry

addi r2,r2,1

subf r2,r1,r2

ori r1,r1,31

movl $123, eaxr1 eax

incl ecxr2 ecx

subl eax, ecxr1 eax ; r2 ecx

orl $31, eax 1r1 eax0

Total 4Total 40

Grand Total 44

r1 Mr1 ; r2 Mr2

Else, 1

Register Map SelectionOptimal Solution

• Use Dynamic Programming– near-optimal solution– account for translations spanning

multiple instructions– simultaneously perform instruction-

selection and register-mapping

Talk Outline

SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Powerpc X86 Translator Implementation

• Superoptimizer– Use a PPC emulator (Qemu) for execution

test– Use a SAT solver (zChaff) for boolean test

• Static user-level translator– ELF 32-bit ppc/Linux binary ELF 32-bit

x86/Linux binary– Translate most (but not all) system calls

Implementation

Endianness: ppc big-endian ; x86 little-endian

– Convert all memory writes to big-endian (source)

– Convert all memory reads to little-endian (dest)

Compiler Optimizations– Problem:PowerPC optimizer staggers data-

dependent instructions to reduce pipeline stalls

– Solution: Cluster data-dependent instructions in basic block before translation

• Many Issues– Condition Codes, Endianness, System Calls,

Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations

Experimental Results• Setup

– Pentium4 3.0 GHz, 1MB Cache, 4GB Memory– gcc 4.0.1, glibc 2.3.6– Use soft-float library– Statically-linked input executables

• Benchmarks– Microbenchmarks, SPEC CINT2000

• Metrics– Compare against natively-compiled code– Compare against other binary translators

• Qemu, Apple’s Rosetta

Experimental Setup

• For our experiments– there are around 750 translation rules

in the peephole table– the translation table is computed

offline and it can take up to a week to compute the peephole rules

Experimental Results:Setup

C source

PowerPCexecutable

x86executable

gcc <options> -arch=ppc gcc <options> -arch=x86

Peephole Binary Translation

x86executable

Compare

Microbenchmarks

emptyloop A bounded for-loop doing nothing

fibo Compute first few fibonacci numbers

quicksort Quicksort on 64-bit integers

mergesort Mergesort on 64-bit integers

bubblesort Bubblesort on 64-bit integers

hanoi1 Towers of Hanoi Algorithm 1

traverse Traverse a linked list

binsearch Binary search on a sorted array

Microbenchmarks99 11

O0 O2 O2 -omit-f rame-pointer

f nati

avg: 90% of native

Experimental Results: Microbenchmarks

• We sometimes outperform native performance on these small benchmarks!– gcc generates better code for

powerpc primarily because it has the luxury of many registers

– Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.

Experimental Results:SPEC CINT2000

f nati

Comparisons with Qemu and Rosetta

• Qemu– Use same PowerPC and x86 executables as used

for our own translator

• Rosetta– Runs on Mac OS X and hence supports on Mac

executables– Recompiled the benchmarks on Mac using the

same compiler version (gcc 4.0.1)– Mac Hardware: Intel Core 2 Duo 1.83GHz

processor, 32KB L1-cache, 2MB L2-cache and 2GB memory

Comparisons with Qemu and Rosetta

0102030405060708090

-O0 -O2

avg: 3% faster than rosetta avg: 12% faster than rosetta

20304050

607080

qemu rosetta peep

Translation Time• Takes 2-6 minutes to translate a 650KB

executable (around 100K instructions)– majority of time spent in optimal register map

computation

• It is possible to reduce this to <10 seconds– For 98K instructions (<0.01% of time), use any

register map. Fast (<1second)– For other 2K, use optimal computation

Conclusions and Future Work

• A scheme to perform efficient binary translation using a superoptimizer– Competitive performance– Simplified Design

• Other applications– Just-in-time compilation– Machine virtualization

Q&A Thank you.

Backup Slides

Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University.

Documents