+ All Categories
Home > Technology > SFO15-208: Improving the Optimizing Compiler in Android Runtime (ART), plans & status

SFO15-208: Improving the Optimizing Compiler in Android Runtime (ART), plans & status

Date post: 19-Feb-2017
Category:
Upload: linaro
View: 1,044 times
Download: 1 times
Share this document with a friend
52
Presented by Date Event Diving into the ART Optimizing Compiler Alexandre Rames Tuesday 22 September 2015 SFO15 1
Transcript

Presented by

Date

Event

Diving into the ART Optimizing Compiler

Alexandre Rames

Tuesday 22 September 2015

SFO15

1

Agenda

● Compilation process overview● Development

○ Compile, run○ Profile, analyze○ Optimize

● Topics○ Development issues○ What we are working on

2

Compilation process overview

*.javaDEX bytecode

javac + dxor jack and jill

HGraph HBasicBlock

HInstructionHGoto

...HInstructionHGoto

...HInstructionHGoto

HEqualHIf

HReturnHExit

dex2oat

.oat(ELF file)

sub sp, sp, #......cmp x0, x1b.eq true_target...ret

optimizations

● Inlining● Dead code elimination● Strength reduction● Constant folding● GVN● Loop invariant code motion● Bounds check elimination

3

pushed ontarget

Compile, run# Setup and compile AOSP.

# Compile the java code.

# Run on target.

cd <aosp>source build/envsetup.shlunch aosp_arm64-engmake -j40 && cd art && mma -j40

javac Main.javadx --dex --output=Main.dex Main.class

export R_PATH=/data/local/tmpadb push Main.dex $R_PATHadb shell “ANDROID_DATA=$R_PATH DEX_LOCATION=$R_PATH dalvikvm64 -cp $R_PATH/Main.dex Main”

4

Cross-compile, analyze

● Analyze compilation through the ‘.cfg’ file.# Cross-compile on host for arm64.dex2oatd --runtime-arg -Xms64m --runtime-arg -Xmx512m --boot-image=$ANDROID_PRODUCT_OUT/dex_bootjars/system/framework/boot.art --dex-file=`pwd`/Main.dex --oat-file=`pwd`/Main.oat --android-root=$ANDROID_PRODUCT_OUT/system --runtime-arg -Xnorelocate --instruction-set=arm64 --instruction-set-features=default -j1 --dump-cfg=art.cfg

● The output .cfg contains dumps of the graph before and after each pass.○ Also includes interleaved disassembly of the generated code.

● Tools can help visualize the art.cfg file.○ C1Visualizer○ IR Hydra

5

Code sample

● RGB lossy compressionFrom: 3 8bit channels (24bit)To: 5-6-5 R-G-B channels (16bit)

final int mask_red = [...]; final int shift_red = [...]; [...]

public void Compress() { for(int i = 0; i < this.array.length; ++i) { int res_low_red = (array[i] >>> shift_right_red) & mask_red; int res_low_green = (array[i] >>> shift_right_green) & mask_green; int res_low_blue = (array[i] >>> shift_right_blue) & mask_blue;

array[i] = res_low_red | res_low_green | res_low_blue; }}

GR B

R G B

04516 1531 1011

07 319 1524 23 1031

2324

array[i]

6

B0 // Function entry

v10 Goto

B2 // Loop header

v120 Exit

v119 ReturnVoid

B3 // Loop body

[...]

v118 Goto

Graph in SSA form

7

B0 // Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

B2 // Loop header

i124 Phi [i8,i115]v123 SuspendCheckl12 NullCheck [l6]l13 InstanceFieldGet [l12]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

v120 Exit

v119 ReturnVoid

B3 // Loop body

[...]

v118 Goto

Graph in SSA form

8

l24 NullCheck [l6]l25 InstanceFieldGet [l24]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]

[... similar code for ‘green’][... similar code for ‘blue’]

l92 NullCheck [l6]l93 InstanceFieldGet [l92]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93]i106 ArrayLength [l104]i109 BoundsCheck [i124,i106]v112 ArraySet [l104,i109,i101]

Loop body in SSA formres_low_red = (array[i] >>> shift_right_red) & mask_red; // Get `array` from the class object.// Check it is non-null.// Get its length.// Check the length is greater than `i`.// Get `array[i]`.// Shift it right.// Mask the resulting value.

res_low_green = (array[i] >>> shift_right_green) & mask_green;res_low_blue = (array[i] >>> shift_right_blue) & mask_blue;

array[i] = res_low_red | res_low_green | res_low_blue;// Get `array` from the class object.// Compute the compressed value.

// Check that `array` is non-null.// Check its length is greater than `i`.

// Store the result.

9

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]

[... similar code for ‘blue’]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

10

Global Value Numbering

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]

[... similar code for ‘blue’]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

Global Value Numbering// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

11

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]

[... similar code for ‘blue’]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

12

Global Value Numbering

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]

[... similar code for ‘blue’]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

Global Value Numbering// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

13

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]

i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]

i79 BoundsCheck [i124,i17]i81 ArrayGet [l13,i79]i85 UShr [i81,i84]i89 And [i85,i88]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

Global Value Numbering// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

14

code for ‘blue’

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]

i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]

i79 BoundsCheck [i124,i17]i81 ArrayGet [l13,i79]i85 UShr [i81,i84]i89 And [i85,i88]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

Global Value Numbering// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

15

code for ‘blue’

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]

i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]

i79 BoundsCheck [i124,i17]i81 ArrayGet [l13,i79]i85 UShr [i81,i84]i89 And [i85,i88]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

Global Value Numbering// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

16

code for ‘blue’

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i39 UShr [i35,i38]i45 And [i39,i41]

i63 UShr [i59,i62]i67 And [i63,i66]

i85 UShr [i81,i84]i89 And [i85,i88]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]

Global Value Numbering// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

code for ‘red’

code for ‘green’

17

code for ‘blue’

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i33 BoundsCheck [i124,i17]i35 ArrayGet [l16,i33]

i39 UShr [i35,i38]i45 And [i39,i41]

i63 UShr [i35,i62]i67 And [i63,i66]

i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i33,i101]

i115 Add [i124,i114]v118 Goto

Loop-invariant code motion// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

‘red’

‘green’

‘blue’

common

18

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckz21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i33 BoundsCheck [i124,i17]i35 ArrayGet [l16,i33]

i39 UShr [i35,i38]i45 And [i39,i41]

i63 UShr [i35,i62]i67 And [i63,i66]

i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i33,i101]

i115 Add [i124,i114]v118 Goto

Graph after Loop-invariant code motion// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

l13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]v10 Goto

19

Let’s optimize the compiler● Finding what to do:

○ Profile your code and look at hot-spots, if any.○ Optimizations implemented on other archs?○ Look for common or bad code patterns.

● Evaluating an optimization.○ Performance, code size, and compilation impact○ Frequency of the optimization○ Reverse-estimation: instead of implementing the optimization, insert

nops or duplicate the code to assess the impact.● Implementing:

○ Start by having a look at other patches upstream.○ The code is clean and nicely modular (passes).

20

Profiling

● Linux perf in AOSP● Use as usual

○ perf record <command>○ perf report

● See additional slides for example commands.

● Streamline also works and has interesting features.

21

Sample perf output

Overhead Command Shared Object Symbol

67.61% main data@local@[email protected] [.] void Main.Compress() 27.73% main data@local@[email protected] [.] void Main.Init() [...]

22

Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1

23

Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1

These should be merged.

24

Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1

This instruction is duplicated for every access to the array.

These should be merged.

25

Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1

i35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i124,i101]

i115 Add [i124,i114]

26

Compilation process overview

*.javaDEX bytecode

HGraph HBasicBlock

HInstructionHGoto

...HInstructionHGoto

...HInstructionHGoto

HEqualHIf

HReturnHExit

.oat(ELF file)

sub sp, sp, #......cmp x0, x1b.eq true_target...ret

27

ADD HAdda + b addLAdd ?

Architecture-specific optimizationsDesign choices

● Architecture-specific IRs● At the same IR level

○ Lower the IRs as necessary instead of having a full IR level.This avoid a lot of work for the translation from HIR to LIR.

○ We want to reuse existing optimization passes. (eg. GVN and LICM)

● So the framework is very simple.○ New HArm64DoStuff IRs.○ New arch-specific optimization passes.

28

Optimizing array accesses● New ARM64 instruction simplification pass● First, simply ‘split’ array accesses.

i35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]

v112 ArraySet [l16,i124,i101]

[...]

l128 Arm64IntermediateAddress [l16,i127]i35 ArrayGet [l128,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]

l129 Arm64IntermediateAddress [l16,i127]v112 ArraySet [l129,i124,i101][...]

addldr

lsrandlsrandlsrandorrorr

addstr[...]

29

Optimizing array accesses● Then we re-run GVN!

i35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]

v112 ArraySet [l16,i124,i101]

[...]

l128 Arm64IntermediateAddress [l16,i127]i35 ArrayGet [l128,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]

l129 Arm64IntermediateAddress [l16,i127]v112 ArraySet [l128,i124,i101][...]

addldr

lsrandlsrandlsrandorrorr

addstr[...]

30

Using the shifter operand● New HArm64DataProcWithShifterOp instruction● The merging is implemented in the same passi35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]

v112 ArraySet [l16,i124,i101]

[...]

l128 Arm64IntermediateAddress [l16,i127]i35 ArrayGet [l128,i124]

i129 Arm64DataProcWithShifterOp [i41,i35] (And+LSR 8)

i130 Arm64DataProcWithShifterOp [i66,i35] (And+LSR 5)

i131 Arm64DataProcWithShifterOp [i88,i35] (And+LSR 3)

i97 Or [i129,i130]i101 Or [i131,i97]

l132 Arm64IntermediateAddress [l16,i127]v112 ArraySet [l132,i124,i101][...]

31

Generated code for the loop bodyadd w7, w0, #0xc (12)ldr w8, [x7, x3, lsl #2]

and w9, w6, w8, lsr #8and w10, w5, w8, lsr #5and w8, w4, w8, lsr #3

orr w9, w9, w10orr w8, w9, w8

str w8, [x7, x3, lsl #2]

add w3, w3, #0x1 (1)

ldrh w16, [tr] ; state_and_flagscbz w16, #-0x30 (addr 0x30)b #+0x24 (addr 0x88)

l128 Arm64IntermediateAddress [l13,i127]i35 ArrayGet [l128,i124]

i129 Arm64DataProcWithShifterOp [i41,i35]i130 Arm64DataProcWithShifterOp [i66,i35]i131 Arm64DataProcWithShifterOp [i88,i35]

i97 Or [i129,i130]i101 Or [i97,i131]

v112 ArraySet [l128,i124,i101]

i115 Add [i124,i114]

v118 Goto

32

Benchmarking the code sample

● Very targeted benchmark● Higher is better

Cortex A53

base

+ array split +9%

+ shifter operand +27%

Cortex A57

base

+ array split +8%

+ shifter operand +17%

33

Upstreaming

● Don’t forget to add tests.● Don’t forget to run all the tests.

● Development happens upstream.○ Our patches first go through Linaro's gerrit.

● Instructions on Linaro’s wiki● Android documentation for submitting patches

34

Discussion points

● Doing something is easy. Recognizing it can be done is hard.● Easy to extend

○ Compilation passes○ VIXL is used for ARM64 code generation

● Don’t forget the compilation time.● Multiple architectures (arm, arm64, x86, x86_64, mips64).

○ Most of the compilation process is architecture-agnostic.○ Could be an issue for some ideas.

■ Condition flags production/consumption as side effects of IRs?

35

Development issues

● Development platform○ Non-Nexus devices don’t always work with upstream.○ Beware frequency scaling and other pitfalls.

● Different behaviours on different CPUs○ Optimizations should be ARM-generic.○ Avoid CPU-specific optimizations.

● Where are my representative benchmarks?○ On the command-line, please!

● Why does Android take so long to compile?○ Things are not that bad with a 20-core machine with hyper-threading.

36

What we are working on

● Command-line Java benchmarks○ LMG hacking room

● More instruction simplification patterns / IR lowering● Instruction scheduling● SlowPaths sharing● Intrinsics● ARM64 simulator

○ To run and debug tests on host.

37

Summary

38

● Extensible compiler● Good tools for profiling, analyzing, testing, and debugging● Architecture-specific optimizations

○ At the same IR level○ Allowed to re-use existing passes

● It generates good code, but there is a lot more to do!

Click to edit master text body

Additional slides

39

Sample `.cfg` outputbegin_compilation name "void Compression.Compress()"begin_cfg name "ssa_builder (after)" begin_block name "B0" begin_HIR 0 5 l6 ParameterValue <|@ 0 1 i8 IntConstant 0 <|@ end_HIR end_block begin_block name "B3" begin_HIR 0 1 l24 NullCheck [l6] env:[[i124,i17,_,_,_,l6]] <|@ 0 1 l25 InstanceFieldGet [l24] <|@ 0 2 l28 NullCheck [l25] env:[[i124,l25,_,_,_,l6]] <|@

40

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl12 NullCheck [l6]l13 InstanceFieldGet [l12 l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

l24 NullCheck [l6]l25 InstanceFieldGet [l24 l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]

l52 NullCheck [l6][... similar code for ‘green’]l74 NullCheck [l6][... similar code for ‘blue’]

l92 NullCheck [l6]l93 InstanceFieldGet [l92 l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93]i106 ArrayLength [l104]i109 BoundsCheck [i124,i106]v112 ArraySet [l104,i109,i101]

Instruction simplification// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

41

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]

l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]

[... similar code for ‘green’][... similar code for ‘blue’]

l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93]i106 ArrayLength [l104]i109 BoundsCheck [i124,i106]v112 ArraySet [l104,i109,i101]

[...]

Instruction simplification// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

v10 Goto

42

Code sample

public void Compress() { for(int i = 0; i < this.array.length; ++i) { int res_low_red = (array[i] >>> shift_right_red) & mask_red; int res_low_green = (array[i] >>> shift_right_green) & mask_green; int res_low_blue = (array[i] >>> shift_right_blue) & mask_blue;

array[i] = res_low_red | res_low_green | res_low_blue; }}

43

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckz21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i33 BoundsCheck [i124,i17]i35 ArrayGet [l16,i33 i124]

i39 UShr [i35,i38]i45 And [i39,i41]

i63 UShr [i35,i62]i67 And [i63,i66]

i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i33 i124,i101]

i115 Add [i124,i114]v118 Goto

Bounds-check elimination// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

l13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]v10 Goto

44

Loop body before and afterl24 NullCheck [l6]l25 InstanceFieldGet [l24]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l48 NullCheck [l6]l49 InstanceFieldGet [l48]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]l70 NullCheck [l6]l71 InstanceFieldGet [l70]l74 NullCheck [l71][... 16 more IRs]

i35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]

i63 UShr [i35,i62]i67 And [i63,i66]

i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i124,i101]

i115 Add [i124,i114]v118 Goto

45

// Loop header

i124 Phi [i8,i115]v123 SuspendCheckz21 GreaterThanOrEqual [i124,i17]v22 If [z21]

i35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]

i63 UShr [i35,i62]i67 And [i63,i66]

i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i124,i101]

i115 Add [i124,i114]v118 Goto

Graph after all optimization passes// Function entry

l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]

l13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]v10 Goto

46

Generated code for the loop bodyadd w16, w0, #0xc (12)ldr w4, [x16, x3, lsl #2]

lsr w5, w4, #8and w5, w5, #0xf800lsr w6, w4, #5and w6, w6, #0xfe0lsr w7, w4, #3and w4, w7, #0x1f

orr w5, w5, w6orr w4, w4, w5

add w16, w0, #0xc (12)str w4, [x16, x3, lsl #2]

add w3, w3, #0x1 (1)

ldrh w16, [tr]cbz w16, #-0x40b #+0x24

i35 ArrayGet [l16,i124]

i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]

i97 Or [i45,i67]i101 Or [i89,i97]

v112 ArraySet [l16,i124,i101]

i115 Add [i124,i114]

v118 Goto

47

Architecture-specific optimizationsGoals and constraints

We want a framework that is● Flexible - enabling a wide range of optimizations● Efficient - speed and code size impact worth the compilation time● Mergeable upstream

○ Works for all architectures○ Not (too) intrusive○ Easy to test

48

Comparison of one example with GCC -O3

<foo>: 0: cmp w0, w1 4: lsl w2, w1, #5 8: b.lt 18 <foo+0x18> c: add w2, w1, w2 10: add w0, w0, w2 14: ret 18: add w2, w0, w2 1c: add w0, w0, w2 20: ret

0x00: cmp w2, w30x04: b.ge #+0x10 (addr 0x14)0x08: add w0, w2, w3, lsl #50x0c: add w0, w2, w00x10: b #+0xc (addr 0x1c)0x14: add w0, w3, w3, lsl #50x18: add w0, w2, w00x1c: ret

int foo(int a, int b) {int res = a;int temp = b << 5;if (a < b)

res += a + temp;else

res += b + temp;return res;

} Ooops! I just saw that. This needs to be improved!

49

Profiling with perf (1/2)# Compile perf and push it on the target.cd <aosp>/external/linux-tools-perf/mma -j40adb sync

cd <work_dir>javac Main.javadx --dex --output=Main.dex Main.class

# REMOTE_PATHexport RP=/data/local/tmpadb push ./Main.dex $RP

50

Profiling with perf (1/2)# Trigger the build with debug symbols.# Specifying `ANDROID_DATA` and `DEX_LOCATION` allows using a local dalvik-cache directory.adb shell "ANDROID_DATA=$RP DEX_LOCATION=$RP dalvikvm64 -Xcompiler-option -g -cp $RP/Main.dex Main"# Symbolize boot.oat.adb shell ls $RP/dalvik-cache/*/*boot.oat | xargs -n 1 bash $ANDROID_BUILD_TOP/art/tools/symbolize.sh# Pull the compiled file.adb pull $RP/dalvik-cache/arm64/data@local@[email protected] $ANDROID_PRODUCT_OUT/symbols/$RP/dalvik-cache/arm64/data@local@[email protected]

# Now record a run.adb shell "cd $RP && ANDROID_DATA=$RP DEX_LOCATION=$RP perf record dalvikvm64 -cp $RP/Main.dex Main"adb pull $RP/perf.data# And report.perf report --demangle --objdump=$ANDROID_BUILD_TOP/prebuilts/gcc/linux-x86/aarch64/aarch64-linux-android-4.9/bin/aarch64-linux-android-objdump --symfs $ANDROID_PRODUCT_OUT/symbols

51

Clean codevoid LocationsBuilderARM64::VisitArrayLength(HArrayLength* instruction) { LocationSummary* locations = new (GetGraph()->GetArena()) LocationSummary(instruction); locations->SetInAt(0, Location::RequiresRegister()); locations->SetOut(Location::RequiresRegister(), Location::kNoOutputOverlap);}

void InstructionCodeGeneratorARM64::VisitArrayLength(HArrayLength* instruction) { BlockPoolsScope block_pools(GetVIXLAssembler()); __ Ldr(OutputRegister(instruction), HeapOperand(InputRegisterAt(instruction, 0), mirror::Array::LengthOffset())); codegen_->MaybeRecordImplicitNullCheck(instruction);}

52


Recommended