Post on 19-Feb-2017
transcript
Presented by
Date
Event
Diving into the ART Optimizing Compiler
Alexandre Rames
Tuesday 22 September 2015
SFO15
1
Agenda
● Compilation process overview● Development
○ Compile, run○ Profile, analyze○ Optimize
● Topics○ Development issues○ What we are working on
2
Compilation process overview
*.javaDEX bytecode
javac + dxor jack and jill
HGraph HBasicBlock
HInstructionHGoto
...HInstructionHGoto
...HInstructionHGoto
HEqualHIf
HReturnHExit
dex2oat
.oat(ELF file)
sub sp, sp, #......cmp x0, x1b.eq true_target...ret
optimizations
● Inlining● Dead code elimination● Strength reduction● Constant folding● GVN● Loop invariant code motion● Bounds check elimination
3
pushed ontarget
Compile, run# Setup and compile AOSP.
# Compile the java code.
# Run on target.
cd <aosp>source build/envsetup.shlunch aosp_arm64-engmake -j40 && cd art && mma -j40
javac Main.javadx --dex --output=Main.dex Main.class
export R_PATH=/data/local/tmpadb push Main.dex $R_PATHadb shell “ANDROID_DATA=$R_PATH DEX_LOCATION=$R_PATH dalvikvm64 -cp $R_PATH/Main.dex Main”
4
Cross-compile, analyze
● Analyze compilation through the ‘.cfg’ file.# Cross-compile on host for arm64.dex2oatd --runtime-arg -Xms64m --runtime-arg -Xmx512m --boot-image=$ANDROID_PRODUCT_OUT/dex_bootjars/system/framework/boot.art --dex-file=`pwd`/Main.dex --oat-file=`pwd`/Main.oat --android-root=$ANDROID_PRODUCT_OUT/system --runtime-arg -Xnorelocate --instruction-set=arm64 --instruction-set-features=default -j1 --dump-cfg=art.cfg
● The output .cfg contains dumps of the graph before and after each pass.○ Also includes interleaved disassembly of the generated code.
● Tools can help visualize the art.cfg file.○ C1Visualizer○ IR Hydra
5
Code sample
● RGB lossy compressionFrom: 3 8bit channels (24bit)To: 5-6-5 R-G-B channels (16bit)
final int mask_red = [...]; final int shift_red = [...]; [...]
public void Compress() { for(int i = 0; i < this.array.length; ++i) { int res_low_red = (array[i] >>> shift_right_red) & mask_red; int res_low_green = (array[i] >>> shift_right_green) & mask_green; int res_low_blue = (array[i] >>> shift_right_blue) & mask_blue;
array[i] = res_low_red | res_low_green | res_low_blue; }}
GR B
R G B
04516 1531 1011
07 319 1524 23 1031
2324
array[i]
6
B0 // Function entry
v10 Goto
B2 // Loop header
v120 Exit
v119 ReturnVoid
B3 // Loop body
[...]
v118 Goto
Graph in SSA form
7
B0 // Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
B2 // Loop header
i124 Phi [i8,i115]v123 SuspendCheckl12 NullCheck [l6]l13 InstanceFieldGet [l12]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
v120 Exit
v119 ReturnVoid
B3 // Loop body
[...]
v118 Goto
Graph in SSA form
8
l24 NullCheck [l6]l25 InstanceFieldGet [l24]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]
[... similar code for ‘green’][... similar code for ‘blue’]
l92 NullCheck [l6]l93 InstanceFieldGet [l92]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93]i106 ArrayLength [l104]i109 BoundsCheck [i124,i106]v112 ArraySet [l104,i109,i101]
Loop body in SSA formres_low_red = (array[i] >>> shift_right_red) & mask_red; // Get `array` from the class object.// Check it is non-null.// Get its length.// Check the length is greater than `i`.// Get `array[i]`.// Shift it right.// Mask the resulting value.
res_low_green = (array[i] >>> shift_right_green) & mask_green;res_low_blue = (array[i] >>> shift_right_blue) & mask_blue;
array[i] = res_low_red | res_low_green | res_low_blue;// Get `array` from the class object.// Compute the compressed value.
// Check that `array` is non-null.// Check its length is greater than `i`.
// Store the result.
9
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]
[... similar code for ‘blue’]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
10
Global Value Numbering
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]
[... similar code for ‘blue’]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
Global Value Numbering// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
11
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]
[... similar code for ‘blue’]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
12
Global Value Numbering
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]l49 InstanceFieldGet [l6]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]
[... similar code for ‘blue’]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
Global Value Numbering// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
13
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]
i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]
i79 BoundsCheck [i124,i17]i81 ArrayGet [l13,i79]i85 UShr [i81,i84]i89 And [i85,i88]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
Global Value Numbering// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
14
code for ‘blue’
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]
i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]
i79 BoundsCheck [i124,i17]i81 ArrayGet [l13,i79]i85 UShr [i81,i84]i89 And [i85,i88]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
Global Value Numbering// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
15
code for ‘blue’
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i33 BoundsCheck [i124,i17]i35 ArrayGet [l13,i33]i39 UShr [i35,i38]i45 And [i39,i41]
i57 BoundsCheck [i124,i17]i59 ArrayGet [l13,i57]i63 UShr [i59,i62]i67 And [i63,i66]
i79 BoundsCheck [i124,i17]i81 ArrayGet [l13,i79]i85 UShr [i81,i84]i89 And [i85,i88]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
Global Value Numbering// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
16
code for ‘blue’
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i39 UShr [i35,i38]i45 And [i39,i41]
i63 UShr [i59,i62]i67 And [i63,i66]
i85 UShr [i81,i84]i89 And [i85,i88]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93][...]
Global Value Numbering// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
code for ‘red’
code for ‘green’
17
code for ‘blue’
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i33 BoundsCheck [i124,i17]i35 ArrayGet [l16,i33]
i39 UShr [i35,i38]i45 And [i39,i41]
i63 UShr [i35,i62]i67 And [i63,i66]
i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i33,i101]
i115 Add [i124,i114]v118 Goto
Loop-invariant code motion// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
‘red’
‘green’
‘blue’
common
18
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckz21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i33 BoundsCheck [i124,i17]i35 ArrayGet [l16,i33]
i39 UShr [i35,i38]i45 And [i39,i41]
i63 UShr [i35,i62]i67 And [i63,i66]
i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i33,i101]
i115 Add [i124,i114]v118 Goto
Graph after Loop-invariant code motion// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
l13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]v10 Goto
19
Let’s optimize the compiler● Finding what to do:
○ Profile your code and look at hot-spots, if any.○ Optimizations implemented on other archs?○ Look for common or bad code patterns.
● Evaluating an optimization.○ Performance, code size, and compilation impact○ Frequency of the optimization○ Reverse-estimation: instead of implementing the optimization, insert
nops or duplicate the code to assess the impact.● Implementing:
○ Start by having a look at other patches upstream.○ The code is clean and nicely modular (passes).
20
Profiling
● Linux perf in AOSP● Use as usual
○ perf record <command>○ perf report
● See additional slides for example commands.
● Streamline also works and has interesting features.
21
Sample perf output
Overhead Command Shared Object Symbol
67.61% main data@local@tmp@Main.dex [.] void Main.Compress() 27.73% main data@local@tmp@Main.dex [.] void Main.Init() [...]
22
Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1
23
Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1
These should be merged.
24
Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1
This instruction is duplicated for every access to the array.
These should be merged.
25
Sample perf outputvoid Main.Compress() /data/local/tmp/dalvik-cache/arm6... │ mov w3, #0 0.01 │ cmp w3, w2 6.77 │ b.ge 2348 <void Main.Compress()+0x6c> 0.04 │ add w16, w0, #0xc 13.71 │ ldr w4, [x16,x3,lsl #2] 19.97 │ lsr w5, w4, #8 6.31 │ and w5, w5, #<mask red> │ lsr w6, w4, #5 6.31 │ and w6, w6, #<mask green> │ lsr w7, w4, #3 6.47 │ and w4, w7, #<mask blue> │ orr w5, w5, w6 6.63 │ orr w4, w4, w5 │ add w16, w0, #0xc 13.61 │ str w4, [x16,x3,lsl #2] │ add w3, w3, #1
i35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i124,i101]
i115 Add [i124,i114]
26
Compilation process overview
*.javaDEX bytecode
HGraph HBasicBlock
HInstructionHGoto
...HInstructionHGoto
...HInstructionHGoto
HEqualHIf
HReturnHExit
.oat(ELF file)
sub sp, sp, #......cmp x0, x1b.eq true_target...ret
27
ADD HAdda + b addLAdd ?
Architecture-specific optimizationsDesign choices
● Architecture-specific IRs● At the same IR level
○ Lower the IRs as necessary instead of having a full IR level.This avoid a lot of work for the translation from HIR to LIR.
○ We want to reuse existing optimization passes. (eg. GVN and LICM)
● So the framework is very simple.○ New HArm64DoStuff IRs.○ New arch-specific optimization passes.
28
Optimizing array accesses● New ARM64 instruction simplification pass● First, simply ‘split’ array accesses.
i35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]
v112 ArraySet [l16,i124,i101]
[...]
l128 Arm64IntermediateAddress [l16,i127]i35 ArrayGet [l128,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]
l129 Arm64IntermediateAddress [l16,i127]v112 ArraySet [l129,i124,i101][...]
addldr
lsrandlsrandlsrandorrorr
addstr[...]
29
Optimizing array accesses● Then we re-run GVN!
i35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]
v112 ArraySet [l16,i124,i101]
[...]
l128 Arm64IntermediateAddress [l16,i127]i35 ArrayGet [l128,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]i97 Or [i45,i67]i101 Or [i89,i97]
l129 Arm64IntermediateAddress [l16,i127]v112 ArraySet [l128,i124,i101][...]
addldr
lsrandlsrandlsrandorrorr
addstr[...]
30
Using the shifter operand● New HArm64DataProcWithShifterOp instruction● The merging is implemented in the same passi35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]
v112 ArraySet [l16,i124,i101]
[...]
l128 Arm64IntermediateAddress [l16,i127]i35 ArrayGet [l128,i124]
i129 Arm64DataProcWithShifterOp [i41,i35] (And+LSR 8)
i130 Arm64DataProcWithShifterOp [i66,i35] (And+LSR 5)
i131 Arm64DataProcWithShifterOp [i88,i35] (And+LSR 3)
i97 Or [i129,i130]i101 Or [i131,i97]
l132 Arm64IntermediateAddress [l16,i127]v112 ArraySet [l132,i124,i101][...]
31
Generated code for the loop bodyadd w7, w0, #0xc (12)ldr w8, [x7, x3, lsl #2]
and w9, w6, w8, lsr #8and w10, w5, w8, lsr #5and w8, w4, w8, lsr #3
orr w9, w9, w10orr w8, w9, w8
str w8, [x7, x3, lsl #2]
add w3, w3, #0x1 (1)
ldrh w16, [tr] ; state_and_flagscbz w16, #-0x30 (addr 0x30)b #+0x24 (addr 0x88)
l128 Arm64IntermediateAddress [l13,i127]i35 ArrayGet [l128,i124]
i129 Arm64DataProcWithShifterOp [i41,i35]i130 Arm64DataProcWithShifterOp [i66,i35]i131 Arm64DataProcWithShifterOp [i88,i35]
i97 Or [i129,i130]i101 Or [i97,i131]
v112 ArraySet [l128,i124,i101]
i115 Add [i124,i114]
v118 Goto
32
Benchmarking the code sample
● Very targeted benchmark● Higher is better
Cortex A53
base
+ array split +9%
+ shifter operand +27%
Cortex A57
base
+ array split +8%
+ shifter operand +17%
33
Upstreaming
● Don’t forget to add tests.● Don’t forget to run all the tests.
● Development happens upstream.○ Our patches first go through Linaro's gerrit.
● Instructions on Linaro’s wiki● Android documentation for submitting patches
34
Discussion points
● Doing something is easy. Recognizing it can be done is hard.● Easy to extend
○ Compilation passes○ VIXL is used for ARM64 code generation
● Don’t forget the compilation time.● Multiple architectures (arm, arm64, x86, x86_64, mips64).
○ Most of the compilation process is architecture-agnostic.○ Could be an issue for some ideas.
■ Condition flags production/consumption as side effects of IRs?
35
Development issues
● Development platform○ Non-Nexus devices don’t always work with upstream.○ Beware frequency scaling and other pitfalls.
● Different behaviours on different CPUs○ Optimizations should be ARM-generic.○ Avoid CPU-specific optimizations.
● Where are my representative benchmarks?○ On the command-line, please!
● Why does Android take so long to compile?○ Things are not that bad with a 20-core machine with hyper-threading.
36
What we are working on
● Command-line Java benchmarks○ LMG hacking room
● More instruction simplification patterns / IR lowering● Instruction scheduling● SlowPaths sharing● Intrinsics● ARM64 simulator
○ To run and debug tests on host.
37
Summary
38
● Extensible compiler● Good tools for profiling, analyzing, testing, and debugging● Architecture-specific optimizations
○ At the same IR level○ Allowed to re-use existing passes
● It generates good code, but there is a lot more to do!
Sample `.cfg` outputbegin_compilation name "void Compression.Compress()"begin_cfg name "ssa_builder (after)" begin_block name "B0" begin_HIR 0 5 l6 ParameterValue <|@ 0 1 i8 IntConstant 0 <|@ end_HIR end_block begin_block name "B3" begin_HIR 0 1 l24 NullCheck [l6] env:[[i124,i17,_,_,_,l6]] <|@ 0 1 l25 InstanceFieldGet [l24] <|@ 0 2 l28 NullCheck [l25] env:[[i124,l25,_,_,_,l6]] <|@
40
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl12 NullCheck [l6]l13 InstanceFieldGet [l12 l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
l24 NullCheck [l6]l25 InstanceFieldGet [l24 l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]
l52 NullCheck [l6][... similar code for ‘green’]l74 NullCheck [l6][... similar code for ‘blue’]
l92 NullCheck [l6]l93 InstanceFieldGet [l92 l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93]i106 ArrayLength [l104]i109 BoundsCheck [i124,i106]v112 ArraySet [l104,i109,i101]
Instruction simplification// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
41
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckl13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]z21 GreaterThanOrEqual [i124,i17]v22 If [z21]
l25 InstanceFieldGet [l6]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]
[... similar code for ‘green’][... similar code for ‘blue’]
l93 InstanceFieldGet [l6]i97 Or [i45,i67]i101 Or [i97,i89]l104 NullCheck [l93]i106 ArrayLength [l104]i109 BoundsCheck [i124,i106]v112 ArraySet [l104,i109,i101]
[...]
Instruction simplification// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
v10 Goto
42
Code sample
public void Compress() { for(int i = 0; i < this.array.length; ++i) { int res_low_red = (array[i] >>> shift_right_red) & mask_red; int res_low_green = (array[i] >>> shift_right_green) & mask_green; int res_low_blue = (array[i] >>> shift_right_blue) & mask_blue;
array[i] = res_low_red | res_low_green | res_low_blue; }}
43
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckz21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i33 BoundsCheck [i124,i17]i35 ArrayGet [l16,i33 i124]
i39 UShr [i35,i38]i45 And [i39,i41]
i63 UShr [i35,i62]i67 And [i63,i66]
i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i33 i124,i101]
i115 Add [i124,i114]v118 Goto
Bounds-check elimination// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
l13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]v10 Goto
44
Loop body before and afterl24 NullCheck [l6]l25 InstanceFieldGet [l24]l28 NullCheck [l25]i30 ArrayLength [l28]i33 BoundsCheck [i124,i30]i35 ArrayGet [l28,i33]i39 UShr [i35,i38]i45 And [i39,i41]l48 NullCheck [l6]l49 InstanceFieldGet [l48]l52 NullCheck [l49]i54 ArrayLength [l52]i57 BoundsCheck [i124,i54]i59 ArrayGet [l52,i57]i63 UShr [i59,i62]i67 And [i63,i66]l70 NullCheck [l6]l71 InstanceFieldGet [l70]l74 NullCheck [l71][... 16 more IRs]
i35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]
i63 UShr [i35,i62]i67 And [i63,i66]
i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i124,i101]
i115 Add [i124,i114]v118 Goto
45
// Loop header
i124 Phi [i8,i115]v123 SuspendCheckz21 GreaterThanOrEqual [i124,i17]v22 If [z21]
i35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]
i63 UShr [i35,i62]i67 And [i63,i66]
i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]v112 ArraySet [l16,i124,i101]
i115 Add [i124,i114]v118 Goto
Graph after all optimization passes// Function entry
l6 ParameterValue // class objecti8 IntConstant 0i38 IntConstant 8[...]
l13 InstanceFieldGet [l6]l16 NullCheck [l13]i17 ArrayLength [l16]v10 Goto
46
Generated code for the loop bodyadd w16, w0, #0xc (12)ldr w4, [x16, x3, lsl #2]
lsr w5, w4, #8and w5, w5, #0xf800lsr w6, w4, #5and w6, w6, #0xfe0lsr w7, w4, #3and w4, w7, #0x1f
orr w5, w5, w6orr w4, w4, w5
add w16, w0, #0xc (12)str w4, [x16, x3, lsl #2]
add w3, w3, #0x1 (1)
ldrh w16, [tr]cbz w16, #-0x40b #+0x24
i35 ArrayGet [l16,i124]
i39 UShr [i35,i38]i45 And [i39,i41]i63 UShr [i35,i62]i67 And [i63,i66]i85 UShr [i35,i84]i89 And [i85,i88]
i97 Or [i45,i67]i101 Or [i89,i97]
v112 ArraySet [l16,i124,i101]
i115 Add [i124,i114]
v118 Goto
47
Architecture-specific optimizationsGoals and constraints
We want a framework that is● Flexible - enabling a wide range of optimizations● Efficient - speed and code size impact worth the compilation time● Mergeable upstream
○ Works for all architectures○ Not (too) intrusive○ Easy to test
48
Comparison of one example with GCC -O3
<foo>: 0: cmp w0, w1 4: lsl w2, w1, #5 8: b.lt 18 <foo+0x18> c: add w2, w1, w2 10: add w0, w0, w2 14: ret 18: add w2, w0, w2 1c: add w0, w0, w2 20: ret
0x00: cmp w2, w30x04: b.ge #+0x10 (addr 0x14)0x08: add w0, w2, w3, lsl #50x0c: add w0, w2, w00x10: b #+0xc (addr 0x1c)0x14: add w0, w3, w3, lsl #50x18: add w0, w2, w00x1c: ret
int foo(int a, int b) {int res = a;int temp = b << 5;if (a < b)
res += a + temp;else
res += b + temp;return res;
} Ooops! I just saw that. This needs to be improved!
49
Profiling with perf (1/2)# Compile perf and push it on the target.cd <aosp>/external/linux-tools-perf/mma -j40adb sync
cd <work_dir>javac Main.javadx --dex --output=Main.dex Main.class
# REMOTE_PATHexport RP=/data/local/tmpadb push ./Main.dex $RP
50
Profiling with perf (1/2)# Trigger the build with debug symbols.# Specifying `ANDROID_DATA` and `DEX_LOCATION` allows using a local dalvik-cache directory.adb shell "ANDROID_DATA=$RP DEX_LOCATION=$RP dalvikvm64 -Xcompiler-option -g -cp $RP/Main.dex Main"# Symbolize boot.oat.adb shell ls $RP/dalvik-cache/*/*boot.oat | xargs -n 1 bash $ANDROID_BUILD_TOP/art/tools/symbolize.sh# Pull the compiled file.adb pull $RP/dalvik-cache/arm64/data@local@tmp@Main.dex $ANDROID_PRODUCT_OUT/symbols/$RP/dalvik-cache/arm64/data@local@tmp@Main.dex
# Now record a run.adb shell "cd $RP && ANDROID_DATA=$RP DEX_LOCATION=$RP perf record dalvikvm64 -cp $RP/Main.dex Main"adb pull $RP/perf.data# And report.perf report --demangle --objdump=$ANDROID_BUILD_TOP/prebuilts/gcc/linux-x86/aarch64/aarch64-linux-android-4.9/bin/aarch64-linux-android-objdump --symfs $ANDROID_PRODUCT_OUT/symbols
51
Clean codevoid LocationsBuilderARM64::VisitArrayLength(HArrayLength* instruction) { LocationSummary* locations = new (GetGraph()->GetArena()) LocationSummary(instruction); locations->SetInAt(0, Location::RequiresRegister()); locations->SetOut(Location::RequiresRegister(), Location::kNoOutputOverlap);}
void InstructionCodeGeneratorARM64::VisitArrayLength(HArrayLength* instruction) { BlockPoolsScope block_pools(GetVIXLAssembler()); __ Ldr(OutputRegister(instruction), HeapOperand(InputRegisterAt(instruction, 0), mirror::Array::LengthOffset())); codegen_->MaybeRecordImplicitNullCheck(instruction);}
52