Inlining Replace function call f(X1,…,Xn) with
body of f/n Optimization enabler
– Simplify code– Specialize code– Remove ”optimization fence”
Standard tool in modern compiler toolbox
Inlining Main problem: which calls to inline?
– Code growth reduces performance– Estimate code size growth– Select the best estimated sites subject to cost
Some static estimations:– f/n is small? (= inline cost is small)– Inlining the call to f/n enables optimization
Are we optimizing the important code?– Or just the convenient code?
Inlining Dynamic estimation
– Profile the program– Select the best hot call sites for inlining
Optimize the important code
Our approach Inlining driven by profiling Permit cross-module inlining
– Computations often span several modules– Code growth measured for whole program
Cross-module optimization enabled by (i) module aggregation and (ii) guarded conversion of remote to local calls
(will not describe this further here) [Lindgren 98]
The rest of this talk Overview of method Performance measurements
Inline forest Inlinings to be done
represented by forest
Nodes are inlined call sites
Leaves are call sites to be checked
(Example shows nested inlining)
Some sites are notinlined
f
g f g
h
h
Priority-based inlining All call sites (leaves in inline forest) are
placed in priority queue– Priority = estimated number of calls
When a call site f is inlined, the call sites in f are added to the queue– Priority scaled appropriately
Inlining algorithm Preprocess code
– call_site and size maps– Initialize priority queue– Initialize inline forest
While prio queue not empty– Take call site (k, f)– Try to inline it
Preprocessing for each function visited k times
– for each call site visited k’ times set ratio(call_site) = (k’/k)
Adjust ratio so that < 1.0 Self-recursive call sites := 0.0
– (improves code quality) maps (function -> [{call_site, ratio}])
dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement(3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6(NewBbcRec,Rest)) end end.
Original code marked with number of visits
dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement(3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6(NewBbcRec,Rest)) end end.
Special attention to function calls
dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement(3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6(NewBbcRec,Rest)) end end.
dec_bearer_capability/2 runs 200,000 timesdec_bearer_capability_6 visited 200,000 times ratio is (200/200) = 1.0 adjust ratio to 0.99
Inlining a call site Bookkeeping phase (code gen later) Call to f(X1,…,Xn), visited k times k < minimum frequency? stop tot_size + size(f) > max_size? skip Otherwise,
– tot_size += size(f)– for each call site g of f
add (k * ratio, g) to priority queue extend node f by call sites g1,…,gn
Iterate until no call sites remain
Example Inlining applied to decode1
– Protocol decoding– Single module
decode1decode_ie_coding_1/3 [800k]decode_action/1 [800k]dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…
Prio queue Inline forest
dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)]decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)]…
Call_site mapping (selected parts)self-recursive so setto 0.0
adjust to 0.99
decode1decode1decode_ie_coding_1/3 [800k]decode_action/1 [800k]dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…
Prio queue Inline forest
dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)]decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)]…
Call_site mapping
Try to inline
-decode_action/1 [800k]dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…
decode1decode1decode1
Prio queue Inline forest
--dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…
decode1decode1decode1decode1
Prio queue Inline forest
decode1decode1decode1decode1decode1
Prio queue Inline forest
Final result:-inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*)-Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5-During inlining, one inline was rejected for too much code growth (not shown)
Now time for code generation
Code generation Walk each inline tree from leaf to root
– Replace inlined calls f(E1,…,En) with (fun(X1,…,Xn) -> E end)(E1,…,En)
– General case: nested inlines Simplify the resulting function
– Apply fun to arguments (above)– Case-of-case– Case-of-if– …
Measurements Used five applications
– decode1 (small protocol decoder)– ldapv2 (ASN.1 encode/decode)– gen_tcp (send/rcv over socket)– beam (compiler)– mnesia (simulate HLR)
BenchmarksApp Mods Funcs Calls Local Visited
Gen_tcp 13 658 1546 989 202
ldapv2 5 321 1038 616 140
beam 51 2347 9669 7594 2653
mnesia 63 4207 13390 8435 984
BenchmarksBenchmarksApp Mods Funcs Calls Local Visited
Gen_tcp 13 658 1546 989 202
ldapv2 5 321 1038 616 140
beam 51 2347 9669 7594 2653
mnesia 63 4207 13390 8435 984
BenchmarksBenchmarksApp Mods Funcs Calls Local Visited
Gen_tcp 13 658 1546 989 202
ldapv2 5 321 1038 616 140
beam 51 2347 9669 7594 2653
mnesia 63 4207 13390 8435 984
Performance Very preliminary
– Code generation problems for beam and mnesia => unable to measure
– (Probably due to name capture bug) Did not use outlining, higher-order
specialization, apply open-coding [EUC’01] Tried only emulated code
– Native code compilation failed
Speedup vs baseline
decode1 1.05
gen_tcp 1.04
ldapv2 1.10
Native compilation of inlined decode1 provided a net slowdown
Future work Integrate with other optimizations Plenty of opportunities for further
source-level simplifications Suggests new approach to module
aggregation – (do it after inlining instead of before)
Tuning, measurements– Bugfixing …
Conclusion
Profile-guided inlining speeds up real code
Whole-program, cross-module inlining probably necessary
Backup slides
%% inlined, before simplifydec_bearer_capability(BbcRec,[Octet5|Rest]) -> ... case if Octet5 band 128 == 128 -> false; true -> true end of true -> dec_bearer_capability_5a(NewBbcRec,Rest); false -> _0_BbcRec = NewBbcRec,[_0_Octet6] = Rest, _0_STC = case (_0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _0_UPCC = case _0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _0_NewBbcRec = erlang:setelement(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC) end.
Case-of-if
%% after simplify:dec_bearer_capability(BbcRec,[Octet5|Rest]) -> ... if Octet5 band 128 == 128 -> _0_BbcRec = NewBbcRec, [_0_Octet6] = Rest, _0_STC = case (_0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _0_UPCC = case _0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _0_NewBbcRec = erlang:setelement(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC); true -> dec_bearer_capability_5a(NewBbcRec,Rest) end.
Module merging We want to optimize over several modules
at a time What to do about hot code loading?
– Merge modules to aggregates– Convert suitable remote calls into local calls– Guard such calls to preserve code loading
semantics– Annotate code regions with ”origin module” to
enable precise process purging Or … extend Erlang appropriately
decode_ie_heads_setup(Bin,TypeOfCall,EprFlag,IEList,BrepFlag) when erlang:is_binary(Bin), erlang:size(Bin) >= 4 -> {Bin1,Bin2} = erlang:split_binary(Bin,4), [Id,F,L1,L0] = erlang:binary_to_list(Bin1), _4_Flag = F, Action = if _4_Flag band 16 == 16 -> case _4_Flag band 3 of 0 -> clear_call; 1 -> discard_proceed; 2 -> discard_proceed_status; _ -> undefined end; true -> false, ignore end, _3_F = F, Coding = case _3_F band 96 of 0 -> itu_t_standard; 96 -> atm_forum_specific; _ -> undefined end, case 256 * L1 + L0 of Len when Len > 0 -> case catch erlang:split_binary(Bin2,Len) of {'EXIT',_} -> decode_ie_heads_setup(not_a_binary,TypeOfCall,EprFlag,IEList,BrepFlag); {Bin3,Bin4} -> IE = {ie,Id,Coding,Action,Len,Bin3}, case Id of 94 -> BbcRec = {scct_bbc,undefined,undefined,undefined,undefined,undefined}, case catch begin _2_BbcRec = BbcRec, [_2_Octet5|_2_Rest] = erlang:binary_to_list(Bin3), _2_NewBbcRec = case _2_Octet5 band 31 of 1 -> erlang:setelement(3,_2_BbcRec,1); 3 -> erlang:setelement(3,_2_BbcRec,3); 16 -> erlang:setelement(3,_2_BbcRec,16); 24 -> erlang:setelement(3,_2_BbcRec,24) end, if _2_Octet5 band 128 == 128 -> _2__0_BbcRec = _2_NewBbcRec, [_2__0_Octet6] = _2_Rest, _2__0_STC = case (_2__0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _2__0_UPCC = case _2__0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _2__0_NewBbcRec = erlang:setelement(6,erlang:setelement(5,_2__0_BbcRec,_2__0_UPCC),_2__0_STC); true -> true, dec_bearer_capability_5a(_2_NewBbcRec,_2_Rest) end end of {'EXIT',_} -> CauseRec = {scct_cause,undefined,2,100,[94]}, RelCompUniMsg = {release_complete_uni,[CauseRec],[]}, {error_throw_relcomp,RelCompUniMsg}; NewBbcRec -> case erlang:element(5,NewBbcRec) of 0 -> decode_ie_heads_setup(Bin4,0,EprFlag,[IE|IEList],BrepFlag); 1 -> decode_ie_heads_setup(Bin4,1,EprFlag,[IE|IEList],BrepFlag) end
end; 84 -> decode_ie_heads_setup(Bin4,TypeOfCall,yes_epr,[IE|IEList],BrepFlag); 99 -> decode_ie_heads_setup(Bin4,TypeOfCall,EprFlag,[IE|IEList],yes_brep); _ -> decode_ie_heads_setup(Bin4,TypeOfCall,EprFlag,[IE|IEList],BrepFlag) end end; Len when Len == 0 -> decode_ie_heads_setup(Bin2,TypeOfCall,EprFlag,IEList,BrepFlag) end;decode_ie_heads_setup(_,1,yes_epr,IEList,no_brep) -> {1,IEList};decode_ie_heads_setup(_,1,yes_epr,IEList,yes_brep) -> {1,lists:reverse(IEList)};decode_ie_heads_setup(_,1,no_epr,_,no_brep) -> CauseRec = {scct_cause,undefined,2,96,[84]}, RelCompUniMsg = {release_complete_uni,[CauseRec],[]}, {error_throw_relcomp,RelCompUniMsg};decode_ie_heads_setup(_,0,_,IEList,no_brep) -> {0,IEList};decode_ie_heads_setup(_,0,_,IEList,yes_brep) -> {0,lists:reverse(IEList)};decode_ie_heads_setup(_,no_bbc_ie,_,_,_) -> CauseRec = {scct_cause,undefined,2,96,[94]}, RelCompUniMsg = {release_complete_uni,[CauseRec],[]}, {error_throw_relcomp,RelCompUniMsg}.