Profile-driven Inlining for Erlang by qbm49310

VIEWS: 0 PAGES: 34

									  Profile-driven
Inlining for Erlang

    Thomas Lindgren
thomasl_erlang@yahoo.com
                Inlining
 Replace function call f(X1,…,Xn) with
  body of f/n
 Optimization enabler
    – Simplify code
    – Specialize code
    – Remove ”optimization fence”
   Standard tool in modern compiler
    toolbox
                    Inlining
   Main problem: which calls to inline?
    – Code growth reduces performance
    – Estimate code size growth
    – Select the best estimated sites subject to cost
   Some static estimations:
    – f/n is small? (= inline cost is small)
    – Inlining the call to f/n enables optimization
   Are we optimizing the important code?
    – Or just the convenient code?
                  Inlining
   Dynamic estimation
    – Profile the program
    – Select the best hot call sites for inlining
   Optimize the important code
               Our approach
   Inlining driven by profiling
   Permit cross-module inlining
    – Computations often span several modules
    – Code growth measured for whole program
   Cross-module optimization enabled by (i)
    module aggregation and (ii) guarded
    conversion of remote to local calls
          (will not describe this further here)
          [Lindgren 98]
    The rest of this talk
 Overview of method
 Performance measurements
             Inline forest
   Inlinings to be done       f
    represented by
    forest                 g       f         g


   Nodes are inlined      h

    call sites                         Some sites are not
                                       inlined
                           h
   Leaves are call
    sites to be checked
   (Example shows
    nested inlining)
    Priority-based inlining
   All call sites (leaves in inline forest) are
    placed in priority queue
    – Priority = estimated number of calls
   When a call site f is inlined, the call
    sites in f are added to the queue
    – Priority scaled appropriately
       Inlining algorithm
   Preprocess code
    – call_site and size maps
    – Initialize priority queue
    – Initialize inline forest
   While prio queue not empty
    – Take call site (k, f)
    – Try to inline it
           Preprocessing
   for each function visited k times
    – for each call site visited k’ times
        set   ratio(call_site) = (k’/k)
 Adjust ratio so that < 1.0
 Self-recursive call sites := 0.0
    – (improves code quality)
   maps (function -> [{call_site, ratio}])
dec_bearer_capability(__X12,__X13) ->
      {visits,200000},
      case {__X12,__X13} of
            {BbcRec,[Octet5|Rest]} ->
                  {200000},
                  NewBbcRec = case Octet5 band 31 of
                                           1 ->
                                               {200000}(erlang:setelement
(3,BbcRec,1));
                                           3 ->
                                                 "...";
Original code marked with number of visits 1 6 - >
                                                 "...";
                                           24 ->
                                                 "..."
                                    end,
                  case if
                              Octet5 band 128 == 128 ->
                                      {200000},
                                      false;
                              true ->
                                      "..."
                          end of
                        true ->
                             "...";
                        false ->
                              {200000}(dec_bearer_capability_6
(NewBbcRec,Rest))
                  end
      end.
dec_bearer_capability(__X12,__X13) ->
    {visits,200000},
    case {__X12,__X13} of
         {BbcRec,[Octet5|Rest]} ->
               {200000},
               NewBbcRec = case Octet5 band 31 of
                                           1 ->
                                               {200000}(erlang:setelement
(3,BbcRec,1));
                                           3 ->
                                                 "...";
        Special attention to function calls1 6 - >
                                                 "...";
                                           24 ->
                                                 "..."
                                    end,
               case if
                               Octet5 band 128 == 128 ->
                                     {200000},
                                     false;
                               true ->
                                     "..."
                        end of
                      true ->
                              "...";
                      false ->
                               {200000}(dec_bearer_capability_6
(NewBbcRec,Rest))
               end
    end.
dec_bearer_capability(__X12,__X13) ->
    {visits,200000},
    case {__X12,__X13} of
         {BbcRec,[Octet5|Rest]} ->
             {200000},
             NewBbcRec = case Octet5 band 31 of
                                     1 ->
                                         {200000}(erlang:setelement
(3,BbcRec,1));
                                     3 ->
                                           "...";
                                     16 ->
                                           "...";
                                     24 ->
                                           ". ."
                 dec_bearer_capability/2 runs.200,000 times
                              end,
                 dec_bearer_capability_6 visited 200,000 times
             case if
                  ratio is (200/200) = 1.0
                        Octet5 band 128 == 128 ->
                  adjust ratio to 0.99
                               {200000},
                               false;
                        true ->
                               "..."
                  end of
                 true ->
                       "...";
                 false ->
                        {200000}(dec_bearer_capability_6
(NewBbcRec,Rest))
             end
    end.
       Inlining a call site
   Bookkeeping phase (code gen later)
   Call to f(X1,…,Xn), visited k times
   k < minimum frequency? stop
   tot_size + size(f) > max_size? skip
   Otherwise,
    – tot_size += size(f)
    – for each call site g of f
          add (k * ratio, g) to priority queue
          extend node f by call sites g1,…,gn
   Iterate until no call sites remain
               Example
   Inlining applied to decode1
    – Protocol decoding
    – Single module
                    decode1
decode_ie_coding_1/3 [800k]
decode_action/1 [800k]
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
           Prio queue                               Inline forest
                                                                    adjust to 0.99
dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)]
decode_ie_heads_setup/5 ->
  [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2),
   (decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)]
…
          Call_site mapping (selected parts)
                                                          self-recursive so set
                                                          to 0.0
Try to inline
                          decode1
     decode_ie_coding_1/3 [800k]
     decode_action/1 [800k]
     dec_bearer_capability/2 [200k]
     dec_bearer_capability_6/2 [198k]
     decode_ie_heads_setup/5 [198k]
     …
                 Prio queue                               Inline forest

      dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)]
      decode_ie_heads_setup/5 ->
        [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2),
         (decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)]
      …
                 Call_site mapping
                    decode1
-
decode_action/1 [800k]
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
           Prio queue              Inline forest
                    decode1
-
-
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
           Prio queue              Inline forest
                   decode1


          Prio queue                               Inline forest


Final result:
-inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*)
-Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5
-During inlining, one inline was rejected for too much code growth (not shown)

Now time for code generation
             Code generation
   Walk each inline tree from leaf to root
    – Replace inlined calls f(E1,…,En) with
            (fun(X1,…,Xn) -> E end)(E1,…,En)
    – General case: nested inlines
   Simplify the resulting function
    –   Apply fun to arguments (above)
    –   Case-of-case
    –   Case-of-if
    –   …
         Measurements
   Used five applications
    – decode1 (small protocol decoder)
    – ldapv2 (ASN.1 encode/decode)
    – gen_tcp (send/rcv over socket)
    – beam (compiler)
    – mnesia (simulate HLR)
               Benchmarks
App       Mods   Funcs Calls   Local   Visited
Gen_tcp
          13     658    1546   989     202

ldapv2    5      321    1038   616     140

beam      51     2347   9669   7594    2653
mnesia    63     4207   13390 8435     984
               Benchmarks
App       Mods   Funcs Calls   Local   Visited
Gen_tcp
          13     658    1546   989     202

ldapv2    5      321    1038   616     140

beam      51     2347   9669   7594    2653
mnesia    63     4207   13390 8435     984
               Benchmarks
App       Mods   Funcs Calls   Local   Visited
Gen_tcp
          13     658    1546   989     202

ldapv2    5      321    1038   616     140

beam      51     2347   9669   7594    2653
mnesia    63     4207   13390 8435     984
             Performance
   Very preliminary
    – Code generation problems for beam and mnesia
      => unable to measure
    – (Probably due to name capture bug)
   Did not use outlining, higher-order
    specialization, apply open-coding [EUC’01]
   Tried only emulated code
    – Native code compilation failed
 Speedup vs baseline


decode1                                      1.05

gen_tcp                                      1.04

ldapv2                                       1.10

  Native compilation of inlined decode1 provided a net slowdown
             Future work
 Integrate with other optimizations
 Plenty of opportunities for further
  source-level simplifications
 Suggests new approach to module
  aggregation
    – (do it after inlining instead of before)
   Tuning, measurements
    – Bugfixing …
          Conclusion

 Profile-guided inlining speeds up real
  code
 Whole-program, cross-module inlining
  probably necessary
Backup slides
%% inlined, before simplify
dec_bearer_capability(BbcRec,[Octet5|Rest]) ->
   ...
       case if                                      Case-of-if
               Octet5 band 128 == 128 ->
                   false;
               true ->
                   true
          end of
         true ->
             dec_bearer_capability_5a(NewBbcRec,Rest);
         false ->
             _0_BbcRec = NewBbcRec,[_0_Octet6] = Rest,
             _0_STC = case (_0_Octet6 bsr 5) band 3 of
                                     0 ->
                                          0;
                                     1 ->
                                          1
                                end,
                       _0_UPCC = case _0_Octet6 band 3 of
                                      0 ->
                                            0;
                                      1 ->
                                            1
                                 end,
                       _0_NewBbcRec = erlang:setelement
(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC) end.
%% after simplify:
dec_bearer_capability(BbcRec,[Octet5|Rest]) ->
    ...
    if
         Octet5 band 128 == 128 ->
             _0_BbcRec = NewBbcRec,
             [_0_Octet6] = Rest,
             _0_STC = case (_0_Octet6 bsr 5) band 3 of
                           0 ->
                                0;
                           1 ->
                                1
                      end,
             _0_UPCC = case _0_Octet6 band 3 of
                            0 ->
                                  0;
                            1 ->
                                  1
                       end,
             _0_NewBbcRec = erlang:setelement
(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC);
         true ->
             dec_bearer_capability_5a(NewBbcRec,Rest)
    end.
         Module merging
   We want to optimize over several modules
    at a time
   What to do about hot code loading?
    – Merge modules to aggregates
    – Convert suitable remote calls into local calls
    – Guard such calls to preserve code loading
      semantics
    – Annotate code regions with ”origin module” to
      enable precise process purging
   Or … extend Erlang appropriately
                                                                      0   - >
                                                              c a s e   _ 4 _ F l a g   b a n d                                   3   o f
                                                      _ 4 _ F l a g   b a n d   1 6   = =   1 6                                   - >
        A   c   t   i   o   n       =       i f
        _   4   _   F   l   a   g       =     F   ,
        [   I   d   ,   F   ,   L   1   ,   L 0   ]   =     e   r   l a n g :   b   i   n   a   r   y   _   t   o   _   l   i   s t ( B i   n 1 ) ,
        {   B   i   n   1   ,   B   i   n   2 }     =   e   r   l   a n g : s   p   l   i   t   _   b   i   n   a   r   y   (   B i n , 4   ) ,
e r l a n   g   :   i   s   _   b   i   n   a r   y ( B i   n   )   ,   e r l   a   n   g   :   s   i   z   e   (   B   i   n   )   > =     4   - >
d e c o d   e   _   i   e   _   h   e   a   d s   _ s e t   u   p   ( B i n ,   T   y   p   e   O   f   C   a   l   l   ,   E   p r F l a   g , I E L i s t , B r e p F l a g )   w h e n

								
To top