Optimizing Charm++ Messaging for the Grid by juanagui


									Optimizing Charm++
Messaging for the Grid

  Gregory A. Koenig (koenig@cs.uiuc.edu)
  Parallel Programming Laboratory
  Department of Computer Science
  University of Illinois at Urbana-Champaign

  2005 Charm++ Workshop
                                                Goals of this work
                                                 Optimize message passing,
                                                   in terms of:
                                                       Latency
                                                       CPU overhead
                                                   Optimize both single cluster
                                                    messages as well as Grid
                                                   Leverage hardware support
          Cluster A
                                                    as much as possible
                        Cluster B
                                                   Use NCSA Virtual Machine
                                                    Interface (VMI) messaging
                                                    layer; Create a solution that
Intra-cluster latency   Inter-cluster latency
                                                    is applicable to other layers
  (microseconds)           (milliseconds)          Primary deployment on
                                                    TeraGrid (Myrinet)
Message Passing Primitives
 Stream Send
   {Stream Open, Send Fragment, …, Stream Close}
   Message data must be copied an extra time into
    receive buffer (i.e., only good for small messages)
   Easy to use (low management overhead)

 RDMA (Remote Direct Memory Access)
   RDMA Put, RDMA Get
   Message data are written/read directly into receive
    buffer (i.e., good for large messages)
   Harder to use (requires buffer management)
    RDMA Put (Rendezvous)
            1              2                            3

    Processor A sends a message to Processor B via RDMA Put:
     A sends a short setup message to B indicating an
        upcoming RDMA Put and specifying the message size
     B registers (pins into memory) a receive buffer of the
        specified size and responds to A with its address
     A does an RDMA Put directly into B’s pinned receive
        buffer (“zero copy”)

Notice that Processor A must actively promote the send for a relatively long
        period of time – time that could be spent computing instead!
Expected Behavior
Unexpected Behavior
    Pitfalls of RDMA Put

       1      2            3
  RDMA Get

 Processor A sends a message to Processor B via RDMA Get:
  A registers (pins into memory) a send buffer and sends
     a short setup message to B with its address and size
  B does an RDMA Get directly from the sender’s buffer
     into a receive buffer (“zero copy”)

Notice that once Processor A initiates the Get operation, it is free to do
 other things (such as computing) while hardware promotes the send.
Benefits of RDMA Get
 Leverages hardware to allow more overlapping of
  communication with computation
 Reduces the number of network traversals required to
  send the message from three to two
 Reduces the chance (by about half) that a busy CPU
  will not acknowledge an RDMA operation in a timely

 But…
   If the receiver is busy when the setup message
     arrives, the Get can still be delayed
   Two network traversals are required to send the
     message (not so good for Grid computations)

               Can we do better??
Eager Communication Channels
 It would be nice if each receiver could dedicate a
  receive buffer to each sender; the sender could then
  just Put data directly into the buffer assigned to it
 Unfortunately, this does not scale
   Buffers must be periodically polled (or serviced by an
      interrupt, which is usually slow)
   Pinned memory is a finite resource

 Solution: Have the message layer observe the
  communication characteristics of the computation and
  set up dedicated buffers (“eager channels”) between
  pairs of processes that communicate frequently
Eager Channel Implementation
(Small Messages)
    When a receiver notices that a given process frequently sends to it, the
     receiver dedicates a buffer to the sender and divides it into “slots”
    Receiver polls a sentinel at the end of the active slot -- a changed
     sentinel indicates that a new message is present in that slot
    When a message is received, the address of the slot is returned to the
     application (must intercept subsequent CmiFree() calls!) and polling
     takes place on the next slot in the buffer, round-robin

    slot 1               slot 2                 slot 3                slot 4


     Sender does an RDMA Put into slots in order
     Every message send requires a “send credit”; if a sender does not
      have a credit, it can still send the message via the slow path
     When the receiver frees a message in a slot, a send credit is returned
      to the sender (frees must happen almost in order, otherwise holes)
Eager Channel Implementation
(Large Messages)
 The maximum message size for the slotted
  buffer approach is bounded by the size of a
 For larger message sizes, dedicate a small
  number (e.g., three) of larger sized buffers
  (e.g., 1 MB) to the sender
 Instead of polling these large message
  buffers (which uses CPU cycles), service
  them via interrupt; the latency of the
  interrupt is essentially lost in the latency of
  the actual data transfer
Summary of message passing
paths in vmi-linux machine layer
Slow path               Fast path (Eager)
 Small messages         Small messages
   Sent via Stream        Sent via RDMA Put
                            into slotted buffers
                            which are polled

 Large messages         Large messages
   Sent via RDMA Get      Sent via RDMA Put
                            into a small number
                            of interrupt-
                            serviced buffers
Preliminary Results
One-way Pingpong Latency
Msg Size
            Slow Path
                        Fast Path
                                     Converse pingpong
            (us)        (us)          test running on
16          11.63       9.20
                                      NCSA Mercury
64          11.66       9.33          (TeraGrid) cluster
256         15.70       10.45
                                       1.3 GHz IA-64
1,024       23.26       18.64           processors
4,096       61.59       33.57          Myrinet
16,384      112.96      85.01           interconnect
65,536      318.59      285.15

262,144     1125.63     1080.32      Yellow: small msg
1,048,576   4345.29     4260.45
                                     Green: large msg
4,194,304   17132.89    16981.12
Persistent Communication API
 Implemented by Gengbin Zheng and Sameer Kumar
  in elan-linux and net-linux machine layers
 General implementation/usage:
   Programmer initializes the API, specifying the
      maximum size for a message send
   Programmer switches persistence on/off on a per-
      send basis
   In a persistent send, data (up to maxsize bytes) are
      Put directly into a dedicated buffer on receiver
   API exploits programmer’s knowledge of the
      application’s communication patterns – no checks are
      done to ensure message data are not overwritten
 In vmi-linux machine layer, the API is simply used as
  an indicator that an eager channel is highly desirable
  between a pair of processes
Future Work
 This is pretty complicated code – a lot more testing is
  needed, additional optimization is most likely possible

 Measure performance in TeraGrid environment

 Measure performance on hardware optimized for
  RDMA (e.g., PCI Express bus, InfiniBand interconnect)

 Implement the ability to discard eager channels if
  they are unused for some period of time (this is
  probably important in conjunction with load balancing
  and migrating objects based on their communication

To top