Learning Center
Plans & pricing Sign in
Sign Out



									Appeared in Proceedings of IEEE SC99: High
Performance Networking and Computing
Conference, Portland, Oregon, November 1999.

      H-RMC: A Hybrid Reliable Multicast Protocol for the Linux Kernel

        Philip K. McKinley and Ravi T. Rao                             Robin F. Wright
 Department of Computer Science and Engineering               MIT Laboratory for Computer Science
             Michigan State University                              545 Technology Square
           East Lansing, Michigan 48824                         Cambridge, Massachusetts 02139

This paper describes H-RMC, a reliable multicast protocol designed for implementation in the Linux
kernel. H-RMC takes advantage of IP multicast and is primarily a NAK-based protocol. To accommodate
low-loss environments, where feedback in the form of NAKs is scarce, H-RMC receivers return periodic
update messages in the absence of other reverse traffic. H-RMC uses a combination of rate-based and
window-based flow control. The sender maintains minimal information about each receiver so that
buffered data is not released prematurely, and polls receivers in case it has not heard from them at the
time of buffer release. Combined, these techniques produce a reliable multicast data stream with a
relatively low rate of feedback. Performance results show that adequate kernel buffer space, combined
with a two-stage rate control method and polling, are effective in minimizing feedback from receivers and
thereby in maintaining reasonable throughputs.


Reliable multicast communication, which involves the reliable delivery of a data stream from a single
source to multiple destinations in a computer network, has received a great deal of attention in recent
years [1-15]. Applications that can take advantage of such a service include distributed parallel
processing, computer supported cooperative work, distance education, bulk distribution of software
upgrades, and distributed operating systems.
          The extent and nature of reliability depends on the type of reliable multicast protocol. In ACK-
based protocols [1, 2, 6], each receiver sends a positive acknowledgement (ACK) for every packet that
arrives. When the sender transmits a packet, it starts a timer and expects acknowledgements from all
receivers to arrive before the timer expires. Typically, a sliding window protocol is used to improve
throughput. To improve scalability, NAK-based protocols [3, 4, 5, 7, 9, 10, 11, 15] delegate the
responsibility of detection of losses to the receivers. The receivers monitor the sequence of packets they
receive from the sender and send a negative acknowledgement (NAK) on detection of a gap in the
sequence numbers. In a purely NAK-based protocol with a finite send buffer, it is possible for a receiver
to send a NAK for a packet that the sender has released from its buffer, thereby compromising reliability.
In polling-based protocols [8], instead of relying on receivers to ACKs or NAKs, the sender expects the
receivers to simply receive data and take no other specific actions unless prompted by the sender.
Specifically, the sender periodically polls sets of receivers, which return updates indicating which packets
they have received so far. The sender uses this information to retransmit packets either to the entire group
or to a limited set of receivers, depending on the extent of loss.
          The reliable multicast problem is fundamentally difficult for three main reasons. First, providing
reliability on the data stream requires either positive feedback or negative feedback from each of the
receivers; either method creates the potential for feedback processing to overwhelm the sender [12].
Although polling lowers this risk, a high polling rate can reduce throughput. Second, maintaining
membership and per-receiver state information at the sender is potentially costly in terms of memory and
processing overhead. Third, implementing flow control that is acceptable to all receivers is complicated
by the presence of heterogeneous receivers, transient network delays, and buffer space limitations.
         This research project addresses these three issues when the reliable multicast protocol is to be
implemented in the operating system kernel, where limited buffer space complicates the operation of the
protocol. The protocols developed are intended to serve as vehicles for studying reliable multicast
implementation issues. As such, they incorporate several features of existing reliable multicast protocols
and are designed to evolve over time. The first phase of the project involved the design of the RMC
protocol [15], which was implemented and tested as a driver in the Linux kernel. RMC used a purely
NAK-based approach to reliability. This design facilitated the study of flow control when the only
feedback available to the sender is through NAKs and flow control requests. Due to finite kernel buffer
space, however, it is possible for the sending protocol to release data that is later requested for
retransmission (this event is rare and never happened in the RMC experiments). In this case, both the
sending and the receiving applications are informed of the retransmission error and can take appropriate
actions to retrieve or recreate the lost data.
         This paper describes the second phase of the project, in which we have minimally augmented the
RMC protocol with periodic positive feedback and polling in order to remove any dependency on the
application processes for guaranteeing reliability. The resulting protocol, H-RMC (for Hybrid RMC), is
primarily NAK-based, but uses update messages and polling to provide the sender with additional
information about the state of receivers. The main contribution of this work is to demonstrate that, using a
combination of window-based and rate-based flow control, in conjunction with a hybrid reliability
mechanism and minimal per-receiver state information, it is possible to develop an efficient reliable
multicast protocol that fits well into the Linux operating system kernel. Moreover, providing adequate
kernel buffer space is a key to maximizing throughput.
         The remainder of this report is organized as follows. In Section 2, we give an overview of the
original RMC protocol. Section 3 presents the H-RMC architecture, and Section 4 describes its Linux
implementation. Performance results of experiments and simulations are presented in Section 5. Finally,
in Section 6 we present our conclusions and discuss future areas of work.


In this section, we review the RMC protocol upon which H-RMC is based. RMC is a NAK-based
protocol that uses anonymous group membership and a combination of window-based and rate based flow
control. The discussion here is intended to provide a brief overview of the architecture and operation of
RMC. For further details, please refer to [15].

Basic Operation. Multicast communication by means of the RMC protocol involves three key entities:
the sending and receiving applications, the RMC protocol itself, and an underlying best-effort multicast
service, such as IP multicast. The application process at the sending host passes a data stream to the
RMC protocol for transmission. The RMC protocol fragments this data stream into a sequence of data
packets, each of which is assigned a sequence number and prefixed with an RMC header. The data
packets are then transmitted using the best-effort network multicast service, which attempts to deliver a
copy of each packet to each of the hosts in the multicast group. As the data packets arrive at the receiving
hosts, the RMC protocol checks the packets for correctness and reassembles the individual packets into a
data stream that is identical to the data stream that was sent from the source application. Finally, the
RMC protocol at each of the hosts delivers the reassembled data stream to the receiving application at that

Packet Types. Like TCP, RMC provides a reliable stream service to applications. The byte stream is
divided into a sequence of packets, each containing one or more bytes of data. All RMC segments are
prefixed with a 20-byte header, shown in Figure 1, which is similar in many respects to that used by TCP
[19]. The major difference is the presence of a Rate Advertisement field, which is used for flow control:
the sender uses this field to inform the receivers of the current transmission rate, and the receivers use it in
feedback messages (rate requests) to suggest a lower sending rate. The original protocol used nine types
of packets, listed in Table 1. Two additional types, UPDATE and PROBE, were added in H-RMC, and
are discussed in Section 3.

                       Source Port                     Destination Port

                                     Sequence Number

                                    Rate Advertisement


                       Checksum                     Tries         Type

                                                                  URG FIN         Unused

                             Figure 1. RMC/H-RMC packet header format.

Connection Management. Like UDP and TCP, RMC uses port numbers to identify the sending and
receiving processes. To establish connection, the sender need only bind to a local port and establish a
destination endpoint consisting of a multicast IP address and a port number. A receiver is required to
perform two tasks: inform the supporting network layer that it wishes to join the multicast group, and
send a JOIN message to the sender in response to the first data packet that it receives from the data
stream. To confirm receipt of the request, the sender unicasts a JOIN_RESPONSE packet back to the new
receiver. In order to close a connection, a receiver informs the supporting network layer that it wishes to
leave the multicast group and sends a LEAVE message to the sender, which responds with a
LEAVE_RESPONSE packet. Consistent with IP multicast, group membership in the original RMC
protocol is anonymous. Information about receivers is not maintained; the sender uses JOIN and LEAVE
requests from receivers only to approximate the number of receivers in the group and to estimate the
worst round-trip time among receivers.

                                Table 1. RMC and H-RMC Packet Types

      Packet Type                                   Use in RMC and H-RMC
 DATA                     Used by sender for data transmissions and retransmissions.
 NAK                      Used by receiver to request data retransmissions.
 NAK_ERR                  Used by sender to inform a receiver it cannot satisfy retransmission request.
 JOIN                     Used by a receiver to request to join the multicast group.
 JOIN_RESPONSE            Used by sender to confirm that a join request has been accepted.
 LEAVE                    Used by a receiver to inform the sender that it is leaving the multicast group.
 LEAVE_RESPONSE           Used by sender to confirm that a leave request has been received.
 CONTROL                  Used by a receiver to request a reduced transmission rate.
 KEEPALIVE                Used by sender to keep the connection active during idle time.
 UPDATE*                  Used by the receiver to send state information to the sender.
 PROBE*                   Used by the sender to obtain state information from receivers.
* Present in H-RMC only

NAK-Based Reliability. The original RMC protocol uses pure NAK-based reliability. As each receiver
reassembles the data stream, it detects missing segments by monitoring the sequence numbers of
incoming data packets. When a gap occurs, the receiver sends a NAK for the missing data directly to the
sender. To avoid retransmitting NAKs before the sender has had ample opportunity to respond, RMC
receivers use local NAK suppression. Recovery of lost packets is centralized: the sender is solely
responsible for retransmitting data to the receivers. Although many reliable multicast protocols use local
recovery, in which other receivers store and retransmit packets [4], the relatively simple centralized
recovery method of RMC (and H-RMC) facilitates the study of flow control and buffering at the sender.
In a future study, we may extend H-RMC to include local recovery.

        A potential problem in NAK-based protocols is that the loss of the last packet in a burst of data
may go undetected until the next burst begins. As in other protocols [4, 5], RMC addresses this problem
by transmitting keepalive packets. These packets contain the sequence number of the last packet
transmitted. To avoid congestion of keepalive packets during periods of inactivity, the keepalive packets
are exponentially backed off up to a maximum delay (currently 2 seconds). Since reliability is purely
NAK-based, the throughput obtained can be very high in low loss networks (an experimental throughput
of nearly 8.5 Mbps in a 10 Mbps LAN with 8 receivers has been reported [15]). However, reliability
cannot be guaranteed: it is possible for packets to be released at the sender, which have not been received
by all the receivers. When a NAK is received for packets already released from the buffer, the sender
transmits a NAK_ERR packet to the receiver informing it that the request cannot be fulfilled.

Flow Control. The two common methods of flow control used in reliable multicasting are window-based
and rate based flow control. In window-based flow control, the sender can transmit only data that lies in
its send window. The send window is advanced once sufficient acknowledgements are received from
receivers. The window typically grows exponentially during periods of successful transmission and is
reduced on receipt of NAKs or flow control requests. In rate-based flow control, timers are used to limit
the amount of data that the sender is allowed to transmit. Rates are adjusted based on the loss that is
reported by receivers. Window-based flow control is usually used in ACK-based protocols, whereas rate-
based flow control is typically used in NAK-based protocols, where lack of state information complicates
the advancing of windows.
      RMC (and H-RMC) uses a combination of window-based and rate-based flow control. In the
window-based component, the sender and each of the receivers enforce a set of rules on the local
sequence number space that defines a window of data that may be sent and received, respectively. Since
RMC uses purely NAK-based reliability, sliding of the window (and hence, buffer release) at the sender is
based on when a packet was most recently sent and an estimate of the round-trip time to the most distant
receiver. The minimum time that any data packet must be buffered is MINBUF round trip times (set to 10
in the current implementation). The receive sequence space is depicted in Figure 2. Region R1 refers to
that part of the data stream that has been received and consumed by the receiving application. Region R2
includes data that has been received and is being buffered until it is read by the application. Region R3
includes data that can be immediately received and buffered. Finally, region R4 is the part of the data
stream that is to be received in the future, but which does not fit into the current receive window.

                            Safe                          Warning                Critical
                           Region                         Region                 Region

     R1                       R2                                                   R3              R4

                                                                                        rcv_wnd +
       rcv_wnd                                           rcv_nxt                        rcv_wnd_size

                                    Figure 2. Flow control in RMC

         The rate-based component of RMC flow control is similar to that of RAMP [5] The sender
maintains a current transmission rate that defines how quickly it can transmit data from the send window.
The value of the current transmission rate is advertised in every outgoing packet. Receivers use rate
requests, contained in CONTROL packets, to request that the sender reduce the transmission rate. Upon
receipt of a new data packet, the following three rules determine whether to send a rate request: (1) if the
receive window is filled only into the safe region, then no flow control action is taken; (2) if the receive
window is filled into the warning region, the receiver sends a rate request if the amount of data that may
be sent at the advertised rate for the next WARNBUF (currently set to 4) round-trip times is greater than
the empty portion of the receive window; (3) if the receive window is filled into the critical region, the
receiver sends a rate request, with the URGENT flag set, which will stop forward transmission for two
round-trip times, regardless of the advertised rate. At the beginning of data transmission for a new
connection, and any time following an urgent rate request, the sender sets the transmission rate to a
minimum value and uses slow start and congestion avoidance phases, similar to those of TCP [20], to
increase the transmission rate. On receipt of a NAK or a warning rate request, the sender cuts its
transmission rate by half and begins a linear increase in transmission rate.

Group Membership. Group membership in RMC is anonymous. To reduce the workload at the sender,
the sender maintains only a count of the number of receivers in the multicast group, based on join
messages sent by the receivers when they start listening to the multicast connection. The sender also
calculates the round trip time to the most distant receiver, using Karn's algorithm [18], and continues
updating this value based on incoming NAKs and rate-reduce requests. The round trip time estimates are
used in various places of the protocol, such as advancing the send window, in order to increase the
adaptability of the protocol.


The H-RMC protocol is designed and implemented as an extension of RMC. H-RMC is a hybrid of the
three approaches to reliable multicast: ACK-based, NAK-based, and polling. The protocol draws from the
merits of all three methods, while attempting to avoid their drawbacks, in order to ensure complete
reliability in the presence of a limited kernel buffers. H-RMC is primarily NAK-based, however receivers
send periodic updates similar to the approach followed in ACK-based protocols. The sender maintains per
receiver state information and polls them to guarantee complete reliability. Moreover, H-RMC exhibits
similar throughput values as RMC, indicating that the negative effect of adding such functionality to
RMC is minimal. In this section we describe the main features of the H-RMC protocol, using the original
RMC protocol as a reference point.

Membership Maintenance. Whereas the original RMC protocol did not maintain per-receiver state
information at the sender, this model had to be changed in order to guarantee reliability with limited
buffer space. In H-RMC, group membership is maintained in the form of a doubly linked list as well as a
hashed list of all the receivers. The space required is minimal: for each receiver, the sender keeps its
(unicast) IP address and the sequence number that the receiver is expecting next. Since both rate requests
and NAKs carry the next expected sequence number, this field is updated whenever any feedback arrives
from the given receiver. The sender uses this field to determine from which receivers it has insufficient
information at the time of buffer release.

Periodic Updates. In RMC, the sender would advance its send window without checking whether all
receivers had received the packet(s) that it was about to release, since such information was not
maintained. The added group membership data structure enables the sender to keep track of the state of
the receivers. To measure the amount of receiver state information available at the sender, we conducted a
simulation study of 10 receivers in different environments. These simulations use the following loss rates:
0.005% for LAN, 0.5% for MAN, 2% for WAN. The per-socket kernel buffer size was varied from
64Kbytes to 1024Kbytes. Figure 3(a) plots the percentage of time that the sender has information from all
receivers concerning a block of data it is about to release from its buffer, using the original RMC
protocol. In a low-loss environment, few NAKs will be generated and, therefore, the sender will obtain
information from receivers mainly through rate requests. As the loss rate increases, the number of NAKs
increase and, as a result, the sender more often has information from all receivers. As shown in the figure,
larger buffer sizes can improve the situation to some extent.

       (a) without updates (original RMC)                               (b) with updates
   Figure 3. Percentage of time sender has complete receiver information when releasing buffer space.

        In order to increase the amount of feedback from receivers, especially in low-loss environments,
we modified the RMC receiver code so that it periodically sends UPDATE packets to the sender. Each
update contains the highest sequence number received in order. Unlike positive acknowledgements, these
updates are not sent for every packet received, but rather whenever the update timer fires at the receiver.
As seen in Figure 3(b), this approach significantly increases the percentage of time that the sender has
complete information, especially when combined with the use of larger buffer sizes.

Probe Messages. Adding periodic updates increases the available information at the sender, but of course
is still not sufficient to guarantee complete reliability. Therefore, the H-RMC sender also polls any
receivers from which state information is lacking. Specifically, before releasing buffer space, the sender
checks the state of all the receivers with respect to the sequence number past which it intends to advance
the window. If the sender knows that all receivers have the relevant data, then the release can be executed
safely. If the release is not safe, then the sender unicasts a PROBE message to each receiver from which it
is lacking relevant information. Each targeted receiver checks whether it has received all data up to and
including the sequence number in the PROBE packet. If so, then it immediately sends an UPDATE
packet to the sender. Otherwise, the receiver generates a NAK message for the needed data. The sender
will not advance the window until it knows that all receivers have received the data.

Dynamic Update Timers. In our original design, the update period was fixed (0.5 seconds). However, a
fixed update period may be too large for certain environments (such as low-loss local networks, using
small buffers) and too small for others. Therefore, we made the update timer period dynamic, specifically,
based on the number of probe messages received. In high-loss environments, few probe messages are
received because the sender has sufficient information from NAKs, so the update period is increased. In
low-loss environments, there is little feedback from receivers; as a result, more probes are received, and
the update period is decreased. The rate of increase/decrease is linear to ensure that the update period does
not change too quickly. In this manner, the H-RMC protocol adapts to its environment.

Summary: Operating as described above, the H-RMC protocol provides 100% reliability multicast
reliability through the combination of five components: membership state maintenance, NAK-based
feedback, updates, probes, and packet retransmissions. The use of rate requests and dynamic update
intervals are not necessary for reliability, but enable the protocol to better adapt to the network
environment and the behavior of receivers.


H-RMC is implemented as a network driver in the Linux kernel and builds on the interfaces that RMC
provides. The kernel driver runs at a higher priority and has closer access to the network than user-level
code. Moreover, the user/kernel boundary does not have to be crossed for protocol processing operations.
Linux was chosen to be the operating system for building our protocol because the source code for Linux
is openly available and the Linux install base is apparently growing.

4.1. Configuration and Socket Structures
Linux networking is implemented as a series of connected layers as shown in Figure 4 [21]. User-level
applications access networking capabilities through the BSD socket interface. For each socket that is
created, the BSD layer allocates and initializes a socket structure to maintain high-level information about
the socket. The INET socket layer provides a common interface between BSD sockets and the TCP/IP
protocol suite. INET sockets normally interface with either the TCP or UDP transport layer protocols, but
may also access the IP layer directly, if the application has superuser privileges. The INET layer uses its
own sock data structure to maintain network information that is common among those protocols. The IP
layer implements the Internet Protocol and is supported directly by the network device drivers. Linux uses
socket buffers, defined by the sk_buff structure, to pass packets between different layers of the stack.


                                         BSD Sockets                                Kernel

                          INET Sockets

                    TCP                  UDP                H-RMC

                  PPP                    SLIP                 Ethernet

                                 Figure 4. Linux network protocol stack

As shown in Figure 4, the H-RMC is implemented as a transport layer protocol and is located between the
BSD socket layer and the IP layer. Placing H-RMC beneath BSD makes the protocol accessible through a
familiar interface. H-RMC bypasses the INET socket layer and interacts directly with BSD and IP. Doing
so enables the H-RMC implementation to remain separate from the existing implementations of TCP and
         Application code that uses the H-RMC protocol looks much like any other socket-related code.
Both the sender and the receiver create a socket with address family AF_HRMC, type SOCK_IP and
protocol IPPROTO_HRMC. The sending application binds to a local port, connects to a known multicast
address and port number, and uses the send system call to transmit data. When it has completed its
communication, it calls close. The receiving application uses setsockopt to join the multicast group, and
the recv system call to receive data on the H-RMC socket. When it has finished using the connection, it
calls close.

         Figure 5 shows three major data structures associated with an H-RMC socket. During a socket
system call, the BSD socket structure is allocated and hrmc_create is called, which allocates the INET
sock structure. The BSD socket initialization routine is responsible for completing any address family
specific initialization for the new socket, including setting the BSD socket type field to the connection
type and the BSD ops field to point to the address family’s proto_ops vector. Later, as the network
sockets are used, BSD references the protocol operations vector in the given BSD socket structure to
complete address family specific operations for that socket.

                    BSD Socket

                          type        SOCK_IP
                                                                  family       AF_HRMC
                                                                   dup            sock_no_dup()

                                                                 release          hrmc_release()


                    INET Socket
                         family       AF_HRMC
                          type        SOCK_IP
                        protocol      IPPROTO_HRMC



                           Figure 5. H-RMC socket related data structures.

        A portion of the INET sock structure is shown in Figure 6. Among other things, this structure
includes the IP address and port numbers of the local and remote communication endpoints. It also
includes a series of variables for tracking memory allocated to the socket and queues for incoming and
outgoing packets. The structure hrmc_opt in the union tp_pinfo holds information specific to H-RMC
sockets. Shown in Figure 7, the hrmc_opt structure contains information on both the sender and receiver
windows, variables to track window usage for flow control, the group membership structure, and the
various timers.

struct sock {
/* Socket Addressing */
         __u32 daddr;                      /* Foreign IPv4 addr              */
         __u32 rcv_saddr;                  /* Bound local IPv4 addr          */
         __u16 dport;                      /* Destination port               */
         unsigned short num;               /* Local port                     */

/* Transport protocol specifc information */
        union {
                 struct tcp_opt af_tcp;            /* Transport level block for TCP         */
                 struct hrmc_opt af_hrmc;          /* Transport level block for H-RMC       */
        } tp_pinfo;

/* Memory allocation information */
      int rcvbuf;                                  /* Size of receiver buffer in bytes */
      int sndbuf;                                  /* Size of sender buffer in bytes */

/* Queues for packets */
       struct sk_buff_head write_queue;            /* Packet sending queue        */
       struct sk_buff_head back_log;               /* Queue when socket locked */
       struct sk_buff_head receive_queue;          /* Incoming packets */
                     Figure 6. Portion of the INET sock structure.
struct hrmc_opt {
/* Receive information */
  __u32 rcv_wnd;        /* Start of receive window */
  __u32 rcv_wnd_size; /* Size of the window */
  __u32 rcv_nxt;        /* Sequence number expected next */

/* Send information */
 __u32 snd_wnd;        /* seq. number of first packet in send window */
 __u32 snd_wnd_size; /* size of the send window */
 __u32 snd_nxt;        /* first seq. number to use on next send */

/* Group Membership structure */
 struct mc_member *mem_hash[RMC_HTABLE_SIZE]; /* Hashed membership information */

/* Send window maintenance */
 __u32 snd_rate_wnd;         /* Sending congestion window */
 __u32 max_snd_rate_wnd;     /* Maximum value for the congestion window */

/* Timers */
struct timer_list transmit_timer;   /* Transmission timer */
struct timer_list retrans_timer;    /* Retransmission timer */
struct timer_list update_timer;     /* Update timer */
struct timer_list ka_timer;         /* Keepalive timer */

                      Figure 7. Portion of the hrmc_opt stucture.

4.2. H-RMC Sender

The H-RMC sender architecture, shown in Figure 8, may be thought of as five concurrent tasks. The main
responsibility of the Application Interface (hrmc_sendmsg routine) is to fragment the data into H-RMC
packets and make the packets available for transmission by inserting them in the send window. The send
window is implemented as a queue of packets (          sk_bufs). If the packet falls outside the current rate
window, then the packet is queued in a backlog for later transmission; otherwise the packet can be sent
         The sender maintains a small structure for each receiver, containing the receiver’s (unicast) IP
address and the next expected sequence number. These structures are stored in a doubly linked list, as
well as a hashed list of receivers, so as to reduce search time in the data structure. Before the data can be
sent to the receivers, the receivers first join the multicast group by sending a JOIN message to the sender.
Upon receipt of a JOIN message, the sender adds the receiver to the multicast group by calling the
add_member routine. On receiving a LEAVE message, the sender calls the rm_member routine to remove
the receiver from the list.

        Data from the                                           NAKs, Rate requests, UPDATES

                                                                         LEAVE requests
                                                                                                                JOIN requests
           Interface                                            rm_member              add_member
      ( hrmc_sendmsg)
                                            State Information                 Group Member
                    Feedback Processor                                       State Information
                  ( hrmc_master_rcv)

     New                                                                                            New Data Packets
     Data                    Rate Control Information
     Packets                                                                Transmitter

                                                                                                       Keepalive           packets
                                    Send Window                                 Priority               Controller
                                    (write_queue)                               Info.                  (ka_timer)

                                                                            Retransmitter           Retransmitted Data Packets
           Retrans.                                                         (retrans_timer)
                               Retransmissions req. list

                                   Figure 8. Architecture of the H-RMC sender.

       The Transmitter (transmit_timer) runs every jiffy (10 msec) and transmits packets from the
backlog queue to the multicast group. During execution of the transmit_timer routine, the sender checks

the state of the send window. If the window is to be advanced, then the sender calls the probe_members
routine with the ending sequence number of the packet being released to ensure that all receivers indeed
have received all data up to and including that sequence number. A PROBE packet is unicast to each
receiver about whom the sender does not have information by using the ip_build_and_send routine. Upon
arrival of a probe at a receiver, the receiver checks to see if it has received packets up to and including the
sequence number in the PROBE packet. If it has, it returns an UPDATE packet; otherwise it immediately
sends a NAK. The send window is advanced only when the sender confirms that all receivers have
received the data to be released.
         The Feedback Processor (hrmc_master_rcv) handles NAKs, rate requests and UPDATE
messages from receivers. In all cases, the sequence number information contained in the packet is used to
update the state information for that receiver using the update_mem routine. The Retransmitter
(retrans_timer) is called if a retransmission is required. Further details of H-RMC flow control, inherited
from RMC, can be found in [15]. The Keepalive Controller ( a_timer) sends KEEPALIVE messages
when no transmission or retransmission has been sent recently. In addition to application idle times, this
routine gets called after an urgent rate request and during other periods when the window cannot be
advanced because information from all receivers about a packet to be released is not available.

4.3. H-RMC Receiver

The H-RMC receiver, shown in Figure 9, comprises three packet queues and four major functional
components. The Backlog Queue is used to temporarily hold data packets that arrive while the destination
socket is locked. The Out-of-Order Queue holds packets that cannot yet be integrated into the data stream
due to one or more gaps in the data stream. The Receive Queue holds packets that are available to be
delivered in order to the receiving application.
        The Initial Packet Processor receives packets from the IP layer and calls the hrmc_lookup routine
to locate the destination socket in the hash table. The packet is then delivered to the Main Packet
Processor (hrmc_master_rcv), whose responsibilities include reassembling the incoming data packets into
the original data stream, determining when to generate NAKs, and monitoring the receive window to
determine when to send rate requests. The NAK Manager (nak_timer) monitors the NAK list and resends
pending NAKs at appropriate intervals. The Update Generator is responsible for sending periodic updates.
Every update period, which is initially set at 50 jiffies, the update generator calls the update_timer routine
to send an UPDATE packet to the sender. The period of the update generator is varied depending on
whether any probes are received in an update period. If probes are received, the update period is reduced
by one jiffy, otherwise it increases it by one jiffy. In this manner, the update generator tries to find an
optimal period at which a minimum numbers of probes are sent to the receiver, while at the same time not
being smaller than is necessary for the environment. Finally, the Application Interface implements receive
system calls and delivers data to receiving applications by pulling data packets from the Receive Queue
and copying the requested number of data bytes to user space using the hrmc_recvmsg routine.


We have evaluated the performance of the H-RMC protocol through both experimentation and
simulation. An important feature of the simulation program is that it includes portions of various data
structures from the Linux kernel, enabling actual H-RMC kernel code to be inserted directly into the
simulation program with only minor changes. Moreover, the results produced by the simulator are
consistent with those produced by experimental tests in our laboratory. Due to space limitations, many
results are omitted here, but can be found in [16].

       Packet Processor

          Initial Packet          Data                                               Update          Updates
            Processor             Packets            Backlog Queue                  Generator
         (hrmc_ip_rcv)                              (backlog_queue)               (update_timer)

                                                                Probes messages                         Interface
            Main Packet
             Processor            Data
        ( hrmc_rcv_data)          Packets
                                                  Out of Order Queue
                                                 (out_of_order_queue)                                Resent
                                              Pending NAK list

                                                                                   Outgoing Rate Requests

         Inorder Data Packets                  Receive Queue

                                                                                  ( hrmc _recvmsg)

                                                                                     Data to the

                                  Figure 9. Architecture of the H-RMC receiver.

5.1 Experimental Study
We used a local network testbed to evaluate the performance of H-RMC. We used 4 PCs running Redhat
Linux 5.0, kernel version 2.1.103; all were Pentium II 300MHz machines with 256MB of memory. All
the machines were connected to the same Ethernet LAN running at either 10 or 100 Mbps, depending on
the test. To study the effect of different number of receivers in a local environment, we conducted tests
for 1, 2, and 3 receivers. For each set of receivers, we collected statistics on a 10MB and 40MB file
transfer, using both memory-to-memory and disk-to-disk tests. In the memory-to-memory tests, the
sender sent data from memory and each of the receivers received data in a memory buffer. In the disk-to-
disk tests, the sender sent a file that it read from the local disk, and each of the receivers stored the
received data to a file on local disk. By performing both the memory and disk-to-disk tests, we were able
to observe the protocol’s behavior when the application was always ready, as well as when it was slowed
by I/O operations.
10Mbps network. The throughput results for each of the experimental tests are shown in Figure 10. Each
plot shows the average throughput over five tests of the given kernel buffer size. From the results shown
in Figure 10, we can make two general observations about H-RMC in a local environment. The first
observation is that kernel buffer size affects throughput for tests run with 64K, 128K, and 256K kernel
buffers, but has a minimal effect on tests run with 512K buffer or more. This happens because with
increasing buffer size, the sender is more likely to have received state information from receivers, through
NAKs, rate requests and updates, before it is ready to advance its window. As a result, the sender does not
have to probe a receiver (to guarantee reliability) and wait for a response. For instance, in all the test cases
in Figure 10, the throughput with 1024K of buffer is approximately the same. Furthermore, the
throughput results are nearly identical to the results obtained in the original version of RMC [15],
indicating that the overhead of introducing reliability into the protocol has not significantly affected the

                                                               (b) memory to memory throughput, 40 MB
      (a) memory to memory throughput,10 MB

          (c) disk to disk throughput, 10 MB                       (d) disk to disk throughput, 40 MB

               Figure 10. Throughput of H-RMC on a 10 Mbps network (experimental)

         To further understand the behavior of H-RMC, we monitored the NAK and rate request feedback
activity during each test. We report this activity in terms of the total number of NAKs and rate requests
that arrive at the sender during each test. In the memory-to-memory tests, there was no data loss and the
receivers’ buffers did not fill past the threshold; therefore there was no feedback. Feedback activity for
the disk tests is shown in Figure 11. Figure 11(a) and (b) plot rate request and NAK activity with a disk
test using 10MB files. As seen, the number of rate requests is proportional to the size of the kernel buffer.
An increase in size of the receivers’ kernel buffers reduces the possibility of arriving packets to cross the

threshold of the warning and critical regions. As a result, the number of rate-reduce requests is seen to
reduce with increase in buffer size. Data loss was minimal; consequently there were very few NAKs, as
can be seen in Figure 11(b). Figure 11(c) and (d) show the feedback activity for 40MB files. While the
number of NAKs is quite small, the number of rate requests is noticeable and seemingly unpredictable. A
number of different activities in the operating system or I/O delays could have caused the application to
slow and the kernel buffers to fill up. However, this does not cause a significant drop in throughput, as
can be seen in Figure 10(c) and (d).

       (a) rate requests, 10MB, disk to disk                      (b) NAKs, 10MB, disk to disk

       (b) rate requests, 40MB, disk to disk                      (d) NAKs, 40 MB, disk to disk

           Figure 11. Feedback activity in H-RMC on a 10 Mbps network (experimental).

100Mbps network. Memory-to-memory throughput results on the 100Mbps network for each of the tests
are shown in Figure 12. From these results, we see that throughput again increases with increase in kernel
buffer. Moreover, the number of receivers does not affect the overall throughput as long as there is
sufficient kernel buffer space. Also, the throughput is higher for larger transfers. Since there are very few
flow control requests in memory tests, the rate window grows exponentially with time causing a large
increase in the sending rate. Small files are transferred quickly, as a result of which, the rate window
cannot grow to very large values. However, the throughput for small buffer sizes is relatively low. This
happens because the protocol is effectively behaving like a stop-and-wait protocol by sending a window
of data and then waiting for probed responses from receivers. An optimization could be to use early
probes, which is discussed later.

    (a) memory to memory throughput, 10MB                    (b) memory to memory throughput, 40MB

             Figure 12. Throughput of H-RMC on a 100 Mbps network (experimental).

         Figure 13 shows the feedback activity for the memory tests. There were no flow control requests
generated from the memory tests and there were no NAKs either for a buffer size up to 1024K, as can be
seen in Figure 13(a) and (b). However, an increase in buffer size beyond 1024K causes some NAKs to be
generated. Since there were no flow control requests, these NAKs could not have been caused by buffer
overflow at the receiver. Nor is it likely for all the NAKs to have been caused on the network since there
were never any NAKs when the kernel buffer was less than 1024K. Therefore, this seems to indicate that
NAKs are being caused due to dropping of packets by the network card. With large kernel buffers, the
send window is large as well. As a result, the sender can transmit a large amount of data in one jiffy and it
is likely that the network card is not being able to accept data at these rates and is dropping packets.

(a) NAK activity, 10MB file, memory to memory             (b) NAK activity, 40 MB file, memory to memory

          Figure 13. Feedback activity of H-RMC on a 100 Mbps network (experimental).

5.2 Simulation Study
In order to evaluate the performance of H-RMC in environments other than a local area network, we used
a simulation program. The program is based on the CSIM simulation package [22] and uses three types
of CSIM processes: host processes, network interface processes, and router processes. The host processes
are used to simulate each participating host. A host process controls the operation of the H-RMC protocol
and underlying operating system on the host, as well as the sending or receiving application. Each host
process is coupled with a network interface process, which handles incoming packets for the host and
simulates the network delay associated with each packet. The router processes are used to simulate the
network topology through which packets are routed before being delivered to the network interfaces.
Each router is assigned a network speed, a queue size, and a loss rate.
         The simulation of packet flow work as follows. At a given host, outgoing packets are constructed
with a full H-RMC header and a partial IP header, and then passed to the local router. Within a router, the
packets are taken from the local queue, assigned a delay according to the network speed, and passed on to
the next router or to the appropriate network interface, as dictated by the IP destination. Multicast packets
are duplicated within a router as necessary. At the network interface, packets are received one at a time,
held for the assigned delay, and then passed to the host. At the host, incoming packets are passed to the
H-RMC protocol, where normal processing continues. To simulate network loss, each router and network
interface may be assigned a loss rate, which it uses to randomly drop packets. To simulate the protocol,
we imported the H-RMC protocol code directly from the Linux kernel into the simulation. In order to
simulate processing time at a host, we measured the time to complete H-RMC processing and lower layer
network processing in the Linux kernel, and then introduced the appropriate delays in the host process.
For sending and receiving data of length l, the H-RMC delay was (10 + .025 * l) microseconds and the
lower layer delay was 150 microseconds, assuming a 300MHz processor speed.
         In order to understand how H-RMC performs when participating hosts are distributed across
different types of networks, we used the router processes to divide receivers into characteristic groups,
where a characteristic group is defined by its network delay and loss properties. We divided the loss rate
between the router process and each receiver's network interface process to simulate both correlated and
uncorrelated loss. As reported in [14], most losses take place in the tail links of the network. The network
backbone and the individual sites are mostly loss free. Thus, we set the loss parameters such that 90% of
the loss was correlated and occurred at the router process and 10% of the loss was uncorrelated and
occurred at the network interface process. Definitions of the characteristics groups that we used are listed
in Figure 14(a). Group A is intended to simulate a local environment, group C is intended to simulate a
wide area environment, and group B simulates something between the two, such as a metropolitan area
network. From these characteristic groups, we formed a set of test cases, which are shown in Figure
14(b). For Tests 1, 2, and 3, the receiver group consisted of the single characteristic groups A, B and C,
while Tests 4 and 5 used receivers from characteristic groups B and C in opposite proportions.

          Group    Delay         Loss Rate                          Test 1            All in A
                                                                    Test 2            All in B
          A        2 ms          0.005%                             Test 3            All in C
          B        20 ms         0.5%                               Test 4       80% in B, 20% in C
          C        100 ms        2%                                 Test 5       20% in B, 80% in C
               (a) characteristic groups                                      (b) test cases
                            Figure 14. Simulated groups and characteristics

10 Mbps Network . Figure 15(a) and (b) show the simulation results for a 10Mbps network with 10
receivers. Figure 15(a) shows the throughput of the H-RMC protocol. The results for Test A are nearly
identical to our experimental results, indicating that at least in the local environment, the simulator is very
accurate. Test 1 is followed in throughput by Test 2, which simulates a metropolitan area network. Test 3,
which is the wide area network, yields the lowest throughput. For Test 4, where the receiver group is
primarily composed of medium area receivers, and Test 5, where the receiver group is primarily
composed of wide area receivers, the throughput values are close to that of the wide area environment.
This result is expected, since H-RMC is designed to adapt to the least capable receiver in the multicast

            (a) throughput, 10 receivers                       (b) rate reduce requests, 10 receivers

                                       (c) throughput, 100 receivers
                 Figure 15. H-RMC performance on a 10 Mbps network (simulated).

        The number of rate reduce requests, shown in Figure 15(b), is dependent on the loss rate of the
network as well. As the loss in the network increases, the number of out-of-order packets arriving at the
receivers increases; these packets are placed in the out-of-order queue till the missing packet(s) is
retransmitted. In H-RMC, NAKs are suppressed locally, however rate-reduce requests are not.
Consequently, out-of-order packets which cross over into the warning or critical region cause rate
requests to be sent. As a result, with small buffers, this causes a large number of rate requests to be sent,
which decrease with an increase in buffer size.
        With 100 receivers, as shown in Figure 15(c), the throughput decreases by a small amount. This
happens because there is an increase in the number of updates from receivers, which leads to increased
processing at the sender. However when sufficient buffer space is provided, the throughput improves.

100Mbps Network: Simulation results for the 100Mbps network are shown in Figure 16. Figure 16(a)
shows the throughput of the H-RMC protocol for 10 receivers. The results for Test A are also similar to
the experimental results obtained. As would be expected, Test 1 shows the highest throughput followed
by Test 2 and finally Test 3. The rate reduce requests, shown in Figure 16(b), follow a pattern similar to
the rate requests seen in the 10Mbps network. However, the number of rate requests is relatively larger
than that obtained for the 10Mbps network. An increase in network speed causes an increase in the rate at
which the receivers’ windows fill up, but the rate at which the application reads data from the receive
window does not change from the 10Mbps network to the 100Mbps network. This rate mismatch could

cause the receive window to remain full for a significant period of the time, and thereby cause an increase
in the number of rate requests sent by the receivers.

        (a) Throughput, 100Mbps network                    (b) Rate reduce requests, 100Mbps network
               Figure 16. Performance of H-RMC on a 100 Mbps network (simulated).

        For 100 receivers (not shown here) the maximum throughput of H-RMC reduced to
approximately 66Mbps on the 100 Mbps network with large buffers, which is not a significant
decrease. Beyond 100 receivers, structured local processing in a manner similar to that of RMTP
[6] could be used to improve performance.


In this research project, we have studied issues related to implementing reliable multicast in an operating
system kernel. We designed the H-RMC reliable multicast protocol as a network driver for the Linux
kernel. The implementation is based on IP multicast and provides a BSD socket interface to user-level
applications. H-RMC is a hybrid reliable multicast protocol for the Linux kernel that draws from the
benefits of three traditional approaches to this problem. Its mechanism of flow control enables the sender
to dynamically adapt the transmission rate depending on the state of the network and receiver buffers.
This feature, combined with local NAK suppression and dynamic adaptation of the update period,
minimizes the amount of reverse traffic. As a result, the throughput we obtained was comparable to that
of TCP and the purely NAK-based approach used in RMC.
         H-RMC is intended to provide a framework to study implementation issues. As such, H-RMC is
continuing to evolve, with various optional features added to enable study of new problems. Examples
include (1) probing receivers prior to buffer release time to avoid a stop-and-wait scenario for small
buffers; (2) multicasting probes when the number of receivers to be probed is greater than some threshold;
(3) use of local recovery to improve the scalability of the protocol; and (4) incorporation of forward error
correction, particularly for wireless environments.

Further Information. A number of related technical reports and papers can be found at the following

Acknowledgements. The authors would like to thank the anonymous referees for their comments and
suggestions on ways to improve this paper. This work was supported in part by NSF grants CCR-
9503838, CDA-9617310, and NCR-9706285.

[1] W. T. Strayer, ed., Xpress Transport Protocol Specification, Revision 4.0, XTP Forum, March 1995.
[2] R. Talpade and M. H. Ammar, “Single Connection Emulation (SCE): An Architecture for Providing a Reliable
    Multcast Transport Service,” in Proceedings of the IEEE International Conference on Distributed Computing
    Systems, (Vancouver, BC, Canada), June 1995.
[3] H. W. Holbrook, S. K. Singhal, and D. R.Cheriton, “Log-based recevier-reliable multicast for distributed
    interactive simulation,” in Proceedings of SIGCOMM '95, (Cambridge, MA USA), pp 328-341, 1995.
[4] S. Floyd, V. Jacobson, C.-G. Liu, S. McCanne, and L. Zhang, “A Reliable Multicast Framework for Light-
    weight Sessions and Application Level Framing,” IEEE/ACM Transactions on Networking, December 1996.
[5] Koifman and S. Zabele, “RAMP: A Reliable Adaptive Multicast Protocol” in Proceedings of IEEE INFOCOM,
    pp 1442-1451, March 1996.
[6] S. Paul, K. K. Sabnani, J. C. Lin, and S. Bhattacharyya, “Reliable Multicast Transport Protocol RMTP,” IEEE
    Journal on Selected Areas in Communications, vol. 15, pp. 407-421, April 1997.
[7] R. Yavatkar, J. Griffioen, and M. Sudan, “A Reliable Dissemination Protocol for Interactive Collaborative
    Applications,” in Proceedings of the ACM Multimedia '95 Conference, November 1995.
[8] M. P. Barcellos and P. D. Ezhilchelvan, “An End-to-End Reliable Multicast Protocol Using Polling for
    Scalability,” in Proceedings of IEEE INFOCOM '98, pp. 1180-1187, March 1998.
[9] J. Crowcroft, K. Paliwoda, “A Multicast Transport Protocol,” in Proceedings of ACM SIGCOMM '88, 1988
[10] B. Whetten, S. Kaplan, T. Montgomery, “A High Performance Totally Ordered Multicast Protocol,” in
     Proceedings of INFOCOM '95, 1995.
[11] K. Miller, K Robertson, A. Tweedly, M. White, “Multicast File Transport Protocol,” Internet draft, April 1998.
[12] P. Danzig, “Flow Control for Limited Buffer Multicast,” IEEE Transactions on Software Engineering, vol. 20,
     pp. 1-12, January 1994.
[13] Brian N. Levine and J. J. Garcia-Luna-Aceves. “A Comparison of Reliable Multicast Protocols,” ACM
     Multimedia Systems Journal, 1998.
[14] Don Towsley, Jim Kurose, and Sridhar Pingali. “A Comparison of Sender-Initiated and Receiver-Initiated
     Reliable Multicast Protocols.” IEEE Journal on Selected Areas in Communications, 15(3):398-406, April 1997.
[15] P. McKinley and R. F. Wright, “A study of reliable multicast flow control in the Linux kernel,” in Proceedings
     of the 12th International Conference on Parallel and Distributed Computing Systems, August 1999.
[16] P. McKinley, R. T. Rao, and R. F. Wright, “H-RMC: A Hybrid Reliable Multicast Protocol in the Linux
     kernel,” Tech. Rep. MSU-CPS-99-22, Department of Computer Science, Michigan State, East Lansing,
     Michigan, April 1999.
[17] Robel Barrios and P. K. McKinley, “WebClass: “A multimedia web-based instructional tool,” Technical Report
     MSU-CPS-98-26, Department of Computer Science and Engineering, Michigan State University, East Lansing,
     Michigan, August 1998.
[18] P. Karn and C. Partridge. “Estimating Round Trip Times in Reliable Transport Protocols,” in Proceedings of
     SIGCOMM '87, August 1987.
[19] Information Sciences Institute, University of Southern California, “Transmission Control Protocol,” Internet
     RFC 793, September 1981.
[20] V. Jacobson and M. J. Karels, “Congestion Avoidance and Control,” in Proceedings of SIGCOMM '88, August
[21] M. Beck, H. Bohme, M. Dziadzka, U. Kunitz, R. Magnus, and D. Verworner, Linux Kernel Internals. Addison-
     Wesley, 1996.
[22] H.D. Schwetman, “CSIM: A C-based, process-oriented simulation language,” Tech. Rep. PP-080-85,
     Microelectronics and Computer Technology Corporation, 1985, (currently available at

To top