Paper review: A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

Reviewer: Hanlin D. Qian

IP Multicast presents many new problems that are not present in unicast. These problems include but are not limited to the following:

Because there is only one sender but multiple receivers in IP Multicast, ACKs from many receivers can have an implosion effect on the sender.
If the sender is responsible for delivery reliability, like in traditional TCP, then the burden becomes huge in multicast when it's necessary to track many receivers.
It is difficult to obtain the set of all the receivers, especially users can join or leave a multicast group at will.
The round-trip time is increasingly difficult to calculate because there are multiple receivers that are at different number of hops of distance away from the sender. Therefore, congestion control calculations become hard to make without available RTT.

The paper describes Scalable Reliable Mulicast (SRM), a basic but operational multicast framework. This framework was designed and implemented with the needs of an application called wb, the LBNL network whiteboard, which is a network conferencing tool that provides a distributed whiteboard. Below are some main ideasbehind the design of this framework. Note that there a lot of these, and because of space contraint, I'm only including the important ideas.

Each member of the group is individually responsible for detecting loss and requesting retranmission.
To prevent an implosion of control packets sent back to the sender, the packets are sent multicast so that other receivers will not send any control packets if they detect a control packet with the same information already sent by one receiver.
A timeout value is set to help determine how long the receiver waits before it sends a request to the sender for retransmission. The timer value is determined as a function of the distance from the sender to the receiver. In particular, a receiver closer to the sender times out before other receivers farther away.
There is a tradeoff between duplicate requests and the length of the delay before a request is sent and full-filled.
The parameters C1 and C2 are two crucial values in the forementioned functions. There values can be manipulated to affect the balance between the delay and the number of duplicate requests.
Different network topologies affect the distances of receivers to the sender, relative to each other. Therefore, different assumptions must be made, and appropriate methods must be applied to accommodate different network topologies. These topologies include, but are not limited to, chains, stars, and trees. For chains, deterministic suppression is used to suppress duplicates. For stars, probabilistic suppression is used. For trees, both are used.
The algorithms can be used to modify C1 and C2 dynamically depending on the duplicate request value and the request delay value. This method can adjust the timer to be at an optimum tradeoff between request delay and the number of duplicate requests.

I give this paper a rating of 4 for significant contribution. SRM is a multicast framework that is general enough to accommodate different kinds of applications. It leaves processing decisions, such as whether to sort the order of arriving packets, to the applications. This is a plus. This model is scalable and efficient. I think the authors have done a good job in laying out a working framework for IP Multicast. IP multicast is a very complicated problem, and SRM is a good place to start.

The methodology overall is quite convincing. The fact that SRM has already been implemented with wb means that it at least works. It is true that some of the testing is done only in simulations; however, the graphs clearly articulate how to reach the design goals of SRM.

One major limitation of SRM is that it is based on many of the assumptions of one application - wb. The authors to argue that SRM can scale to many different applications, but there is still a limitations from the design point of view of a single application so heavily influence the design of the framework. Another limitation is that the paper makes many assumptions that are true only if simplicity of the problem is assumed. For example, the paper ignores the possibility that the control packets themselves can be lost. Multicasting requests after a lost packet may result in further packet loss. Another major issue that this paper ignores is congestion control; much of the methodology in this paper, such as sending control packets multicast, may contribute to further congestion in the netork.

As the paper points out in section 10, future work needs to be done on scalable session messages, local recovery, and most importantly, congestion control.