ICDCS 2016 Review #157A --------------------------------------------------------------------------- Paper #157: Riptide: Jump-Starting Back-Office Connections in Cloud Systems --------------------------------------------------------------------------- Overall merit: 3. Weak accept Reviewer expertise: 3. Knowledgeable ===== Paper summary ===== The paper describes a technique to improve the performance of TCP flows over the internet between different PoPs of a cloud provider. A key observation is that new TCP flows start at the default conservative congestion window, despite the fact that previous data transfer between the two endpoints had reached a much larger congestion window. Based on this observation, the paper describes a system (Riptide) that monitors the congestion window sizes of ongoing TCP connections and sets the initial congestion window of new TCP connections based on past observations. By providing the new flows with a (potentially) larger congestion window from the very beginning, the completion times of flows is reduced. ===== Paper strengths ===== The approach requires no modifications to the kernel. Its very lightweight, leveraging standard linux tools and utilities. ===== Paper weaknesses ===== It is not clear as to how one would go about preparing the receiver for such a large influx of traffic within the first receive window. ===== Comments for author ===== The paper is easy to read and explains the pros and cons of using Riptide under various circumstances. What is worrisome is how the receive window is adjusted to be capable of receiving a large amount of data at the beginning of a flow. Section III.C seems to downplay the issue, by stating that increases in congestion window in linux are always accompanied by increases in receive window. However, Riptide does not do a static one-time increase of the congestion window. The increase happens dynamically, when the system is running. Comments on the evaluation section: * The evaluation section does not measure the impact of the receive window issue in any manner. * The evaluation measures the scalability of Riptide using probes over a single flow, conducted every hour. It would be very useful to show some performance metrics of normal traffic, where there is a large churn in the set of flows from a single PoP. * Some important details are also left out. For example, how many times does Riptide sample the congestion window size/hour? How many times/hour does Riptide have to set the congestion window using the ip tool (which has some impact on traffic) * While it is clear that Riptide can increase the congestion window of TCP flows using the ip tool, it is unclear what is the performance impact of such an action on day-to-day traffic. How does Riptide scale when there is a large churn in the number of flows between two PoPs? (say 100-1000 flows of varying sizes, per hour?) =========================================================================== ICDCS 2016 Review #157B --------------------------------------------------------------------------- Paper #157: Riptide: Jump-Starting Back-Office Connections in Cloud Systems --------------------------------------------------------------------------- Overall merit: 4. Accept Reviewer expertise: 3. Knowledgeable ===== Paper summary ===== This paper addresses the issue of optimizing the size of the initial TCP congestion window size to avoid traditional TCP slow start when possible. Specifically, it proposes a system called Riptide that determines this initial value by monitoring the behavior of open connections along the same path. The paper describes the design and implementation of Riptide, and presents experimental results for a global CDN system running on a production network. ===== Paper strengths ===== The idea is simple yet experimental results demonstrate its effectiveness in establishing an initial window that reduces flow completion times for back-office cloud traffic. The system has been deployed and tested in a production network on a global scale. The paper is clearly written with a nice level of detail. ===== Paper weaknesses ===== The core idea is not earth shattering. ===== Comments for author ===== I found this paper to be a well-done and thorough treatment of the relatively simple idea of using the behavior of existing flows to establish an initial TCP congestion window for new flows. While not feasible in all scenarios, the authors demonstrate that it works well in situations like back-office traffic between DCs and PoPs where the path is typically predictable and where there are often open flows to monitor. The approach has been implemented in a user-space system called Riptide, and there are a good array of experimental results based on a globally deployed CDN system. While it is not entirely clear whether Riptide itself is used in production — i.e., the paper says “in a production network”, which I take as different than “in production” — the paper still makes a good case that the approach is both feasible and effective. Overall, this a nice contribution, and I recommend that it be accepted. =========================================================================== ICDCS 2016 Review #157C --------------------------------------------------------------------------- Paper #157: Riptide: Jump-Starting Back-Office Connections in Cloud Systems --------------------------------------------------------------------------- Overall merit: 4. Accept Reviewer expertise: 3. Knowledgeable ===== Paper summary ===== The paper describes RIPTIDE a system to speed up TCP connection between distributed data centers. It shows the effectiveness of the approach with a series of measurements. ===== Paper strengths ===== + very simple but effective approach + approach is deployed for more than a year + the approach seems to be effective ===== Paper weaknesses ===== - maybe, the approach is too simple to be new? ===== Comments for author ===== I enjoyed reading the paper and in particular, I see that the measurements are helpful to evaluate the effectiveness of this approach. Personally, I very much like the simplicity of the approach. Of course, since it is that simple, I worry that this might have been done beforehand. However, I was not able to find a system that done this previously. Even if this approach would have been proposed previously, I believe the measurements have sufficient value to accept the paper anyhow. =========================================================================== ICDCS 2016 Review #157D --------------------------------------------------------------------------- Paper #157: Riptide: Jump-Starting Back-Office Connections in Cloud Systems --------------------------------------------------------------------------- Overall merit: 2. Weak reject Reviewer expertise: 3. Knowledgeable ===== Paper summary ===== This paper addresses the performance optimization of network communication between data centers in a wide area (world wide) network. The basic problem (TCP slow start) and the solution (increase the size of the initial congestion window) are well know (and the latter controversial) but the novel twist that the authors provide is to use the congestion windows of existing connections between data centers to set the initial value of the congestion window). The paper explain the problem of slow start well, describes the Riptide design and implementation in detail, and evaluates the solution in a worldwide CDN. ===== Paper strengths ===== The problem is described well, the solution is described well. The solution has been implemented and evaluated on a world wide test bed. The approach shows some improvement, especially for longer distance links (between continents). ===== Paper weaknesses ===== The solution seems to do the learning only within each VM/host in the data center. If you really think sharing the congestion window size makes sense, wouldn't you want to share the information across all the VMs/hosts in the data center so all can benefit? The paper shows that the approach does not lead to message loss. However, this is likely due to the fact that only this one application uses the more aggressive initial window size and as a result, the increased amount of traffic initially sent by the VM is insignificant amount of the traffic handled by the network. If everybody was using this approach, the message losses would likely start happening. ===== Comments for author ===== I enjoyed reading the paper. However, the idea does not quite make it over the acceptance threshold for ICDCS in my opinion. Comment in related work: "Riptide is further intended to operate on data center to data center connections, removing the requirement that it be able to perform on general purpose Internet links". The assumption that there are dedicated data center to data center links spanning the world is not true for most data center providers. Thus, the DC to DC traffic has to mingle with normal internet traffic.