CoNEXT '17 Paper #2 Reviews and Comments =========================================================================== Paper #33 Drongo: Speeding Up CDNs with Subnet Assimilation from the Client Review #33A =========================================================================== Overall merit ------------- 4. Accept Paper summary ------------- The paper investigates the problem of replica selection in CDN context. The key idea is to leverage client information as well in addition to the server view of load distribution and content placement. Specifically, the authors propose to use subnet assimilation so to declare their own “network location". The idea is implemented in Drongo, a client-side system which can be already used today and that shows large benefits in the node selection problem. A large scale and realistic evaluation is proposed: 429 clients, 177, 6 CDNs. Results show that benefits for 70% of the clients and up to one order of magnitude latency reduction. Strengths --------- Interesting approach Solution already deployable today Paper is well written Latency valley experiments and results are very valuable Weaknesses ---------- Some results require further explanation, especially when they are negative I think paper can be improved by discussing how they system would work at scale Comments for author ------------------- I think this is a good paper and it deserves to be presented at the conference. However, my excitement fade off as I was reading which made me reconsider my score from 5 to 4. The main reason being that I feel the second part of the paper is not as strong and as complete as the first part. Specifically, the paper does not discuss and evaluate how the system would work at scale both in term of how domains to be tested are chosen and how to handle lots of clients potentially located in the same subnet, thus generating lots of redundant requests. Similarly, some negative results (e.g., for Alibaba) require more attention. More details comments here below. Missing details for Figure 4 -- I think Figure 4 is quite important, as I was wondering the impact of measuring latency just via ping versus application layer latency, e.g., a replica might be better ping-wise but the server could be overloaded and thus provide overall worse performance. The figure is not really described, apart from a simple conclusion that "ping" is enough. Reader needs to know the differences between post caching and not, how measurements are run (it seems curl for specific objects), etc. Measuring page-load-time using a headless browser (phantomjs for instance) is also probably a good idea to strengthen this results. Scalability -- Assuming high adoption of the solution proposed, many clients from the same subnet will be performing such (redundant) tests. I think they system would benefit from a valley repository where clients share such data. Whenever a test needs to be run, the client would first check the repo and see if a previous valid test (same domain, with window) is available. If yes, then valley information is used; if not, then a test is run and donated. The paper does not address this issue, similarly it does not discuss how domains to be tested are selected to ensure that fresh information is available to the client. It seems that a popularity based approach based on client locations might work quite well, because of language and interest similarities. On top of overall traffic reduction, sharing the load among multiple clients should help the system as well. What about performance penalties? -- The paper states that they evaluate (Section 5) the "performance gains imposed on clients where subnet assimilation is applied". Drongo also impose performance penalties and in some cases (>25% for Alibaba according to Figure 11) they are far from negligible. The paper does not discuss such results, but they can be huge problem from CDNs, so they deserve a discussion and, hopefully, a way to avoid such negative results. Also, is Figure 6 cropped at 1.0? [MINOR] Figure 8 plots the average latency ratio in the same manner as Figure 7, yet only for queries where Drongo applied sub- net assimilation. -- it would be nice to add directly in the text how many they are. Yes, we see this in the next plot, but an indication here would be nice [MINOR] its peak aggregate performance gains of 5.18%. -- I am not sure this plays in your favor. I would at least mention how much is the gain where an opportunity can be seized. [MINOR][typo][Introduction] spe- ed up --> speed up [MINOR][typo][Section 2.1] to adapt the EDNS0 --> probably "to adopt"? Review #33B =========================================================================== Overall merit ------------- 2. Weak reject Paper summary ------------- The paper presents a large scale study of "latency valleys" in popular CDN providers. A latency valley is when a client is not served by the closest (delay wise) CDN replica. The paper provides evidence to the existence of such valleys in the Internet, analyzing various properties such as their frequency, persistence and predictability. The paper then proposes Drongo, a tool that leverages ECS for subnet assimilation, i.e., to search for subnets served by CDN replicas different from those service the client's subnet and have shorter latency to the client. Drongo enables clients to choose closer CDN replicas. The paper evaluates Drongo, and shows that it could provide some increase in performance. Strengths --------- - Providing evidence to the existence of latency valleys, which highlights potential room for improving the efficiency of content distribution. - A tool (Drongo) to enable clients use closer (latency-wise) CDN replicas. Weaknesses ---------- - The search for latency valleys incurs unnecessary complexities in some cases. - The proposed system's (Drongo) performance improvement seems insignificant, and its wide adoption may cause substantial network administration burdens. - Writing flow is hard to follow at some places of the paper. Comments for author ------------------- To search for latency valleys, the paper proposes to check the latencies between the client and the replicas offered to the intermediate hops along the path from the client to the suggested replica. This can however limit the pool of options since alternative replicas are only sought from intermediate hops. The rationale for limiting the options to hop-replica sets is not justified. There is also a conflict between Drongo requiring only client-side changes, and the claim that it respects load balancing. If Drongo is widely used, there is little doubt it can highly interfere with the CDN's network mapping and distribution policies, which (as the authors acknowledge) often involve several factors other than latencies such as load balancing, etc. The authors suggest firewall rules can enforce the CDN's distribution policies, but this means large scale adoption of Drongo will likely be accompanied by increased network administration complexities to keep the mapping optimized by the CDN. Drongo's performance gains seem low in many cases (e.g., 5.18% and 5.85%). The statement "an unrestricted adoption of the ECS option [..] is in the best interest of every CDN" is overly strong and unsupported, especially with the authors reporting Akamai indeed restricts ECS. The hop-exclusion criteria in Section 3.1 is also unjustified. For example, if the same /16 belonged to a different AS than the client, it may well be served by another CDN replica. Why exclude it in that case? In Section 3.2.2, the authors "consider how the latency ratio changes as [they] increase the time passed between trials". I was waiting to see that analysis, but I did not find it discussed anywhere after in the paper. So was the time between trials (within a window) increased? by how much and what was the effect? The window size ranges 1-15 also does not indicate if the time between those trials was again fixed to 1-2 hours, or was increased as suggested. The time between trials can affect the predictability results, so it is imperative to discuss that aspect. In Fig. 4a, Google and CloudFront seem to have the same CDF (y-value) at valley frequency (x-value) of 0.5, yet they differ in the last column in Table 1. If the table data was compiled from multiple sources, please clarify. Editorial comments: - Page 2, 3rd line in bottom left paragraph: word "speed" in unnecessarily hyphened and split. - Page 4, bottom of 2nd to last paragraph on the left: "this is such a case" - awkward grammar. - Page 4, 1st paragraph under 2.4: "ratios below 1" and "ratio above one" - unify format. - Page 6, 2nd paragraph under Section 3.2: "being suggested a replica on par" - should be "path"? - Later in the same paragraph: "i.e., the Column 2" - remove "the". - Page 7, 1st paragraph under Section 3.2.2: "we take a window m of size N" - variable "m" was never used again in the paper. - Page 10, 2nd paragraph in right column: "We make the several insights" - remove "the". - Next paragraph: "indicating that at even the vt can be too leanient" - remove "at" and "leanient" is misspelled. - Fig.7,8,9: vf cannot be greater than 1, so please fix the three legends at "vf>=1.0". Review #33C =========================================================================== Overall merit ------------- 3. Weak accept Paper summary ------------- This paper presents a system that improves the assignment of a CDN server to a client. The CDNs use the EDNS extension to map clients to a geographic area and assign it to a specific replica. Authors propose a methodology where the client traceroutes the CDN replica and then check if the hops, found in its upstream, are directed to a 'closer' replica, in terms of latency. Focusing on 6 different CDNs, authors shows that the sub-optimal assignment is a common problem and with their approach, the latency decreases pf about 25% on the median case (5.8% on average). Strengths --------- - Problem is relevant, and authors have done a reasonable job of explaining that sub-optimal assignment often happens. - The proposed approach achieves good performance with relatively low overhead - It is appreciable that authors use two different measurement platforms to validate their results. Weaknesses ---------- - The latency between the CDN and the clients is only one of the criteria that CDNs consider, and may be less important than e..g, content placement, business relationship,bandwidth, etc. - Most of the gains are expressed in relative terms. So 25% median (5.8% average) gain could possibly be quite small in absolute terms to have any practical relevance - Authors decided to ignore (or hide their thoughts about) the fact paths are generally asymmetric due to policy routing and traffic engineering Comments for author ------------------- The paper makes a nice read of a simple solution on an actual problem, with fairly conducted experiments. Yet it is unclear about the relevance, applicability, and the prospective impact, of the solution once applied. For instance, about the relevance: authors ignore (or hide their thoughts about) the fact paths are generally asymmetric due to policy routing and traffic engineering: CDNs want to improve their performance on the backward path, which cannot be studied doing only traceroutes from the client side. This study fails to consider it, but it would be interesting to study the backward path. This problem has been studied in the past in [1,2,3,4]. Next, concerning applicability: the latency between the CDN and the clients is only one of the criteria that CDNs consider (as the author mention). Others criteria, e.g. content placement, business relationship,bandwidth, etc..., have an important role in the decision and this is not studied in the paper. It is not clear to what extent the applicability could be hampered by such constraint (i.e., without violating business relationship) Furthermore, concerning the prospective impact of the results, already considering the delay it is unclear the extent of the potential improvements for the end-used: indeed all figures (except Fig 3) show the latency ratio. And clearly a 50% improvement from 100ms or 1ms have a different significance for the user: a consistent quantitative view (as Fig 3)is missing. Additionally, the quantitative view in Fig 3 leaves the reader quite skeptical: i.e.., one can infer that there's not much that can be said for Alibaba (drongo and original selection falls in the 100ms-1000ms range) and it is unclear if in CDNnetwork the approach could be used at all, (since from author explanation it seems that CDNetworks uses a combination of anycast and DNS redirection. This is a solution proposed by Microsoft [5], but I am not sure if CDNetworks has adopted it). Additionally, concerning results, authors do also consider transfer time (See Fig 4.) however the picture is totally unclear: eg. Fig 4b says to be a CDF (on the y-axis) of the total download time (in capition b) but the x-axis reads as valley frequency. It is thus unlikely that the picture is really a distribution fo the download time, and the associated comment (sec 3.2.1) is not at all clarifying. Additionally, when comparing Fig 4b (without caching) to Fig 4c (where caching take place), one cannot see any difference: this can on the one hand imply that the first selection was good (i.e., because the content was already there) or imply that there is no difference overall in using drongo (i.e., when the first reuqest to a drongo-selected node implies a miss and a request to the origina-CDN-selected node). Other tricsk - It's quite difficult to read Fig 4 and 5. - Reference 7 is duplicated in 8 - Figure 6 and 11 are not directly comparable, as you explained. It would be nice to repeat the experiment from Ripe Atlas with the same configuration to compare them. [1] He, Y., Faloutsos, M., Krishnamurthy, S., Huffaker, B.: On routing asymmetry in the Internet. In: Global Telecommunications Conference, GLOBECOM 2005, vol. 2, p. 6. IEEE (November 2005) [2] He, Y., Faloutsos, M., Krishnamurthy, S.V.: Quantifying routing asymmetry in the Internet at the AS level. In: GLOBECOM, pp. 1474–1479. IEEE (2004), [3] Paxson, V.: End-to-end Routing Behavior in the Internet. In: Conference Proceedings on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM 1996, pp. 25–38. ACM, New York (1996), [4] Schwartz, Y., Shavitt, Y., Weinsberg, U.: On the Diversity, Stability and Symmetry of End-to-End Internet Routes. In: INFOCOM IEEE Conference on Computer Communications Workshops 2010, pp. 1–6 (2010) [5]Flavel, Ashley, et al. "Fastroute: A scalable load-aware anycast routing architecture for modern cdns." connections 27 (2015): 19. Review #33D =========================================================================== Overall merit ------------- 1. Reject (I will fight against this paper) Paper summary ------------- The paper proposes a method to reduce the latency for reaching a CDN replica: to this aim, the client sends multiple DNS requests indicating different originating subnets, selecting the closest replica through pings. The authors analyze different metrics to show that it is possible to decrease the latency, and they test their solution on different CDNs. Strengths --------- Interesting idea, since the CDN matching system is complex and it does not always return the best replica for the client. The evaluation includes tests from different vantage points and multiple CDNs. Weaknesses ---------- The authors explore the presence of valleys, without focusing on the stability of such valleys. The system does not work in real-time, therefore it may have some problem with entries with short TTL. Comments for author ------------------- In Sections 3.2.1 and 3.2.2 the authors consider two questions: wether valleys persist and wether they can be predicted. Instead, the fundamental question should be wether the *same* valleys persist, so that they can be used also in the future. This is actually the essence of their method: finding alternative CDN replicas during the client low activity, and use such information when needed. If the same valleys do not persist (since the content has been removed from a CDN server), then the approach is useless! In addition to this issue, there is still another fundamental problem. CDNs make heavy use of short TTLs for their entries. How the proposed methodology can cope with short TTL: it seems that Drongo goes in the opposite direction, i.e., it collects alternative CDN replicas when not needed, and use this information (that could be stale) when necessary. Overall, therefore, it seems that the system targets a small set of objects (the ones with long TTLs), and it does so not correctly (it does not assure that valleys are consistent, i.e., the valleys are the same). Review #33E =========================================================================== Overall merit ------------- 4. Accept Paper summary ------------- The paper seeks to reduce latency between CDN end users and their servers. The work leverages the ECS (EDNS0 Client Subnet) option that enables a DNS client to specify its subnet in name-resolution messages. The proposal traceroutes the path to the server suggested by the CDN and, for each router on the path, exercises the ECS option with the router's subnet to discover CDN servers with lower latency. Drongo is a specific instance of this main idea and applies it conservatively to select the alternative CDN servers only when their latency advantage is likely to persist. The paper evaluates its main idea and Drongo instance in PlanetLab amd RIPE Atlas for six CDNs that support the ECS option. Strengths --------- * The paper proposes a creative idea. * The extensive measurement-based studies show the idea to be workable and effective. * The solution is simple and easy to deploy. * Providing false information as a common mode of Internet-protocol operation is an emerging pattern that deserves better understanding. Weaknesses ---------- * The paper does not study implications of the proposal in scenarios where the CDN selects a server with larger latency due to load-balancing or business reasons. * More generally, the paper does not sufficiently position the proposal within the general context of Internet content delivery. * The presentation would benefit from various improvements. Comments for author ------------------- The paper proposes a creative idea. While the topic of latency reduction in content distribution is of high concurrent interest and has attracted intensive research efforts, the submission is the first to use the ECS option to tackle the problem. The high creativity alone makes the paper a strong candidate for acceptance. The extensive measurement-based studies show the idea to be workable and effective. The measurement results come from two major, globally distributed platforms. The evaluation thoughtfully examines a number of pertinent factors and reveals that the proposal not only works but also provides surprisingly large performance improvements. Specifying a different subnet delivers significant latency reductions for substantial potions of the considered end-user populations. The solution is simple and easy to deploy. In the presented basic instance, the solution employs relatively few measurements and modifies the end-user side only. Not requiring changes or cooperation from the CDN or ISPs constitutes a solid basis for the proposal's adoption. Providing false information as a common mode of Internet-protocol operation is an emerging pattern that deserves better understanding. The submission cites a couple of recent papers in routing and traffic engineering where supply of bogus information is shown to be an effective way for achieving performance benefits. This might mark a transition to the Internet that resembles the animal, and human, worlds where deception is routine and has to be dealt with in everyday operation. The paper does not study implications of the proposal in scenarios where the CDN selects a server with larger latency due to load-balancing or business reasons. While the submission mentions both scenarios, it lacks an in-depth assessment of whether the proposal can severely unbalance the CDN server load or make the CDN to serve an end user from a server that is contractually dedicated to users from another subnet. More generally, the paper does not sufficiently position the proposal within the general context of Internet content delivery. While latency is an important metric, CDN cache provisioning also involves minimizing the CDN operator's cost while satisfying the predicted load: S. Hasan, S. Gorinsky, C. Dovrolis, and R. Sitaraman, "Trade-offs in Optimizing the Cache Deployments of CDNs", Proceedings of IEEE INFOCOM 2014, pp. 460-468, April 2014. Also, because the cache provisioning is done on a slower time scale, e.g., months, real-time load balancing is needed to address mismatches between the predicted and actual loads on the CDN server infrastructure. The paper will benefit from clarifying the role of the proposed method in the global process of CDN provisioning and operation. What are likely reactions by CDNs to the proposed usage of the ECS option? Even though the paper claims that the ECS option is in the best interest of every CDN, the reluctance of Akamai to support ECS is a strong counterargument, with the reverse-engineering being a reasonable concern. Also, the paper does not evaluate the overhead imposed by the proposal on the DNS system. The presentation would benefit from various improvements. Whereas the paper makes its individual points clearly, there is somehow a lack of overall coherence. The two measurement studies are conducted in different settings: PlanetLab with the median HRM versus RIPE Atlas with the minimum HRM. Drongo, which is highlighted in the title and hence expected to be the main contribution, is not discussed until page 9. The submission does not number its pages. A footnote would help to explain the Drongo name by connecting it to the bird tricks. Some text is repetitive, e.g., paragraphs "By conducting experiments ..." and "We implement and evaluate ..." in the Introduction convey similar sentiments and can be merged. The writing has a number of typos, e.g., "are a common phenomena" (thrice), "spe- ed", "after first attempt". The Introduction and Conclusion highlight the 25% median latency improvement and omit mentioning the 5% aggregate performance gains, which sound less impressive. Overall, the submission would benefit from a complete rewriting pass to present its exciting material in a more consistent style. Comment @A1 =========================================================================== ## PC Discussion summary The committee discussed and accepted the paper at the in-person meeting in Rome. The PC greatly appreciated the creativity of the submission. The main idea was judged to be fresh and workable. On the other hand, the submission would benefit from evaluating and presenting the idea more coherently, with a clearer positioning of the proposal within the general context of CDN operation. The impact of the proposal on CDN load balancing and business obligations was identified as a major concern. While the PC acknowledged the impossibility of studying all relevant aspects comprehensively, the authors are requested to discuss these issues deeper in the final version.