Reviewer #1:

Very interesting paper, and novel idea, however, as stated in Line 60, Page 1, "a subset of this work appears in the Proceedings of ACM Sigcomm'06". We can see that these two versions have almost the same abstract. Can you describe how different between them in content?

(*) In addition to our Sigcomm'06 paper, we added section VI to evaluate the proposed one-hop source routing using BitTorrent peers as potential intermediate nodes. We modified the abstract to reflect this.

We found that in different sections of paper, different nodes in PlanetLab testbed are used. For example, to evaluate the server density, Berkeley and Purdue nodes are used, when evaluating the Rediction dynamics (Fig 5), Berkeley, Korea and Brazil nodes are used. While in next figures, MIT cs.vu.nl, mp.br, taiwan and UK nodes are used as well. It somehow confuses me. What's the reason to select different ones? Will the same set of nodes offer more insightful observations and give a consistent view to readers?

(*) This is an interesting point raised by the reviewer. Different PL nodes have different views of Akamai network. We select different sets of representative nodes to emphasize different characteristics of Akamai network. For example, Berkeley node shows apparent time of day effects while Brazil node experiences long inter-redirection times, and hence we use them to emphasize these particular features. In addition to our case studies, we plot aggregate results in figures 3,8, and 10, as well as in table 1, respectively.

Line 4, page 10, you let nodes measure the distance to the Akamai edge server to determine the route. Can you analyze the measurement cost? Since you argued in "related work" section that schemes in [31],[32] are not scalable due to its monitoring or measurement overhead. How about comparing yours with theirs?

(*) The reviewer correctly points out that we use pings (in Section V-D) between nodes. Note, however, that these measurements are made in order to validate whether following Akamai's redirections for a particular case work or not. Regarding the overhead, the results (Figure 14) show that the update frequency in our case can be as long as almost 2 hours before its performance significantly declines. Moreover, we show that with update intervals on the order of a day, we can still outperform the direct path on average. Thus, our effective measurement overhead per path is one request to Akamai's DNS infrastructure per 20 seconds.

To compare our measurement overhead to that of references [31],[32] (now references [32] and [33]), we proceed as follows. Assume an overlay consisting of N nodes. In the RON case [32], the measurement overhead is O(N^2). At the same time, the overhead of our approach O(N). This is because each node queries the Akamai's DNS system for redirection hints to discover appropriate detouring nodes. The scheme from [33] is designed for structured and topology-aware p2p systems. In addition to probing neighboring nodes, they need to disseminate routing information throughout the system. Hence, comparing their communication overhead to that of ours’ scheme is not trivial.

Line 50, page 4, we're told that ATT node experienced low server density (up to 340 servers found) while lbnl node has higher density (only 2 servers). This somehow surprised me. As I learned from this paper that Akamai deployed its server on most of major ISPs (ATT would be included), and if the node is very close the Akamai edge server, it may experience high server density. I'm not sure whether this ATT node is insided its network and close to any of Akamai edge server located in ATT network.

(*) We included an additional table (table 1 on page 4) to demonstrate the relationship between server diversity and network distance between PL nodes and edge servers. The results indicate that the larger the network distance between a node and its Akamai servers, the larger the number of associated edge servers. Also, we investigated in more depth the ATT node issue raised by the reviewer. The ATT node has IP address of '204.178.4.163'. In our measurement dataset, we did not find an edge server that shares the same network with this node (subnet 204.178.4.0/24). The average distance between this node and its edge servers is 55.829ms. Therefore, this node apparently is not close to Akamai hot spots.

Some other minor issues: The captions of Fig.1, 6, 11 are not consistent, some use "illustration", some don't. The font in Fig 1, 6, 11 is small and hard to read. In page 7, we expect to find Fig 9, but there's Fig 10, while Fig 9 is in page 8. line 26, page 9, why using 91 pairs instead of 91x2=182? You have said that the one-hop routing is asymmetric, so it's better to compare all of them. In footnote 1 of Page 6, since you have known that CNN is supported by Akamai currently, so it would be better to add one more curve to reflect its current normalized rank for all PL node.

(*) We fixed the figures mentioned by the reviewer. We re-arranged figure 9 and 10 to appear in proper pages. The reviewer correctly pointed out there are 182 possible unidirectional paths between the 91 pairs. We randomly assigned one node as source and the other as destination and measure unidirectional paths between the 91 pairs. This is in order to scale our experiments and to ensure timely responses of our measuring processes on Planet lab nodes with varied server loading conditions. Figure 8 reflects the state of things at the time they were measured. By the time we submit this paper for ToN review, CNN did switch back to Akamai. However, we learned that CNN switched its CDN again to Limelight [4] currently. We opt not to add an additional curve of current CNN, since Limelight CDN is out of scope of this paper. We modified the footnote accordingly to reflect the current condition of CNN.



Reviewer #2:

The paper describes how to exploit the rich measurements of Akamai for clients to detect the network latency changes between the clients and the Akamai Edge servers. Basically, one can send DNS queries to an Akamai low-level DNS server and see whether the DNS server returns Akamai Edge Servers different from the previous one. If so, it means that the network latency from the host to the previous Akamai Edge server has been degraded. These findings suggest that when we choose an one-hop relay node, we may want to choose one that is near the suggested Akamai Edge Server.

The paper shows rich evidence that the Akamai redirection and the network conditions are correlated. Furthermore, in 50% of the investigated scenarios, the suggested Akamai edge server can be used as a relay node as it shows smaller latency than that of the direct path. To exploit these findings in a practical setting, they propose a mechanism that can map the overlay nodes to Akamai Edge Servers.

The idea is very useful. But I have serveral questions. I think that the paper has roughly two contributions. The first one is the measurement part, which shows that the Akamai redirection is correleted with the network latency. The sencond one is the proposed detouring scheme, which reduces the amount of the active measurements among overlay nodes to find the one hop relay node.

The first part is OK. However, for the second part, they should compare their results with other schemes that are to reduce active measurements among overlay nodes. For example, the following paper discusses scalable network diagnosis. [1] Y. Chen, D. Bindel, H. Song, and R. H. Katz, An Algebraic Approach to Practical and Scalable Overlay Network Monitoring, in Proceedings of ACM SIGCOMM, 2004 It would have been more interesting if they had shown the comparison with other schemes.

(*) The focus of our paper is on studying the feasibility of reusing Akamai's CDN measurement results to discover high quality Internet paths instead of conducting measurements ourselves. Chen et al. present a linear algebraic approach to efficient overlay paths' monitoring. Their proposed method measures k linear independent paths and infer packet loss rate of all other paths. Our work is different from the former in 2 fundamental aspects. First, our CDN-based detouring does not require measurements among overlay nodes (beyond simply validating Akamai's performance at very low time scales (e.g., once in 2 hours, Section V-D)). Second, Chen et al.'s work is intended to improve system’s reliability by avoiding lossy network paths, while our goal is to improve clients’ performance by selecting low-latency paths as recommended by Akamai. We included these comments as part of the related work section (page 13).

Despite the above differences, if we would like to compare the overhead between the two schemes, we can proceed as follows. Assume an overlay consisting of N nodes. As indicated above, the overhead of our approach is O(N). This is because each node queries Akamai's DNS system for redirection hints once each 20 seconds to discover appropriate detouring nodes. At the same time, Chen et al's algorithm's overhead is O(NlogN) (> O(N)). Moreover, to effectively measure packet loss, the probing frequency should be higher than 1/20 sec.

Another thing is about the characterstics of the best peers. It seems that most of the best peers are near the source. I'm wondering whether this fact can be exploited to filter the candidate relay nodes in other schemes such as RON. Some comments on this direction would be helpful.

(*) Indeed, an overlay network like RON can take advantage of our methodology in selecting intermediate nodes. In particular, the source node can select potential detouring nodes by finding overlay nodes that redirect to the same edge servers (indication of the relay nodes are close to the source). We explore this topic in more detail in another publication:
D. R. Choffnes and F. E. Bustamante, SideStep - An Open, Scalable Detouring Service, NWU-EECS-07-08 http://www.eecs.northwestern.edu/docs/techreports/2007_TR/NWU-EECS-07-08.pdf
We addressed this point in our related work section in page 12.

Finally, one of the experiments uses 305 CDN clusters. It would be more interesting if they had shown the distribution of the distances between the peers and the nearest Akamai Edge Servers to see how well the mapping mechanism works. If some of the nodes are not close enough to the Edge Servers, those nodes are effectively removed from the candidate relay node sets by the filtering heuristic, which might make the selection suboptimal because those nodes are not considered as candidate relay nodes.

(*) The reviewer correctly point out that the mapping mechanism can potentially affect the system's effectiveness. This problem can be categorized to a general networking problem – closest node selection (i.e., how to select a node that is closest to another node in network sense). We actually studied this subject in the network positioning context. Our finding showed that relative distance between nodes can be inferred from the overlapping degree of their observed edge servers. The details of our CDN based network positioning system can be found in the following publication:
A-J. Su, D. R. Choffnes, F. E. Bustamante and A. Kuzmanovic. “Relative Network Positioning via CDN Redirections”, Proc. of the International Conference on Distributed Computing Systems (ICDCS), June 2008.

One minor comment is about Fig. 10. The Fig. 10 is quite confusing. It might be better to use the real delay difference instead of the absolute one. Furthermore, it might be better to use linear scale instead of the log scale. In short, the style of Fig. 13 would be better fit in this situation.

(*) As suggested by the reviewer, we modified the figure to use real difference instead of absolute values and use linear scale. We changed the text accordingly.



Reviewer #3:

This paper uses 140 PlanetLab nodes to conduct an extensive active measurement study of a large CDN (Akamai's) DNS redirection mechanism and explores the idea for measuring path performance via observing how a CDN distributes it content. They observe that CDNs like Akamai perform extensive network performance measurements for determining how clients are assigned to servers. While the CDNs measurements themselves are not available, the redirection actions that are based on these measurements can be observed. The ideas then is is to piggyback estimation of network path properties on these CDN measurements --- infer the quality of different network paths based on the observed redirection actions of the CDN. Their measurements show that in many cases, using the nodes returned by Akamai's DNS redirection results in low latency network paths between end-users and Akamai's edge nodes.

This is a fairly, large and detailed measurement study of an interesting performance question. The methodology and the results are quite interesting. The attractiveness of the main idea is that , if it is shown to work, it would obviate the need for large-scale dedicated probing by other overlay so long as (i) they have nodes that are very near to Akamai nodes (ii) the decisions from Akamai match their performance needs

One concern about the current submission is that : conclusions are made at places that are too strong and not quite backed by the results. Often point examples (e.g.., results for 1 or 2 PL nodes, or results for a single point in time) are used to make quite general statements and conclusions. The authors should make a detailed pass and either substantiate these or tone down the claims. Below are some specific suggestions/questions.

"The Berkeley nodes is served by fewer than 20 unique edge servers indicating that this node and its Akamai servers are nearby" . While this is a plausible explanation, it is not the only one. There could be other reasons why it is served by these many nodes. Have you confirmed that this is indeed true? What methodology was used to confirm the claim that "PL nodes with low server diversity typically share the network with Akamai servers "?

(*) We added an additional table to illustrate the above point. The table depicts the relationship between the number of edge servers seen by a PL node and the average RTT to those servers. We clustered edge servers in the same class C subnet, as they exhibit essentially identical network characteristics (e.g. RTTs to their clients). Each row in the table lists the average number of edge server clusters seen by a PL node within a particular RTT range. For instance, when PL nodes are on average less than 5ms away from their edge servers, they see a low number, 2.18 avg, of edge server clusters. Since Akamai tries to bring its client to their closest edge servers for better performance, we are not surprised to see nodes located within Akamai hot spots are redirected to their de facto edge servers for most of the times. On the other hand, when they are farther than 150ms away from their edge servers they see an order of magnitude more edge servers in average – 45.25.

The text suggests that long time intervals between redirection updates are undesirable. There seems to be a hidden presumption here that longer redirection intervals imply that performance for clients is less. Isn't it possible that some longer intervals between redirections were exactly the right choice - because no changes were needed in between?

(*) The reviewer is correct. Long redirection intervals do not necessarily mean worse performance when the network conditions are stable and the redirections are the right choices. In fact, this does not affect our proposed one-hop source routing since right choices essentially provide good hints on finding intermediate nodes for detouring. We tone down our statements in page 5 to clarify this point.

Please clarify if each measurement point in Fig 7 corresponds to a 5 sec or 20 sec window. It seems it should be 5 sec, since you run pings every 5 sec, and a new ping round can potentially change the relative rankings.

(*) In each 20 sec round, we averaged 4 pings to be the measured latency between clients and edge servers. The servers were then ranked by the averaged latency. We modified the text in page 5 to clarify our methodology.

Fig 8: it is not clear if you consider one particular time interval and look across PL nodes or if you consider across multiple time intervals in this graph. If the former, then how was that interval selected? - and ho does it generalize to other intervals ? If the latter, then explain how you combined results from the different time intervals. The observations about CNN are interesting and quite neat. The conclusion that "Akamai redirection overwhelmingly correlate with network conditions does seem too strong given that the actual results are somewhat more mixed." The paper uses the example of the customer with the best performance to make their point. But all that says is for that particular customer the statement holds, nothing more. Even ignoring CNN< others like Yahoo have mixed results.

(*) Our experiment looked across 140 PL nodes and it is executed every 20 seconds for 7 days. Since Akamai's low level DNS servers refresh every 20 seconds, we set the duration of each round to 20 second in order to capture every redirection change in Akamai network. In addition, we set 5 seconds between each pings in order to get a finer granularity of network conditions while still being able to scale our experiments.

We change the text in page 5 to clarify our methodology as follows “As in the above experiments, each of the 140 nodes sends a DNS request for one of the Akamai customers every 20 seconds and records the IP addresses of the edge servers returned by Akamai. This enables us to capture every redirection change in our monitored Akamai network since Akamai’s low level DNS servers are set to refresh in every 20 seconds as we discussed in section II. In addition, every 5 seconds, each PL node pings a set of the 10 best Akamai edge servers in order to gain a finer granularity of network conditions between a PL node and Akamai edge servers. The 4 pings measurements in every 20 second are then averaged to be the estimated RTT.”

The reviewer is correct that Yahoo does not show better results in first 40 PL nodes. However, other customers indeed show high correlation between Akamai's redirections and network conditions. We modified the text on page 6 and toned down our conclusion appropriately.

Best delay computation for a 20 sec interval - is it the lowest RTT among the 4 5-sec ping sets in that 20 sec ?

(*) In each round, the 4 5-sec pings are averaged to be the estimated RTT between the client and edge server. We changed the text in our measurement methodology in page 5 (same as above) to clarify this point.

Fig 10. It’s not clear how the host ids were ordered on the y-axis. A better way would be to plot the CDF of the absolute difference. Also, why consider the average difference ? Wouldn't some other measure of the distribution like the median difference be more appropriate?

(*) The host Ids on the x-axis is sorted by the difference of the average delay. We modify the figure and use real values instead of absolute values as suggested by reviewer#2. We use average because it represents a random select of one of the 10 best edge servers. And the difference of RTTs represents the gain or loss comparing to the random selection. Using median as the baseline of comparison would blur the worse paths in the result. This would in turn mask Akamai's ability to avoid worse paths (e.g., the paths that once were good but became congested in later times) which is good for our one-hop source routing.

One application described in the paper is that of using the Akamai-based info for routing between overlay nodes. [...] However for this to be useful, the overlay would need to have a relay node that is close to the said Akamai server. The question then is that if you already have an overlay infrastructure with nodes near the Akamai servers, why not run direct measurements between the overlays nodes instead of trying to use inferences via Akamai ? Saving some probe traffic overhead clearly? Akamai's redirection decisions may potentially be tailored to the type of content being delivered and to other business decisions. These may be different from the overlays needs. How would the overlay determine when it can use Akamai and when it needs to run its own measurements?

(*) An overlay network would be hard to scale if it needs to perform measurements between nodes. For example, RON provides near-optimal performance at the cost of measurement overhead that is quadratic in the number of nodes in the system. In this paper, we focus on the problem of how to select a detouring node by reusing Akamai's measurement results instead of measuring the overlay ourselves. Our proposed methodology and evaluation results shows that, by following Akamai's recommendation, an overlay node can find a good detouring node without extensively probing nodes in the overlay. The reviewer is correct that Akamai's redirection may be tailored to it business decision. In section 4, our measurement results demonstrate Akamai's redirection is highly correlated to network conditions. Moreover, in section 5D, we provide a path pruning algorithm to decide whether to follow Akamai's recommendation or use a direct path. In particular, if Akamai's redirections for a particular network region are driven by Akamai's business (or any other) decisions, our algorithm is capable of detecting and filtering out such events.

It seems like this piggybacking may put additional load (query resolution) on the DNS infrastructure - specifically Akamai's DNS infrastructure. Why is that a fair solution as opposed to running your own measurements? Arguably Akamai does not expect to be facilitating third party overlays. Could Akamai start using countermeasures, e.g. in the case of iterative DNS resolution? The paper addresses this question, but this is a concern which will remain. It is not enough to just state that such attempts will have "non-negligible false positives an negatives". What if a certain fraction of the numbers start getting polluted? An interesting approach could be to add some uncertainly to the accuracy of the measurements obtained from Akamai.

(*) We are well aware of the issues raised by the reviewer. Still, we note that our proposed methodology does nothing but keep track of Akamai's DNS redirection which is exactly what a web client normally does. In addition, for an overlay node like a BitTorrent client operated by an individual Internet user, passively monitoring user's browser activities can further eliminate additional load to Akamai's DNS infrastructure. However, the reviewer is correct that Akamai can start using countermeasures. Still, in this case, regular Akamai's clients would suffer from the polluted results as well. In any case, we do recognize that following Akamai's redirection is not optimal in all scenarios. In section 5D, we provide a path pruning algorithm that handles scenarios where Akamai's recommendations are not optimal.

For a given pair of overlay nodes A, B. which DNS CNAME name resolution is the most appropriate one that A should ask for? As the paper itself observes, different content providers may be served from different edge servers. DNS resolution returned would be a function of both the name in the query, and different queries may impact performance to different extents.

(*) The reviewer correctly points out selection of CNAME can potentially affect performance. We purposely do not address this issue in our current design. This is because the use of different CNAMEs can help avoid potential synchronization issues and it can help balance the traffic load, as we discussed in more detail in Section 7 (Implications of widespread adoption). In practice, particular nodes can use history to select a subset of CNAMEs that delivers the best performance.

The measurement methodology should be clarified with more details. How do you "employ 10000 randomly selected hosts"?

(*) In our experiment, we randomly select 10,000 BitTorrent peers that have DNS servers supporting recursive queries from our collected peers. We change the text in page 10 to clarify this methodology.