Reviewer #1:

NSDI '11 Review #32A
Updated Friday 22 Oct 2010 1:10:13pm PDT
Paper #32: Towards Street-Level Client-Independent IP Geolocation

Overall merit: 4. Accept
Reviewer expertise: 2. Some familiarity
Is the paper easy to understand?: 4. easy

===== Paper summary =====

The paper presents an IP geo-location system that achieves accuracy below 1km. The key element that differentiates this system from prior work is that it mines the web to obtain "landmarks", i.e., {IP address, geographic location} pairs. This information is combined with delay measurements from multiple vantage points to identify the landmark that appears to be closest to the target IP address.

===== Reasons to accept (1-3 lines) =====

+ Presents a complete and practical solution Significantly improves
+ geo-location accuracy compared to prior work

===== Reasons to reject (1-3 lines) =====

- Weak motivation
- The proposed system assumes that institutions host their web-sites locally, which is becoming increasingly unlikely.

===== Comments for author =====

This is a nice piece of work that clearly improves upon prior work. I only have a couple of complaints:

- The motivation presented in the introduction is not compelling. First, to claim that one coud rely on a geo-location system for location-based access restrictions or context-aware security requires showing that the geo-location system is resistant to attacks (which the authors do not do). Second, it is not clear why online advertising needs geo-location accuracy better than tens of km. Similarly, it is not clear that such fine-grained geo-location would improve the scalability and availability of cloud-computing services.

- The proposed system relies on the assumption that institutions host their web-sites locally. As institutions move their sites to the cloud (or web-hosting providers, in general) this assumption will hold in fewer and fewer areas of the world. Why not simply acknowledge that this is a fundamental limitation of the proposed system? I found the assertion in Section 5 that "... the remaining percent of websites will always create a reliable and accurate backbone for our method" unnecessary, unsupported, and provocative (in the wrong way).

Some minor comments:

- Nice evaluation methodology -- particularly the derivation of the "online maps" dataset.

- Could one verify landmarks in the following way: (1) geo-locate the landmark's IP address using some existing technique, e.g., CBG; (2) if the geographic location obtained in this way is compatible with (e.g., within tens of km from) the postal address obtained from the web, then this is a valid landmark. This would fail only if an institution happens to have its web-site hosted by a provider that is geographically located close to the postal address of the institution.

- I am curious: How much worse would the results be if one (1) collected landmarks from the web, (2) identified which of these landmarks share a common IP address prefix or belong to the same Autonomous System with the target IP address, and (3) picked the landmark that appears closest to the target as the authors do in Tier 2 and 3? I.e., are CBG and multilateration necessary?

- I think the fact that the authors were able to build this system should raise privacy concerns. Perhaps this work should be viewed as a wake-up call for looking into ways to prevent such accurate geo-location systems from working.

Reviewer #2:

NSDI '11 Review #32B
Updated Monday 25 Oct 2010 1:21:18am PDT
Paper #32: Towards Street-Level Client-Independent IP Geolocation

Overall merit: 5. Strong accept
Reviewer expertise: 3. Knowledgeable
Is the paper easy to understand?: 4. easy

===== Paper summary =====

This paper achieves astonishingly good IP geo-location by using address information available on web servers that are suspected of being co-located with the address, as landmarks. They describe the approach, and show convincing results based on three datasets.

===== Reasons to accept (1-3 lines) =====

Amazing results...a couple orders of magnitude better than previous approaches, and in my mind good enough results to be used for, say, location-targetted advertising.

===== Reasons to reject (1-3 lines) =====


===== Comments for author =====

Well done guys! I was completely skeptical of your approach until I got to figure 5 (and of course figure 8). I would not have guessed that there would be such a nice correlation between delay and distance. This guess is based on the kind of reasoning in the IMC paper "On the difficulty of finding the nearest peer in p2p systems". This paper argued that the collection of hosts from the hosts' nearest access POP would radiate outwards from that POP, and in some sense all look the same distance from an external measuring point. For instance, if two locations were east and west of a POP by the same distance, they should have the similar delay characteristics, which should confound your approach. Of course the line in figure 5 is not perfectly straight, but it is surprisingly good. I do wonder if there could be a problem of having too many landmarks near a target, so that the noise in figure 5 has you pick the wrong one, but in any event your results speak for themselves.

I share your concern about web sites moving to the cloud (and find your argument that there will always be enough self-hosters to be too speculative), but for now you have great results.

Personally I think you should state the median result of all three of your datasets in the abstract, rather than your best data set....the result is still very impressive, and it's more honest.

A few typos:

"do play the role" should be "do play a role".

"targets within few kilometers" should be "targets within a few kilometers"

"information from US." should perhaps be "information from within the US."

Reviewer #3:

NSDI '11 Review #32C
Updated Monday 29 Nov 2010 9:30:19pm PST
Paper #32: Towards Street-Level Client-Independent IP Geolocation

Overall merit: 2. Weak reject
Reviewer expertise: 2. Some familiarity
Is the paper easy to understand?: 4. easy

===== Paper summary =====

The paper presents a new IP geolocation method that accurately maps a given IP to a geographic location. They determine the rough location of the target with a bunch of public ping/traceroute servers whose locations are known. Based on the intersecting measurement rings from the vantage points, they find a subset of ZIP codes near to the target location. Then, they use Web-based landmarks in the ZIP codes (whose locations are again fixed) to measure the estimated delay (and
distance) between the landmarks and the target via traceroutes. They find the ZIP codes near the center of the measurement rings from the landmarks, and recursively find the landmarks in the ZIP codes until no points in a round belong to the intersection. They finally determine the closest landmark to the target and derive the location from it.

===== Reasons to accept (1-3 lines) =====

A methodical IP geolocation with given restrictions. Using Web server locations as landmarks is interesting, and the results are very accurate at least for the datasets in the paper.

===== Reasons to reject (1-3 lines) =====

1. Validation datasets are too small to be generalized. Most GeoIP DB companies boast of 80-90% accuracy (though at the city level) for the entire IP network. I don't think validation with a few hundred IPs in the U.S. (let alone the PlanetLab dataset is way too easy) warrants the accuracy for comprehensive datasets.

2. ZIP code based geolocation is limited to U.S while many services (CDNs, international company sites, etc.) need the geolocation for non-U.S. IPs. The paper does mention international coverage in the discussion section, but ZIP code DB and webserver locations may not be available in all countries.

3. This approach could face a scalability issue. Maintaining a large number of landmark location information up to date for the entire IP network (or even for the U.S. network) would require periodic crawling the Web sites.

===== Comments for author =====

Street-level IP geolocation might be a useful feature, but I am not sure where it is critical to have this information. The intro mentions location-based advertisement as one example, but I don't really want to be spammed with such ads. If I need to find businesses around me, I could provide my location to the services like Google Map. If I'm in the street, the GPS of my cell phone correctly determines my current location as well. Other examples that the paper suggests are location-based access restriction and context-aware security. What are these exactly? If you meant IP geolocation for a CDN system, city-level information should be more than enough. I didn't understand why cloud computing requires street-level IP geolocation, either. In short, I wasn't convinced with the motivation of this paper.

The paper extensively uses ZIP code information to locate nearby Web-based landmarks. This approach limits the applicability to the U.S. as noted in the paper. Even if we assume such information is available for the whole world, maintaining the landmark database up to date could be challenging due to a large scale. How many Web-based landmarks are needed to cover the entire U.S. and the whole world?
You verify if the site is hosted by a CDN or a Web hosting service or whether a site has multiple branches, which is good, but verifying a large number of sites periodically would not scale.

How accurate are Google Maps and Bing Maps? I did see some errors that are not fixed for many months (even over a year) with Google Maps. Fortunately, section 3.3 says the approach is resilient to error, but have you tried injecting faults and measured the level of inaccuracy?

In the last para of the intro says "all the measurements could be done within 1-2 seconds". And the discussion section says the measurements are done simultaneously. I see hundreds of ping servers in section 2, and 930 Web-based landmarks in 257 distinctive ZIP codes in section 2.2. Hundreds of ping measurements alone could trigger the IDS alarms very frequently (note PlanetLab experiments). Also, I would be interested in seeing the measurement latency distribution for your datasets.


A few sentences in the introduction are copied and pasted from [3]. You should be careful about using the phrases and sentences from other text.

Your second para of intro: ... of location-based access restrictions and context-aware security[3]. Also of rising importance is cloud computing. In an attempt to seamlessly use public and private cloud implementations for the sake of scalability and availability, hybrid architectures can leverage a highly accurate geolocation system to enable a broader spectrum of functionality and options.

Intro in [3]:
As the accuracy of geolocation technology has improved, there are more use cases for location-based networking than ever before. Advertising and performance- related implementations are still a valid use cases, but the enforcement of location-based access restrictions and context-aware security is quickly becoming more important, especially among an increasingly mobile user base.

Also of rising importance is cloud computing, which introduces new challenges to IT in terms of global load balancing configurations. Hybrid architectures that attempt to seamlessly use public and private cloud implementations for scalability, disaster recovery, and availability purposes can leverage accurate geolocation data to enable a broader spectrum of functionality and options.

Reviewer #4:

NSDI '11 Review #32D
Updated Tuesday 16 Nov 2010 9:27:46am PST
Paper #32: Towards Street-Level Client-Independent IP Geolocation

Overall merit: 1. Reject
Reviewer expertise: 2. Some familiarity
Is the paper easy to understand?: 3. mostly easy

===== Paper summary =====

This paper presents a new approach to IP geolocation. The idea is simple:
scrape actual addresses off web servers that have relatively well-known locations (e.g., a business located at some address and actually hosting their own web page). The authors do a lot of analysis to show that this idea improves geolocation over previous approaches, sometimes quite substantially.

===== Reasons to accept (1-3 lines) =====

Reasons to accept:
- kind of a clever idea
- actually does improve on state of the art
- the analysis is reasonable

===== Reasons to reject (1-3 lines) =====

- Didn't really build anything (no I in NSDI)
- Is this an interesting problem?
- If so, will it be in a few years?

===== Comments for author =====


I have a hard time arguing for acceptance for this paper, for the following

- Is geolocation still a problem worth investigating? A number of techniques
already exist, and some work reasonably well. Do we need an even better
approach? The authors would be better served by building some real
applications and showing that they only work when the fidelity of location
is as provided by their system and not others.

- The approach seems likely to get worse over time, as the address of a
business will increasingly have nothing to do with where it is hosted. The
trend here is in the wrong direction, alas, with
clouds/virtualization/etc. making it less and less likely that such an
approach will make any sense.

- The world is changing in other ways too, with an increasing number of
clients with GPS inside of them. This is likely the dominant platform for
services that need location information (who needs location info about a PC
in your house?) and thus the problem of location already has a better
solution which is widely deployed.

For these reasons, I unfortunately argue for reject. The authors do a reasonable job given the problem, and the idea seems novel (if not robust), but I just cannot believe that there is much more to do in this space.

Reviewer #5:

NSDI '11 Review #32E
Updated Wednesday 17 Nov 2010 6:16:20am PST
Paper #32: Towards Street-Level Client-Independent IP Geolocation

Overall merit: 3. Weak accept
Reviewer expertise: 3. Knowledgeable
Is the paper easy to understand?: 4. easy

===== Paper summary =====

The paper presents techniques for very precise IP2Geo location, based on the location of known passive landmarks that (i) publish an exact geographic location on their web site and (ii) are verified as hosting their own Web site rather than using a CDN or running their Web site at some other location of their company. To infer the location of another host, the proposed technique performs traceroute to the target and to nearby (passive) landmarks, based on an initial coarse-grain estimate of the target's location. Comparing the relative traceroute delays for several such landmarks, the technique identifies the "closest' landmark and associate's the target's location with that.
The paper presents results for several data sets, including a small set of residential users and query traces from an online maps service.

===== Reasons to accept (1-3 lines) =====

The paper presents several interesting techniques for inferring the geographic location of an IP address. The way the physical street addresses are used to infer the locations of other hosts is quite clever. The technique seems to work reasonably well, though validation of IP2Geo techniques is notoriously difficult.

===== Reasons to reject (1-3 lines) =====

The paper is quite informal and conversational, sometimes blurring (particularly in Section 3) what steps are part of the algorithm vs. part of the validation. The paper could be more concise (particularly in the early parts of the paper) and the related work section could be much shorter, leaving more room for a deeper justification of the algorithm -- and particularly a deeper treatment of the (very interesting) claims about relative delay. Also, the accuracy of the proposed technique will likely decrease over time, as more people outsource their Web hosting to the cloud.

===== Comments for author =====

- In the introduction, this was somewhat cryptic/leading: "In an attempt to seamlessly use public and private cloud implementations for the sake of scalability and availability, hybrid architectures can leverage a highly accurate geolocation system to enable a broader spectrum of functionality and options." What do you have in mind here? Earlier in the intro, how does more accurate geo-location simplify "network management in large-scale systems"?

- "can geolocate IP addresses with a median error distance of 690 meters in an academic environment" -- an academic environment doesn't seem like the right environment for evaluating the accuracy of IP2Geo mapping solutions. These intitutions may be more open, cover a tighter geographic region, be better connected to the Internet, and more likely to host their own Web servers (since they typically have a dedicated IT staff and also became connected to the 'net earlier than many institutions). (The introduction later acknowledges less accurate results for other settings -- why not mention all the results together in one place, up front?)

- "To find a subset of ZIP Codes that belong to the given region, we proceed as follows. We first determine the center of the intersection area. Then, we draw a ring centered in the intersection center with a diameter of 5 km." -- why do you take this approach? why not consider all zip codes that intersect with the region?

- the traceroute-based technique in Section 2.2 is neat

- associating a target with the seemingly-closest landmark would work well in geographic regions with a limited number of ISPs, but less well in regions with greater competition -- where two sites might be geographically close but still have very high delay.

- the explanation in the last paragraph of Section 2 is pretty informal. given the observation about relative distance is an important point in the paper, it would be worthwhile to elaborate on why relative distance is a good indicator and perhaps to evaluate the claim with a larger data set (i.e., more than 13 landmarks).

- in section 3, why not generate a list of known CDN IP addresses? the last paragraph of 3.2.2 seems to suggest some (manual?) checking of whether an IP address corresponds to a hosting platform, but is this automated in some way?

- it would be helpful if section 3 summarized the entire algorithm in one place.
the section ultimately feels like a lot of (admittedly interesting) discussion, but in the end it is not quite clear what steps happen in what order, and which parts of section 3 are validation techniques vs. parts of the algorithm.

- section 4.1.1 refers to manually verifying the locations of PlanetLab nodes.
how? by e-mailing the administrators at each site?

- the related work discussion, while interesting, is quite long. with a shorter related work section, the paper would have more space to delve into the evaluation of the algorithm.

Reviewer #6:

NSDI '11 Review #32F
Updated Wednesday 24 Nov 2010 4:56:08am PST
Paper #32: Towards Street-Level Client-Independent IP Geolocation

Overall merit: 3. Weak accept
Reviewer expertise: 2. Some familiarity
Is the paper easy to understand?: 2. somewhat difficult

===== Paper summary =====

The paper describe a new, more accurate approach for IP geo-location.
The key idea is to use external information about a large number of local landmarks to help improve localization. The paper describes how to select candidate local landmarks and further prune them to determine the likely geo-location. The approach is much better than the state-of-art -- its median localization error is under a kilometer!

===== Reasons to accept (1-3 lines) =====

- Important topic

- Cool idea

- Reasonable evaluation

===== Reasons to reject (1-3 lines) =====

- Reads more like a measurement paper. System benchmarking not performed.

- Some parts are really difficult to understand.

- No experimental comparison with prior work.

===== Comments for author =====

I really like the main idea in this paper. Its very simple, yet remarkably effective.

The main concerns I have are:

- The paper does not delve into important aspects of how the geo-location service is designed. It focuses more on the heuristics, corner cases and measurements. Reads like an IMC paper. I wanted to see things like: how fast can you compute the geo-location of a new IP address? The paper's explanation that this is "somewhere in the ballpark of 8 RTTs" is highly unsatisfactory. You should show how expensive the multi-tier "multilateration" and pruning etc are. What are their relative costs?

- More importantly, what latencies do other systems like geoping or Octant impose? Are they faster at the expense of being less accurate?
There is no way of telling. It does seem like latency may be important for context-dependent online services.

In general, the paper could benefit from concrete example of services that need a highly accurate approach like this, as well as the constraints for such services in terms of how fast geo-location needs to be performed. Without this, the main idea lacks proper motivation.

- A more head-to-head comparison with Octant is needed. In particular, I would like to know if Octant, when applied to the *same* datasets as your approach would still result in a 60km median error, as you claim.

- Several parts of the paper are too cryptic. Section 2.2, where you address the issue of underestimation, section 3.3 where you explain resilience to errors and section 4.2.2 where you explain landmark density were impossible to follow.

Some other comments:

p3: "an viable" --> a viable

p3: "geolocates a collected target" --> what is a "collected target"?

p4: Don't you need to deal with router aliasing in determining last common router along a pair of paths?

p4: I did not understand either of the paragraphs starting "Routers in the ..." and "Through this process,...". Without a good understanding of this, it is very hard to evaluate the accuracy of your approach.

p5: The notion of using relative vs absolute delays appears here first. You claim that this is a crucial aspect that helps you perform better than techniques that use absolute delays. This statement needs some more explanation. One way to do this is to talk about an example prior approach that uses absolute delays (e.g., Geoping?) and show how it would perform geo-location for the scenario you have in Figure 4.

p6: The "solutions" in section 3.2 are too ad hoc, but I guess this is unavoidable given the nature of external information you have chosen to leverage.

p7: I did follow section 3.3. In particular:

"However, since the upper bound is used to measure the delay and convert it to distance," --> what does this mean,

and right after this you say

" such underestimates can be counteracted" --> how exactly??

The final case in Figure 3(d) was interesting and I was curious how you dealt with it, but you relegated this interesting case to a technical report. This was quite disappointing.

p8: "our social networks"?? What are these? you mean your friends?

The problem with the residential data set is that you still don't know the ground truth as you have to trust user input. This is not such a big issue, probably, given the users are your friends (I hope!).

p9: properties do play the role --> a role

p10: rarely populated --> sparsely populated

p11: The discussion section seemed largely bogus.

Your explanation of latency being roughly 8RTTs is not satisfactory as I said earlier. Even so, I don't think this delay is universally acceptable! Also, to provide context, it would be good to show how long Octant, for example, would take.

"Migrating web services..." this paragraph is bogus. Your explanation of why your approach would continue to work is totally unconvincing. Just admit that migration is a problem and move on, or come up with a more concrete explanation.

"Reliance on.." this para also seemed bogus. You say that you can collect the information directly from web sites, but for this you need to crawl the entire Web and parse/mine the information. Google would probably always do a better job of this :-)

"highly insensitive" --> drop highly.

Really awesome related work section. I loved it.