Reviewer(s)' Comments to Author:
Reviewer: 1
Comments to the Author
The idea of leveraging Google to mine end-host properties is cute.
However, it's odd that the paper does not recognize the main drawback of this approach -- there is no
characterization of the data that's available on Google. For example, if one considers traces gathered at
a particular organization, even though the information it provides in contrast with your approach is
limited, one can at least make the statement that the information learnt from the trace is true for users
in the local organization. Though biased, the nature of the bias is known. Similarly, if you consider
another area where the notion of bias frequently crops up -- Internet measurement -- a common
complaint is that results are presented based on measurements between a set of vantage points (e.g.,
PlanetLab nodes) and so are not applicable to the Internet at large. However, even in that case, one can
state that results are true for paths between PlanetLab nodes.
---
(*) We agree with the reviewer that it is hard to directly and
comprehensively characterize the information about IPs that is
available on Google. Thus, we proceed in two directions.
First, in order to analyze the accuracy of our approach we now compare
it against a signature based traffic classification technique in
Section III.C. The two results show a high degree of overlap, which
indirectly validates the accuracy of the information obtained via
Google.
Second, in order to further understand the related issue of staleness
of information we have conducted an analysis of how frequently the
websites that collect and publish information about Internet endpoints
update their records. The results are presented in Section VI:
In order to further analyze this aspect we have conducted a study on
the staleness of the information that we have used. We took the top
300 websites that gave most hits that were used in the traffic
classification part (using only the tags taken from these websites we
have managed to classify 91% of the traffic). On this set we retrieved
the Last Update date. We found that for 88% of these websites we could
retrieve such a date and 71% of the websites analyzed have been
updated in the last month. This measure shows a fairly high freshness
of the IP-address-related information available on the Web.
---
Whereas, with UEP, the best one can claim is that the end-host profiling information gathered is true for
"those IP addresses seen on Google" -- a pretty arbitrary set since its membership is dependent on
several factors: the webpages crawled by Google, the addresses that do make it onto webpages, the
fraction of those for which application information can be inferred, etc. (Given all of these
constraints,
you never explain why your approach is "unconstrained".) Though UEP has the limitation that it does not
provide comprehensive data across the whole Internet, I do not think that's a big problem. The bigger
problem is that no statistically rigorous claim can be made about any subset of Internet addresses.
---
(*) Regarding the concern that not all information about IP addresses
is present on Google, we agree with the reviewer and we also recognize
this in the first paragraph of Section II.
Of course, not all active endpoints appear on the web, and not all
communication leaves a public trace.
We agree with the reviewer that it is hard to provide any
statistically rigorous claim about a subset of Internet addresses
available on Google, for the reasons presented by the reviewer. This
is one of the main reasons why we were unable to fully match the
results from network traces in Section III.B. We change the text in
Section III.B to address this issue explicitly. Still, despite the
necessary bias, it is important to note that we were able to
accurately detect the most popular applications in different
scenarios. Finally, we agree with the reviewer that the largest gain
of our approach is in complementing the existing packet and flow level
classification schemes in Section III.C (details below).
Our approach is unconstrained in the sense that it looks for and
uses external information that could help in predicting application
trends and classifying traffic. For example, packet and flow-level
approaches such as BLINC are constrained in the sense that they only
use information that they notice from the traffic flows. This is
explained at the beginning of Section III.C and we also change the
text in the Introduction to clear this issue up front.
---
But, even with the drawback that information yielded by UEP cannot be characterized; the UEP
approach does add significant value in adding information to packet and flow-level traces. I would
suggest that you should consider writing the paper focusing on this angle. Instead of contrasting the
value-add from UEP with that from traces, you should present the two approaches as complementary of
each other (which you do but more as a secondary claim at the moment).
---
(*) In the text, we indeed emphasize the fact that our methodology
(Section II) uses information beyond that available in network traces
because it is the key distinction between our and other classification
approaches. At the same time, we agree with the reviewer and we also
suggested that our approach is complementary to other approaches. We
address this issue now in the Introduction by acknowledging that UEP
is complementary to the existing traffic classification schemes. Also,
we make this clear in Section III.C:
Note that we regard our unconstrained approach as complementary to
other traffic classification approaches. We view the web-crawling part
of UEP as a first and inexpensive pass that gathers services
information that are used to classify traffic. This can be followed by
other more complex techniques which could benefit from the UEP-based
information.
Regarding the presentation of UEP?s applications in Section 3, the
idea was not to favor one application over another, but rather to show
different UEP?s applications as the amount of information from traces
changes, i.e., no traces, packet-level traces, sampled packet-level
traces.
---
Questions/comments about the approach and results:
-- Why only search for the dotted decimal representation of the IP address and not the DNS name of the
address?
---
(*) We have addressed this in section V.B where we show that not a lot
of the IP addresses present in the trace actually return DNS names and
hence the dotted decimal approach is justified.
---
-- Many of the results are anecdotal, e.g., Google is the top website, orkut is more popular than
wikipedia, etc. Is this because UEP is good at only identifying the most popular applications and not at
ranking the popularity of all applications?
---
(*) We provide a more comprehensive set of results, beyond anecdotal
ones, in reference [34], which we omitted in this paper due to space
constraints. We change the text in Section III.B and point a reader to
the above reference. Also, we did not mean to imply in the paper that
there should be perfect correlation between the results obtained using
Google and by analyzing network traces. We stated that if an
application is popular and this shows in the network trace, it will
also show on Google. We change the text in Section III.B to clear this
point. Moreover, we provide additional explanations in III.B about the
causes of this effect, as suggested by the reviewer.
---
-- When UEP discovers that a particular IP address is associated with a particular application, it is
most
probably true. However, when UEP discovers k1 IP addresses associated with App1 and k2 addresses
associated with App2, it's not right to infer that App1 is more popular than App2 if k1 > k2. This goes
back to UEP's drawback that there is no characterization of the IP addresses it covers.
---
(*) We agree with the reviewer, we cannot absolutely draw the above
conclusion and we have included the above comment in Section III.B.
Nevertheless, we were able to accurately detect the top-most
application. Our claim regarding this can be found in Section III.B:
Apparently, when an application is strongly present in a given area
this result shows up consistently in both network traces and on the
web
And this argument is supported by our analysis of top applications in
the mentioned regions.
---
-- How do you discover IP addresses of clients participating in chat applications? For example, in one
dataset you claim that MSN is the most popular chat application. I assume MSN does not have its chat
logs public (at least I hope so!), and I do not see how else the addresses of chat clients could make it
onto the web.
---
(*) Information about clients participating in chat applications comes
from forums that record the IP address. These forums also ask for
(Yahoo, MSN) Messenger ID?s in order to display the online status of
the forum user. When searching the IP address on Google one also finds
this related information. We now explain this in Section II.B.
---
-- With regard to the evaluation in III-A, I find it odd that the window size of 17 was determined so as
to
"maximize the overlap between Google- and trace-based active IPs". A more reasonable approach
would be to determine the intersection in /24s covered by the sets of IPs from the two sources.
---
(*) The window range of 17 (corresponding to /28 in the slash
notation) is an empirical result. It was obtained by varying the
window size and observing the size of the overlap between Google- and
trace-based active IPs. Considering the whole range instead of
intersections in /24s gives us a more complete overview of the network
range that we analyze. Basically by considering separate /24s we could
find some such networks in which the overlap is bigger and others in
which the overlap is smaller. By applying our method over the whole
range instead of separate /24s we get an average across the above
mentioned smaller networks.
---
-- The graphs in Fig. 6 and 7 need better explanation. Your never explain what "Max", "Avg", and "Min"
in Fig. 7 represent. You need to have better figure captions than "Traffic locality". A figure's caption
ought to explain the experiment that produced the figure.
---
(*) We have added in the captions a more comprehensive description of
the experiments carried out to produce the figures.
---
Nits:
-- At a number of places you have a comma before an "and" joining two phrases. A comma before "and
"is required only when the "and" is joining two sentences, ref. Strunk and White.
---
(*) We have corrected this.
---
-- Too much negative vspace above Tables V and VI.
---
(*) We have removed the vspace above these tables.
---
-- In first para of section IV, it should be "explained in Section III-C".
---
(*) We have corrected this.
---
Reviewer: 2
Comments to the Author
This paper describes an interesting approach to network traffic classification in which a search engine
is
used to retrieve information on the web about endpoint behavior and hence infer traffic patterns and
trends within the network. The work addresses an important problem that has received a great deal of
attention in the literature. It's a cute idea that offers a pragmatic solution when traces are not
available.
However the evaluation presented in this paper is unconvincing. The question of timescale is critical:
over what time periods does an endpoint need to be "active" to register a presence in Google's index?
Presumably different types of activity will, in general, take different amounts of time to appear. And
since the information from Google may be historical, effectiveness of UEP surely depends on keeping
track of trends. For this approach to have credibility, we need to see analysis of how traffic trends are
manifested in Google (preferably against ground truth from traces), and over what timescales.
---
(*) We agree with the reviewer that the issue of time scales is
important one. A time-scale related issue we were able to address in
this version of the paper is to understand how up-to-date are web
pages that bring the most information used in our method. We present
these results in Section VI. In particular, we took the top 300
websites that gave most hits that were used in the traffic
classification part (using only the tags taken from these websites we
have managed to classify 91% of the traffic). On this set we retrieved
the Last Update date. We found that for 88% of these websites we could
retrieve such a date and 71% of the websites analyzed have been
updated in the last month.
Given that all these websites are already indexed and crawled by
Google at shorter-than-a-month timescales (typically days up to a
week), it follows that IP-related trend tracking is bounded by these
two timescales: lower-bounded by several days to a week (depending on
Google?s crawling agility) and upper-bounded by a month in most cases
(depending on the update frequency for websites that collect IP-level
information).
---
This point about timescales is alluded to in Section III.B, which observes that correlation with
operational traces is somewhat arbitrary due to the fact that the Google sample is averaged over an
unknown amount of time, while the trace covers a single, small, time interval. Now what if my trace
only covered 5 mins - how much correlation would I see? How about 5 days? 5 weeks? It is impossible
to draw any conclusions from these results without more investigation into what they mean.
---
(*) There are 3 issues here. First, as discussed above, when no packet
traces are available (Section III.B) our approach can track
application trends over longer time-scales, i.e., from several days up
to a month.
Second, we compared our results to relatively short time-scale
network-level traces not because we expected to see a full match
between the two results, but simply because we needed to compare our
results to some version of ground truth. Despite the mismatch in time
scales, we were able to accurately detect the most popular application
in these scenarios. However, this does not mean that our approach is
capable of tracking changes at such short time scales; this happened
because the most popular application stays popular over longer time
scales.
Third, we agree that it would be interesting to perform the
comparisons over different timescales as proposed by the reviewer.
However, this would require obtaining both traces and Google-based
information over the same time intervals for longer time, which we
cannot do simply because we do not have access to such network traces.
---
- Since there is no technical reason for favoring one search engine over another, the use of "Google"
throughout feels like product-placement in a movie: it's either mildly distracting or quite annoying
depending on your point of view. And for that matter, have you considered combining (or comparing)
results from different search engines?
---
(*) We have no preference over a search engine and we did not use
Google in order to promote it. We simply reported the fact that we
used this particular search engine. We do plan to use other search
engines but this is beyond the scope of the current work.
---
- Why use BLINC as the only comparison point? The results in Section III C should at least also include a
comparison with conventional port number classification.
---
(*) We now compare against a signature-based traffic classification
approach too and present this in Section III.C.
---
- In what way is BLINC "constrained" and your approach unconstrained"? This terminology does not
mean anything to me.
---
(*) Our approach is unconstrained in the sense that it looks for and
uses external information that could help in classifying traffic.
Approaches such as BLINC are constrained in the sense that they only
use information that they notice from the traffic flows. We change the
text in the Introduction and at the beginning of Section III.C to
clarify this.
---
- In Section III.A why not present results for all the address ranges for which you have ground truth? In
particular, won't comparing against the one week long flow-level traces for N. America give much more
informative results?
---
(*) We indeed compare against the one week flow-level traces for N.
America and we find that the overlap is between 79% and 84% for the
five sub-networks. We change the text in Section III.A to address
this.
---
- Section III.B must include the numbers as well as more description of the experiments. For example,
what proportion of the traffic is misclassified? How do you define "correlation"? What are the
grounds for calling the correlation "remarkable" and "strong"?
---
(*) We do not show more detailed results in Section III.B due to space
constraints. We now change the text in Section III.B and point a
reader to reference [34] which has more detailed results.
We address the traffic classification problem (accuracy) in Section
III C. In section III B, we aimed only at identifying regional trends
and we validated them with top-most applications which appear both in
the traces and in the Google results. We did not claim 1 to 1
correspondence between the results obtained by Google and those seen
in traces but only that if an application is popular and this shows in
the network trace, it will also show on Google. Also we adjust the
terms remarkable and strong to high correlation and we define
what we mean by high correlation:
By high correlation we mean that when an application is strongly
present in a given area this result shows up consistently in both
network traces and on the web and this is shown by the topmost
applications that we have analyzed.
---
- Phrases such as "what do people do on the Internet" and "determine what information does the
website contain" are awkward: they are structured like questions and therefore should be terminated
with a question mark. I would prefer "what people do on the Internet", or even better "how people use
the Internet", and "determine what information is contained on a website".
---
(*) We have changed the phrases to the ones suggested by the reviewer.
---
- The phrase "literally fall apart" is slang (and doesn't make sense in this context either - structures
like
buildings might fall apart, a software program will fail to function correctly but won't "fall apart" in
any
literal sense). Another slang phrase is "out of nowhere" - please avoid these vacuous and imprecise
phrases.
---
(*) We have toned down the above phrases. Instead of fall apart we
have changed to do not perform as well and instead of out of
nowhere we have changed to unexpected.
---
- The phrase "it can help dramatically outperform" is somewhat evasive (p.2, 2nd para). Does the
approach outperform other traffic classification algorithms or not?
---
(*) We now compare against signature-based traffic classification too
and present this in Section III.C. We have also toned down the above
phrasing to outperform and mention BLINC as the tool we outperform.
---
- In the footnote on page 2 you "find and use the top 60 keywords". How did you choose this number?
Presumably it was a manual process, in which case how much effort was involved in examining
potential keywords and deciding if they could be "meaningfully used for endpoint classification"? Please
give a precise explanation of "meaningfully used".
---
(*) Yes, it was indeed a manual process. It lasted a few hours in
which we took all the hits that were generated by Google. We then
automatically counted the number of times a certain keyword appeared
and ranked the keywords according to this. Next, we looked at the
keyword ranking, removed the trivial words, and took each remaining
word and looked where it occurred in the hits and devised rules to
capture those exact websites. For example we looked at counter
strike? and saw that it appears on a website that lists Counter Strike
servers. By meaningfully used we mean that the keyword found implies
an application or application class associated with network activity.
We change the text in Section II.A to address this.
---
- Change the first sentence of III.D to "Packet-level traces are not always available from the network."
---
(*) We have changed to the phrase as suggested by the reviewer.
---
- In general, the text in the latter sections of the paper (III.D and IV) is rough and needs an editing
pass.
In the earlier sections the writing is good.
---
(*) We made an effort to improve the text.
---
Reviewer: 3
Comments to the Author
Summary of Paper: The paper presents a new method to profile end points on the Internet. The paper
starts with the observation that most of the information to profile many of the internet endpoints is on
the Internet itself. This information can therefore be used. The paper then presents primarily a tool to
validate this idea - by collection measurements from 4 networks, validating the efficacy of the technique
and then using this to shed some insight into global endpoint behavior by clustering endpoints and by
examining traffic locality properties.
Main Comments:
This reviewer found the idea novel, the idea well executed and the results well presented. Some major
quibbles which need to be addressed:
a) Line 53, pg 3, col2: Threshold of 2 is used to assign more confidence to domains to contain info which
can be used to profile endpoints. Why 2? Why not 3 or 4? Similarly:
---
(*) We have chosen this threshold in order to filter out websites that
carry information about a single IP address only. At the same time,
this approach maximizes the amount of traffic that we can classify
while filtering out the above sites. We make this clear now in the
text.
---
Line 33, pg 4, col 2; The confidence threshold for hit text based tagging is 1. Why? Some discussion is
warranted here.
---
(*) We have chosen this threshold in order to maximize the amount of
traffic that we can manage to classify. We make this clear now in the
text.
---
b) Potential pitfalls/weaknesses: One can think of ways to bias the profiling tool. While the very
important issue of accuracy of information is mentioned, this reviewer is glossed over. I would be more
comfortable to have more information about this.
---
(*) We have addressed indirectly the question of accuracy now by
comparing against a signature-based traffic classification tool and
present this in Section III.C. A related issue is how up-to-date is
information about IP addresses on the web. We have addressed this in
Section VI.
---
c) Suggestion: On the traffic locality set of results, it may be interesting to record how much of the
P2P
access patterns are local.
---
We have conducted the above study and we are thankful to the reviewer
for suggesting it. The results are shown in section IV.B and also
below:
When considering p2p traffic locality we notice a slight increase in
the intra-AS traffic (compared to the average) - 23% for the Asian
ISP, 27% for the S. American ISP and 14% for the N. American ISP.
However, notice that a large volume of p2p traffic, i.e., 73% to 86%,
is inter-AS which creates well-known problems for ISPs.
---
Some aesthetic issues:
Figures: Please have all figures at the top. This include figure 1,3,4,5,8 as well as the tables.
---
(*) We have moved the figures and tables to be on top.
---
Figure 2 is unreadable - please make it bigger.
---
(*) We have increased the figure size.
---
While Fig 4,5 convey useful information, I would rather see the authors devote more real estate to the
question of accuracy.
---
(*) We have addressed the question of accuracy by comparing against a
signature-based traffic classification tool and present this in
Section III C.
---
Line 60, pg 2, col 1 : 'do stay'-> 'does stay'
---
(*) We have changed this.
---
Reviewer: 4
Comments to the Author
The paper presents a methodology to profile Internet endpoints, namely IP addresses, through search
engines and specifically Google. The authors argue that this endpoint profiling is successful as
information about IP addresses exists publicly in the web providing easy characterization of the nature
of the applications this IP engages in.
Googling IPs is a potentially interesting idea and this idea is the main contribution of the paper.
However, the whole paper appears as a "cool idea" in search of a real application, and presents several
shortcomings and misconceptions.
First, the authors argue throughout the paper that information about IP addresses should be publicly
available in the WWW. While this might be true for certain applications such as web services or
streaming services, for other applications, such a statement is certainly exaggerated and not true. The
obvious case here is peer-to-peer applications, which while they might expose a limited number of
server IPs (e.g., emule servers or bittorrent trackers), all the peer IPs that contribute the vast
majority of
the traffic will never appear in public logs. If this were true, RIAA and other agencies would not have
to
go the extra mile of monitoring and eavesdropping the p2p networks to identify p2p users. The same
holds for gaming traffic or even VoIP (e.g., skype). While some server IPs will be publically available,
most of the actual gaming traffic is p2p thus hiding peer IPs. This is indeed evident in the paper's
results
where the authors need to employ crawling to identify extra p2p client IPs. As is further widely known
p2p IPs are very dynamic, which casts even more doubt to the proposed idea for certain types of
applications.
---
(*) There are several issues here. First, we define unconstrained
endpoint profiling (UEP) as a methodology that uses external
information about endpoints, beyond that available from network
traces, to predict application trends or complement the existing
traffic classification schemes. As such, in this paper we focused on
Web-based UEP approach, and explore the potentials of a P2P UEP
approach (Section V). Indeed, these UEP approaches are fully
compatible with each other. We have originally explained this at the
beginning of Section V. We now change the text in the Introduction to
better explain this.
Second, regarding the above comments that not all applications are
available on WWW, we agree with this. We never stated that all
information about IP addresses should be available on WWW and
therefore on Google. We have originally explained this at the
beginning of Section II:
Of course, not all active endpoints appear on the web, and not all
communication leaves a public trace.
Third, regarding the reviewer?s discussion about P2P systems we agree
that not all P2P information stays publicly available. We now clarify
this now in Section II.B:
Consequently, an IP?s involvement in p2p applications ... becomes
visible when contacting the first point of entry into the system. For
this, this first point of entry is known and typically available on
the web. Example websites are emule-project.net, edonkey2000.cn, or
cache.vagaa.com, which lists torrent nodes.
Fourth, regarding the comment that p2p UEP approach from Section V
contradicts the WWW UEP approach explained in Section II, III, and IV
are contradicting, we again emphasize that the two approaches are
complementary, as we explained above.
Fifth, regarding the reviewer?s comment about RIAA, such agencies
cannot use our approach to detect p2p users that share copyrighted
files. This is because affiliation with a given p2p system, e.g.,
BitTorrent, is not illegal per se. Indeed, not all files on p2p
systems are illegal. Hence, association of an IP with p2p systems is
insufficient for RIAA and other agencies, and hence they must go the
extra mile and directly identify peers that download copyrighted
files.
---
Second, the authors base a significant part of their motivation to the fact the proposed methodology
does not require packet traces. Yet, when comparing with classification techniques, they do have to use
several packet traces negating their argument. In fact, the authors overstate several of their arguments
especially when trying to refute past work.
---
(*)Our methodology (Section II) is independent from network traces.
However, it can be used in multiple scenarios: when network traces are
not available (Sections III.A and III.B) and when they are available
(Sections III.C and III.D).
Our understanding is that the reviewer?s above comment is referring to
one of the applications of UEP (Section III.B) and not to a part of
the methodology (Section II). In Section III.B our methodology can
indeed be used without network traces to identify application trends.
In Section III.B we use network-level traces from the given networks
as the ground truth in order to validate our approach. Thus, we are
not negating our arguments, but rather validating one of the
applications of our approach.
---
Third, even the idea of comparing the proposed methodology with traffic classification techniques is
misguided. UEP is more suitable to uncover publicly available information to identify general trends
about applications, and as such cannot be applied for traffic classification purposes and is not directly
comparable to traffic classification techniques. Traffic classification techniques attempt to tag flows
with
particular application which is a level of detail UEP cannot achieve. UEP tries to characterize endpoints
with a general application, rather than uncover all applications for a specific endpoint by examining all
its traffic (for example traffic classification techniques can tag an endpoint with a multitude of
applications as it would be expected from a traffic classification technique).
---
(*) We disagree. UEP (Section II) has several applications (Section
III). One of the applications (Section III.B) is in identifying trends
in the absence of packet traces by considering publicly available
information about that specific network range. Another application
(Section III.C) is a classical? traffic classification application
when packet traces are available. In both scenarios (Sections III.B
and III.C), our approach can uncover multiple applications associated
with a given IP, not a general application only. Next, in the case of
application III.C, our approach attempts to tag each and every
observed flow in a trace, and as such it is fully comparable to other
traffic classification approaches. Indeed, the traffic classification
techniques that the reviewer implies use a variety of techniques such
as port numbers, packet content, packet sizes, inter-arrival times etc
to achieve the exact same goal, tagging a flow with a specific
application type. The key difference between these classification
approaches and our approach from Section III.C is that we use external
information, beyond network traces only, to classify traffic.
Fourth, the evaluation of the paper is very weak not examining several aspects that affect the method's
performance (details below). Further, the authors do not have ground truth and thus all comparisons
and results presented could be false positives/negative or indeed stale information in the web.
---
(*) We have addressed the question of accuracy by comparing our
results with a signature-based traffic classification approach (which
examines packet payload) in Section III.C. The two results show a high
degree of overlap, which indirectly validates the accuracy of the
information obtained via Google. We address the issue of stale
information on the web in Section V.
---
Finally, the methodology presented has limited novelty (besides the main idea) and its main difficulty is
parsing HTTP content. Retrieving relevant context from web pages has been extensively studied for
example in data mining.
---
(*) We agree with the reviewer that extracting relevant context from
web pages has been extensively studied in other areas such as data
mining. Still, we claim novelty in 2 aspects. First, to the best of
our knowledge, we are the first to mine data about IP addresses from
the web. We address this issue explicitly in Section VI (vertical
search engines). Second, we claim that our contribution is in using
this information in a novel way for the four proposed applications
(Section III).
---
Detailed comments:
Introduction: The authors overstate their arguments throughout their introduction and abstract . This is
especially true when stating arguments that are not "common knowledge" or not shown in previous
work (e.g., "IP address must be publicly available", "endpoint becomes publicly visible", "packet-level
classification tools are inapplicable"). While these arguments might hold in some cases, generalizing and
making absolute comments without evidence or citations does not help the authors make their case.
---
(*) We agree with the reviewer and we have toned down the above
comments. Instead of "packet-level classification tools are
inapplicable we have put: some of the packet-level classification
tools do not perform as well, instead of "endpoint becomes publicly
visible" we have put association between some of the endpoints and
such an application becomes publicly visible., and instead of IP
address must be publicly available we have put which IP address one
must contact in order to proceed is publicly available (not
necessarily on Google).
---
Section II: There are several parameters of the methodology presented which simply appear as random
choices without any intuition offered. For example, why 60 keywords, why a threshold of 2, what is the
sample "seed set"? Are these parameters validated and do the authors examine their effect in the
results?
---
(*) Selecting the 60 keywords was a manual process. We explain this in
detail in a comment to the reviewer above. We also change the text in
Section II.A. We have chosen the threshold of 2 in order to filter out
websites that carry information about a single IP address only. At the
same time, this approach maximizes the amount of traffic that we can
classify while filtering out the above sites. We change the text in
Section II.A to address this. Regarding the sample seed set, we
change the text at the beginning of Section III.B first paragraph to
clarify this.
---
Similarly, is the assumption that URLs in the same domain contain similar content true? Is it validated?
(we could imagine web pages hosting streaming or online radio stations in same domain IPs.)
---
(*) In order to avoid this we also parse out the corresponding
resource identifier from in front of the IP address, e.g., mms:/,
rtsp:/ etc.
---
As expected, most of the tags in Table 1 relate to web-based services or servers. This relates to the
comment in the previous paragraph that certain applications cannot be identified using UEP. Further,
what does a proxy server tag mean? What is the application associated with it?
---
(*) We have discussed this issue above. We repeat the key points here.
First, while most of the tags in Table 1 necessarily relate to webbased
services, the table also involves non-web-based tags such as
gaming or p2p. Second, for applications that cannot be accurately
identified using Web-based UEP approach we apply complementary p2pbased
UEP approaches (Section V.A).
A proxy server tag identifies an anonymous web proxy server and the
application associated with it is browsing.
---
In page 5 (p2p communications): The arguments here are false (or at least not proven to be true by the
authors ). A bittorrent peer IP or an emule peer IP vary rarely (if ever) become public in the WWW.
---
(*) We did not intend to suggest that each and every p2p communication
is visible on WWW and captured on Google. We just wanted to state that
when one sees an IP address communicating with a node which is known
as a P2P server or a Torrent Tracker we can infer its affiliation with
that specific system. We have changed the text in Section II.B
accordingly:
Consequently, an IP?s involvement in p2p applications such as eMule,
gnutella, edonkey, kazaa, torrents, p2p streaming software, etc.,
becomes visible when contacting the first point of entry into the
system. For this, this first point of entry is known and typically
available on the web. Example websites are emule-project.net,
edonkey2000.cn, or cache.vagaa.com, which lists torrent nodes.
---
Section III: Sections III.A and III.B particularly stress why UEP cannot be applied to traffic
classification
as its main functionality is characterizing the "publicly" available information for an IP to uncover
only
part of the applications this IP engages in.
---
(*) We disagree with the reviewer?s comment that Web-based UEP cannot
be applied to traffic classification. In particular, sections III.A
and III.B are two separate applications of UEP. The first one is in
identifying probable active ranges of IP addresses and second one is
identifying the application usage trends inside a network range. By
using information about the services that UEP manages to identify we
are able to perform classical traffic classification too and this is
presented in Section III C. For this application (III.C), publicly
available information from the Web (see Figure 4) is essential in
achieving highly accurate traffic classification results (see Tables V
and VI).
---
As in the previous section, there are several numbers that appear out of nowhere with no evaluation or
validation provided (for example selecting 3,659 active IP addresses or 2,120 IPs from the trace -how?).
---
(*) We believe that the word extract? is the source of the confusion.
In particular, we did not select 3,659 active IP addresses, etc, these
are simply empirical results. We have corrected this in Section III.A.
In addition, we explained the process better.
We determine 3,659 active IPs using Google (IP addresses which
generate hits on Google). At the same time, we determine 2,120 IPs
from the trace (The IP addresses belonging to that specific range
which are observed sending traffic). The overlap is 593 addresses, or
28% (593/2120).
---
How does the window of 17 influence the results presented (without the window only 28% correlation
between the trace and google is observed!)?
---
(*) The window of 17 is an empirical result that maximizes the overlap
between the IPs available in traces and those from the Google-based
approach. This is explained in the footnote 4 in Section III.A. Yes,
without the window approach, the overlap is only 28%. This is not a
surprise given the different timescales at which these 2 datasets are
obtained. We have discussed the timescale issue in Section III.B.
---
Why should all neighboring IP addresses be active?
---
(*) We provide a reason for this in Section III.A: Indeed, to ease
network management, network administrators typically assign contiguous
IP addresses to hosts in the same network.
---
What happens with a window of 16 or 10?
---
(*) As we explained above, a window of 17 is an empirical result. The
windows of sizes 16 and 10 both give a lower overlap.
---
Even with the current window choice, almost 1 out of three IP addresses appear not to be active both in
the trace and Google.
---
(*) There are two potential reasons why this is the case. First, these
IP addresses might simply not be active. As we explained in the text,
the key point of this approach is not to accurately infer if a given
IP address is active or not, but rather to hint at the highly probable
active IP ranges and ease methodologies that require such information
(e.g., [27]). Second, these IP addresses might be active, but they
might not get captured neither by Google (Section II, paragraph 1) nor
in traces (because that they are relatively short-lived, see Section
III.B).
---
Why only one /17 is examined here and not the others?
---
(*)We indeed compare against other networks as well and we find that
the overlap is even larger, between 79% and 84%. We change the text in
Section III.A to address this.
---
When correlating the results with operational traces, the authors never provide ground truth and
present no concrete numbers, rather only vague statements, e.g., "Gnutella is the leading p2p system".
This argument also makes this reviewer very skeptical since BitTorrent and Emule are widely known to
be the most popular p2p applications. This probably further highlights the problems with p2p
application detection of the methodology. How is the "behavior inevitably reflected in the traces" with
no ground truth??
---
(*) There are several issues here. First, we omit the concrete numbers
here due to space constraints. We have presented the concrete numbers
in reference [34]. We change the text in Section III.B to address
this.
Second, we agree with the reviewer that BitTorrent and Emule are
globally the most popular p2p applications. However, different
applications are popular in different parts of the world, and the
strength of our approach lies in its ability to detect such trends
(Section III.B). As such, both our Google-based approach and network
traces from the given ISP in S. America imply that Gnutella is the
leading p2p system in that particular ISP (the comment in the paper
applies to S. America?s ISP only as implied in the paragraph prior to
this one).
Third, we obtain ground truth by applying a signature-based traffic
classification technique on the network traces. We change the text in
Section III.B to address this explicitly.
---
In summary, this reviewer has no clue as to whether the results in section III.B are to be trusted or
just
what the methodology presents which is probably a significantly "biased" view for all the
aforementioned reasons.
---
(*) We hope that we have cleared the reviewer?s concerns regarding
Section III.B. We have additionally changed this section in response
to other reviewer?s comments.
---
Section III.C presents several misconceptions of the authors and is falsely used to stress the UEP
success.
It is known that methodologies like BLINC need a sufficient traffic mix (e.g., traffic volume, IP
addresses
present) in order to build the associated detection heuristics or statistics used by the various
methodologies. The authors only compare against limited-scale (/17, /18) networks for which they also
provide no traffic statistics. As such the reader is unable to judge if even the comparison is fair or if
simply BLINC fails due to the limited traffic mix. This is indeed obvious in table VI where the authors
show that there are only 20 endpoints for gaming traffic or 160 for streaming! Of course with such a
small sample of IP addresses BLINC is bound to fail.
---
(*) There are several issues here. First, we agree with the reviewer
that methodologies like BLINC need a sufficient traffic mix in order
to build the associated detection heuristics. We now cite a newer
study ([25]) which gives more insights on the performances of BLINC.
We also note as mentioned in the reference that BLINC might have given
mixed results due to the place where the traces were collected. We
change the text in Section III.C to address this explicitly.
The reasons for the performances obtained by using BLINC can be
found in [25]. The authors evaluate BLINC on several traces with mixed
results and conclude that BLINC is not recommended for backbone links
but mostly for border links of a single-homed edge network. Moreover,
it is known that methodologies like BLINC need a sufficient traffic
mix (e.g., traffic volume, IP addresses present) in order to build the
associated detection heuristics or statistics used by the various
methodologies. Because our network traces are short-lived, they are
necessarily unfriendly to BLINC-like approaches.
Second, with respect to the above issue, we have toned down the
remarks about BLINC as suggested by the reviewer. In particular, we
believe that UEP and BLINC are complementary approaches, and we
emphasize this in the introduction and in Section III.C.
Third, in light of the above discussion, we believe that it is still
very useful to have tools like Web-based UEP that are capable of
accurately classifying traffic using short-lived traces, limited-scale
networks, and small amount of endpoints, etc., where approaches such
as BLINC are bound to fail, as implied by the reviewer. This is
another reason why we believe that BLINC and UEP are complementary
approaches.
[25] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K.
Lee. Internet Traffic Classificatoin Demystified: Myths, Caveats, and
the Best Practices. In ACM CONEXT 2008, Madrid, Spain, December 2008.
---
The traces examined are only 2-hours in duration. Adaptive methodologies like BLINC go through
learning phases during which they accumulate samples (i.e., comparison during the first intervals
possibly the first hour of the trace is not fair). Similarly, no ground truth is provided here and thus
no
results can be evaluated by the reader. This section further shows problems with p2p-like applications
as it would be expected (see above).
---
(*) First, the BLINC performance issue has been addressed above.
Second, regarding the ground truth issue and the question of accuracy,
we now address it by comparing UEP?s results against a signature-based
traffic classification too and present this in Section III.C. Third,
regarding p2p-like applications, we addressed this issue above.
---
Section III.D suffers similar problems since the authors sample the already limited traffic mix
available.
---
(*) There are two issues here. First, the main point of section III.D
is that UEP?s performance is independent from traffic sampling, which
is a very important feature because backbone networks can typically
provide such traces due to scalability issues. Second, we believe that
the comparison with BLINC in III.D is justified. This is because BLINC
does not share the same property since it depends on the associated
detection heuristics or statistics used by the various methodologies,
which are necessarily distorted when traffic sampling is used.
---
Sections IV and V present general results that offer limited contributions to the rest of the paper. In
fact,
it appears that the paper is disconnected here. The authors could have spent more space evaluating
their methodology.
---
(*)Sections IV and V are connected to the rest of the paper in the
following respects: Section IV applies the proposed methodology to
reveal traffic characteristics in different parts of the world. To the
best of our knowledge, this is the first attempt to compare results
from different world regions in this respect. Section V provides non-
Web-based UEP approaches such as P2P- and reverse-DNS-based UEP. As we
explained in the introduction and at the beginning of Section V, UEP
means using external information beyond network traces to classify
endpoints; hence, the connection with the Web-based UEP.
---
Even in these sections the results presented are hard to trust. For example, "browsing, chat and mail
seems to be the most common behavior globally". This is definitely not true (how about p2p or
streaming?), but rather reflects the services that UEP can successfully identify.
---
(*) We presented the results in terms of detected flows, not Bytes. We
agree with the reviewer that p2p or streaming accounts for more of the
traffic in terms of Bytes, but not flows.
---
Section IV.B is confusing since it characterizes traffic locality. How can UEP examine this, since it
does
not examine traffic in the first place?
---
(*) Here, we use the approach of Section III.B to characterize traffic
locality. Instead of traffic from the given region we considered
endpoint proofs-of-accesses, e.g., accesses to a certain forum
obtained from Google. Next, for each such destination (e.g., forum) we
resolve the AS and the country membership and compute AS-level
distance to a given destination. Finally, we compute the percent of
accesses that goes to a given AS and country. We have included a
better description of the experiments carried out in the captions of
Figures 6 and 7.
---
In section V, the authors seem to acknowledge that some applications cannot be identified through UEP
and argue that their approach could be coupled with crawling. If that is the case, why not use just
crawling in the first place? Further, crawling requires significant effort which is against the authors
motivation throughout the paper.
---
(*) There are several questions here, some of which we addressed
above. Nevertheless, we reiterate some of them here. First, yes, not
all IPs are available on WWW, and hence other approaches, such as
crawling p2p systems to obtain this information is needed. Second, as
we explain at the beginning of this section and in the Introduction,
these approaches are complementary and a part of the same UEP
methodology. Third, regarding the question why not use crawling in
the first place, there are two answers. (a) Because crawling WWW is a
huge task that we cannot achieve on our own. (b) Because Google and
other search engines already provide this service and hence it is
practical and justified to exploit the existing systems. Fourth,
crawling p2p systems for the sake of collecting the peer IPs is a
fairly simple and scalable task relative to web crawling.
---
Finally, the discussion skims through and tries to hide important issues such as the staleness of the
information in the web. The authors never make any attempt to evaluate to what extent information
they uncover is potentially fresh or stale and thus not valid. Logs and web forums may relate to events
in the past not reflecting current application usage especially regarding non-server IP addresses.
---
(*) In order to address the issue of staleness of information we have
conducted an analysis of how frequently are the websites that we have
identified updated. The results are presented in Section VI and we
also mention them here:
In order to further analyze this aspect we have conducted a study on
the staleness of the information that we have used. We took the top
300 websites that gave most hits that were used in the traffic
classification part (using only the tags taken from these websites we
have managed to classify 91% of the traffic). On this set we retrieved
the Last Update date. We found that for 88% of these websites we could
retrieve such a date and 71% of the websites analyzed have been
updated in the last month. This measure shows a fairly high freshness
of the IP-address-related information available on the Web.
---
Another question is, what happens if historical traces need to be examined. Obviously re-running UEP in
the same trace in different time instances might have different results.
---
(*) This can be addressed by creating databases of UEP results at
specific time instances.
---
The paper appears the same as the SIGCOMM paper from the same authors. What are the new
contributions here?
---
(*) The novelties relative to the SIGCOMM paper are the following:
(i) We compared against a signature-based traffic classification
approach which addresses the important issue of accuracy (Tables V and
VI in Section III.C).
(ii) We provide additional and comprehensive results regarding traffic
locality (Figure 7 in Section IV).
(iii) We explored a p2p-based UEP approach and presented the results
(this is now Section V.A).
(iv) We explored a DNS-based UEP approach and presented the results
(this is now Section V.B).
(v) We explored the issue of the staleness of the IP-address-related
information available on the Web (now in Section VI).
---








Reviewer(s)' Comments to Author:
Reviewer: 1
Comments to the Author

The paper benefits greatly with the revisions with regard to
a) toning down the claims about UEP's utility in isolation and a better explanation of how UEP
complements existing traffic classification approaches
b) better evaluation of UEP's accuracy by comparing with a payload signature based tool

Below are a list of things that need fixing:

-- The captions of almost all figures need more detail. You can't have "traffic destinations" (Fig. 3)
and "IP addresses" (Fig. 4) as figure captions! The caption of a graph needs to outline the experiment
that produced the graph so that the reader can interpret the graph without having to go through all the
text. The following URL has a few guidelines that are worth following when captioning figures:
http://www.physics.ohio-state.edu/~wilkins/writing/Handouts/fig-captions.html
---
(*)We agree with the reviewer that it would be useful for the captions to contain more information and we
have modified them accordingly.
---

-- For the results in Table V and VI, it would be valuable to have the absolute number of flows in
addition to the percentages cited. For example, for streaming, mail, and FTP traffic, though the absolute
difference in percentages between UEP and signature-based classification is small, there is a significant
relative difference. It's hard to say how significant the relative difference is without access to the
absolute number of flows.
---
(*)The authors agree with the suggestion of the reviewer. We have added the absolute number of flows in
Table 5. Table 6 already contains the absolute number of endpoints.
---

-- The study on staleness of information discovered by UEP (second column of page 13) is plain incorrect.
What you study is the staleness of the webpages from which you extract information about IP addresses.
This does not tell you anything about whether the information you extract from that page is still valid.
For example, on an online forum, the page corresponding to a particular thread may be updated every day
as new users post to that thread. However, information you initially extracted from this page, e.g., that
the IP address of the first poster on this thread is associated with say gaming, may no longer be valid.

To perform a valid study of staleness, you need ground truth, for which you need a trace spanning a long
interval. If you are unable to gather such a trace, please tone down your claims about the staleness of
data yielded by UEP.
---
(*)We have toned down the claims about the staleness of the information present on the Web and we have
added the following line: While via this method we are unable to estimate staleness of a particular IP
address, we are still able to show a high level of freshness for the websites that provide information
about IP addresses
---

-- In the second column on page 8, it should be "only by the signature based traffic classification (S-U)
and only by our approach (U-S)"
---
(*)We have addressed this.
---

Reviewer: 2
Comments to the Author
I am reviewer 3 from the previous set of reviewers. I like to state that all my major concerns have been
addressed, though a few concerns (mostly editorial) remain. First let me mention a few high-level
concerns.

- I believe the confusion regarding the word "unconstrained" is due to people confusing "different" with
"better". BLINC etc. used network flow data by design. Your solution uses information that can be
searched via the WWW to achieve the same task, by design. Your solution is different, but that doesn't
automatically make it better. Just because your solution is "unconstrained" doesn't make your solution
better. Although I do realize you never adopted such a stance, but unfortunately
it seems to come out that way.
---
(*)The authors understand the confusion given by the use of the word unconstrained. We never intended to
suggest that unconstrained might mean better. In order to address this we have added the following
sentences in the introduction: Hence our approach is different by design (not necessarily better) from
other traffic classification approaches (e.g., BLINC). We compare these approaches for given networks
later in the paper.
---
- Regarding traffic locality, if you have the information, It might be nice to show what percentage of
the total traffic in areas with high locality is P2P. For instance in China, where you show high
locality, is most of the traffic P2P or other apps? This would be very interesting for multiple reasons.
If a high percentage of total traffic is local, and if a high percentage of the traffic that is local is
not P2P, this can have fundamental consequences in how traffic patterns in the Internet are modeled now.
For instance the gravity model to model traffic matrices will not hold under locality.
---
(*)The authors would like to perform such a study. However, given that all used approaches did not manage
to classify a given percent of the traffic, making absolute claims about this would be hazardous.
---

Some minor comments:

- Pg 3, Col 2 Line 9 : Sentence incomplete!
---
(*)We have addressed this.
---
- Pg 3, Col 2, Line 17: ".. extract the ones which.."  -> 'which' to 'that'. There are other places also
where which should be changed to that - please go thru the text carefully.
---
(*)We have addressed this.
---
- Pg 8, Col1, Line 39: "The first reason.." -> Rephrase,very awkward sentence construction.
---
(*)We have addressed this. The phrase now reads: The UEP approach is online capable because of its
ability to classify traffic based on a single observed packet for which one of the endpoints is revealed
( e.g., a web server). Furthermore, there is a huge bias of traffic destinations
---
- Pg 8 Col1, Line 54: "..most popular 5%.." -> popular where? 5% of the most popular IP addresses in the
whole known universe? Please be specific.
---

(*)We have addressed this. most popular 5% of IP addresses from the traces
---

Reviewer: 3
Comments to the Author
The authors seem to have addressed some of the issues raised by the reviewers.

However a number of issues still remain:

The limitations of the UEP methodology are never clearly identified and discussed properly.  For example,
the authors discuss the need/absence of packet traces in two paragraphs in the introduction as a
limitation for previous work. But to provide sufficient classification for a number of applications, UEP
also needs a lot of "side" information such as crawling p2p networks. This is never discussed in the
introduction.

Thus, the authors should clearly state in the introduction (since most of the focus is towards
application classification), that googling the Internet alone can only provide limited information about
certain applications such as p2p networks, and significant effort might be needed to overcome such
limitations (e.g., crawling a p2p network). The authors only vaguely allude to this fact and this can
lead to misconceptions when reading the paper.
---
(*)We have incorporated the following paragraph in the introduction: Still not all information is
available on the Web. Hence, results may be improved by using additional sources of information, some of
which come at a high cost (e.g., joining and crawling a p2p network).
---

Also, when discussing comparisons with other classification methodologies, they should make clear that
UEP outperforms other methodologies in the particular cases and networks examined (which is again only
alluded in later sections).
The authors should devote a couple of paragraphs clearly discussing several choices made throughout the
paper and the limitations that these might entail (This might be towards the end of the paper in the
discussion section or a separate limitations section). These include the assumption of neighboring IPs,
the timescale issue which is only touched by the reviewers, the fact that only particular networks are
examined, browsing through anonymization proxies or networks etc. This is needed so readers can clearly
identify the strengths and weaknesses of UEP.
---
(*)Regarding the assumption of neighboring IP addresses this is based on the fact that IP address
assignment on the Internet at different points in the assignment hierarchy is done on a block basis.
Indeed, assignment agencies assign specific contiguous blocks to ISPs and also higher tier ISPs assign
contiguous blocks to lower tier ISPs. It is also easier to specify blocks in access control lists,
filters than to specify individual hosts. We have added the following paragraphs in the paper to account
for this: Indeed, IP address assignment on the Internet at different points in the assignment hierarchy
except the end user is done on a block basis. Also, to ease network management, network administrators
typically assign contiguous IP addresses to hosts in the same network
We have added paragraphs in the introduction referring that our comparison holds for the particular
networks examined:
Hence our approach is different by design (not necessarily better) from other traffic classification
approaches ( e.g.BLINC). We compare these approaches for given networks later in the paper

It should be noted that the examined networks belong to tier 1 ISPs which is an unfriendly environment
for one of the compared approaches [25].

We have toned down the claims about the timescales and staleness of the information present on the Web
and we have added the following line: Although we cannot validate that the information about a particular
IP address is accurate this measure still shows a fairly high freshness of the websites that display IP
address information.

Regarding browsing through anonymization proxies our paper did not draw any conclusions regarding the
particular websites that people access through these proxies but only with regard to the browsing
behavior which is still captured.
---


Especially for the timescale issue, the data provided are not convincing. The authors show that 30%-40%
(depending on how one interprets the 88% and 71% numbers, 88%*71%=62%) of the top 300 websites do contain
stale information, which is a significant fraction and might lead to misclassifications or false
findings. The authors should make clear that this might be a limitation (it would be interesting to
examine this in future work).
---

(*)We have toned down the claims (see comment from previous reviewer) about the staleness of the
information present on the Web and we have added the following line: Although we cannot validate that the
information about a particular IP address is accurate this measure still shows a fairly high freshness of
the websites that display IP address information.
---

Similarly for p2p networks. The authors added text mentioning that the entry point of p2p will be known
publicly. They also need to make clear that these entry points are however only a small fraction of the
whole p2p network (or provide evidence against this).
---
(*)We agree with the reviewer and have added the following phrase to acknowledge this: The number of such
endpoints in a p2p network is relatively small; however, whenever a client wants to retrieve a file he
typically goes through such an access point.
---

The authors added ground truth analysis comparing with a signature based classifier. This is nice but on
the other hand, it clearly highlights the points above, namely that for applications besides web where
information is public, UEP fails with no other "side" information. The authors should clearly identify
this, namely that false negatives compared to signature based classification are roughly 50% for p2p and
streaming applications, 23% for email and 36% for ftp. Since results are presented in flows, one would
expect the overall classification to be significantly low in terms of bytes (taking into account that p2p
and streaming would contribute most in terms of bytes). This is only vaguely discussed by the authors.
---
(*)We agree with the reviewer and we have added this in the text. ( e.g. for mail 22,365 flows for UEP
compared to 30,426 flows for signature based and for FTP 1,338 flows for UEP compared to 2,278 flows for
signature based). We have also added the number of flows in Table 5 to illustrate this.
Our paper was never targeted at identifying traffic volumes and consequently we did not make claims
regarding this.
---