Estimation of network distances using off-line measurements

doi:10.1016/j.comcom.2006.05.010

Computer Communications

Volume 29, Issue 16, 12 October 2006, Pages 3295-3305

https://doi.org/10.1016/j.comcom.2006.05.010 Get rights and content

Abstract

Several emerging large-scale Internet applications such as Content Distribution Networks, and Peer-to-Peer networks could potentially benefit from knowing the underlying Internet topology and the distances (i.e., round-trip-time) between different hosts. Most existing techniques for distance estimation either use a dedicated infrastructure or use on-line measurements in which probe packets are injected into the network during estimation. Our goal in this paper is to study off-line techniques for distance estimation that do not require a dedicated infrastructure. To this end, we propose a metric termed “depth” and we observe that together with a quadratic function on the geographic distance, it can predict the network distance with high accuracy using multi-variable regression. When used for closest server selection, our approach performs much better than random server selection, and similar to the on-line metrics. Our approach incurs low overhead and can be deployed easily with some DNS extensions.

Introduction

In recent years more and more popular Web items are being replicated to facilitate their fast retrieval. The web caching schemes in which web items are cached and delivered to users from the local cache are evolving into global Content Distribution Networks (CDN). A Content Distribution Network [1], [2], [3] is an infrastructure that distributes the content of popular items to multiple geographically dispersed servers. When a client requests an item, the CDN directs it to the “best” replica, that is, the closest replica in terms of user observed latency. One of the main factors in finding this “good” replica is the network distance in terms of RTT (round trip time) between the client and the possible content servers. Thus, it is important for content distribution service providers to know these distances in order to make good choices.

Other examples where multiple copies of the same item are stored in the Internet are the peer-to-peer file sharing applications such as Gnutella and Napster. Here again, while searching for a specific file, the user can be directed to one out of several currently on-line copies, but it will be beneficial to direct the user to the “closest” copy. New emerging peer-to-peer overlay network applications can also use distance information in order to make the overlay network “distance aware”. But, in this overlay network it will make very little sense to connect clients from say San Francisco, to peers in, say New York. Thus it is important to accommodate distance information when constructing peer-to-peer overlay networks [4].

The importance of retrieving network distance information as indicated by the above examples has resulted in several approaches toward collecting such information. Due to the substantial overhead both in terms of delay and network traffic, a number of projects aimed at collecting network distance information and distributing it to various applications are gaining popularity [5], [6], [7]. The information provided by such mechanisms is somewhat less accurate [8], [6], and they require a special dedicated complex infrastructure. To address these drawbacks a light-weight approach of using “landmarks” has been recently proposed [4], [9]. In this approach the distance information is computed from measuring distances from the client to a small number of well-known landmarks. Although this approach provides good distance information, it also requires the construction and maintenance of dedicated infrastructure of landmarks, and some on-line measurements. Techniques that require messaging during distance estimation are termed “on-line” and the others are referred to as “off-line”. On-line techniques are typically time consuming and require high overhead of messages.

Our goal in this paper is to propose and study a network distance estimation technique that does not involve on-line measurements or a dedicated infrastructure. The main idea is to use static information for this purpose. One well-known type of such information is the geographic location of the host. As geographic distances are believed to be insufficient in predicting network latency [10], we propose a new off-line metric called DEPTH, which is the average RTT from a given network element to the nearest backbone network (precise definition is given later in the paper). Our approach is to use multi-variable regression techniques that exploit the combined information contained in these metrics to predict the minimum RTT. To contrast our results with on-line metrics, we use the number of hops and the number of autonomous systems on the route, both of which require on-line traffic for measurement.

The first step towards answering our question was to collect real Internet data. This was done using traceroute servers. Overall we performed more than 600,000 traceroute operations using more than 3000 hosts and 24 servers (10 sets of traceroutes between each host-server pair), worldwide. The data was split into training and test sets. The training data was used for performing the regression analysis and the test data was used to study the performance of closest server selection. Based on the statistical analysis we design an algorithmic framework that computes an approximated network RTT from the given static data. The metrics were evaluated to measure their performance in the context of performing topologically aware operations such as server selection. We observe that DEPTH together with quadratic distance provides the best off-line metric and it performs similar to the best on-line metric.

It is important to note that our technique does not require any infrastructure, and it can be easily deployed using existing extensions of the DNS service known as DNS-LOC [11]. This is basically a format defined in RFC1876 [12] that allows addition of location information to the DNS service. Using similar methods or extensions to the fields defined in RFC1876, one can add additional information such as geographic location and depth. Then, either the host, the local DNS server, or the application (CDN, P2P) server can perform our algorithm (which is a very simple computation) and choose the desired server.

The main contributions of this work are threefold.

•
We identify the network depth as an important off-line metric that provides (in combination with location information) enough information to enable accurate topologically aware operations.
•
We provide a detailed statistical analysis of various off-line and on-line metrics, and provide an analytical study of the correlation of these variables.
•
We show that topologically aware server selection can be done without any need for dedicated infrastructure or on-line measurements. Furthermore, the accuracy of such server selection is similar to the best on-line methods, and the deployment of our scheme is immediate.

The rest of this paper is organized as follows. In the next section, we further discuss the different approaches for topological distance data and explain our proposed framework. Then in Section 4 we explain our data collection process, and in Section 5 we provide the statistical analysis of the data. In Section 6 we present and discuss the performance of our method for the problem of closest server selection. Finally, we conclude with pointers to future research in Section 7.

Section snippets

Related work

The correlation between different network metrics such as the number of hops on the route and RTT has been lately revisited due to the availability of new measurement infrastructures and tools. Skitter [13] is a tool that measures the forward path and round trip time (RTT) to a set of destinations by sending probe packets through the Internet. Based on this tool, several reports have been published lately that address correlation between different network parameters [14], [15], [16]. However,

Distance estimation metrics: off-line vs. on-line

We classify the techniques for Internet distance measurement into on-line and off-line. On-line techniques require messaging during the computation of network distance, whereas off-line techniques only rely on measurements taken at other times. By definition, off-line techniques can not be aware of dynamic factors such as server load and network congestion. However, the low message complexity and the quick computation of network distances makes off-line estimation an attractive choice. This

Methodology

Since US is highly populated with machines and networks compared to the rest of the world, we wanted to study the US data in isolation. The measurements that we collected between end hosts in the US are collectively referred to as the US-data. However, to get a broader picture and observe relations across the globe, we also collected and studied a second set of data, called the World-data, for which the hosts are spread across the globe outside of US.

We used traceroute as our tool for data

Regression analysis

In this section we analyze the training data with the goal of finding a predictive model for minimum RTT. First, we build a model based on DIST, HOPS, and AS as the predictors. Second, we build a model based only on the off-line metrics, i.e., DIST and DEPTH. Then we compare these two models from a statistical aspect. The statistical tool Splus [27] has been used for all the analysis presented in this section. A model is called off-line if it involves off-line metrics only, otherwise it is

Closest server selection

In the previous section we analyzed the data statistically to come up with a predictive model for RTT. Based on the regression results, we found that DEPTH along with DIST (quadratic) is the best off-line model for distance estimation. Now we want to evaluate the quality of these models with respect to the “goodness” of the topology-aware operations, particularly the closest server selection. Here, we are interested in identifying the closest server from a given set of servers without taking

Conclusions

In this paper, we studied the ability to perform accurate topology-aware operations such as server selection using only static information. Our study was based on four metrics, namely Geographic distance, Number of hops, Number of Autonomous Systems (AS), and Depth, out of which only the Geographic distance and Depth are off-line parameters. The other two were included for the sake of comparison. Based on our detailed study of various combinations of these parameters using regression analysis,

References (28)

J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, B. Weihl, Globally distributed content delivery, in:...
B. Krishnamurthy, C. Willis, Y. Zhang, On the use and performance of Content Distribution Networks, in: Proceedings of...
C. Gkantsidis, P. Rodriguez, Network coding for large scale content distribution, in: Proceedings of INFOCOM, vol. 4,...
S. Ratnasamy, M. Handley, R. Karp, S. Shenker, Topologically-aware overlay construction and server selection, in:...
P. Francis, S. Jamin, V. Paxson, L. Zhang, D.F. Gryniewicz, Y. Jin, An architecture for a global internet host distance...
P. Francis et al.
IDMaps: a global internet host distance estimation service
IEEE/ACM Transactions on Networking
(2001)
S. Srinivasan, E. Zegura, Network Measurement as a Cooperative Enterprise, Lecture Notes In Computer Science, 2002...
E. Cronin et al.
Constrained mirror placement on the internet
IEEE Journal on Selected Areas in Communications
(2002)
T.S.E. Ng et al.
Towards global network positioning
R.L. Carter, M. Crovella, Server selection using dynamic path characterization in wide-area networks, in: INFOCOM,...

C. Davis, H. Rose, DNS LOC: geo-enabling the domain name system,...

C. Davis, P. Vixie, T. Goodwin, I. Dickinson, A means for expressing location information in the domain name system,...

Skitter, Tool for topology and performance analysis for the internet,...

B. Huffaker, M. Fomenkov, D. Moore, k. claffy, Macroscopic Analyses of the infrastructure: measurement and...

Cited by (0)

Prasun Sinha received his PhD from University of Illinois, Urbana-Champaign in 2001, MS from Michigan State University in 1997, and B. Tech. from IIT Delhi in 1995. He worked at Bell Labs, Lucent Technologies as a Member of Technical Staff from 2001 to 2003. Since 2003 he is an Assistant Professor in Department of Computer Science and Engineering at Ohio State University. His research focuses on design of network protocols for sensor networks and mesh networks. He served on the program committees of various conferences including INFOCOM (2004-2006) and MOBICOM (2004-2005). He has won several awards including Ray Ozzie Fellowship (UIUC, 2000), Mavis Memorial Scholarship (UIUC, 1999), and Distinguished Academic Achievement Award (MSU, 1997). He received the prestigious NSF CAREER award in 2006.

Danny Raz received his doctoral degree from the Weizmann Institute of Science, Israel, in 1995. From 1995 to 1997 he was a post-doctoral fellow at the International Computer Science Institute, (ICSI) Berkeley, CA, and a visiting lecturer at the University of California, Berkeley. Between 1997 and 2001 he was a Member of Technical Staff at the Networking Research Laboratory at Bell Labs, Lucent Technologies. In October 2000, Danny Raz joined the faculty of the computer science department at the Technion, Israel. His primary research interest is the theory and application of management related problems in IP networks. Danny Raz served as the general chair of OpenArch 2000, and as a TPC member for many conferences including INFOCOM 2002-2003, OpenArch 2000-2001-2003, IM-NOMS 2001-2005, and as an Editor of the IEEE/ACM Transactions on Networking (ToN).

Nidhan Choudhuri received his PhD from Michigan State University, his M. Stat. and B. Stat. degrees from Indian Statistical Institute, Calcutta, India. He was a visiting faculty at University of Michigan Ann Arbor from 1999 to 2000. Since 2000 he has been an assistant professor at Case Western Reserve University. His areas of interest are nonparametric function estimation, Bayesian methods, empirical likelihood, and statistical computing.

View full text

Computer Communications

Estimation of network distances using off-line measurements

Abstract

Introduction

Section snippets

Related work

Distance estimation metrics: off-line vs. on-line

Methodology

Regression analysis

Closest server selection

Conclusions

IDMaps: a global internet host distance estimation service

IEEE/ACM Transactions on Networking

Constrained mirror placement on the internet

IEEE Journal on Selected Areas in Communications

Towards global network positioning