Elsevier

Computer Networks

Volume 55, Issue 3, 21 February 2011, Pages 873-889
Computer Networks

Prediction models for long-term Internet prefix availability

https://doi.org/10.1016/j.comnet.2010.10.005Get rights and content

Abstract

The Border Gateway Protocol (BGP) maintains inter-domain routing information by announcing and withdrawing IP prefixes. These routing updates can cause prefixes to be unreachable for periods of time, reducing prefix availability observed from different vantage points on the Internet. The observed prefix availability values may not meet the standards promised by Service Level Agreements (SLAs).

In this paper, we develop a framework for predicting long-term availability of prefixes, given short-duration prefix information from publicly available BGP routing databases like RouteViews, and prediction models constructed from information about other Internet prefixes. We compare three prediction models and find that machine learning-based prediction methods outperform a baseline model that predicts the future availability of a prefix to be the same as its past availability. Our results show that mean time to failure is the most important attribute for predicting availability. We also quantify how prefix availability is related to prefix length and update frequency. Our prediction models achieve 82% accuracy and 0.7 ranking quality when predicting for a future duration equal to the learning duration. We can also predict for a longer future duration, with graceful performance reduction. Our models allow ISPs to adjust BGP routing policies if predicted availability is low, and are generally useful for cloud computing systems, content distribution networks, P2P, and VoIP applications.

Introduction

The Border Gateway Protocol (BGP), the de-facto Internet inter-domain routing protocol, propagates reachability information by announcing paths to prefixes, which are aggregates of IP addresses. Autonomous Systems (ASes) maintain paths to prefixes in their routing tables, and (conditionally) update this information when route update messages are received. These update messages can be announcements, which announce an AS path to a prefix, or withdrawals, which indicate that no path is available to the prefix. Continuous prefix reachability over time is crucial for the smooth operation of the Internet. This is captured using the metric of availability, defined as the time duration when the prefix is deemed reachable divided by the total time duration we are interested in. While typical system availability metrics for telephone networks exceed five 9s, i.e., 99.999%, computer networks are known to have lower availability [1], [2], [3]. The five 9s availability value amounts to the system being down for about five minutes in a year’s period and is usually too stringent a requirement for Internet prefixes.

Prefixes belonging to highly popular services such as CNN, Google, and YouTube need to be highly available, and a disruption of more than a few minutes is generally unacceptable. Internet Service Providers (ISPs) such as AT&T and Sprint usually provide availability guarantees on their backbone network through Service Level Agreements (SLAs) [4], [5]. However, content providers are more interested in their website availability as observed from various points in the Internet, and a routing path being advertised is critical to maintaining traffic flow to their data centers. Attempts at defining policies so that SLAs can be extended to several ISPs [6] and at defining and estimating service availability between two end points [7] in the Internet have had limited success. Meanwhile, several reachability problems have occurred, such as the YouTube prefix hijack which lasted about two hours [8], and several undersea cable cuts, e.g., [9], [10], which caused significant disruptions and increase in web latencies to much of the Middle East, Asia, and North Africa for a period of several weeks [11].

Measuring prefix availability is non-trivial without an extensive measurement infrastructure comprising many vantage points. Additionally, data plane measurements are inherently discontinuous, as they take reachability samples at periodic time instants. The reachability estimate they compute increases in accuracy as the sampling interval is made smaller, at the cost of increased burden on the prober and elevated network traffic. Moreover, the observations need to be made over a long period of time to obtain a reasonable estimate. A shortfall in measured availability requires a reactive approach that corrects the problem after the fact. Our work takes a predictive approach to solve the availability prediction problem, i.e., predicting the advertised availability of prefixes, as observed from multiple vantage points in the Internet.

Our framework predicts long-term control plane availability, i.e., the availability of the paths to prefixes as advertised by BGP. However, previous work has shown that the control plane-advertised paths may not always imply that the paths are usable in the data plane [12], [13], [14]. Wang et al. [14] studied the correlation between control plane and data plane events and found that control plane changes mostly result in data plane performance degradation, showing that the two planes are correlated. BGP routing dynamics have been used to predict data plane failures in previous work [13], [15]. Zhang et al. [13] found that data plane failures can be predicted using routing updates with about 80–90% accuracy for about 60–70% of the prefixes. Feamster et al. [15] predict end-to-end path failure using the number of BGP messages observed during a 15 minute window. This indicates that the control plane does indeed have a positive correlation with the data plane.

Transient events like routing convergence and forwarding loops result in temporary reachability loss in the data plane, most of which last less than 300 seconds [13]. However, since we are concerned with the long term availability metric considering at least a few days at a time, the percentage of time that the control plane and data plane paths mismatch should be insignificant compared to the time over which our availability values are computed.

Data plane reachability can exist even when control plane paths are withdrawn due to the presence of default routes [12]. However, it is not possible to predict the existence of default routes, as they depend on intermediate ASes between the source and the destination. There is no agreed upon method to detect the existence of default routes, though some initial efforts have been made by the authors of [12] by controlling announcements and withdrawals of certain prefixes allocated to their ASes. Our work considers only control plane availability and hence actual prefix availability could be higher in the data plane if default routes are present. As can be seen from the discussion above, establishing the correlation between the two planes is by itself a challenging topic [12] and detailed study of this is beyond the scope of this work.

In this work, we compute attributes during a short duration observation period of publicly available routing information (e.g., from RouteViews [16]) and develop a prediction model based on information on other Internet prefixes. Thus, our approach does not need additional measurement infrastructure apart from RouteViews [16], which has been maintained by the University of Oregon for several years.

A predicted long-term advertised availability value which falls short of requirements could lead to changes in BGP policies of the ISP regulating the advertisement of these prefixes to the rest of the Internet. For example, one can increase the penalty threshold associated with route flap damping for the routes to a high availability requirement prefix (like a business customer) to ensure higher availability [17]. Changing BGP attributes such as MED and community, or aggregating prefixes, can increase the perceived prefix availability or aid traffic engineering [17]. We will make our prediction tool publicly available through a web page so it can be used for monitoring the predicted availability, e.g., of prefixes of an ISP.

Our work can optimize Hubble [18] – a system that studies black holes in the Internet by issuing traceroutes to potentially problem prefixes, and then analyzing the results to identify data plane reachability problems. Currently, Hubble uses BGP updates for a prefix as one of the potential indicators of problems, focusing on withdrawals and AS path changes. We can enhance this technique by using the prefixes for which the predicted availability falls below a threshold as the potentially problem prefixes. This will increase detection accuracy of black holes. Our work also complements a data plane loss rate prediction system such as iPlane [19].

Applications of our work include content distribution networks (CDNs), cloud computing applications, VoIP applications, and P2P networks. CDNs and cloud computing applications can use the highest predicted availability replica/server to redirect the clients to. VoIP implementations can use predicted availability of relay nodes along with latency and loss rate estimates for better performance. Our work can also be applied to peer networks, where ensuring content availability is a primary concern amid extensive peer churn. One can modify the incentive mechanisms of BitTorrent [20] by unchoking the BitTorrent peers which are parts of a highly available prefix, in addition to considering their download rate and latency/loss rate estimates. Our system eliminates the need for storing information about peers at clients that are not currently downloading from these peers but may do so in the future.

The key premise in this paper is that Internet prefix characteristics convey valuable information about prefix availability. We argue that prediction models are viable even if prefixes whose availability is to be predicted and prefixes used for learning prediction models are unrelated (e.g., learning and predicted prefixes are not in the same AS). This is because an important factor causing paths to prefixes from various vantage points to go up or down is BGP path convergence, caused by BGP reaction to path failure or policy changes. This, combined with the fact that operator reaction to path failures is relatively standard, and that AS policy changes, e.g., AS de-peering, typically affect several prefixes at a time, supports this premise. We therefore use randomly selected prefixes from RouteViews to learn models, and then predict availability of other prefixes. This theme is common in other disciplines, such as medicine, where one uses known symptoms of patients with a diagnosed disease to try to diagnose patients with an unknown condition. To the best of our knowledge, no other work has exploited the similarity of prefixes in the Internet; a few studies, e.g., [13], applied predictive modeling in the context of BGP, but they only examined problem ASes in the path to a particular prefix.

While we focus on predicting prefix availability using observed routing updates, our prediction framework can be easily extended to predict other prefix properties of interest. We formulate hypotheses about how attributes of a prefix such as prefix length and update frequency relate to its availability, and prove or refute them based on our data. We show that past availability of a prefix is inadequate for accurately predicting future availability. Our availability predictions from three models are compared to measured availability values from RouteViews.

This paper extends our previous work [21] as follows:

  • (1)

    In addition to varying the ratio of the learning duration to the prediction duration as in [21], we vary the learning duration itself. This is important because the availability distribution depends on the duration over which it is computed, and hence this impacts prediction performance.

  • (2)

    We consider an additional machine learning model, namely the Naïve Bayes model, for availability prediction. This is a popular model in the machine learning literature [22], known to be simpler than the bagged decision trees considered in [21] but potentially less accurate. We find that the performance of this model is better or worse than bagged decision trees, depending on the learning duration (Section 6.4). We also conduct a more thorough investigation of the prediction models used in [21].

  • (3)

    We study the distribution of the prefix attributes and show, using statistical tests, that the attributes indeed demarcate availability classes (Section 5.3).

  • (4)

    We predict availability of a large number of prefixes, thereby showing that the prediction models are scalable (Section 7).

  • (5)

    All results presented in this paper are for the time period of January to October 2009, as opposed to [21], where one month of data was considered at a time. This leads to higher diversity of the visible prefixes, since some prefixes are only visible for short time periods. As availability is a long-term metric, this 10-month evaluation of the prediction models is more realistic.

The remainder of this paper is organized as follows. Section 2 summarizes related work. We define the problem that we study in Section 3. Section 4 describes our datasets, and Section 5 describes our methodology and metrics. In Section 6, we compare results from three prediction models and study the effect of classification attributes and using certain more predictable prefixes on prediction results. Section 7 describes our results of applying prediction models to large sets of combinations. Section 8 concludes the paper and discusses future work.

Section snippets

Related work

Rexford et al. [23] find that highly popular prefixes have relatively stable BGP routes, and experience fewer and shorter update events. Their results fit into our prediction framework, with the prefix popularity being a feature that can be used to predict stability, specifically the number of update events associated with a prefix. Our work goes a step further by predicting prefix availability, not just the events associated with a prefix, using easily computable attributes. Prefix attributes

Problem definition

We define the availability prediction problem to be the prediction of the BGP-advertised availability of a prefix, given its attributes computed by observing BGP updates (for example, through RouteViews), and the availability and attribute information of other prefixes, collected for a short duration of time. Advertised availability is critical in maintaining smooth traffic flow to these prefixes. Going back to our patient analogy, given the symptoms and known diseases of some patients, one can

Datasets

The routing tables (RIB files) and updates available from RouteViews [16] are in .bz2 format with typical sizes of 0.8 GB per day of RIB files (sampled every two hours) and about 25 MB per day of update files (written every minutes), which total about 25 GB per month of data. We preprocess the data using libbgpdump version 1.4.99.7 [31] to convert the files from the MRT format to text. We reduce the storage space required by removing unused fields. We only keep the timestamp, peer IP, prefix, and

Methodology

We define a combination as a (peer, prefix) tuple, which implies that the prefix was observed by the peer in the RouteViews dataset. We compute the availability of these combinations and use that for building our prediction models. The notion of availability of a prefix is with reference to an observation point in the Internet. For the RouteViews data, these observation points are the peers. They are fairly well spread out over the world, enabling one to observe the availability of prefixes

Model evaluation

In this section, we study three prediction models using the metrics in Section 5.5. As mentioned in Section 5.2, we work with 10,000 combinations and their attributes, downsampled from all the combinations in each of four different months. We do 10-fold incremental cross-validation as described in Section 5.4; thus n = 10. We conduct k = 5 runs, generating a different set of 10 folds each time. Hence, we have 50 performance measures for each model averaged to give an output measurement.

We start

Larger test datasets

So far in this paper, we have used training and test sets which are constructed out of a sample of 10,000 combinations using 10-fold cross-validation. We now investigate the scalability of our models, where we apply the learned models to a large number of combinations. This may be required of a typical prediction application, if one is interested in predicting the availability of a set of prefixes from a large number of vantage points in the Internet.

To evaluate scalability, we learn Naïve

Conclusions and future work

In this paper, we have developed a long-term availability prediction framework that uses mean time to recovery, mean time to failure, prefix length, and update frequency as attributes. These attributes are easily computable from public RouteViews data observed for a short period of time. Our framework learns a prediction model from a set of Internet prefixes, and uses that model to predict availability of other prefixes. To the best of our knowledge, this is the first work that uses the

Ravish Khosla is a graduate student pursuing a Ph.D. in Electrical and Computer Engineering Department at Purdue University under the supervision of Prof. Sonia Fahmy and Prof. Y. Charlie Hu. His research interests lie in routing protocols in the Internet, specifically Border Gateway Protocol (BGP). He is currently working on evaluating BGP resilience by studying availability of Internet prefixes. He is a student member of IEEE. He has a MS in ECE from Purdue University with thesis titled

References (46)

  • P. Pongpaibool et al.

    Providing end-to-end service level agreements across multiple ISP networks

    Computer Networks

    (2004)
  • John Shepler, The Holy Grail of five-nines reliability, 2005....
  • M. Dahlin et al.

    End-to-end WAN service availability

    IEEE/ACM Transactions on Networking

    (2003)
  • V. Paxson

    End-to-end routing behavior in the Internet

    IEEE/ACM Transactions on Networking

    (1997)
  • AT&T, AT&T High Speed Internet Business Edition Service Level Agreements, http://www.att.com/gen/general?pid=6622...
  • Sprint, Sprint service level agreements. <http://www.sprintworldwide.com/english/solutions/sla/> (accessed April...
  • R. Keralapura, C.-N. Chuah, G. Iannaccone, S. Bhattacharyya, Service availability: a new approach to characterize IP...
  • E. Zmijewski, Threats to Internet routing and global connectivity, in: Proceedings of the 20th Annual FIRST Conference,...
  • Cable News Network (CNN), Internet failure hits two continents, 2008....
  • Fox News, Severed Cables Cut Egypt’s Internet Access Once Again, 2008....
  • Akamai, Mideast outage, 2008....
  • R. Bush, O. Maennel, M. Roughan, S. Uhlig, Internet optometry: assessing the broken glasses in internet reachability,...
  • Y. Zhang, Z.M. Mao, J. Wang, A framework for measuring and predicting the impact of routing changes, in: INFOCOM, 2007,...
  • F. Wang, Z.M. Mao, J. Wang, L. Gao, R. Bush, A measurement study on the impact of routing events on end-to-end internet...
  • N. Feamster, D.G. Andersen, H. Balakrishnan, M.F. Kaashoek, Measuring the effects of internet path faults on reactive...
  • University of Oregon, Route Views Project, http://www.routeviews.org/ (accessed April...
  • M. Caesar et al.

    BGP routing policies in ISP networks

    IEEE Network Magazine

    (2005)
  • E. Katz-Bassett, H.V. Madhyastha, J.P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, Studying blackholes in the...
  • H.V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, A. Venkataramani, iPlane: an information...
  • B. Cohen, Incentives Build Robustness in BitTorrent, 2003....
  • R. Khosla, S. Fahmy, Y.C. Hu, J. Neville, Predicting prefix availability in the Internet, in: INFOCOM,...
  • I.H. Witten et al.

    Data Mining: Practical Machine Learning Tools and Techniques

    (2005)
  • J. Rexford, J. Wang, Z. Xiao, Y. Zhang, BGP routing stability of popular destinations, in: Proceedings of the ACM IMW,...
  • Cited by (2)

    • A bipolar resource management framework for resource provisioning in Cloud's virtualized environment

      2016, Applied Soft Computing Journal
      Citation Excerpt :

      As part of our methodology to solve the above mentioned problem, we utilize the notion of prediction based resource provisioning in the cloud virtualized environment which is studied in Ref. [16] (Prediction based decision making is one of the most interesting fields of study during the recent decade which is broadly popular in the fields of stock marketing [17–22], financial management [23–26] and so on. Prediction based decision making has also applications in distributed computing [27–32]). As another part of this paper methodology, due to the importance of request–response coupling in our approach, we employ demand analysis as a popular field in the management of customers’ requests which is broadly studied in Customer Relationship Management (CRM) applications [33–37].

    • Swift: Predictive fast reroute

      2017, SIGCOMM 2017 - Proceedings of the 2017 Conference of the ACM Special Interest Group on Data Communication

    Ravish Khosla is a graduate student pursuing a Ph.D. in Electrical and Computer Engineering Department at Purdue University under the supervision of Prof. Sonia Fahmy and Prof. Y. Charlie Hu. His research interests lie in routing protocols in the Internet, specifically Border Gateway Protocol (BGP). He is currently working on evaluating BGP resilience by studying availability of Internet prefixes. He is a student member of IEEE. He has a MS in ECE from Purdue University with thesis titled “Reliable Data Dissemination in Energy Constrained Sensor Networks” and a B.Tech (H) in Electrical Engineering from IIT Kharagpur, India.

    Sonia Fahmy is an Associate Professor of Computer Science at Purdue University. She received her Ph.D. degree from the Ohio State University in 1999. Her current research interests lie in Internet measurement and tomography, network testbeds, network security, and wireless sensor networks. She received the National Science Foundation CAREER award in 2003, and the Schlumberger technical merit award in 2000. She is a member of the ACM and a senior member of the IEEE. For more information, please see: http://www.cs.purdue.edu/∼fahmy/.

    Y. Charlie Hu is an Associate Professor of Electrical and Computer Engineering at Purdue University. He received his Ph.D. degree in Computer Science from Harvard University in 1997 and was a research scientist at Rice University from 1997 to 2001. His research interests include wireless networking, overlay networking, operating systems, and distributed systems. He has published more than 120 papers in these areas. Dr. Hu received the NSF CAREER Award in 2003. He served as a TPC vice chair for ICPP 2004, ICDCS 2007, and SBAC PAD 2009. He is a senior member of ACM and IEEE.

    Jennifer Neville is an Assistant Professor of Computer Science and Statistics at Purdue University. She received her Ph.D. degree from University of Massachusetts, Amherst, in 2006. Her research interests lie in data mining and machine learning techniques for relational data. She focuses on the development and analysis of relational learning algorithms and the application of those algorithms to real-world tasks. For more information, please see: http://www.cs.purdue.edu/∼neville/.

    View full text