Abstract
Overlay networks have emerged as a powerful and flexible platform for developing new disruptive network applications. The attractive characteristics of overlay networks such as routing flexibility and overlay topology dynamics bring to overlay fault diagnosis new challenges, which include the dynamical overlay symptom-fault correlation, multi-layer (i.e., underlay vs. overlay) abstraction, and unregulated overlay symptoms. To address these challenges, we propose a novel user-level probabilistic and reactive fault diagnosis technique, called ProFis for overlay networks, which can seamlessly integrate passive and active fault reasoning to develop an optimal fault diagnosis framework. ProFis uses observable overlay symptoms as reported by overlay applications to dynamically correlate overlay symptoms and faults. ProFis diagnoses overlay faults passively and selects optimal actions (i.e., with the least cost) to enhance the passive diagnosis whenever necessary. Our evaluation study shows that ProFis can efficiently (i.e., low latency) and accurately localize the root causes of overlay faults, even when symptom loss rate is high.
Similar content being viewed by others
References
Akamai Global Content Delivery Technology Overview. http://www.akamai.com/html/technology/. Last access: October 20, 2010
Al-Shaer E, Tang Y (2002) QoS path monitoring for multicast networks. J Netw Syst Manag (JNSM) 10(3):357–381
Anagnostakis KG, Greenwald MB, Ryger RS (2003) Cing: measuring network-internal delays using only existing infrastructure. In: IEEE INFOCOM
Applety K et al (2002) Yemanja—a layered event correlation system for multi-domain computing utilities. J Netw Syst Manag 10
Brodie M, Rish I, Ma S (2001) Optimizing probe selection for fault localization. In: IEEE/IFIP (DSOM)
Chen Y, Bindel D, Song H, Katz RH (2004) An algebraic approach to practical and scalable overlay network monitoring. In: Proceeding of ACM SIGCOMM
Chu Y-h, Rao SG, Zhang H (2000) A case for end system multicast. In: Proceedings of ACM SIGMETRICS. Santa Clara, CA
Coates M, Hero A, Nowak R, Yu B (2002) Internet tomography. IEEE Signal Process Mag 19(3):47–65
CoMon—a monitoring infrastructure for PlanetLab. http://comon.cs.princeton.edu/. Last access: October 20, 2010
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. MIT Press, Cambridge
Guo J, Kar G, Kermani P (2004) Approaches to building self healing system using dependency analysis. In: IEEE/IFIP (NOMS). Seoul, Korea
Hessler S (2006) BPB: a novel approach for obtaining network path characteristics in non-cooperative environments. In: Student’s workshop at INFOCOM
Houck K, Calo S, Finkel A (1995) Towards a practical alarm correlation system. In: Integrated network management IV. Santa Barbara, CA
Howell F, McNab R (1998) Simjava: a discrete event simulation package for Java with applications in computer systems modelling. In: First international conference on Web-based modelling and simulation
Jakobson G, Weissman MD (1993) Alarm correlation. IEEE Netw 7:52–59
Kliger S, Yemini S, Yemini Y, Ohsie D, Stolfo S (1995) A coding approach to event correlation. In: Proceedings of the fourth international symposium on intelligent network management
Liu G, Mok AK, Yang EJ (1999) Composite events for network event correlation. In: Integrated network management VI, pp 247–260. Boston, MA
Mao ZM et al (2004) Scalable and accurate identification of as-level forwarding paths. In: IEEE Infocom
Mahajan R, Spring N, Wetherall D, Anderson T (2003) User-level Internet path diagnosis. In: Proc. ACM SOSP
McCloghrie K, Rose M (1991) Management information base for network management of TCP/IP-based Internets: MIB-II, RFC1213
Medina A, Matta I, Byers J (2000) On the origin of power laws in Internet topologies. ACM Comput Commun Rev 30:18–28
Peterson L, Anderson T, Culler D, Roscoe T (2002) A blueprint for introducing disruptive technology into the Internet. In: The proceedings of ACM HotNets-I workshop
PlanetLab. http://www.planet-lab.org. Last access: October 20, 2010
Rish I, Brodie M, Odintsova N, Ma S, Grabarnik G (2004) Real-time problem determination in distributed systems using active probing. In: IEEE/IFIP (NOMS). Seoul, Korea
Savage S (1999) Sting: a TCP-based network measurement tool. In: Proceedings of the 1999 USENIX symposium on Internet technologies and systems
Skitter. CAIDAs topology measurement tool. http://www.caida.org/tools/measurement/skitter/. Last access: October 20, 2010
Steinder M, Sethi AS (2002) Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms. In: Proc. of IEEE INFOCOM. New York, NY
Steinder M, Sethi AS (2004) Probabilistic fault diagnosis in communication systems through incremental hypothesis updating. Comput Netw 45(4):537–562
Tang Y, Al-Shaer E (2008) Towards user-level collaborative overlay fault diagnosis. In: The 27th IEEE INFOCOM mini-conference. Phoenix, AZ
Tang Y, Al-Shaer E (2009) Sharing end-user negative symptoms for improving overlay network dependability. In: The 39th IEEE/IFIP international conference on dependable systems and networks (DSN). Lisbon, Portugal
Tang Y, Al-Shaer E, Boutaba R (2008) Efficient fault diagnosis using incremental alarm correlation and active investigation for internet and overlay networks. IEEE Transactions on Network and Service Management 5(1):36–49
Team Cymru IP to ASN Lookup. http://asn.cymru.com/cgi-bin/whois.cgi. Last access: October 20, 2010
Xie H, Yang YR, Krishnamurthy A, Liu Y, Silberschatz A (2008) P4P: portal for (P2P) applications. In: Proceedings of SIGCOMM
Zhang M, Zhang C, Pai V, Peterson L, Wang R (2004) PlanetSeer: Internet path failure monitoring and characterization in wide-area services. In: Proc. sixth symposium on operating systems design and implementation
Zhao Y, Chen Y, Bindel D (2006) Towards unbiased end-to-end network diagnosis. In: Proceeding of ACM SIGCOMM
Acknowledgements
This research was supported in part by the National Grand Fundamental Research 973 program of China under Grant No. 2009CB320505, and the National Nature Science Foundation of China under Grant No. 60973123.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tang, Y., Cheng, G. & Xu, Z. Probabilistic and reactive fault diagnosis for dynamic overlay networks. Peer-to-Peer Netw. Appl. 4, 439–452 (2011). https://doi.org/10.1007/s12083-010-0100-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-010-0100-4