Skip to main content
Log in

Probabilistic and reactive fault diagnosis for dynamic overlay networks

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Overlay networks have emerged as a powerful and flexible platform for developing new disruptive network applications. The attractive characteristics of overlay networks such as routing flexibility and overlay topology dynamics bring to overlay fault diagnosis new challenges, which include the dynamical overlay symptom-fault correlation, multi-layer (i.e., underlay vs. overlay) abstraction, and unregulated overlay symptoms. To address these challenges, we propose a novel user-level probabilistic and reactive fault diagnosis technique, called ProFis for overlay networks, which can seamlessly integrate passive and active fault reasoning to develop an optimal fault diagnosis framework. ProFis uses observable overlay symptoms as reported by overlay applications to dynamically correlate overlay symptoms and faults. ProFis diagnoses overlay faults passively and selects optimal actions (i.e., with the least cost) to enhance the passive diagnosis whenever necessary. Our evaluation study shows that ProFis can efficiently (i.e., low latency) and accurately localize the root causes of overlay faults, even when symptom loss rate is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Akamai Global Content Delivery Technology Overview. http://www.akamai.com/html/technology/. Last access: October 20, 2010

  2. Al-Shaer E, Tang Y (2002) QoS path monitoring for multicast networks. J Netw Syst Manag (JNSM) 10(3):357–381

    Article  Google Scholar 

  3. Anagnostakis KG, Greenwald MB, Ryger RS (2003) Cing: measuring network-internal delays using only existing infrastructure. In: IEEE INFOCOM

  4. Applety K et al (2002) Yemanja—a layered event correlation system for multi-domain computing utilities. J Netw Syst Manag 10

  5. Brodie M, Rish I, Ma S (2001) Optimizing probe selection for fault localization. In: IEEE/IFIP (DSOM)

  6. Chen Y, Bindel D, Song H, Katz RH (2004) An algebraic approach to practical and scalable overlay network monitoring. In: Proceeding of ACM SIGCOMM

  7. Chu Y-h, Rao SG, Zhang H (2000) A case for end system multicast. In: Proceedings of ACM SIGMETRICS. Santa Clara, CA

  8. Coates M, Hero A, Nowak R, Yu B (2002) Internet tomography. IEEE Signal Process Mag 19(3):47–65

    Article  Google Scholar 

  9. CoMon—a monitoring infrastructure for PlanetLab. http://comon.cs.princeton.edu/. Last access: October 20, 2010

  10. Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. MIT Press, Cambridge

    MATH  Google Scholar 

  11. Guo J, Kar G, Kermani P (2004) Approaches to building self healing system using dependency analysis. In: IEEE/IFIP (NOMS). Seoul, Korea

  12. Hessler S (2006) BPB: a novel approach for obtaining network path characteristics in non-cooperative environments. In: Student’s workshop at INFOCOM

  13. Houck K, Calo S, Finkel A (1995) Towards a practical alarm correlation system. In: Integrated network management IV. Santa Barbara, CA

  14. Howell F, McNab R (1998) Simjava: a discrete event simulation package for Java with applications in computer systems modelling. In: First international conference on Web-based modelling and simulation

  15. Jakobson G, Weissman MD (1993) Alarm correlation. IEEE Netw 7:52–59

    Article  Google Scholar 

  16. Kliger S, Yemini S, Yemini Y, Ohsie D, Stolfo S (1995) A coding approach to event correlation. In: Proceedings of the fourth international symposium on intelligent network management

  17. Liu G, Mok AK, Yang EJ (1999) Composite events for network event correlation. In: Integrated network management VI, pp 247–260. Boston, MA

  18. Mao ZM et al (2004) Scalable and accurate identification of as-level forwarding paths. In: IEEE Infocom

  19. Mahajan R, Spring N, Wetherall D, Anderson T (2003) User-level Internet path diagnosis. In: Proc. ACM SOSP

  20. McCloghrie K, Rose M (1991) Management information base for network management of TCP/IP-based Internets: MIB-II, RFC1213

  21. Medina A, Matta I, Byers J (2000) On the origin of power laws in Internet topologies. ACM Comput Commun Rev 30:18–28

    Article  Google Scholar 

  22. Peterson L, Anderson T, Culler D, Roscoe T (2002) A blueprint for introducing disruptive technology into the Internet. In: The proceedings of ACM HotNets-I workshop

  23. PlanetLab. http://www.planet-lab.org. Last access: October 20, 2010

  24. Rish I, Brodie M, Odintsova N, Ma S, Grabarnik G (2004) Real-time problem determination in distributed systems using active probing. In: IEEE/IFIP (NOMS). Seoul, Korea

  25. Savage S (1999) Sting: a TCP-based network measurement tool. In: Proceedings of the 1999 USENIX symposium on Internet technologies and systems

  26. Skitter. CAIDAs topology measurement tool. http://www.caida.org/tools/measurement/skitter/. Last access: October 20, 2010

  27. Steinder M, Sethi AS (2002) Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms. In: Proc. of IEEE INFOCOM. New York, NY

  28. Steinder M, Sethi AS (2004) Probabilistic fault diagnosis in communication systems through incremental hypothesis updating. Comput Netw 45(4):537–562

    Article  MATH  Google Scholar 

  29. Tang Y, Al-Shaer E (2008) Towards user-level collaborative overlay fault diagnosis. In: The 27th IEEE INFOCOM mini-conference. Phoenix, AZ

  30. Tang Y, Al-Shaer E (2009) Sharing end-user negative symptoms for improving overlay network dependability. In: The 39th IEEE/IFIP international conference on dependable systems and networks (DSN). Lisbon, Portugal

  31. Tang Y, Al-Shaer E, Boutaba R (2008) Efficient fault diagnosis using incremental alarm correlation and active investigation for internet and overlay networks. IEEE Transactions on Network and Service Management 5(1):36–49

    Article  Google Scholar 

  32. Team Cymru IP to ASN Lookup. http://asn.cymru.com/cgi-bin/whois.cgi. Last access: October 20, 2010

  33. Xie H, Yang YR, Krishnamurthy A, Liu Y, Silberschatz A (2008) P4P: portal for (P2P) applications. In: Proceedings of SIGCOMM

  34. Zhang M, Zhang C, Pai V, Peterson L, Wang R (2004) PlanetSeer: Internet path failure monitoring and characterization in wide-area services. In: Proc. sixth symposium on operating systems design and implementation

  35. Zhao Y, Chen Y, Bindel D (2006) Towards unbiased end-to-end network diagnosis. In: Proceeding of ACM SIGCOMM

Download references

Acknowledgements

This research was supported in part by the National Grand Fundamental Research 973 program of China under Grant No. 2009CB320505, and the National Nature Science Foundation of China under Grant No. 60973123.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongning Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, Y., Cheng, G. & Xu, Z. Probabilistic and reactive fault diagnosis for dynamic overlay networks. Peer-to-Peer Netw. Appl. 4, 439–452 (2011). https://doi.org/10.1007/s12083-010-0100-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-010-0100-4

Keywords

Navigation