Abstract
In Internet service fault management based on active probing, uncertainty and noises will affect service fault management. In order to reduce the impact, challenges of Internet service fault management are analyzed in this paper. Bipartite Bayesian network is chosen to model the dependency relationship between faults and probes, binary symmetric channel is chosen to model noises, and a service fault management approach using active probing is proposed for such an environment. This approach is composed of two phases: fault detection and fault diagnosis. In first phase, we propose a greedy approximation probe selection algorithm (GAPSA), which selects a minimal set of probes while remaining a high probability of fault detection. In second phase, we propose a fault diagnosis probe selection algorithm (FDPSA), which selects probes to obtain more system information based on the symptoms observed in previous phase. To deal with dynamic fault set caused by fault recovery mechanism, we propose a hypothesis inference algorithm based on fault persistent time statistic (FPTS). Simulation results prove the validity and efficiency of our approach.
Similar content being viewed by others
References
Molina-Jimenez C, Shrivastava S, Crowcroft J, et al. On the monitoring of contractual service level agreements. In: Proceedings of First IEEE International Workshop on Electronic Contracting, IEEE Computer Society, 2004. 1–8
Li F, Thottan M. End-to-end service quality measurement using source-routed probes. In: IEEE INFOCOM, 2006
Chen Z X. Proactive probing and probing on demand in service fault localization. Int J Intell Contr Syst, 2005, 2(2): 107–113
Natu M, Sethi A S. Active probing approach for fault localization in computer networks. In: E2EMON’06, 2006
Nguyen X, Thiran P. Using end-to-end data to infer lossy inks in sensor networks. In: IEEE INFOCOM, 2006
Steinder M, Sethi A S. A survey of fault localization techniques in computer networks. Sci Comp Program Comp Syst (AH), 2004, 53(22): 165–194
Steinder M, Sethi A S. Probabilistic fault diagnosis in communication systems through incremental hypothesis updating. Comp Netw, 2004, 45(4): 537–562
Steinder M, Sethi A S. Probabilistic fault diagnosis in communication systems using belief networks. IEEE/ACM Trans Netw, 2004, 12(5): 809–822
Huang X H, Zou S H, Wang W D, et al. Fault management for Internet service: Modeling and algorithms. In: IEEE International Conference on Communications, ICC 2006
Agrawal S, Naidu K V M, Rastogi R. Diagnosing link-level anomalies using passive probes. In: 26th IEEE International Conference on Computer Communications. IEEE INFOCOM 2007, 2007. 1757–1765
Keller A, Ludwig H. The WSLA framework: Specifying and monitoring service level agreements for web services. J Netw Syst Manag, Special Issue on E-Business Management, Plenum Publishing Corporation, 2003, 11(1): 57–81
Keynote Systems, Inc. Available: http://www.keynote.com
Software Research Inc. Available: http://www.soft.com
Natu M, Sethi A S. Probabilistic fault diagnosis using adaptive probing. In: IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, San Jose, CA, 2007. 38–49
Rish I, Brodie M, Ma S, et al. Adaptive diagnosis in distributed systems. IEEE Trans Neural Netw (special issue on Adaptive Learning Systems in Communication Networks), 2005, 16(5): 1088–1109
Tang Y N, Al-Shaer E S, Boutaba R. Active integrated fault localization in communication networks. In: 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. 2005. 543–556
Weerawarana S, Francisco C. Business Process with BPEL4WS: Understanding BPEL4WS, Part 1. Research report, IBM developerWorks, Aug. 2002; www-106.ibm.com/developerworks/webservices/library/ws-bpelcol1/
Bagchi S, Kar G, Hellerstein J. Dependency analysis in distributed systems using fault injection: Application to problem determination in an e-commerce environment. In: 12th International Workshop on Distributed Systems: Operations and Management, DSOM’2001, 2001
Basu S, Casati F, Daniel F. Web service dependency discovery tool for SOA management. In: 2007 IEEE International Conference on Services Computing: SOA Industry Summit, 2007
Fox A, Gribble S D, Chawathe Y, et al. Cluster-based scalable network services. In: Proceedings of the Sixteenth ACM Symposium on Operating System Principles, 1997
Kiciman E, Subramanian L. A root cause localization model for large-scale systems. In: Proceedings of USENIX Hot Topics On Dependability (HotDep), 2005
Chen M Y, Kiciman E, Fratkin E, et al. Pinpoint: Problem determination in large, dynamic, Internet services. In: Proceedings of the International Conference on Dependable Systems and Networks (IPDS Track), 2002
Steinder M, Sethi A S. The present and future of event correlation: A need for end-to-end service fault localization. In: World Multi-Conf. Systemics, Cybernetics, and Informatics (SCI), 2001
Huang X H, Zou S H, Wang W D, Cheng S D. MDFM: Multi-domain fault management for Internet services. In: 8th International Conference on Management of Multimedia Networks and Services, MMNS 2005. New York: Springer-Verlag, LNCS 3754, 2005. 121–132
Narasimha R, Dihidar S, Ji C, et al. Scalable fault diagnosis in IP Networks using graphical models: A variational inference approach. In: IEEE International Conference on Communications. ICC’07. 2007. 147–152
Candea G, Kiciman E, Zhang S, et al. JAGR: An autonomous self-recovering application server. In: Proceedings of the 5th International Workshop on Active Middleware Services, 2003
Lerner U, Parr R, Koller D, et al. Bayesian fault detection and diagnosis in dynamic systems. In: Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), 2000. 531–537
Ding J G, Kramer B, Xu S H, et al. Predictive fault management in the dynamic environment of IP networks. In: Proceedings of IEEE Workshop on IP Operations and Management. 2004. 233–239
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the National Basic Research Program of China (973 Program) (Grant No. 2003CB314806), the National High-Tech Research & Development Program of China (863 Program) (Grant Nos. 2007AA12Z321 and 2007AA01Z206), and the National Natural Science Foundation of China (Grant Nos. 60603060, 60502037 and 90604019)
Rights and permissions
About this article
Cite this article
Chu, L., Zou, S., Cheng, S. et al. Active probing based Internet service fault management in uncertain and noisy environment. Sci. China Ser. F-Inf. Sci. 51, 1857–1870 (2008). https://doi.org/10.1007/s11432-008-0143-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-008-0143-9