Abstract
This paper introduces a novel approach to failure prediction for mission critical distributed systems that has the distinctive features to be black-box, non-intrusive and online. The approach combines Complex Event Processing (CEP) and Hidden Markov Models (HMM) so as to analyze symptoms of failures that might occur in the form of anomalous conditions of performance metrics identified for such purpose. The paper describes an architecture named CASPER, based on CEP and HMM, that relies on sniffed information from the communication network of a mission critical system, only, for predicting anomalies that can lead to software failures. An instance of CASPER has been implemented, trained and tuned to monitor a real Air Traffic Control (ATC) system. An extensive experimental evaluation of CASPER is presented. The obtained results show (i) a very low percentage of false positives over both normal and under stress conditions, and (ii) a sufficiently high failure prediction time that allows the system to apply appropriate recovery procedures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Esper: Esper project web page (2011), http://esper.codehaus.org/
Rabiner, L., Juang, B.: An introduction to hidden markov models. IEEE ASSP Magazine 3(1), 4–16 (1986)
Murphy, K.: Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, Computer Science Division (2002)
Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. PhD thesis, Department of Computer Science, Humboldt-Universität zu Berlin, Germany (2008)
Hoffmann, G.A., Salfner, F., Malek, M.: Advanced Failure Prediction in Complex Software Systems. Technical Report 172, Berlin, Germany (2004)
Yu, L., Zheng, Z., Lan, Z., Coghlan, S.: Practical online failure prediction for blue gene/p: Period-based vs event-driven. In: Proc. of IEEE/IFIP DSN-W 2011, pp. 259–264 (2011)
Williams, A.W., Pertet, S.M., Narasimhan, P.: Tiresias: Black-box failure prediction in distributed systems. In: Proc. of IEEE IPDPS 2007, Los Alamitos, CA, USA (2007)
Tan, Y., Gu, X., Wang, H.: Adaptive system anomaly prediction for large-scale hosting infrastructures. In: Proc. of ACM PODC 2010, pp. 173–182. ACM, New York (2010)
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. SIGOPS Oper. Syst. Rev. 37, 74–89 (2003)
Fu, S., Zhong Xu, C.: Exploring event correlation for failure prediction in coalitions of clusters (2007)
Daidone, A., Di Giandomenico, F., Bondavalli, A., Chiaradonna, S.: Hidden markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In: SRDS 2006, Leeds, UK, pp. 245–256 (2006)
Gu, X., Papadimitrioul, S., Yu, P.S., Chang, S.P.: Online failure forecast for fault-tolerant data stream processing. In: ICDE 2008, pp. 1388–1390 (2008)
Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)
Hood, C., Ji, C.: Proactive network-fault detection. IEEE Transactions on Reliability 46(3), 333–341 (1997)
Thottan, M., Ji, C.: Properties of network faults. In: NOMS 2000, pp. 941–942 (2000)
Baldoni, R., Lodi, G., Mariotta, G., Montanari, L., Rizzuto, M.: Online Black-box Failure Prediction for Mission Critical Distributed Systems. Technical report (2012), http://www.dis.uniroma1.it/~midlab/articoli/MidlabTechReport3-2012.pdf
Object Management Group: CORBA. Specification, Object Management Group (2011)
IBM: System S Web Site (2011), http://domino.research.ibm.com/comm/research_projects.nsf/pages/esps.index.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baldoni, R., Lodi, G., Montanari, L., Mariotta, G., Rizzuto, M. (2012). Online Black-Box Failure Prediction for Mission Critical Distributed Systems. In: Ortmeier, F., Daniel, P. (eds) Computer Safety, Reliability, and Security. SAFECOMP 2012. Lecture Notes in Computer Science, vol 7612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33678-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-33678-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33677-5
Online ISBN: 978-3-642-33678-2
eBook Packages: Computer ScienceComputer Science (R0)