Skip to main content

Online Black-Box Failure Prediction for Mission Critical Distributed Systems

  • Conference paper
Computer Safety, Reliability, and Security (SAFECOMP 2012)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7612))

Included in the following conference series:

Abstract

This paper introduces a novel approach to failure prediction for mission critical distributed systems that has the distinctive features to be black-box, non-intrusive and online. The approach combines Complex Event Processing (CEP) and Hidden Markov Models (HMM) so as to analyze symptoms of failures that might occur in the form of anomalous conditions of performance metrics identified for such purpose. The paper describes an architecture named CASPER, based on CEP and HMM, that relies on sniffed information from the communication network of a mission critical system, only, for predicting anomalies that can lead to software failures. An instance of CASPER has been implemented, trained and tuned to monitor a real Air Traffic Control (ATC) system. An extensive experimental evaluation of CASPER is presented. The obtained results show (i) a very low percentage of false positives over both normal and under stress conditions, and (ii) a sufficiently high failure prediction time that allows the system to apply appropriate recovery procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Esper: Esper project web page (2011), http://esper.codehaus.org/

  2. Rabiner, L., Juang, B.: An introduction to hidden markov models. IEEE ASSP Magazine 3(1), 4–16 (1986)

    Article  Google Scholar 

  3. Murphy, K.: Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, Computer Science Division (2002)

    Google Scholar 

  4. Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. PhD thesis, Department of Computer Science, Humboldt-Universität zu Berlin, Germany (2008)

    Google Scholar 

  5. Hoffmann, G.A., Salfner, F., Malek, M.: Advanced Failure Prediction in Complex Software Systems. Technical Report 172, Berlin, Germany (2004)

    Google Scholar 

  6. Yu, L., Zheng, Z., Lan, Z., Coghlan, S.: Practical online failure prediction for blue gene/p: Period-based vs event-driven. In: Proc. of IEEE/IFIP DSN-W 2011, pp. 259–264 (2011)

    Google Scholar 

  7. Williams, A.W., Pertet, S.M., Narasimhan, P.: Tiresias: Black-box failure prediction in distributed systems. In: Proc. of IEEE IPDPS 2007, Los Alamitos, CA, USA (2007)

    Google Scholar 

  8. Tan, Y., Gu, X., Wang, H.: Adaptive system anomaly prediction for large-scale hosting infrastructures. In: Proc. of ACM PODC 2010, pp. 173–182. ACM, New York (2010)

    Google Scholar 

  9. Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. SIGOPS Oper. Syst. Rev. 37, 74–89 (2003)

    Article  Google Scholar 

  10. Fu, S., Zhong Xu, C.: Exploring event correlation for failure prediction in coalitions of clusters (2007)

    Google Scholar 

  11. Daidone, A., Di Giandomenico, F., Bondavalli, A., Chiaradonna, S.: Hidden markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In: SRDS 2006, Leeds, UK, pp. 245–256 (2006)

    Google Scholar 

  12. Gu, X., Papadimitrioul, S., Yu, P.S., Chang, S.P.: Online failure forecast for fault-tolerant data stream processing. In: ICDE 2008, pp. 1388–1390 (2008)

    Google Scholar 

  13. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)

    Article  Google Scholar 

  14. Hood, C., Ji, C.: Proactive network-fault detection. IEEE Transactions on Reliability 46(3), 333–341 (1997)

    Article  Google Scholar 

  15. Thottan, M., Ji, C.: Properties of network faults. In: NOMS 2000, pp. 941–942 (2000)

    Google Scholar 

  16. Baldoni, R., Lodi, G., Mariotta, G., Montanari, L., Rizzuto, M.: Online Black-box Failure Prediction for Mission Critical Distributed Systems. Technical report (2012), http://www.dis.uniroma1.it/~midlab/articoli/MidlabTechReport3-2012.pdf

  17. Object Management Group: CORBA. Specification, Object Management Group (2011)

    Google Scholar 

  18. IBM: System S Web Site (2011), http://domino.research.ibm.com/comm/research_projects.nsf/pages/esps.index.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baldoni, R., Lodi, G., Montanari, L., Mariotta, G., Rizzuto, M. (2012). Online Black-Box Failure Prediction for Mission Critical Distributed Systems. In: Ortmeier, F., Daniel, P. (eds) Computer Safety, Reliability, and Security. SAFECOMP 2012. Lecture Notes in Computer Science, vol 7612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33678-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33678-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33677-5

  • Online ISBN: 978-3-642-33678-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics