Skip to main content

Analyzing and Predicting Failure in Hadoop Clusters Using Distributed Hidden Markov Model

  • Conference paper
  • First Online:
Cloud Computing and Big Data (CloudCom-Asia 2015)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9106))

Abstract

In this paper, we propose a novel approach to analyze and predict failures in Hadoop cluster. We enumerate several key challenges that hinder failure prediction in such systems: heterogeneity of the system, hidden complexity, time limitation and scalability. At first, clustering approach is applied to group similar error sequences, which makes training of the model effectual subsequently Hidden Markov Models (HMMs) is used to predict failure, using the MapReduce programming framework. The effectiveness of the failure prediction algorithm is measured by precision, recall and accuracy metrics. Our algorithm can predict failure with an accuracy of \(91\,\%\) with 2 days in advance using \(87\,\%\) of data as training sets. Although the model presented in this paper focuses on Hadoop clusters, the model can be generalized in other cloud computing frameworks as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apache: Apache flume (2010). https://flume.apache.org/FlumeUserGuide.html

  2. Baum, L.E., Eagon, J., et al.: An inequality with applications to statistical estimation for probabilistic functions of markov processes and to a model for ecology. Bull. Am. Math. Soc. 73(3), 360–363 (1967)

    Article  MATH  MathSciNet  Google Scholar 

  3. Box, G.E., Jenkins, G.M., Reinsel, G.C.: Time Series Analysis: Forecasting and Control. Wiley, New York (2013)

    Google Scholar 

  4. Chang, H., Kodialam, M., Kompella, R.R., Lakshman, T., Lee, M., Mukherjee, S.: Scheduling in mapreduce-like systems for fast completion time. In: 2011 Proceedings IEEE INFOCOM, pp. 3074–3082. IEEE (2011)

    Google Scholar 

  5. Daidone, A., Di Giandomenico, F., Bondavalli, A., Chiaradonna, S.: Hidden markov models as a support for diagnosis: formalization of the problem and synthesis of the solution. In: 25th IEEE Symposium on Reliable Distributed Systems, SRDS 2006, pp. 245–256. IEEE (2006)

    Google Scholar 

  6. David: anarchyape (2013). https://github.com/david78k/anarchyape

  7. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264

  8. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  9. Durbin, R.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)

    Book  MATH  Google Scholar 

  10. Faghri, F., Bazarbayev, S., Overholt, M., Farivar, R., Campbell, R.H., Sanders, W.H.: Failure scenario as a service (fsaas) for hadoop clusters. In: Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management, p. 5. ACM (2012)

    Google Scholar 

  11. Fahad, A., Alshatri, N., Tari, Z., ALAmri, A., Y Zomaya, A., Khalil, I., Foufou, S., Bouras, A.: A survey of clustering algorithms for big data: taxonomy & empirical analysis (2014)

    Google Scholar 

  12. Fonseca, R.: X-trace (2010). https://github.com/rfonseca/X-Trace

  13. Fulp, E.W., Fink, G.A., Haack, J.N.: Predicting computer system failures using support vector machines. In: Proceedings of the First USENIX Conference on Analysis of System Logs, WASL 2008, p. 5. USENIX Association, Berkeley (2008). http://dl.acm.org/citation.cfm?id=1855886.1855891

  14. Hassan, M.R., Nath, B., Kirley, M.: A fusion model of hmm, ann and ga for stock market forecasting. Expert Syst. Appl. 33(1), 171–180 (2007)

    Article  Google Scholar 

  15. Huang, X., Acero, A., Hon, H.W., Foreword By-Reddy, R.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River (2001)

    Google Scholar 

  16. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002). http://dx.doi.org/10.1109/TPAMI.2002.1017616

    Article  Google Scholar 

  17. Konwinski, A., Zaharia, M., Katz, R., Stoica, I.: X-tracing hadoop (2008)

    Google Scholar 

  18. Liang, Y., Zhang, Y., Sivasubramaniam, A., Sahoo, R.K., Moreira, J., Gupta, M.: Filtering failure logs for a bluegene/l prototype. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN 2005, pp. 476–485. IEEE (2005)

    Google Scholar 

  19. de Botelho Marcos, P.: Maresia: an approach to deal with the single points of failure of the mapreduce model (2013)

    Google Scholar 

  20. Mccreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: studying scalability and efficiency. Inf. Process. Manage. 48(5), 873–888 (2012). http://dx.doi.org/10.1016/j.ipm.2010.12.003

    Article  Google Scholar 

  21. Ng, F.: Analysis of hadoops performance under failures. Rice University

    Google Scholar 

  22. Plötz, T., Fink, G.A.: Markov Models for Handwriting Recognition. Springer, Heidelberg (2011)

    Book  MATH  Google Scholar 

  23. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  24. Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 772 (2004)

    Google Scholar 

  25. Salfner, F., Malek, M.: Using hidden semi-markov models for effective online failure prediction. In: 26th IEEE International Symposium on Reliable Distributed Systems, SRDS 2007, pp. 161–174. IEEE (2007)

    Google Scholar 

  26. SWIMProjectUCB: Swimprojectucb/swim (2012). https://github.com/SWIMProjectUCB/SWIM

  27. Tai, A.H., Ching, W.K., Chan, L.Y.: Detection of machine failure: hidden markov model approach. Comput. Ind. Eng. 57(2), 608–619 (2009)

    Article  Google Scholar 

  28. Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Salsa: analyzing logs as state machines. WASL 8, 6–6 (2008)

    Google Scholar 

  29. Teoh, T.T., Cho, S.Y., Nguwi, Y.Y.: Hidden markov model for hard-drive failure detection. In: 2012 7th International Conference on Computer Science & Education (ICCSE), pp. 3–8. IEEE (2012)

    Google Scholar 

  30. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  31. Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: Proceedings of the First International Workshop on Cloud Data Management, pp. 37–44. ACM (2009)

    Google Scholar 

  32. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2012)

    Google Scholar 

  33. Wilson, A.D., Bobick, A.F.: Parametric hidden Markov models for gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 884–900 (1999)

    Article  Google Scholar 

  34. Zawawy, H., Kontogiannis, K., Mylopoulos, J.: Log filtering and interpretation for root cause analysis. In: ICSM, pp. 1–5. IEEE Computer Society (2010). http://dblp.uni-trier.de/db/conf/icsm/icsm2010.html#ZawawyKM10

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bikash Agrawal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Agrawal, B., Wiktorski, T., Rong, C. (2015). Analyzing and Predicting Failure in Hadoop Clusters Using Distributed Hidden Markov Model. In: Qiang, W., Zheng, X., Hsu, CH. (eds) Cloud Computing and Big Data. CloudCom-Asia 2015. Lecture Notes in Computer Science(), vol 9106. Springer, Cham. https://doi.org/10.1007/978-3-319-28430-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28430-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28429-3

  • Online ISBN: 978-3-319-28430-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics