Analyzing and Predicting Failure in Hadoop Clusters Using Distributed Hidden Markov Model

Agrawal, Bikash; Wiktorski, Tomasz; Rong, Chunming

doi:10.1007/978-3-319-28430-9_18

Bikash Agrawal¹⁶,
Tomasz Wiktorski¹⁶ &
Chunming Rong¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9106))

Included in the following conference series:

Second International Conference on Cloud Computing and Big Data in Asia

1454 Accesses
6 Citations

Abstract

In this paper, we propose a novel approach to analyze and predict failures in Hadoop cluster. We enumerate several key challenges that hinder failure prediction in such systems: heterogeneity of the system, hidden complexity, time limitation and scalability. At first, clustering approach is applied to group similar error sequences, which makes training of the model effectual subsequently Hidden Markov Models (HMMs) is used to predict failure, using the MapReduce programming framework. The effectiveness of the failure prediction algorithm is measured by precision, recall and accuracy metrics. Our algorithm can predict failure with an accuracy of \(91\,\%\) with 2 days in advance using \(87\,\%\) of data as training sets. Although the model presented in this paper focuses on Hadoop clusters, the model can be generalized in other cloud computing frameworks as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache: Apache flume (2010). https://flume.apache.org/FlumeUserGuide.html
Baum, L.E., Eagon, J., et al.: An inequality with applications to statistical estimation for probabilistic functions of markov processes and to a model for ecology. Bull. Am. Math. Soc. 73(3), 360–363 (1967)
Article MATH MathSciNet Google Scholar
Box, G.E., Jenkins, G.M., Reinsel, G.C.: Time Series Analysis: Forecasting and Control. Wiley, New York (2013)
Google Scholar
Chang, H., Kodialam, M., Kompella, R.R., Lakshman, T., Lee, M., Mukherjee, S.: Scheduling in mapreduce-like systems for fast completion time. In: 2011 Proceedings IEEE INFOCOM, pp. 3074–3082. IEEE (2011)
Google Scholar
Daidone, A., Di Giandomenico, F., Bondavalli, A., Chiaradonna, S.: Hidden markov models as a support for diagnosis: formalization of the problem and synthesis of the solution. In: 25th IEEE Symposium on Reliable Distributed Systems, SRDS 2006, pp. 245–256. IEEE (2006)
Google Scholar
David: anarchyape (2013). https://github.com/david78k/anarchyape
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Durbin, R.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Book MATH Google Scholar
Faghri, F., Bazarbayev, S., Overholt, M., Farivar, R., Campbell, R.H., Sanders, W.H.: Failure scenario as a service (fsaas) for hadoop clusters. In: Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management, p. 5. ACM (2012)
Google Scholar
Fahad, A., Alshatri, N., Tari, Z., ALAmri, A., Y Zomaya, A., Khalil, I., Foufou, S., Bouras, A.: A survey of clustering algorithms for big data: taxonomy & empirical analysis (2014)
Google Scholar
Fonseca, R.: X-trace (2010). https://github.com/rfonseca/X-Trace
Fulp, E.W., Fink, G.A., Haack, J.N.: Predicting computer system failures using support vector machines. In: Proceedings of the First USENIX Conference on Analysis of System Logs, WASL 2008, p. 5. USENIX Association, Berkeley (2008). http://dl.acm.org/citation.cfm?id=1855886.1855891
Hassan, M.R., Nath, B., Kirley, M.: A fusion model of hmm, ann and ga for stock market forecasting. Expert Syst. Appl. 33(1), 171–180 (2007)
Article Google Scholar
Huang, X., Acero, A., Hon, H.W., Foreword By-Reddy, R.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River (2001)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002). http://dx.doi.org/10.1109/TPAMI.2002.1017616
Article Google Scholar
Konwinski, A., Zaharia, M., Katz, R., Stoica, I.: X-tracing hadoop (2008)
Google Scholar
Liang, Y., Zhang, Y., Sivasubramaniam, A., Sahoo, R.K., Moreira, J., Gupta, M.: Filtering failure logs for a bluegene/l prototype. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN 2005, pp. 476–485. IEEE (2005)
Google Scholar
de Botelho Marcos, P.: Maresia: an approach to deal with the single points of failure of the mapreduce model (2013)
Google Scholar
Mccreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: studying scalability and efficiency. Inf. Process. Manage. 48(5), 873–888 (2012). http://dx.doi.org/10.1016/j.ipm.2010.12.003
Article Google Scholar
Ng, F.: Analysis of hadoops performance under failures. Rice University
Google Scholar
Plötz, T., Fink, G.A.: Markov Models for Handwriting Recognition. Springer, Heidelberg (2011)
Book MATH Google Scholar
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 772 (2004)
Google Scholar
Salfner, F., Malek, M.: Using hidden semi-markov models for effective online failure prediction. In: 26th IEEE International Symposium on Reliable Distributed Systems, SRDS 2007, pp. 161–174. IEEE (2007)
Google Scholar
SWIMProjectUCB: Swimprojectucb/swim (2012). https://github.com/SWIMProjectUCB/SWIM
Tai, A.H., Ching, W.K., Chan, L.Y.: Detection of machine failure: hidden markov model approach. Comput. Ind. Eng. 57(2), 608–619 (2009)
Article Google Scholar
Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Salsa: analyzing logs as state machines. WASL 8, 6–6 (2008)
Google Scholar
Teoh, T.T., Cho, S.Y., Nguwi, Y.Y.: Hidden markov model for hard-drive failure detection. In: 2012 7th International Conference on Computer Science & Education (ICCSE), pp. 3–8. IEEE (2012)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)
Article MATH Google Scholar
Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: Proceedings of the First International Workshop on Cloud Data Management, pp. 37–44. ACM (2009)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2012)
Google Scholar
Wilson, A.D., Bobick, A.F.: Parametric hidden Markov models for gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 884–900 (1999)
Article Google Scholar
Zawawy, H., Kontogiannis, K., Mylopoulos, J.: Log filtering and interpretation for root cause analysis. In: ICSM, pp. 1–5. IEEE Computer Society (2010). http://dblp.uni-trier.de/db/conf/icsm/icsm2010.html#ZawawyKM10

Download references

Author information

Authors and Affiliations

Department of Computer and Electrical Engineering, University of Stavanger, Stavanger, Norway
Bikash Agrawal, Tomasz Wiktorski & Chunming Rong

Authors

Bikash Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Wiktorski
View author publications
You can also search for this author in PubMed Google Scholar
Chunming Rong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bikash Agrawal .

Editor information

Editors and Affiliations

School of Computer Science and Tech., Huazhong Univ. of Science and Technology, Wuhan, China
Weizhong Qiang
College of Mathematics and Computer Sci., Fuzhou University, Fuzhou, China
Xianghan Zheng
Dept. of Computer Scie and Informat. Eng, Chung Hua University, Hsinchu, Taiwan
Ching-Hsien Hsu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agrawal, B., Wiktorski, T., Rong, C. (2015). Analyzing and Predicting Failure in Hadoop Clusters Using Distributed Hidden Markov Model. In: Qiang, W., Zheng, X., Hsu, CH. (eds) Cloud Computing and Big Data. CloudCom-Asia 2015. Lecture Notes in Computer Science(), vol 9106. Springer, Cham. https://doi.org/10.1007/978-3-319-28430-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-28430-9_18
Published: 10 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28429-3
Online ISBN: 978-3-319-28430-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics