Abstract
Early detection of faults is essential to maintaining the reliability of a distributed system. While there are many solutions for detecting faults, handling high dimensionality and uncertainty of system observations to make an accurate detection still remains a challenge. In this paper, we address this challenge with a two-dimensional convolutional neural network in the form of a denoising autoencoder with recurrent neural networks that performs simultaneous fault detection and diagnosis based on real-time system metrics from a given distributed system (e.g. CPU usage, memory consumption, etc.). The model provides a unified way to automatically learn useful features and make adaptive inferences regarding the onset of faults without hand-crafted feature extraction and human diagnostic expertise. In addition, we develop a Bayesian change-point detection approach for fault localization, in order to support the fault recovery process. We conducted extensive experiments in a real distributed environment over Amazon EC2 and the results demonstrate our proposal outperforms a variety of state-of-the-art machine learning algorithms that are used for fault detection and diagnosis in distributed systems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 (2007)
Byun, H., Lee, S.W.: Applications of support vector machines for pattern recognition: a survey. In: Lee, S.W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 213–236. Springer, Heidelberg (2002). doi:10.1007/3-540-45665-1_17
Cid-Fuentes, J.A., Szabo, C., Falkner, K.: Online behavior identification in distributed systems. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS), pp. 202–211. IEEE (2015)
Cook, B., Babu, S., Candea, G., Duan, S.: Toward self-healing multitier services. In: 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 424–432. IEEE (2007)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Jia, R., Abdelwahed, S., Erradi, A.: Towards proactive fault management of enterprise systems. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 21–32. IEEE (2015)
Kola, G., Kosar, T., Livny, M.: Faults in large distributed systems and what we can do about them. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 442–453. Springer, Heidelberg (2005). doi:10.1007/11549468_51
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: Anomaly detection using program control flow graph mining from execution logs. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 215–224. ACM (2016)
Tanenbaum, A.S., Van Steen, M.: Distributed Systems. Prentice-Hall, Upper Saddle River (2007)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Zhou, W.: Fault management in distributed systems. Technical report MS-CIS-10-03, University of Pennsylvania, Department of Computer and Information Science (2010)
Acknowledgements
The research of Lina Yao and Guangyang Qi was funded by the Defence Science Institute (DSI), an initiative of the State Government of Victoria, as part of a collaborative project between DST Group and UNSW Sydney under the DSI CERA program.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Qi, G., Yao, L., Uzunov, A.V. (2017). Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-69179-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)