Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks

Qi, Guangyang; Yao, Lina; Uzunov, Anton V.

doi:10.1007/978-3-319-69179-4_3

Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks

Guangyang Qi¹⁸,
Lina Yao¹⁸ &
Anton V. Uzunov¹⁹

Conference paper
First Online: 14 October 2017

3415 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Abstract

Early detection of faults is essential to maintaining the reliability of a distributed system. While there are many solutions for detecting faults, handling high dimensionality and uncertainty of system observations to make an accurate detection still remains a challenge. In this paper, we address this challenge with a two-dimensional convolutional neural network in the form of a denoising autoencoder with recurrent neural networks that performs simultaneous fault detection and diagnosis based on real-time system metrics from a given distributed system (e.g. CPU usage, memory consumption, etc.). The model provides a unified way to automatically learn useful features and make adaptive inferences regarding the onset of faults without hand-crafted feature extraction and human diagnostic expertise. In addition, we develop a Bayesian change-point detection approach for fault localization, in order to support the fault recovery process. We conducted extensive experiments in a real distributed environment over Amazon EC2 and the results demonstrate our proposal outperforms a variety of state-of-the-art machine learning algorithms that are used for fault detection and diagnosis in distributed systems.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 (2007)
Byun, H., Lee, S.W.: Applications of support vector machines for pattern recognition: a survey. In: Lee, S.W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 213–236. Springer, Heidelberg (2002). doi:10.1007/3-540-45665-1_17
Chapter Google Scholar
Cid-Fuentes, J.A., Szabo, C., Falkner, K.: Online behavior identification in distributed systems. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS), pp. 202–211. IEEE (2015)
Google Scholar
Cook, B., Babu, S., Candea, G., Duan, S.: Toward self-healing multitier services. In: 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 424–432. IEEE (2007)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Jia, R., Abdelwahed, S., Erradi, A.: Towards proactive fault management of enterprise systems. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 21–32. IEEE (2015)
Google Scholar
Kola, G., Kosar, T., Livny, M.: Faults in large distributed systems and what we can do about them. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 442–453. Springer, Heidelberg (2005). doi:10.1007/11549468_51
Chapter Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: Anomaly detection using program control flow graph mining from execution logs. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 215–224. ACM (2016)
Google Scholar
Tanenbaum, A.S., Van Steen, M.: Distributed Systems. Prentice-Hall, Upper Saddle River (2007)
MATH Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Zhou, W.: Fault management in distributed systems. Technical report MS-CIS-10-03, University of Pennsylvania, Department of Computer and Information Science (2010)
Google Scholar

Download references

Acknowledgements

The research of Lina Yao and Guangyang Qi was funded by the Defence Science Institute (DSI), an initiative of the State Government of Victoria, as part of a collaborative project between DST Group and UNSW Sydney under the DSI CERA program.

Author information

Authors and Affiliations

University of New South Wales, Sydney, Australia
Guangyang Qi & Lina Yao
Defence Science and Technology Group, Adelaide, Australia
Anton V. Uzunov

Authors

Guangyang Qi
View author publications
You can also search for this author in PubMed Google Scholar
Lina Yao
View author publications
You can also search for this author in PubMed Google Scholar
Anton V. Uzunov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Guangyang Qi , Lina Yao or Anton V. Uzunov .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Gao Cong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Macquarie University, Sydney, New South Wales, Australia
Wei Emma Zhang
Wuhan University, Wuhan, China
Chengliang Li
Nanyang Technological University, Singapore, Singapore
Aixin Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qi, G., Yao, L., Uzunov, A.V. (2017). Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-69179-4_3
Published: 14 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics