Skip to main content

Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Abstract

Early detection of faults is essential to maintaining the reliability of a distributed system. While there are many solutions for detecting faults, handling high dimensionality and uncertainty of system observations to make an accurate detection still remains a challenge. In this paper, we address this challenge with a two-dimensional convolutional neural network in the form of a denoising autoencoder with recurrent neural networks that performs simultaneous fault detection and diagnosis based on real-time system metrics from a given distributed system (e.g. CPU usage, memory consumption, etc.). The model provides a unified way to automatically learn useful features and make adaptive inferences regarding the onset of faults without hand-crafted feature extraction and human diagnostic expertise. In addition, we develop a Bayesian change-point detection approach for fault localization, in order to support the fault recovery process. We conducted extensive experiments in a real distributed environment over Amazon EC2 and the results demonstrate our proposal outperforms a variety of state-of-the-art machine learning algorithms that are used for fault detection and diagnosis in distributed systems.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://insdata.org/opensource/faultpredition.

  2. 2.

    http://insdata.org/opensource/faultpredition.

  3. 3.

    https://kafka.apache.org/.

  4. 4.

    https://github.com/hyperic/sigar.

  5. 5.

    http://insdata.org/opensource/faultpredition.

References

  1. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 (2007)

  2. Byun, H., Lee, S.W.: Applications of support vector machines for pattern recognition: a survey. In: Lee, S.W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 213–236. Springer, Heidelberg (2002). doi:10.1007/3-540-45665-1_17

    Chapter  Google Scholar 

  3. Cid-Fuentes, J.A., Szabo, C., Falkner, K.: Online behavior identification in distributed systems. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS), pp. 202–211. IEEE (2015)

    Google Scholar 

  4. Cook, B., Babu, S., Candea, G., Duan, S.: Toward self-healing multitier services. In: 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 424–432. IEEE (2007)

    Google Scholar 

  5. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  6. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  7. Jia, R., Abdelwahed, S., Erradi, A.: Towards proactive fault management of enterprise systems. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 21–32. IEEE (2015)

    Google Scholar 

  8. Kola, G., Kosar, T., Livny, M.: Faults in large distributed systems and what we can do about them. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 442–453. Springer, Heidelberg (2005). doi:10.1007/11549468_51

    Chapter  Google Scholar 

  9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  10. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  11. Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: Anomaly detection using program control flow graph mining from execution logs. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 215–224. ACM (2016)

    Google Scholar 

  12. Tanenbaum, A.S., Van Steen, M.: Distributed Systems. Prentice-Hall, Upper Saddle River (2007)

    MATH  Google Scholar 

  13. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)

    Google Scholar 

  14. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  15. Zhou, W.: Fault management in distributed systems. Technical report MS-CIS-10-03, University of Pennsylvania, Department of Computer and Information Science (2010)

    Google Scholar 

Download references

Acknowledgements

The research of Lina Yao and Guangyang Qi was funded by the Defence Science Institute (DSI), an initiative of the State Government of Victoria, as part of a collaborative project between DST Group and UNSW Sydney under the DSI CERA program.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Guangyang Qi , Lina Yao or Anton V. Uzunov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Qi, G., Yao, L., Uzunov, A.V. (2017). Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69179-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69178-7

  • Online ISBN: 978-3-319-69179-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics