skip to main content
10.1145/3318216.3363305acmconferencesArticle/Chapter ViewAbstractPublication PagessecConference Proceedingsconference-collections
research-article

Infrastructure fault detection and prediction in edge cloud environments

Published: 07 November 2019 Publication History

Abstract

As an emerging 5G system component, edge cloud becomes one of the key enablers to provide services such us mission critical, IoT and content delivery applications. However, because of limited fail-over mechanisms in edge clouds, faults (e.g., CPU or HDD faults) are highly undesirable. When infrastructure faults occur in edge clouds, they can accumulate and propagate; leading to severe degradation of system and application performance. It is therefore crucial to identify these faults early on and mitigate them. In this paper, we propose a framework to detect and predict several faults at infrastructure-level of edge clouds using supervised machine learning and statistical techniques. The proposed framework is composed of three main components responsible for: (1) data pre-processing, (2) fault detection, and (3) fault prediction. The results show that the framework allows to timely detect and predict several faults online. For instance, using Support Vector Machine (SVM), Random Forest(RF) and Neural Network(NN)models, the framework is able to detect non-fatal CPU and HDD overload faults with an F1 score of more than 95%. For the prediction, the Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) have comparable accuracy at 96.47% vs. 96.88% for CPU-overload fault and 85.52% vs. 88.73% for network fault.

References

[1]
2019 (accessed May, 2019). Azure IoT Edge. https://azure.microsoft.com/en-ca/services/iot-edge/.
[2]
2019 (accessed May, 2019). Edge TPU. https://cloud.google.com/edge-tpu/.
[3]
2019 (accessed May, 2019). How to Develop Convolutional Neural Networks for Multi-Step Time Series Forecasting. https://machinelearningmastery.com/how-to-develop-convolutional-neural-networks-for-multi-step-time-series-forecasting/.
[4]
2019 (accessed May, 2019). Kubernetes federation v2. https://github.com/kubernetes-sigs/federation-v2.
[5]
2019 (accessed May, 2019). Prometheus. https://prometheus.io/.
[6]
2019 (accessed May, 2019). Real-time visibility into stacks, sensors and systems. https://www.influxdata.com/.
[7]
2019 (accessed May, 2019). Selecting good features - Part III: random forests. https://blog.datadive.net/selecting-good-features-part-iii-random-forests/.
[8]
2019 (accessed May, 2019). Selecting good features - Part IV: stability selection, RFE and everything side by side. https://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/.
[9]
2019 (accessed May, 2019). Stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.
[10]
2019 (accessed May, 2019). TC. http://manpages.ubuntu.com/manpages/xenial/man8/tc.8.html.
[11]
2019 (accessed May, 2019). Tempest. https://theiotlearninginitiative.gitbook.io/edgecomputingsolutions/introduction/stacks/openstack/testing/akraino/tempest.
[12]
A. Cauveri and R. Kalpana. 2017. Dynamic fault diagnosis framework for virtual machine rolling upgrade operation in google cloud platform. In International Conference on Power and Embedded Drive Control (ICPEDC). 235--241.
[13]
M. Farshchi, J. Schneider, I. Weber, and J. Grundy. 2015. Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis. In IEEE 26th International Symposium on Software Reliability Engineering (ISSRE). 24--34.
[14]
Albert Greenberg, James Hamilton, David A. Maltz, and Parveen Patel. 2008. The Cost of a Cloud: Research Problems in Data Center Networks. SIGCOMM Comput. Commun. Rev. 39, 1 (2008), 68--73.
[15]
Q. Guan and S. Fu. 2013. Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures. In IEEE International Symposium on Reliable Distributed Systems. 205--214.
[16]
A. Gulenko, M. Wallschläger, F. Schmidt, O. Kao, and F. Liu. 2016. Evaluating machine learning algorithms for anomaly detection in clouds. In IEEE International Conference on Big Data (Big Data). 2716--2721.
[17]
Anton Gulenko, Marcel Wallschläger, Florian Schmidt, Odej Kao, and Feng Liu. 2016. A System Architecture for Real-time Anomaly Detection in Large-scale NFV Systems. Procedia Computer Science 94 (2016), 491 -- 496.
[18]
T. Gunasegaran and Y. Cheah. 2017. Evolutionary cross validation. In International Conference on Information Technology (ICIT). 89--95.
[19]
Cheol-Ho Hong and Blesson Varghese. 2018. Resource Management in Fog/Edge Computing: A Survey. CoRR abs/1810.00305 (2018). http://arxiv.org/abs/1810.00305
[20]
Shrivas AK Hota HS, Handa R. 2017. Time Series Data Prediction Using Sliding Window Based RBF Neural Network. International Journal of Computational Intelligence Research 17, 5 (2017), 1145--1156.
[21]
S. Huang, X. Xu, Y. Xiao, and W. Wang. 2012. Cloud Based Test Coverage Service. In IEEE 19th International Conference on Web Services. 648--649.
[22]
T. Islam and D. Manivannan. 2017. Predicting Application Failure in Cloud: A Machine Learning Approach. In IEEE International Conference on Cognitive Computing (ICCC). 24--31.
[23]
A. Lavin and S. Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms - The Numenta Anomaly Benchmark. In IEEE Conference on Machine Learning and Applications. 38--44.
[24]
Bashir Mohammed, Irfan Awan, Hassan Ugail, and Muhammad Younas. 2019. Failure prediction using machine learning in a virtualised HPC system and application. Cluster Computing (03 2019).
[25]
S. K. Mukkavilli and S. Shetty. 2012. Mining Concept Drifting Network Traffic in Cloud Computing Environments. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 721--722.
[26]
H. S. Pannu, J. Liu, Q. Guan, and S. Fu. 2012. AFD: Adaptive failure detection system for cloud computing infrastructures. In IEEE International Performance Computing and Communications Conference. 71--80.
[27]
Pedro Henriques dos Santos Teixeira, Ricardo Gomes Clemente, Ronald Andreu Kaiser, and Denis Almeida Vieira, Jr. 2010. HOLMES: An Event-driven Solution to Monitor Data Centers Through Continuous Queries and Machine Learning. In ACM International Conference on Distributed Event-Based Systems (DEBS '10). 216--221.
[28]
H. Truong and M. Karan. 2018. Analytics of Performance and Data Quality for Mobile Edge Cloud Applications. In IEEE 11th International Conference on Cloud Computing (CLOUD). 660-667.
[29]
Chengwei Wang, V. Talwar, K. Schwan, and P. Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In IEEE Network Operations and Management Symposium. 96--103.
[30]
C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, and K. Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops. 385--392.
[31]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting Large-scale System Problems by Mining Console Logs. In ACM Symposium on Operating Systems Principles. 117--132.
[32]
Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J. Leon Zhao. 2014. Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks. In Web-Age Information Management. Springer International Publishing, 298--310.

Cited By

View all
  • (2024)Context-Aware Fault Classification for Multi-Access Edge ComputingIEEE Transactions on Network and Service Management10.1109/TNSM.2024.343882821:6(6290-6300)Online publication date: Dec-2024
  • (2024)Labeling Cloud Metrics Data for Fault Detection in Cloud Using Active Learning With Test SuiteIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335531021:3(2837-2853)Online publication date: Jun-2024
  • (2024)A Dynamic Trusted Monitoring Method for Cloud Applications Based on CCA2024 9th International Conference on Signal and Image Processing (ICSIP)10.1109/ICSIP61881.2024.10671423(346-350)Online publication date: 12-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SEC '19: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing
November 2019
455 pages
ISBN:9781450367332
DOI:10.1145/3318216
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SEC '19
Sponsor:
SEC '19: The Fourth ACM/IEEE Symposium on Edge Computing
November 7 - 9, 2019
Virginia, Arlington

Acceptance Rates

SEC '19 Paper Acceptance Rate 20 of 59 submissions, 34%;
Overall Acceptance Rate 40 of 100 submissions, 40%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)11
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Context-Aware Fault Classification for Multi-Access Edge ComputingIEEE Transactions on Network and Service Management10.1109/TNSM.2024.343882821:6(6290-6300)Online publication date: Dec-2024
  • (2024)Labeling Cloud Metrics Data for Fault Detection in Cloud Using Active Learning With Test SuiteIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335531021:3(2837-2853)Online publication date: Jun-2024
  • (2024)A Dynamic Trusted Monitoring Method for Cloud Applications Based on CCA2024 9th International Conference on Signal and Image Processing (ICSIP)10.1109/ICSIP61881.2024.10671423(346-350)Online publication date: 12-Jul-2024
  • (2024)EINS: Edge-Cloud Deep Model Inference with Network-Efficiency Schedule in Serverless2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10580052(1376-1381)Online publication date: 8-May-2024
  • (2024)Self-organising Approach to Anomaly Mitigation in the Cloud-to-Edge ContinuumCooperative Information Systems10.1007/978-3-031-81375-7_15(263-279)Online publication date: 20-Nov-2024
  • (2023)APRENDIZADO DE MÁQUINA EM AMBIENTES HOSPITALARES: UM ESTUDO DE ANÁLISE DE TENDÊNCIAS DE SOBRECARGA EM SISTEMAS DE TECNOLOGIAS DA INFORMAÇÃO E COMUNICAÇÃORevista Contemporânea10.56083/RCV3N9-1273:9(15866-15893)Online publication date: 27-Sep-2023
  • (2023)An Empirical Study of Resource-Stressing Faults in Edge-Computing ApplicationsProceedings of the 6th International Workshop on Edge Systems, Analytics and Networking10.1145/3578354.3592873(54-59)Online publication date: 8-May-2023
  • (2023)Prioritized Fault Recovery Strategies for Multi-Access Edge Computing Using Probabilistic Model CheckingIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.314387720:1(797-812)Online publication date: 1-Jan-2023
  • (2023)Data Labeling for Fault Detection in Cloud: A Test Suite-Based Active Learning Approach2023 IEEE 9th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft57336.2023.10175492(262-266)Online publication date: 19-Jun-2023
  • (2023)Cooperative Resource Allocation for Computation-Intensive IIoT Applications in Aerial ComputingIEEE Internet of Things Journal10.1109/JIOT.2022.322234010:11(9295-9307)Online publication date: 1-Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media