Skip to main content
Log in

Failure prediction of tasks in the cloud at an earlier stage: a solution based on domain information mining

  • Published:
Computing Aims and scope Submit manuscript

Abstract

In a large-scale data center, it is vital to precisely recognize the termination statuses of applications at an early stage. In recent years, many machine learning techniques have been applied to this issue, which is beneficial for optimizing the scheduling policy and improving the efficiency of resource utilization. However, if the application’s dynamic information is insufficient at the early stage, the generalization performance of the machine learning model will be lessened, and the prediction accuracy could be low. To overcome this problem, a novel failure prediction method that is based on the association relationships between similar jobs is proposed in this paper to jointly predict task’s termination statuses at an earlier stage. The similar jobs whose tasks have similar changing modes of consumed resources, an inherent structural correlation may exist, and the correlation information is significant for improving the prediction model’s generalization performance. First, a job clustering algorithm is proposed for identifying the jobs with higher similarity from jobs that have various numbers of tasks. Second, based on the job clustering results, the robust multi-task learning algorithm is introduced to effectively utilize the domain information among jobs (i.e. interactional relationship among jobs on the termination statuses of task). Experiments are conducted on a Google cluster workload traces dataset. The results show that the proposed method can realize higher prediction accuracy, lower misjudgment rate, and higher predictive stability than several state-of-the-art methods at 1/3 the running time of the tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28

    Article  Google Scholar 

  2. Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with Borg. In: Proceedings of the tenth European conference on computer systems (In EuroSys), Bordeaux, France, pp 1–17

  3. Jassas M, Mahmoud QH (2018) Failure analysis and characterization of scheduling jobs in google cluster trace. In: IECON 2018-44th annual conference of the IEEE Industrial Electronics Society Washington, pp 3102–3107

  4. Chen X, Lu CD, Pattabiraman K (2014) Failure analysis of jobs in compute clouds: a google cluster case study. In: Proceedings of IEEE international symposium on software reliability engineering workshops, Naples, Italy, pp 167–177

  5. Liu HC, Han JJ, Shang Y, Liu C, Bo C, Chen J (2017) Predicting of job failure in compute cloud based on online extreme learning machine: a comparative study. IEEE Access 5(99):9359–9368

    Article  Google Scholar 

  6. Mao W, He L, Yan Y, Wang J (2017) Online sequential prediction of bearings imbalanced fault diagnosis by extreme learning machine. Mech Syst Signal Process 83:450–473

    Article  Google Scholar 

  7. Wang Z, Zhang M, Wang D, Song C, Liu M, Li J, Lou L, Liu Z (2017) Failure prediction using machine learning and time series in optical network. Opt Express 25(16):18553–18565

    Article  Google Scholar 

  8. Rosa A, Chen LY, Binder W (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998

    Article  Google Scholar 

  9. Ganguly S, Consul A, Khan A, Bussone B, Miguel A (2016) A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of IEEE second international conference on big data computing service and applications, Oxford, UK, pp 105–116

  10. Padmakumari P, Umamakeswari A (2019) Task failure prediction using combine bagging ensemble (CBE) classification in cloud workflow. Wirel Pers Commun 107(1):23–40

    Article  Google Scholar 

  11. Chen X, Lu C, Pattabiramanb K (2014) Failure prediction of jobs in compute clouds: a google cluster case study. 2014 IEEE international symposium on software reliability engineering workshops. Naples, Italy, pp 341–346

  12. Pei Y, Qi T, He J (2017) Multi-task function-on-function regression with co-grouping structured sparsity. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, Halifax, NS, Canada, pp 1255–1264

  13. Liu T, Tao D, Song M, Maybank S (2017) Algorithm-dependent generalization bounds for multi-task learning. IEEE Trans Pattern Anal 39(2):227–241

    Article  Google Scholar 

  14. Liu CH, Han JJ, Shang YL (2016) Predicting job failure in cloud cluster: based on SVM classification. J Beijing Univ Posts Telecommun 39(5):104–109

    Google Scholar 

  15. Li Z, Tian Z, Mu Z, Zhang Z, Yue J (2018) Awareness of line-of-sight propagation for indoor localization using Hopkins statistic. IEEE Sens J 18(9):3864–3874

    Article  Google Scholar 

  16. Padmanaban S, Thiruvenkadam K (2018) Rapid brain tissue segmentation process by modified FCM algorithm with CUDA enabled GPU machine. Int J Imag Syst Technol 28(3):163–174

    Article  Google Scholar 

  17. Pan S, Shi W, He P, Ming H, Zhang X (2016) Novel approach to unsupervised change detection based on a robust semi-supervised FCM clustering algorithm. Remote Sens 8(3):264

    Article  Google Scholar 

  18. Chen J, Zhou J, Ye J (2011) Integrating low-rank and groupsparse structures for robust multi-task learning. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, San Diego, California, USA, pp 42–50

  19. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202

    Article  MathSciNet  Google Scholar 

  20. Mao W, Mu X, Zheng Y, Yan G (2014) Leave-one-out cross-validationbased model selection for multi-input multi-output support vector machine. Neural Comput Appl 24(2):441–451

    Article  Google Scholar 

  21. Navarro JM, Parada GHA, Duenas JC (2014) System failure prediction through rare-events elastic-net logistic regression. In: Proceedings of international conference on artificial intelligence, Madrid, Spain, pp 120-125

  22. Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Montreal, Quebec, Canada, pp 339–348

  23. Pong TK, Tseng P, Ji S, Ye J (2010) Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM J Optim 20(6):3465–3489

    Article  MathSciNet  Google Scholar 

  24. Belghazi I, Rajeswar S, Baratin A, Hjelm R D, Courville A (2018) MINE: mutual information neural estimation. In: Proceedings of the 35th international conference on machine learning, Stockholm, Sweden

  25. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, cluster analysis: basic concepts and methods, 3rd edn. Elsevier, Amsterdam, pp 443–495

    Google Scholar 

  26. Zhou HB, Gao JT (2014) Automatic method for determining cluster number based on silhouette coefficient. Adv Mater Res 951:227–230

    Article  Google Scholar 

  27. Sitompul OS, Nababan EB (2018) Optimization model of K-means clustering using artificial neural networks to handle class imbalance problem. In: IOP conference series: materials science and engineering, vol . 288, no. 1, p 12075

  28. Li X (2016) Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans Pattern Anal 12(11):1088–1092

    Article  Google Scholar 

  29. Pan L, Zhang B, Yang W, Ram R (2017) A sparse linear model and significance test for individual consumption prediction. IEEE Trans Power Syst 32(6):4489–4500

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. U1704158), China Postdoctoral Science Foundation Special Support (No. 2016T90944), Doctoral Research Project of Henan Normal University (No. 5101119170145), Science and Technology Research Project of Henan Province (No.172102210045).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chunhong Liu or Wentao Mao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Dai, L., Lai, Y. et al. Failure prediction of tasks in the cloud at an earlier stage: a solution based on domain information mining. Computing 102, 2001–2023 (2020). https://doi.org/10.1007/s00607-020-00800-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-020-00800-1

Keywords

Mathematics Subject Classification

Navigation