Abstract
In a large-scale data center, it is vital to precisely recognize the termination statuses of applications at an early stage. In recent years, many machine learning techniques have been applied to this issue, which is beneficial for optimizing the scheduling policy and improving the efficiency of resource utilization. However, if the application’s dynamic information is insufficient at the early stage, the generalization performance of the machine learning model will be lessened, and the prediction accuracy could be low. To overcome this problem, a novel failure prediction method that is based on the association relationships between similar jobs is proposed in this paper to jointly predict task’s termination statuses at an earlier stage. The similar jobs whose tasks have similar changing modes of consumed resources, an inherent structural correlation may exist, and the correlation information is significant for improving the prediction model’s generalization performance. First, a job clustering algorithm is proposed for identifying the jobs with higher similarity from jobs that have various numbers of tasks. Second, based on the job clustering results, the robust multi-task learning algorithm is introduced to effectively utilize the domain information among jobs (i.e. interactional relationship among jobs on the termination statuses of task). Experiments are conducted on a Google cluster workload traces dataset. The results show that the proposed method can realize higher prediction accuracy, lower misjudgment rate, and higher predictive stability than several state-of-the-art methods at 1/3 the running time of the tasks.
Similar content being viewed by others
References
Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28
Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with Borg. In: Proceedings of the tenth European conference on computer systems (In EuroSys), Bordeaux, France, pp 1–17
Jassas M, Mahmoud QH (2018) Failure analysis and characterization of scheduling jobs in google cluster trace. In: IECON 2018-44th annual conference of the IEEE Industrial Electronics Society Washington, pp 3102–3107
Chen X, Lu CD, Pattabiraman K (2014) Failure analysis of jobs in compute clouds: a google cluster case study. In: Proceedings of IEEE international symposium on software reliability engineering workshops, Naples, Italy, pp 167–177
Liu HC, Han JJ, Shang Y, Liu C, Bo C, Chen J (2017) Predicting of job failure in compute cloud based on online extreme learning machine: a comparative study. IEEE Access 5(99):9359–9368
Mao W, He L, Yan Y, Wang J (2017) Online sequential prediction of bearings imbalanced fault diagnosis by extreme learning machine. Mech Syst Signal Process 83:450–473
Wang Z, Zhang M, Wang D, Song C, Liu M, Li J, Lou L, Liu Z (2017) Failure prediction using machine learning and time series in optical network. Opt Express 25(16):18553–18565
Rosa A, Chen LY, Binder W (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998
Ganguly S, Consul A, Khan A, Bussone B, Miguel A (2016) A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of IEEE second international conference on big data computing service and applications, Oxford, UK, pp 105–116
Padmakumari P, Umamakeswari A (2019) Task failure prediction using combine bagging ensemble (CBE) classification in cloud workflow. Wirel Pers Commun 107(1):23–40
Chen X, Lu C, Pattabiramanb K (2014) Failure prediction of jobs in compute clouds: a google cluster case study. 2014 IEEE international symposium on software reliability engineering workshops. Naples, Italy, pp 341–346
Pei Y, Qi T, He J (2017) Multi-task function-on-function regression with co-grouping structured sparsity. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, Halifax, NS, Canada, pp 1255–1264
Liu T, Tao D, Song M, Maybank S (2017) Algorithm-dependent generalization bounds for multi-task learning. IEEE Trans Pattern Anal 39(2):227–241
Liu CH, Han JJ, Shang YL (2016) Predicting job failure in cloud cluster: based on SVM classification. J Beijing Univ Posts Telecommun 39(5):104–109
Li Z, Tian Z, Mu Z, Zhang Z, Yue J (2018) Awareness of line-of-sight propagation for indoor localization using Hopkins statistic. IEEE Sens J 18(9):3864–3874
Padmanaban S, Thiruvenkadam K (2018) Rapid brain tissue segmentation process by modified FCM algorithm with CUDA enabled GPU machine. Int J Imag Syst Technol 28(3):163–174
Pan S, Shi W, He P, Ming H, Zhang X (2016) Novel approach to unsupervised change detection based on a robust semi-supervised FCM clustering algorithm. Remote Sens 8(3):264
Chen J, Zhou J, Ye J (2011) Integrating low-rank and groupsparse structures for robust multi-task learning. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, San Diego, California, USA, pp 42–50
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202
Mao W, Mu X, Zheng Y, Yan G (2014) Leave-one-out cross-validationbased model selection for multi-input multi-output support vector machine. Neural Comput Appl 24(2):441–451
Navarro JM, Parada GHA, Duenas JC (2014) System failure prediction through rare-events elastic-net logistic regression. In: Proceedings of international conference on artificial intelligence, Madrid, Spain, pp 120-125
Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Montreal, Quebec, Canada, pp 339–348
Pong TK, Tseng P, Ji S, Ye J (2010) Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM J Optim 20(6):3465–3489
Belghazi I, Rajeswar S, Baratin A, Hjelm R D, Courville A (2018) MINE: mutual information neural estimation. In: Proceedings of the 35th international conference on machine learning, Stockholm, Sweden
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, cluster analysis: basic concepts and methods, 3rd edn. Elsevier, Amsterdam, pp 443–495
Zhou HB, Gao JT (2014) Automatic method for determining cluster number based on silhouette coefficient. Adv Mater Res 951:227–230
Sitompul OS, Nababan EB (2018) Optimization model of K-means clustering using artificial neural networks to handle class imbalance problem. In: IOP conference series: materials science and engineering, vol . 288, no. 1, p 12075
Li X (2016) Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans Pattern Anal 12(11):1088–1092
Pan L, Zhang B, Yang W, Ram R (2017) A sparse linear model and significance test for individual consumption prediction. IEEE Trans Power Syst 32(6):4489–4500
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. U1704158), China Postdoctoral Science Foundation Special Support (No. 2016T90944), Doctoral Research Project of Henan Normal University (No. 5101119170145), Science and Technology Research Project of Henan Province (No.172102210045).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, C., Dai, L., Lai, Y. et al. Failure prediction of tasks in the cloud at an earlier stage: a solution based on domain information mining. Computing 102, 2001–2023 (2020). https://doi.org/10.1007/s00607-020-00800-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-020-00800-1