Predicting Job Failures in AuverGrid Based on Workload Log Analysis

Saadatfar, Hamid; Fadishei, Hamid; Deldari, Hossein

doi:10.1007/s00354-012-0105-z

Predicting Job Failures in AuverGrid Based on Workload Log Analysis

Published: 07 February 2012

Volume 30, pages 73–94, (2012)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Hamid Saadatfar¹,
Hamid Fadishei¹ &
Hossein Deldari¹

245 Accesses
7 Citations
Explore all metrics

Abstract

Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

The Large Hadron Collider Grid (LCG) website. Availaible on: http://lcg.web.cern.ch/LCG/.
The Lawrence Livermore National Laboratory BlueGene/L supercomputer website. Availaible on: http://asc.llnl.gov/computing_resources/bluegenel/.
The AuverGrid website. Availaible on: http://www.auvergrid.fr/.
The EGEE (Enabling Grid for E-Science) project website. Availaible on: http://www.eu-egee.org/.
Fu, S. and Xu, C.-Z., “Exploring Event Correlation for Failure Prediction in Coalitions of Clusters,” in Proc. of the International Conference on Supercomputing, 41, ACM/IEEE, 2007.
Asadzadeh, P., Buyya, R., Kei, C. L., Nayar, D. and Venugopal, S., “Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies,” Wiley Series on Parallel and Distributed Computing, Chap. 22, 2006.
Iosup A., Li H., Jan M., Anoep S., Dumitrescu C., Wolters L., Epema D.H.J.: “The Grid Workloads Archive”. Elsevier Journal of Future Generation Computer Systems 24, 672–686 (2008)
Article Google Scholar
Gunter, D., Tierney, B. L., Brown, A., Swany, M., Bresnahan, J. and Schopf, J. M., “Log Summarization and Anomaly Detection for Troubleshooting Distributed Systems,” in Proc. of the International Conference on Grid Computing, IEEE/ACM, pp. 226–234, 2007.
Chawla, N. V., Thain, D., Lichtenwalter, R. and Cieslak, D. A., “Data Mining on the Grid for the Grid,” in Proc. of International Symposium on Parallel and Distributed Processing, IEEE, pp. 1–8, 2008.
Zeinalipour-Yazti, D., Neocleous, K., Georgiou, C. and Dikaiakos, M. D., “Identifying Failures in Grids through Monitoring and Ranking,” in Proc. of International Symposium on Network Computing and Applications, IEEE, pp. 291–298, 2008.
Zhang, X., Sebag, M. and Germain, C., “Toward Behavioral Modeling of A Grid System: Mining the Logging and Bookkeeping Files,” in Proc. of International Conference on Data Mining, IEEE, pp. 581–588, 2007.
Kang, W. and Grimshaw, A., “Failure Prediction in Computational Grids,” in Proc. of Annual Simulation Symposium, IEEE, pp. 275–282, 2007.
Yuan, Y., Wu, Y., Yang, G. and Zheng, W., “Adaptive Hybrid Model for Long Term Load Prediction in Computational Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 340–347, 2008.
Akioka, S. and Muraoka, Y., “Extended Forecast of CPU and Network Load on Computational Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 765–772, 2004.
Iosup, A., Dumitrescu, C. and Epema., D., “How are real grids used? The Analysis of Four Grid Traces and Its Implications,” in Proc. of International Conference on Grid Computing, IEEE/ACM, pp. 262–269, 2006.
Wolski, R., Spring, N. and Hayes, J., “Predicting the CPU availability of Timeshared UNIX Systems,” in Proc. of International Symposium on High Performance and Distributed Computing, IEEE, pp. 105–112, 1999.
Dinda, P. A., “The Statistical Properties of Host Load,” in Journal of Scientific Programming, 7, (3,4), pp. 211–229, 1998.
Nadeem, F., Prodan, R. and Fahringer, T., “Characterizing, Modeling and Predicting Dynamic Resource Availability in a Large Scale Multi-purpose Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 348–357, 2008.
Domingues, P., Marques, P. and Silva, L., “DGSchedSim: A Trace-Driven Simulator to Evaluate Scheduling Algorithms for Desktop Grid Environments,” in Proc. of International Conference on Parallel, Distributed, and Network-Based Processing, IEEE, pp. 8–15, 2006.
Zhao, Y., Shao, G. and Yang, G., “A Survey of Methods and Applications for Trace Analysis in Grid Systems,” in Proc. of ChinaGrid Annual Conference, IEEE, pp. 264–271, 2008.
Andrzejak, A., Domingues, P. and Silva, L., “Classifier-Based Capacity Prediction for Desktop Grids,” in Proc. of Integrated Research in Grid Computing, CoreGRID Workshop, pp. 135–144, 2005.
Rood, B., Walters, J. P., Chaudhary, V. and Lewis, M. J., “Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing,” in HPDC’07, IEEE, 2007.
Rood, B. and Lewis, M. J., “Resource Availability Prediction for Improved Grid Scheduling,” in Proc. of eScience’08, IEEE, pp. 711–718, 2008.
Li, H., Groep, D. and Wolters, L., “Mining Performance Data for Metascheduling Decision Support in the Grid,” in FGCS, 23, 1, Elsevier, pp. 92–99, 2007.
Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S. and Nudd, G. R., “Local Grid Scheduling Techniques using Performance Prediction,” in Proc. on Computers and Digital Techniques, IEE, pp. 87–96, 2003.
Leangsuksun, C., Liu, T., Rao, T., Scott, S. L. and Libby, R., “A Failure Predictive and Policy-Based High Availability Strategy For Linux HPC Cluster,” in Proc. of the International Conference on Linux Clusters, pp. 1–12, 2004.
Lin, T.-T. Y. and Siewiorek, D. P., “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” in IEEE Transactions on Reliability, 39, 4, pp. 419–432, 1990.
Brevik, J., Nurmi, D. and Wolski, R., “Automatic methods for predicting machine availability in desktop Grid and peer to-peer systems,” in Proc. of CCGRID’ 04, IEEE, pp. 190–199, 2004.
Li, H., “Workload Characterization, Modeling, and Prediction in Grid Computing,” Ph.D. Thesis, Leiden University, 2008.
Li, H., Groep, D., Wolters, L., and Templon, J., “Job Failure Analysis and Its Implications in a Large-Scale Production Grid,” in Proc. of International Conference on e-Science and Grid Computing, IEEE, pp. 27–34, 2006.
Cieslak, D. A., Chawla, N. V. and Thain, D. L., “Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques,” in GRID’08, IEEE, pp. 217–224, 2008.
Cieslak, D. A., Thain, D. L. and Chawla, N. V., “Short Paper: Troubleshooting Distributed Systems via Data Mining,” in HDPC’06, IEEE, pp. 309–312, 2006.
Lan, Z., Gujrati, P., Li, Y., Zheng, Z., Thakur, R. and White, J., “A Fault Diagnosis and Prognosis Service for TeraGrid Clusters,” in TeraGrid’07 Conference, 2007.
Dabrowski, C., “Reliability in Grid Computing Systems,” Concurrency and Computation, Special OGF Issue, Wiley, pp. 927–959, 2009.
Smith, W. and Wong, P., “Resource selection using execution and queue wait time predictions,” Technical Report NAS-02-003, NAS, 2002.
Kiran M., A. Hashim A.-H., Kuan L.M., Jiun Y.Y.: “Execution Time Prediction of Imperative Paradigm Tasks for Grid Scheduling Optimization”. International Journal of Computer Science and Network Security 9(2), 155–163 (2009)
Google Scholar
Sonmez, O., Yigitbasi, N., Iosup, A. and Epema, D., “Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids,” in Proc. of the 18th International Symposium on High Performance Distributed Computing (HPDC’09), ACM, pp. 111–120, 2009.
Cooper G., Herskovits E.: “A Bayesian Method for the Induction of Probabilistic Networks from Data”. Journal of Machine Learning 9, 309–347 (1992)
MATH Google Scholar
Witten, I. H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Elsevier, 2005.
Jensen, F. V., Nielsen, T. D., Bayesian Networks and Decision Graphs, (Second Edition), Springer-Verlag, 2007.
Fu S.: “Failure-aware resource management for high-availability computing clusters with distributed virtual machines”. Journal of Parallel and Distributed Computing 70(4), 384–393 (2010)
Article Google Scholar
Khoo B.T.B., Veeravalli B.: “Pro-active failure handling mechanisms for scheduling in grid computing environments”. Journal of Parallel and Distributed Computing 70(3), 189–200 (2010)
Article Google Scholar
Wu, L., Ren, C, Meng, D, Jianfeng, Z. and Tu, B., “The Failure-Rate Aware Scheduling Policies for Large-Scale Cluster Systems,” in Proc. of the 7th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’06), IEEE, pp. 364–367, 2006.
Shrinivas, L., Naughton, J. F., “Issues in Applying Data Mining to Grid Job Failure Detection and Diagnosis,” in Proc. of the International Symposium on High Performance and Distributed Computing (HDPC’08), ACM, pp. 239–240, 2008.
Duan, R., Prodan, R., Fahringer T., “Short Paper: Data Mining-based Fault Prediction and Detection on the Grid,” in Proc. of the 15th International Conference on High Performance Distributed Computing (HPDC’06), IEEE, pp. 305–308, 2006.
Gu, J., Zheng, Z., Lan, Z., White, J., Hocks, E., Park, B.-H., “Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study,” in Proc. of the 37th International Conference on Parallel Processing (ICPP’08), IEEE, pp. 157–164, 2008.

Download references

Author information

Authors and Affiliations

Parallel & Distributed Processing Lab, Computer Engineering Department, Ferdowsi University of Mashhad, Azadi Sq., Mashhad, Iran
Hamid Saadatfar, Hamid Fadishei & Hossein Deldari

Authors

Hamid Saadatfar
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Fadishei
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Deldari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamid Saadatfar.

About this article

Cite this article

Saadatfar, H., Fadishei, H. & Deldari, H. Predicting Job Failures in AuverGrid Based on Workload Log Analysis. New Gener. Comput. 30, 73–94 (2012). https://doi.org/10.1007/s00354-012-0105-z

Download citation

Received: 06 March 2010
Revised: 19 April 2011
Published: 07 February 2012
Issue Date: January 2012
DOI: https://doi.org/10.1007/s00354-012-0105-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting Job Failures in AuverGrid Based on Workload Log Analysis

Abstract

Access this article

Similar content being viewed by others

Analyzing and predicting job failures from HPC system log

Modeling Instability for Large Scale Processing Tasks Within HEP Distributed Computing Environments

The Failure Prediction of Cluster Systems Based on System Logs

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Navigation

Predicting Job Failures in AuverGrid Based on Workload Log Analysis

Abstract

Access this article

Similar content being viewed by others

Analyzing and predicting job failures from HPC system log

Modeling Instability for Large Scale Processing Tasks Within HEP Distributed Computing Environments

The Failure Prediction of Cluster Systems Based on System Logs

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Search

Navigation