Skip to main content
Log in

A Large-Scale Study of Failures on Petascale Supercomputers

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proactive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies. This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components’ faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supercomputers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Cappello F. Resilience: One of the main challenges for exascale computing. Technical Report of the INRIA-Illinois Joint Laboratory, 2011.

  2. Kusnezov D, Binkley S, Harrod B, Meisner B. DOE exascale initiative. Technical Report of US Department of Energy (DOE), 2013. https://energy.gov/downloads/doe-exascaleinitiative, Dec. 2017.

  3. Kogge P, Bergman K, Borkar S et al. Exascale computing study: Technology challenges in achieving exascale systems. 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, Dec. 2017.

  4. Schroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 2010 7(4): 337-350

  5. Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R. BlueGene/L failure analysis and prediction models. In Proc. the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2006, pp.425-434.

  6. Zheng Z, Lan Z, Park B H et al. System log pre-processing to improve failure prediction. In Proc. IEEE/IFIP International Conference Dependable Systems and Networks, June 29-July 2, 2009.

  7. Zheng Z, Yu L, Tang W et al. Co-analysis of RAS log and job log on Blue Gene/P. In Proc. the 2011 IEEE International Parallel & Distributed Processing Symposium, May 2011 pp.840-851.

  8. Zheng Z, Lan Z. Reliability-aware scalability models for high performance computing. In Proc. IEEE International Conference Cluster Computing and Workshops, Aug. 31-Sept. 4, 2009.

  9. Heien E, LaPine D, Kondo D et al. Modeling and tolerating heterogeneous failures in large parallel systems. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2011, Article No. 45.

  10. Nie B, Tiwari D, Gupta S et al. A large-scale study of softerrors on GPUs in the field. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp.519-530.

  11. Schroeder B, Pinheiro E, Weber W. DRAM errors in the wild: A large-scale field study. In Proc. the 11th International Joint Conference on Measurement and Modeling of Computer Systems, June 2009, pp.193-204.

  12. Pinheiro E, Weber W, Barroso L A. Failure trends in a large disk drive population. In Proc. the 5th USENIX Conference on File and Storage Technologies, February 2007, pp.17-28.

  13. Gunawi H S, Hao M, Suminto R O et al. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proc. the 7th ACM Symposium on Cloud Computing, October 2016, pp.1-16.

  14. Gunawi H S, Hao M, Leesatapornwongsa T et al. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proc. the ACM Symposium on Cloud Computing, November 2014, pp.1-14.

  15. Huang P, Guo C, Zhou L et al. Gray failure: The Achilles’ heel of cloud-scale systems. In Proc. the 16th Workshop on Hot Topics in Operating Systems, May 2017, pp.150-155.

  16. Zheng Z, Lan Z, Gupta R et al. A practical failure prediction with location and lead time for Blue Gene/P. In Proc. the 2010 International Conference Dependable Systems and Networks Workshops (DSN-W), June 28-July 1, 2010.

  17. Sahoo R K, Oliner A J, Rish I et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.426-435.

  18. Gu J, Zheng Z, Lan Z et al. Dynamic meta-learning for failure prediction in large-scale systems: A case study. In Proc. the International Conference on Parallel Processing, Sept. 2008.

  19. Gainaru A, Cappello F, Snir M et al. Fault prediction under the microscope: A closer look into HPC systems. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 77.

  20. Lu X, Wang H Q, Zhou R J et al. Autonomic failure prediction based on manifold learning for large-scale distributed systems. The Journal of China Universities of Posts and Telecommunications, 2010, 17(4): 116-124.

    Article  Google Scholar 

  21. Srikant R, Agrawal R. Mining sequential patterns: Generalizations and performance improvements. In Lecture Notes in Computer Science 1057, Apers P, Bouzeghoub M, Gardarin G (eds.), June 2005.

  22. Mannila H, Toivonen H, Verkamo A I. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997, 1(3): 259-289.

    Article  Google Scholar 

  23. Joshi M, Karypis G, Kumar V. A universal formulation of sequential patterns. Technical Report, No.99-021, University of Minnesota. https://www.cs.umn.edu/research/technical reports/view/99-021, Dec. 2017.

  24. Fournier-Viger P,Wu CW, Tseng V S et al. Mining sequential rules common to several sequences with the window size constraint. In Proc. the 25th Conference on Advances in Artificial Intelligence, May 2012, pp.299-304.

  25. Fournier-Viger P, Wu C W, Tseng V S et al. Mining partially-ordered sequential rules common to multiple sequences. IEEE Transactions on Knowledge and Data Engineering, 27(8): 2203-2216.

  26. Zhang Z. Reliability Theory and Engineering Application. Beijing: Science Press, 2012. (in Chinese)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui-Tao Liu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 98 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, RT., Chen, ZN. A Large-Scale Study of Failures on Petascale Supercomputers. J. Comput. Sci. Technol. 33, 24–41 (2018). https://doi.org/10.1007/s11390-018-1806-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-018-1806-7

Keywords

Navigation