A Large-Scale Study of Failures on Petascale Supercomputers

Liu, Rui-Tao; Chen, Zuo-Ning

doi:10.1007/s11390-018-1806-7

A Large-Scale Study of Failures on Petascale Supercomputers

Regular Paper
Published: 26 January 2018

Volume 33, pages 24–41, (2018)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Rui-Tao Liu¹ &
Zuo-Ning Chen²

314 Accesses
13 Citations
Explore all metrics

Abstract

With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proactive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies. This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components’ faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supercomputers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Cappello F. Resilience: One of the main challenges for exascale computing. Technical Report of the INRIA-Illinois Joint Laboratory, 2011.
Kusnezov D, Binkley S, Harrod B, Meisner B. DOE exascale initiative. Technical Report of US Department of Energy (DOE), 2013. https://energy.gov/downloads/doe-exascaleinitiative, Dec. 2017.
Kogge P, Bergman K, Borkar S et al. Exascale computing study: Technology challenges in achieving exascale systems. 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, Dec. 2017.
Schroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 2010 7(4): 337-350
Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R. BlueGene/L failure analysis and prediction models. In Proc. the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2006, pp.425-434.
Zheng Z, Lan Z, Park B H et al. System log pre-processing to improve failure prediction. In Proc. IEEE/IFIP International Conference Dependable Systems and Networks, June 29-July 2, 2009.
Zheng Z, Yu L, Tang W et al. Co-analysis of RAS log and job log on Blue Gene/P. In Proc. the 2011 IEEE International Parallel & Distributed Processing Symposium, May 2011 pp.840-851.
Zheng Z, Lan Z. Reliability-aware scalability models for high performance computing. In Proc. IEEE International Conference Cluster Computing and Workshops, Aug. 31-Sept. 4, 2009.
Heien E, LaPine D, Kondo D et al. Modeling and tolerating heterogeneous failures in large parallel systems. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2011, Article No. 45.
Nie B, Tiwari D, Gupta S et al. A large-scale study of softerrors on GPUs in the field. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp.519-530.
Schroeder B, Pinheiro E, Weber W. DRAM errors in the wild: A large-scale field study. In Proc. the 11th International Joint Conference on Measurement and Modeling of Computer Systems, June 2009, pp.193-204.
Pinheiro E, Weber W, Barroso L A. Failure trends in a large disk drive population. In Proc. the 5th USENIX Conference on File and Storage Technologies, February 2007, pp.17-28.
Gunawi H S, Hao M, Suminto R O et al. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proc. the 7th ACM Symposium on Cloud Computing, October 2016, pp.1-16.
Gunawi H S, Hao M, Leesatapornwongsa T et al. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proc. the ACM Symposium on Cloud Computing, November 2014, pp.1-14.
Huang P, Guo C, Zhou L et al. Gray failure: The Achilles’ heel of cloud-scale systems. In Proc. the 16th Workshop on Hot Topics in Operating Systems, May 2017, pp.150-155.
Zheng Z, Lan Z, Gupta R et al. A practical failure prediction with location and lead time for Blue Gene/P. In Proc. the 2010 International Conference Dependable Systems and Networks Workshops (DSN-W), June 28-July 1, 2010.
Sahoo R K, Oliner A J, Rish I et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.426-435.
Gu J, Zheng Z, Lan Z et al. Dynamic meta-learning for failure prediction in large-scale systems: A case study. In Proc. the International Conference on Parallel Processing, Sept. 2008.
Gainaru A, Cappello F, Snir M et al. Fault prediction under the microscope: A closer look into HPC systems. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 77.
Lu X, Wang H Q, Zhou R J et al. Autonomic failure prediction based on manifold learning for large-scale distributed systems. The Journal of China Universities of Posts and Telecommunications, 2010, 17(4): 116-124.
Article Google Scholar
Srikant R, Agrawal R. Mining sequential patterns: Generalizations and performance improvements. In Lecture Notes in Computer Science 1057, Apers P, Bouzeghoub M, Gardarin G (eds.), June 2005.
Mannila H, Toivonen H, Verkamo A I. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997, 1(3): 259-289.
Article Google Scholar
Joshi M, Karypis G, Kumar V. A universal formulation of sequential patterns. Technical Report, No.99-021, University of Minnesota. https://www.cs.umn.edu/research/technical reports/view/99-021, Dec. 2017.
Fournier-Viger P,Wu CW, Tseng V S et al. Mining sequential rules common to several sequences with the window size constraint. In Proc. the 25th Conference on Advances in Artificial Intelligence, May 2012, pp.299-304.
Fournier-Viger P, Wu C W, Tseng V S et al. Mining partially-ordered sequential rules common to multiple sequences. IEEE Transactions on Knowledge and Data Engineering, 27(8): 2203-2216.
Zhang Z. Reliability Theory and Engineering Application. Beijing: Science Press, 2012. (in Chinese)

Download references

Author information

Authors and Affiliations

State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, 214215, China
Rui-Tao Liu
National Research Center of Parallel Computer Engineering and Technology, Beijing, 100190, China
Zuo-Ning Chen

Authors

Rui-Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zuo-Ning Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui-Tao Liu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 98 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, RT., Chen, ZN. A Large-Scale Study of Failures on Petascale Supercomputers. J. Comput. Sci. Technol. 33, 24–41 (2018). https://doi.org/10.1007/s11390-018-1806-7

Download citation

Received: 29 July 2017
Revised: 07 December 2017
Published: 26 January 2018
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11390-018-1806-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Large-Scale Study of Failures on Petascale Supercomputers

Abstract

Access this article

Similar content being viewed by others

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Large-Scale Study of Failures on Petascale Supercomputers

Abstract

Access this article

Similar content being viewed by others

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation