ABSTRACT
Fault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.
- A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on 1, 1 (2004), 11–33. https://doi.org/10.1109/TDSC.2004.2Google ScholarDigital Library
- Frank Benford. 1938. The Law of Anomalous Numbers. Proc. Am. Philos. Soc. 78, 4 (March 1938), 551–572.Google Scholar
- Arno Berger and Theodore P. Hill. 2011. Benford’s law strikes back: no simple explanation in sight for mathematical gem. 33, 1 (March 2011), 85–91. https://doi.org/10.1007/s00283-010-9182-3Google ScholarCross Ref
- Timothy J. Dell. 1997. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division.Google Scholar
- Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In International Conference on Dependable Systems and Networks (Atlanta, GA).Google ScholarDigital Library
- Kurt B. Ferreira and Scott Levy. 2021. Characterizing Memory Failures Using Benford’s Law. In Lecture Notes in Computer Science: Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par) 2021: Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids. Springer Verlag, Berlin, Germany, Lisbon, Portugal.Google Scholar
- Kurt B. Ferreira, Scott Levy, Joshua Hemmert, and Kevin Pedretti. 2022. Astra Memory Error and System Monitoring Data Sets. https://doi.org/10.5281/zenodo.6515019. https://doi.org/10.5281/zenodo.6515019Google ScholarCross Ref
- Kurt B. Ferreira, Scott Levy, Joshua Hemmert, and Kevin Pedretti. 2022. Understanding Memory Failures on a Petascale Arm System. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (Minneapolis, MN, USA) (HPDC ’22). Association for Computing Machinery, New York, NY, USA, 84–96. https://doi.org/10.1145/3502181.3531465Google ScholarDigital Library
- Adrien Jamain. 2001. Benford’s Law. Master’s thesis. Department of Mathematics, Imperial College of London and ENSIMAG, London, UK. http://www.math.ualberta.ca/ aberger/benford_bibliography/jamain_thesis01.pdfNot found in Imperial College Library or COPAC catalogs on 16 February 2013. URL link is broken too..Google Scholar
- David Jauk, Dai Yang, and Martin Schulz. 2019. Predicting Faults in High Performance Computing Systems: An in-Depth Survey of the State-of-the-Practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 30, 13 pages. https://doi.org/10.1145/3295500.3356185Google ScholarDigital Library
- D. A. Kenny. 2012. Measuring Model Fit. http://davidakenny.net/cm/fit.htm.Google Scholar
- Scott Levy, Kurt B. Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, and Elisabeth Baseman. 2018. Lessons Learned from Memory Errors Observed over the Lifetime of Cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC ’18). IEEE Press, Article 43, 12 pages.Google ScholarDigital Library
- Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2010. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference (Boston, MA) (USENIXATC’10). USENIX Association, Berkeley, Calif., USA, 6–20. http://dl.acm.org/citation.cfm?id=1855840.1855846Google ScholarDigital Library
- Yudan Liu, Raja Nassar, Chokchai Leangsuksun, Nichamon Naksinehaboon, Mihaela Paun, and Stephen L Scott. 2008. An optimal checkpoint/restart model for a large scale high performance computing system. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 1–9.Google Scholar
- Simon McIntosh-Smith, James Price, Tom Deakin, and Andrei Poenaru. 2019. A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience 31, 16 (2019), e5110. https://doi.org/10.1002/cpe.5110Google ScholarCross Ref
- Simon Newcomb. 1881. Note on the frequency of use of the different digits in natural numbers. American Journal of Mathematics 4, 1–4 (1881), 39–40. http://www.jstor.org/stable/2369148Google ScholarCross Ref
- Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (July 1900), 157–175. https://doi.org/10.1080/14786440009463897Google ScholarCross Ref
- Kevin Pedretti, Andrew J Younge, Simon D Hammond, James H Laros III, Matthew L Curry, Michael J Aguilar, Robert J Hoekstra, and Ron Brightwell. 2020. Chronicles of Astra: challenges and lessons from the first petascale Arm supercomputer. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.Google ScholarCross Ref
- Bianca Schroeder and Garth A. Gibson. 2006. A Large-scale Study of Failures in High-performance Computing Systems. In Proceedings of the International Conference on Dependable Systems and Networks(DSN ’06). IEEE Computer Society, Washington, DC, USA, 249–258. https://doi.org/10.1109/DSN.2006.5Google ScholarDigital Library
- Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: a large-scale field study. Commun. ACM 54, 2 (Feb. 2009), 100–107. https://doi.org/10.1145/1897816.1897844Google ScholarDigital Library
- T. Siddiqua, V. Sridharan, S. E. Raasch, N. DeBardeleben, K. B. Ferreira, S. Levy, E. Baseman, and Q. Guan. 2017. Lifetime memory reliability data from the field. In 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). 1–6. https://doi.org/10.1109/DFT.2017.8244428Google ScholarCross Ref
- Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (ASPLOS ’15). ACM, New York, NY, USA, 297–310. https://doi.org/10.1145/2694344.2694348Google ScholarDigital Library
Index Terms
- Using Benford's Law to Identify Unusual Failure Regions
Recommendations
Characterizing Memory Failures Using Benford’s Law
Euro-Par 2021: Parallel Processing WorkshopsAbstractFault tolerance is a key challenge as high performance computing systems continue to increase component counts, individual component reliability decreases, and hardware and software complexity increases. To better understand the potential impacts ...
Understanding Benford's law and its vulnerability in image forensics
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and ExpoIn this paper, we attempt to shed light on Benford's law from the viewpoint of probability theory and point out its limitation in image forensic applications. First, we consider a generalized form of Benford's law and relate it to a random variable of a ...
Comments