skip to main content
10.1145/3624062.3624121acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Using Benford's Law to Identify Unusual Failure Regions

Published:12 November 2023Publication History

ABSTRACT

Fault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.

References

  1. A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on 1, 1 (2004), 11–33. https://doi.org/10.1109/TDSC.2004.2Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Frank Benford. 1938. The Law of Anomalous Numbers. Proc. Am. Philos. Soc. 78, 4 (March 1938), 551–572.Google ScholarGoogle Scholar
  3. Arno Berger and Theodore P. Hill. 2011. Benford’s law strikes back: no simple explanation in sight for mathematical gem. 33, 1 (March 2011), 85–91. https://doi.org/10.1007/s00283-010-9182-3Google ScholarGoogle ScholarCross RefCross Ref
  4. Timothy J. Dell. 1997. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division.Google ScholarGoogle Scholar
  5. Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In International Conference on Dependable Systems and Networks (Atlanta, GA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Kurt B. Ferreira and Scott Levy. 2021. Characterizing Memory Failures Using Benford’s Law. In Lecture Notes in Computer Science: Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par) 2021: Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids. Springer Verlag, Berlin, Germany, Lisbon, Portugal.Google ScholarGoogle Scholar
  7. Kurt B. Ferreira, Scott Levy, Joshua Hemmert, and Kevin Pedretti. 2022. Astra Memory Error and System Monitoring Data Sets. https://doi.org/10.5281/zenodo.6515019. https://doi.org/10.5281/zenodo.6515019Google ScholarGoogle ScholarCross RefCross Ref
  8. Kurt B. Ferreira, Scott Levy, Joshua Hemmert, and Kevin Pedretti. 2022. Understanding Memory Failures on a Petascale Arm System. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (Minneapolis, MN, USA) (HPDC ’22). Association for Computing Machinery, New York, NY, USA, 84–96. https://doi.org/10.1145/3502181.3531465Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Adrien Jamain. 2001. Benford’s Law. Master’s thesis. Department of Mathematics, Imperial College of London and ENSIMAG, London, UK. http://www.math.ualberta.ca/ aberger/benford_bibliography/jamain_thesis01.pdfNot found in Imperial College Library or COPAC catalogs on 16 February 2013. URL link is broken too..Google ScholarGoogle Scholar
  10. David Jauk, Dai Yang, and Martin Schulz. 2019. Predicting Faults in High Performance Computing Systems: An in-Depth Survey of the State-of-the-Practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 30, 13 pages. https://doi.org/10.1145/3295500.3356185Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. A. Kenny. 2012. Measuring Model Fit. http://davidakenny.net/cm/fit.htm.Google ScholarGoogle Scholar
  12. Scott Levy, Kurt B. Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, and Elisabeth Baseman. 2018. Lessons Learned from Memory Errors Observed over the Lifetime of Cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC ’18). IEEE Press, Article 43, 12 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2010. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference (Boston, MA) (USENIXATC’10). USENIX Association, Berkeley, Calif., USA, 6–20. http://dl.acm.org/citation.cfm?id=1855840.1855846Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yudan Liu, Raja Nassar, Chokchai Leangsuksun, Nichamon Naksinehaboon, Mihaela Paun, and Stephen L Scott. 2008. An optimal checkpoint/restart model for a large scale high performance computing system. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 1–9.Google ScholarGoogle Scholar
  15. Simon McIntosh-Smith, James Price, Tom Deakin, and Andrei Poenaru. 2019. A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience 31, 16 (2019), e5110. https://doi.org/10.1002/cpe.5110Google ScholarGoogle ScholarCross RefCross Ref
  16. Simon Newcomb. 1881. Note on the frequency of use of the different digits in natural numbers. American Journal of Mathematics 4, 1–4 (1881), 39–40. http://www.jstor.org/stable/2369148Google ScholarGoogle ScholarCross RefCross Ref
  17. Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (July 1900), 157–175. https://doi.org/10.1080/14786440009463897Google ScholarGoogle ScholarCross RefCross Ref
  18. Kevin Pedretti, Andrew J Younge, Simon D Hammond, James H Laros III, Matthew L Curry, Michael J Aguilar, Robert J Hoekstra, and Ron Brightwell. 2020. Chronicles of Astra: challenges and lessons from the first petascale Arm supercomputer. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.Google ScholarGoogle ScholarCross RefCross Ref
  19. Bianca Schroeder and Garth A. Gibson. 2006. A Large-scale Study of Failures in High-performance Computing Systems. In Proceedings of the International Conference on Dependable Systems and Networks(DSN ’06). IEEE Computer Society, Washington, DC, USA, 249–258. https://doi.org/10.1109/DSN.2006.5Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: a large-scale field study. Commun. ACM 54, 2 (Feb. 2009), 100–107. https://doi.org/10.1145/1897816.1897844Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Siddiqua, V. Sridharan, S. E. Raasch, N. DeBardeleben, K. B. Ferreira, S. Levy, E. Baseman, and Q. Guan. 2017. Lifetime memory reliability data from the field. In 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). 1–6. https://doi.org/10.1109/DFT.2017.8244428Google ScholarGoogle ScholarCross RefCross Ref
  22. Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (ASPLOS ’15). ACM, New York, NY, USA, 297–310. https://doi.org/10.1145/2694344.2694348Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Using Benford's Law to Identify Unusual Failure Regions
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
              November 2023
              2180 pages
              ISBN:9798400707858
              DOI:10.1145/3624062

              Copyright © 2023 ACM

              Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 12 November 2023

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited
            • Article Metrics

              • Downloads (Last 12 months)16
              • Downloads (Last 6 weeks)3

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format