skip to main content
10.1145/3357526.3357527acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
short-paper

Combining error statistics with failure prediction in memory page offlining

Published: 30 September 2019 Publication History

Abstract

Memory errors constitute a large fraction of total hardware failures on modern systems, negatively impacting their reliability, availability, and serviceability. Error correcting codes have been developed to detect and correct the errors. Error-prevention mechanisms such as soft page offlining has been implemented in modern operating systems. Common offlining policies are based on error rate statistics in a certain past period. However, future error occurrences depend on both the wear level of the faulty memory cells and the implicit runtime context which may not be reflected in the statistics in the period. This paper proposes a new offlining policy which combines error statistics with failure prediction of the memory cells in a page. Using a failure prediction component, the new approach is more effective in mining the combined effect of both the hardware wear level and the runtime context. Scoring with both the error statistics and the prediction confidence, the new policy is more comprehensive in offlining error-prone pages to avoid more errors. Empirical evaluation demonstrates that the new policy outperforms the baseline policies based on historical error statistics.

References

[1]
Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, and Kyung Dong Ryu. 2014. A System Software Approach to Proactive Memory-error Avoidance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 707--718.
[2]
T. J. Dell. 1997. A White Paper on the Benefits of Chipkill-correct ECC for PC Server Main Memory. (Jan. 1997).
[3]
Xiaoming Du and Cong Li. 2018. Memory Failure Prediction Using Online Learning. In Proceedings of the 4th International Symposium on Memory Systems (MEMSYS '18). ACM, New York, NY, USA, 38--49.
[4]
Ioana Giurgiu, Jacint Szabo, Dorothea Wiesmann, and John Bird. 2017. Predicting DRAM Reliability in the Field with Machine Learning. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track (Middleware '17). ACM, New York, NY, USA, 15--21.
[5]
M. Gupta, V. Sridharan, D. Roberts, A. Prodromou, A. Venkat, D. Tullsen, and R. Gupta. 2018. Reliability-Aware Data Placement for Heterogeneous Memory Architecture. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 583--595.
[6]
Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. 2008. Kernel Methods in Machine Learning. The Annals of Statistics 36, 3 (2008), 1171--1220.
[7]
Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASP-LOS XVII). ACM, New York, NY, USA, 111--122.
[8]
J. Kivinen, A. J. Smola, and R. C. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8 (Aug 2004), 2165--2176.
[9]
Scott Levy, Kurt B. Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, and Elisabeth Baseman. 2018. Lessons Learned from Memory Errors Observed over the Lifetime of Cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 43, 12 pages. http://dl.acm.org/citation.cfm?id=3291656.3291714
[10]
Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2007. An Empirical Study of Memory Hardware Errors in a Server Farm. In Proceedings of the 3rd Workshop on on Hot Topics in System Dependability (HotDep '07). USENIX Association, Berkeley, CA, USA, Article 13. http://dl.acm.org/citation.cfm?id=1323140.1323153
[11]
Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2010. ARealistic Evaluation of Memory Hardware Errors and Software System Susceptibility. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC '10). USENIX Association, Berkeley, CA, USA, 6--6. http://dl.acm.org/citation.cfm?id=1855840.1855846
[12]
Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '15). IEEE Computer Society, Washington, DC, USA, 415--426.
[13]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '09). ACM, New York, NY, USA, 193--204.
[14]
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 297--310.
[15]
Vilas Sridharan and Dean Liberty. 2012. A Study of DRAM Failures in the Field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 76, 11 pages. https://dl.acm.org/citation.cfm?id=2389100
[16]
Dong Tang, Peter Carruthers, Zuheir Totari, and Michael W. Shapiro. 2006. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN '06). IEEE Computer Society, Washington, DC, USA, 365--370.

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)Investigating Memory Failure Prediction Across CPU Architectures2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00033(88-95)Online publication date: 24-Jun-2024
  • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
  • Show More Cited By

Index Terms

  1. Combining error statistics with failure prediction in memory page offlining

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      MEMSYS '19: Proceedings of the International Symposium on Memory Systems
      September 2019
      517 pages
      ISBN:9781450372060
      DOI:10.1145/3357526
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 September 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. DRAM failure prediction
      2. DRAM reliability
      3. software page offlining

      Qualifiers

      • Short-paper

      Conference

      MEMSYS '19
      MEMSYS '19: The International Symposium on Memory Systems
      September 30 - October 3, 2019
      District of Columbia, Washington, USA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)28
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 27 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
      • (2024)Investigating Memory Failure Prediction Across CPU Architectures2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00033(88-95)Online publication date: 24-Jun-2024
      • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
      • (2023)Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323692(01-09)Online publication date: 28-Oct-2023
      • (2023)HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00031(216-228)Online publication date: Jun-2023
      • (2023)Review of Memory RAS for Data CentersIEEE Access10.1109/ACCESS.2023.332998411(124782-124796)Online publication date: 2023
      • (2022)From correctable memory errors to uncorrectable memory errorsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571986(1-14)Online publication date: 13-Nov-2022
      • (2022)From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits TellSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00081(01-14)Online publication date: Nov-2022
      • (2022)Predicting DRAM-Caused Node Unavailability in Hyper-Scale Clouds2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN53405.2022.00037(275-286)Online publication date: Jun-2022
      • (2021)Predicting Uncorrectable Memory Errors from the Correctable Error History: No Free Predictors in the FieldProceedings of the International Symposium on Memory Systems10.1145/3488423.3519316(1-10)Online publication date: 27-Sep-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media