skip to main content
10.1145/3240302.3240309acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Memory failure prediction using online learning

Published: 01 October 2018 Publication History

Abstract

Occurring frequently in datacenters, dynamic random access memory (DRAM) errors are the leading cause of the failures among various hardware components. DRAM failure analysis is one of the most important topics in hardware reliability, availability, and serviceability. Though with comprehensive studies of DRAM failure modes in prior work, a mechanism of predicting future failures on DRAM components is not available today.
In this paper we address the problem of predicting the failures on micro-level DRAM components including cells, rows, and columns. A DRAM failure is the combined effect of the wear level of a DRAM fault and the implicit runtime context. Correctly predicting DRAM failures quantifies the impact to DRAM reliability and enables advanced error-prevention mechanisms such as efficient page retirement or dynamic substitution with spare DRAM components.
We propose an online learning method, repeatedly taking the historical memory failure data of an individual server as the input to predict its failure occurrences in the near future. The learning algorithm embeds a kernel function to evaluate how well the current error observation follows certain previous observations in its failure history. It then performs the prediction by discovering the implicit patterns online. The algorithm strengthens failure confidence with the history of repeated errors from hard faults while washing out occasional errors from soft faults. It also adapts to the unobservable time-varying context by penalizing the change in failure observations across cycles. We further augment the algorithm with a mechanism which propagates failure confidence scores to nearby cells including the ones without error history. Empirical evaluation demonstrates that the failure prediction approach consistently outperforms the baseline methods based on historical error statistics.

References

[1]
E. Baseman, N. DeBardeleben, K. Ferreira, S. Levy, S. Raasch, V. Sridharan, T. Siddiqua, and Q. Guan. 2016. Improving DRAM Fault Characterization through Machine Learning. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W). 250--253.
[2]
T.J. Dell. 1997. A White Paper on the Benefits of Chipkill-correct ECC for PC Server Main Memory. (Jan. 1997).
[3]
Ioana Giurgiu, Jacint Szabo, Dorothea Wiesmann, and John Bird. 2017. Predicting DRAM Reliability in the Field with Machine Learning. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track (Middleware '17). ACM, New York, NY, USA, 15--21.
[4]
Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 44, 12 pages.
[5]
Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. 2008. Kernel Methods in Machine Learning. The Annals of Statistics 36, 3 (2008), 1171--1220.
[6]
Chih-Sheng Hou, Yong-Xiao Chen, Jin-Fu Li, Chih-Yen Lo, Ding-Ming Kwai, and Yung-Fa Chou. 2016. A built-in self-repair scheme for DRAMs with spare rows, columns, and bits. In Proceedings of the 2016 IEEE International Test Conference (ITC '16). 1--7.
[7]
Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 111--122.
[8]
Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. 2014. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 361--372. http://dl.acm.org/citation.cfm?id=2665671.2665726
[9]
Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2007. An Empirical Study of Memory Hardware Errors in a Server Farm. In Proceedings of the 3rd Workshop on on Hot Topics in System Dependability (HotDep'07). USENIX Association, Berkeley, CA, USA, Article 13. http://dl.acm.org/citation.cfm?id=1323140.1323153
[10]
Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2010. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'10). USENIX Association, Berkeley, CA, USA, 6--6. http://dl.acm.org/citation.cfm?id=1855840.1855846
[11]
Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '15). IEEE Computer Society, Washington, DC, USA, 415--426.
[12]
IBM Journal of Research and Development Staff. 2008. Overview of the IBM Blue Gene/P Project. IBM J. Res. Dev. 52, 1/2 (Jan. 2008), 199--220. http://dl.acm.org/citation.cfm?id=1375990.1376008
[13]
Ayush Patwari, Ignacio Laguna, Martin Schulz, and Saurabh Bagchi. 2017. Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters. In Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS '17). ACM, New York, NY, USA, 17--22.
[14]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '09). ACM, New York, NY, USA, 193--204.
[15]
Taniya Siddiqua, Athanasios E. Papathanasiou, Arijit Biswas, and Sudhanva Gurumurthi. 2013. Analysis and Modeling of Memory Errors from Large-scale Field Data Collection. In In Proceedings of the 2013 IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE).
[16]
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 297--310.
[17]
Vilas Sridharan and Dean Liberty. 2012. A Study of DRAM Failures in the Field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 76, 11 pages.
[18]
Dong Tang, Peter Carruthers, Zuheir Totari, and Michael W. Shapiro. 2006. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN '06). IEEE Computer Society, Washington, DC, USA, 365--370.
[19]
Ziming Zheng, Li Yu, Wei Tang, Zhiling Lan, Rinku Gupta, Narayan Desai, Susan Coghlan, and Daniel Buettner. 2011. Co-analysis of RAS Log and Job Log on Blue Gene/P. In Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS '11). 840--851.

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)DRAM Errors and Cosmic Rays: Space Invaders or Science Fiction?2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00025(194-205)Online publication date: 13-Nov-2024
  • (2024)Investigating Memory Failure Prediction Across CPU Architectures2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00033(88-95)Online publication date: 24-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '18: Proceedings of the International Symposium on Memory Systems
October 2018
361 pages
ISBN:9781450364751
DOI:10.1145/3240302
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM failure prediction
  2. DRAM reliability
  3. failure propagation
  4. kernel method
  5. online learning

Qualifiers

  • Research-article

Conference

MEMSYS '18
MEMSYS '18: The International Symposium on Memory Systems
October 1 - 4, 2018
Virginia, Alexandria, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)9
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)DRAM Errors and Cosmic Rays: Space Invaders or Science Fiction?2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00025(194-205)Online publication date: 13-Nov-2024
  • (2024)Investigating Memory Failure Prediction Across CPU Architectures2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00033(88-95)Online publication date: 24-Jun-2024
  • (2023)Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323692(01-09)Online publication date: 28-Oct-2023
  • (2023)HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00031(216-228)Online publication date: Jun-2023
  • (2023)EXPERT: EXPloiting DRAM ERror Types to Improve the Effective Forecasting Coverage in the Field2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S58398.2023.00022(35-41)Online publication date: Jun-2023
  • (2023)An Optical Transceiver Reliability Study based on SFP Monitoring and OS-level Metric Data2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00011(1-12)Online publication date: May-2023
  • (2023)Review of Memory RAS for Data CentersIEEE Access10.1109/ACCESS.2023.332998411(124782-124796)Online publication date: 2023
  • (2022)From correctable memory errors to uncorrectable memory errorsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571986(1-14)Online publication date: 13-Nov-2022
  • (2022)An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers2022 41st International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS55811.2022.00032(262-272)Online publication date: Sep-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media