research-article

Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management

Authors:
Jaeyung Jun

Korea University, Seoul, Korea

Korea University, Seoul, Korea

0000-0002-1840-8114
View Profile

,
Yoonah Paik

Korea University, Seoul, Korea

Korea University, Seoul, Korea
View Profile

,
Gyeong Il Min

Korea University, Seoul, Korea

Korea University, Seoul, Korea
View Profile

,
Seon Wook Kim

Korea University, Seoul, Korea

Korea University, Seoul, Korea
View Profile

,
Youngsun Han

Kyungil University, Gyeongsan, Korea

Kyungil University, Gyeongsan, Korea
View Profile

ACM Transactions on Design Automation of Electronic Systems Volume 24 Issue 4Article No.: 47pp 1–25https://doi.org/10.1145/3329079

Published:05 June 2019Publication History

ACM Transactions on Design Automation of Electronic Systems

Abstract

As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate.

In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

References

Mcelog {n.d.}. Advanced hardware error handling for x86 Linux. Retrieved from http://www.mcelog.org/badpageofflining.html.Google Scholar
Linux Kernel Archives {n.d.}. Page migration. Retrieved from https://www.kernel.org/doc/Documentation/vm/page_migration.Google Scholar
N. Axelos, K. Pekmestzi, and D. Gizopoulos. 2012. Efficient memory repair using cache-based redundancy. IEEE Trans. Very Large Scale Integr. Syst. 20, 12 (Dec. 2012), 2278--2288. Google ScholarDigital Library
S. Baek, S. Cho, and R. Melhem. 2014. Refresh now and then. IEEE Trans. Comput. 63, 12 (Dec. 2014), 3114--3126. Google ScholarDigital Library
Daniel Bartholomew. 2006. QEMU: A multihost, multitarget emulator. Linux J. 2006, 145 (May 2006), 3. Retrieved from http://dl.acm.org/citation.cfm?id=1134160.1134163. Google ScholarDigital Library
L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 645--655. Google ScholarDigital Library
L. Borucki, G. Schindlbeck, and C. Slayman. 2008. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proceedings of the IEEE International Reliability Physics Symposium. 482--487.Google Scholar
Daniel Bovet and Marco Cesati. 2005. Understanding the Linux Kernel. Oreilly 8 Associates. Google ScholarDigital Library
Ronald P. Cenker, Donald G. Clemons, William R. Huber, Joseph B. Petrizzi, Frank J. Procyk, and George M. Trout. 1979. A fault-tolerant 64K dynamic random-access memory. IEEE Trans. Electron. Devices 26, 6 (June 1979), 853--860.Google ScholarCross Ref
Kevin K. Chang, A. Giray Yaglikci, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O’Connor, Hasan Hassan, and Onur Mutlu. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. Proc. ACM Measure. Anal. Comput. Syst. 1, 1, (June 2017). Google ScholarDigital Library
Timothy J. Dell.1997. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division.Google Scholar
Carlos O’Donell. et al. 2017. The GNU C Library. Retrieved from https://www.gnu.org/software/libc.Google Scholar
Mel Gorman. 2004. Understanding the Linux Virtual Memory Manager. Prentice Hall, Upper Saddle River, NJ. Google ScholarDigital Library
Masashi Horiguchi and Kiyoo Itoh. 2011. Nanoscale Memory Repair. Springer, New York, NY, 19--67. Google ScholarDigital Library
C. S. Hou, Y. X. Chen, J. F. Li, C. Y. Lo, D. M. Kwai, and Y. F. Chou. 2016. A built-in self-repair scheme for DRAMs with spare rows, columns, and bits. In Proceedings of IEEE International Test Conference (ITC’16). 1--7.Google Scholar
Ciji Isen and Lizy John. 2009. ESKIMO: Energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 337--346. Google ScholarDigital Library
Jaeyung Jun, Kyu Hyun Choi, Hokwon Kim, Sang Ho Yu, Seon Wook Kim, and Youngsun Han. 2017. Recovering from biased distribution of faulty cells in memory by reorganizing replacement regions through universal hashing. ACM Trans. Design Automat. Electron. Syst. 23, 2, Article 16 (Oct. 2017). Google ScholarDigital Library
D. W. Kim and M. Erez. 2015. Balancing reliability, cost, and performance tradeoffs with FreeFault. In Proceedings of IEEE 21th International Symposium on High Performance Computer Architecture (HPCA’15). 439--450.Google Scholar
K. Kim and J. Lee. 2009. A new investigation of data retention time in truly nanoscaled DRAMs. IEEE Electron. Device Lett. 30, 8 (Aug. 2009), 846--848.Google Scholar
Toshiaki Kirihata, Gerhard Mueller, Brian Ji, Gerd Frankowsky, John M. Ross, Hartmud Terletzki, Dmitry G. Netis, Oliver Weinfurtner, David R. Hanson, Gabriel Daniel, Louis Lu-Chen Hsu, Daniel W. Storaska, Armin M. Reith, Marco A. Hug, Kevin P. Guay, Manfred Selz, Peter Poechmueller, Heinz Hoenigschmid, and Matthew R. Wordeman. 1999. A 390-mm<sup>2</sup> 16-bank 1 Gb DDR SDRAM with hybrid bitline architecture. IEEE J. Solid-State Circ. 34, 11 (Nov. 1999), 1580--1588.Google ScholarCross Ref
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, 213--224. Google ScholarDigital Library
M. Lv, H. Sun, Q. Ren, B. Yu, J. Xin, and N. Zheng. 2015. Logic-DRAM co-design to exploit the efficient repair technique for stacked DRAM. IEEE Trans. Circ. Syst. I: Reg. Papers 62, 5 (May 2015), 1362--1371.Google ScholarCross Ref
P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke. 2012. IBM zEnterprise redundant array of independent memory subsystem. IBM J. Res. Dev. 56, 1.2 (Jan. 2012), 4:1--4:11. Google ScholarDigital Library
Prashant J. Nair, Dae-Hyun Kim, and Moinuddin K. Qureshi. 2013. ArchShield: Architectural framework for assisting DRAM scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, 72--83. Google ScholarDigital Library
M. Patel, J. S. Kim, and O. Mutlu. 2017. The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). 255--268. Google ScholarDigital Library
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). ACM, 193--204. Google ScholarDigital Library
Young Hoon Son, Sukhan Lee, Seongil O, Sanghyuk Kown, Nam Sung Kim, and Jung Ho Ahn. 2015. CiDRA: A cache-inspired DRAM resilience architecture. In Proceedings of IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 502--513.Google ScholarCross Ref
Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press. Google ScholarDigital Library
Andrew S. Tanenbaum. 2007. Modern Operating Systems (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA. Google ScholarDigital Library
Dong Tang, Peter Carruthers, Zuheir Totari, and Michael W. Shapiro. 2006. Assessment of the effect of memory page retirement on system RAS against hardware faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE Computer Society, 365--370. Google ScholarDigital Library
R. K. Venkatesan, S. Herr, and E. Rotenberg. 2006. Retention-aware placement in DRAM (RAPID): Software methods for quasi-non-volatile DRAM. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture. 155--165.Google Scholar
Ran Wang, Krishnendu Chakrabarty, and Sudipta Bhawmik. 2015. Built-in self-test and test scheduling for interposer-based 2.5D IC. ACM Trans. Design Automat. Electron. Syst. 20, 4, Article 58 (Sept. 2015). Google ScholarDigital Library
Xianwei Zhang, Youtao Zhang, Bruce R. Childers, and Jun Yang. 2017. On the restore time variations of future DRAM memory. ACM Trans. Design Automat. Electron. Syst. 22, 2, Article 26 (Feb. 2017). Google ScholarDigital Library
Ruohuang Zheng and Michael C. Huang. 2017. Redundant memory array architecture for efficient selective protection. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, 214--227. Google ScholarDigital Library

Index Terms

Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Allocation / deallocation strategies

Recommendations

Recovering from Biased Distribution of Faulty Cells in Memory by Reorganizing Replacement Regions through Universal Hashing

Recently, scaling down dynamic random access memory (DRAM) has become more of a challenge, with more faults than before and a significant degradation in yield. To improve the yield in DRAM, a redundancy repair technique with intra-subarray replacement ...
Read More
Online fault tolerance for FPGA logic blocks

Most adaptive computing systems use reconfigurable hardware in the form of field programmable gate arrays (FPGAs). For these systems to be fielded in harsh environments where high reliability and availability are a must, the applications running on the ...
Read More
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor

In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Design Automation of Electronic Systems Volume 24, Issue 4
July 2019
258 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3326461
Editor:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 5 June 2019
- Revised: 1 April 2019
- Accepted: 1 April 2019
- Received: 1 July 2018
Published in todaes Volume 24, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DRAM fault recovery
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 186
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management

ACM Transactions on Design Automation of Electronic Systems

Abstract

References

Cited By

Index Terms

Recommendations

Recovering from Biased Distribution of Faulty Cells in Memory by Reorganizing Replacement Regions through Universal Hashing

Online fault tolerance for FPGA logic blocks

Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor