skip to main content
research-article

Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management

Published:05 June 2019Publication History
Skip Abstract Section

Abstract

As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate.

In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

References

  1. Mcelog {n.d.}. Advanced hardware error handling for x86 Linux. Retrieved from http://www.mcelog.org/badpageofflining.html.Google ScholarGoogle Scholar
  2. Linux Kernel Archives {n.d.}. Page migration. Retrieved from https://www.kernel.org/doc/Documentation/vm/page_migration.Google ScholarGoogle Scholar
  3. N. Axelos, K. Pekmestzi, and D. Gizopoulos. 2012. Efficient memory repair using cache-based redundancy. IEEE Trans. Very Large Scale Integr. Syst. 20, 12 (Dec. 2012), 2278--2288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Baek, S. Cho, and R. Melhem. 2014. Refresh now and then. IEEE Trans. Comput. 63, 12 (Dec. 2014), 3114--3126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Daniel Bartholomew. 2006. QEMU: A multihost, multitarget emulator. Linux J. 2006, 145 (May 2006), 3. Retrieved from http://dl.acm.org/citation.cfm?id=1134160.1134163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 645--655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Borucki, G. Schindlbeck, and C. Slayman. 2008. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proceedings of the IEEE International Reliability Physics Symposium. 482--487.Google ScholarGoogle Scholar
  8. Daniel Bovet and Marco Cesati. 2005. Understanding the Linux Kernel. Oreilly 8 Associates. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ronald P. Cenker, Donald G. Clemons, William R. Huber, Joseph B. Petrizzi, Frank J. Procyk, and George M. Trout. 1979. A fault-tolerant 64K dynamic random-access memory. IEEE Trans. Electron. Devices 26, 6 (June 1979), 853--860.Google ScholarGoogle ScholarCross RefCross Ref
  10. Kevin K. Chang, A. Giray Yaglikci, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O’Connor, Hasan Hassan, and Onur Mutlu. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. Proc. ACM Measure. Anal. Comput. Syst. 1, 1, (June 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Timothy J. Dell.1997. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division.Google ScholarGoogle Scholar
  12. Carlos O’Donell. et al. 2017. The GNU C Library. Retrieved from https://www.gnu.org/software/libc.Google ScholarGoogle Scholar
  13. Mel Gorman. 2004. Understanding the Linux Virtual Memory Manager. Prentice Hall, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Masashi Horiguchi and Kiyoo Itoh. 2011. Nanoscale Memory Repair. Springer, New York, NY, 19--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. S. Hou, Y. X. Chen, J. F. Li, C. Y. Lo, D. M. Kwai, and Y. F. Chou. 2016. A built-in self-repair scheme for DRAMs with spare rows, columns, and bits. In Proceedings of IEEE International Test Conference (ITC’16). 1--7.Google ScholarGoogle Scholar
  16. Ciji Isen and Lizy John. 2009. ESKIMO: Energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 337--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jaeyung Jun, Kyu Hyun Choi, Hokwon Kim, Sang Ho Yu, Seon Wook Kim, and Youngsun Han. 2017. Recovering from biased distribution of faulty cells in memory by reorganizing replacement regions through universal hashing. ACM Trans. Design Automat. Electron. Syst. 23, 2, Article 16 (Oct. 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. W. Kim and M. Erez. 2015. Balancing reliability, cost, and performance tradeoffs with FreeFault. In Proceedings of IEEE 21th International Symposium on High Performance Computer Architecture (HPCA’15). 439--450.Google ScholarGoogle Scholar
  19. K. Kim and J. Lee. 2009. A new investigation of data retention time in truly nanoscaled DRAMs. IEEE Electron. Device Lett. 30, 8 (Aug. 2009), 846--848.Google ScholarGoogle Scholar
  20. Toshiaki Kirihata, Gerhard Mueller, Brian Ji, Gerd Frankowsky, John M. Ross, Hartmud Terletzki, Dmitry G. Netis, Oliver Weinfurtner, David R. Hanson, Gabriel Daniel, Louis Lu-Chen Hsu, Daniel W. Storaska, Armin M. Reith, Marco A. Hug, Kevin P. Guay, Manfred Selz, Peter Poechmueller, Heinz Hoenigschmid, and Matthew R. Wordeman. 1999. A 390-mm<sup>2</sup> 16-bank 1 Gb DDR SDRAM with hybrid bitline architecture. IEEE J. Solid-State Circ. 34, 11 (Nov. 1999), 1580--1588.Google ScholarGoogle ScholarCross RefCross Ref
  21. Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, 213--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Lv, H. Sun, Q. Ren, B. Yu, J. Xin, and N. Zheng. 2015. Logic-DRAM co-design to exploit the efficient repair technique for stacked DRAM. IEEE Trans. Circ. Syst. I: Reg. Papers 62, 5 (May 2015), 1362--1371.Google ScholarGoogle ScholarCross RefCross Ref
  23. P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke. 2012. IBM zEnterprise redundant array of independent memory subsystem. IBM J. Res. Dev. 56, 1.2 (Jan. 2012), 4:1--4:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Prashant J. Nair, Dae-Hyun Kim, and Moinuddin K. Qureshi. 2013. ArchShield: Architectural framework for assisting DRAM scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Patel, J. S. Kim, and O. Mutlu. 2017. The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). 255--268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). ACM, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Young Hoon Son, Sukhan Lee, Seongil O, Sanghyuk Kown, Nam Sung Kim, and Jung Ho Ahn. 2015. CiDRA: A cache-inspired DRAM resilience architecture. In Proceedings of IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 502--513.Google ScholarGoogle ScholarCross RefCross Ref
  28. Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Andrew S. Tanenbaum. 2007. Modern Operating Systems (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dong Tang, Peter Carruthers, Zuheir Totari, and Michael W. Shapiro. 2006. Assessment of the effect of memory page retirement on system RAS against hardware faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE Computer Society, 365--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. K. Venkatesan, S. Herr, and E. Rotenberg. 2006. Retention-aware placement in DRAM (RAPID): Software methods for quasi-non-volatile DRAM. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture. 155--165.Google ScholarGoogle Scholar
  32. Ran Wang, Krishnendu Chakrabarty, and Sudipta Bhawmik. 2015. Built-in self-test and test scheduling for interposer-based 2.5D IC. ACM Trans. Design Automat. Electron. Syst. 20, 4, Article 58 (Sept. 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Xianwei Zhang, Youtao Zhang, Bruce R. Childers, and Jun Yang. 2017. On the restore time variations of future DRAM memory. ACM Trans. Design Automat. Electron. Syst. 22, 2, Article 26 (Feb. 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ruohuang Zheng and Michael C. Huang. 2017. Redundant memory array architecture for efficient selective protection. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, 214--227. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Design Automation of Electronic Systems
      ACM Transactions on Design Automation of Electronic Systems  Volume 24, Issue 4
      July 2019
      258 pages
      ISSN:1084-4309
      EISSN:1557-7309
      DOI:10.1145/3326461
      • Editor:
      • Naehyuck Chang
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 June 2019
      • Revised: 1 April 2019
      • Accepted: 1 April 2019
      • Received: 1 July 2018
      Published in todaes Volume 24, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format