Abstract
As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate.
In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.
- Mcelog {n.d.}. Advanced hardware error handling for x86 Linux. Retrieved from http://www.mcelog.org/badpageofflining.html.Google Scholar
- Linux Kernel Archives {n.d.}. Page migration. Retrieved from https://www.kernel.org/doc/Documentation/vm/page_migration.Google Scholar
- N. Axelos, K. Pekmestzi, and D. Gizopoulos. 2012. Efficient memory repair using cache-based redundancy. IEEE Trans. Very Large Scale Integr. Syst. 20, 12 (Dec. 2012), 2278--2288. Google ScholarDigital Library
- S. Baek, S. Cho, and R. Melhem. 2014. Refresh now and then. IEEE Trans. Comput. 63, 12 (Dec. 2014), 3114--3126. Google ScholarDigital Library
- Daniel Bartholomew. 2006. QEMU: A multihost, multitarget emulator. Linux J. 2006, 145 (May 2006), 3. Retrieved from http://dl.acm.org/citation.cfm?id=1134160.1134163. Google ScholarDigital Library
- L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 645--655. Google ScholarDigital Library
- L. Borucki, G. Schindlbeck, and C. Slayman. 2008. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proceedings of the IEEE International Reliability Physics Symposium. 482--487.Google Scholar
- Daniel Bovet and Marco Cesati. 2005. Understanding the Linux Kernel. Oreilly 8 Associates. Google ScholarDigital Library
- Ronald P. Cenker, Donald G. Clemons, William R. Huber, Joseph B. Petrizzi, Frank J. Procyk, and George M. Trout. 1979. A fault-tolerant 64K dynamic random-access memory. IEEE Trans. Electron. Devices 26, 6 (June 1979), 853--860.Google ScholarCross Ref
- Kevin K. Chang, A. Giray Yaglikci, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O’Connor, Hasan Hassan, and Onur Mutlu. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. Proc. ACM Measure. Anal. Comput. Syst. 1, 1, (June 2017). Google ScholarDigital Library
- Timothy J. Dell.1997. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division.Google Scholar
- Carlos O’Donell. et al. 2017. The GNU C Library. Retrieved from https://www.gnu.org/software/libc.Google Scholar
- Mel Gorman. 2004. Understanding the Linux Virtual Memory Manager. Prentice Hall, Upper Saddle River, NJ. Google ScholarDigital Library
- Masashi Horiguchi and Kiyoo Itoh. 2011. Nanoscale Memory Repair. Springer, New York, NY, 19--67. Google ScholarDigital Library
- C. S. Hou, Y. X. Chen, J. F. Li, C. Y. Lo, D. M. Kwai, and Y. F. Chou. 2016. A built-in self-repair scheme for DRAMs with spare rows, columns, and bits. In Proceedings of IEEE International Test Conference (ITC’16). 1--7.Google Scholar
- Ciji Isen and Lizy John. 2009. ESKIMO: Energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 337--346. Google ScholarDigital Library
- Jaeyung Jun, Kyu Hyun Choi, Hokwon Kim, Sang Ho Yu, Seon Wook Kim, and Youngsun Han. 2017. Recovering from biased distribution of faulty cells in memory by reorganizing replacement regions through universal hashing. ACM Trans. Design Automat. Electron. Syst. 23, 2, Article 16 (Oct. 2017). Google ScholarDigital Library
- D. W. Kim and M. Erez. 2015. Balancing reliability, cost, and performance tradeoffs with FreeFault. In Proceedings of IEEE 21th International Symposium on High Performance Computer Architecture (HPCA’15). 439--450.Google Scholar
- K. Kim and J. Lee. 2009. A new investigation of data retention time in truly nanoscaled DRAMs. IEEE Electron. Device Lett. 30, 8 (Aug. 2009), 846--848.Google Scholar
- Toshiaki Kirihata, Gerhard Mueller, Brian Ji, Gerd Frankowsky, John M. Ross, Hartmud Terletzki, Dmitry G. Netis, Oliver Weinfurtner, David R. Hanson, Gabriel Daniel, Louis Lu-Chen Hsu, Daniel W. Storaska, Armin M. Reith, Marco A. Hug, Kevin P. Guay, Manfred Selz, Peter Poechmueller, Heinz Hoenigschmid, and Matthew R. Wordeman. 1999. A 390-mm<sup>2</sup> 16-bank 1 Gb DDR SDRAM with hybrid bitline architecture. IEEE J. Solid-State Circ. 34, 11 (Nov. 1999), 1580--1588.Google ScholarCross Ref
- Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, 213--224. Google ScholarDigital Library
- M. Lv, H. Sun, Q. Ren, B. Yu, J. Xin, and N. Zheng. 2015. Logic-DRAM co-design to exploit the efficient repair technique for stacked DRAM. IEEE Trans. Circ. Syst. I: Reg. Papers 62, 5 (May 2015), 1362--1371.Google ScholarCross Ref
- P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke. 2012. IBM zEnterprise redundant array of independent memory subsystem. IBM J. Res. Dev. 56, 1.2 (Jan. 2012), 4:1--4:11. Google ScholarDigital Library
- Prashant J. Nair, Dae-Hyun Kim, and Moinuddin K. Qureshi. 2013. ArchShield: Architectural framework for assisting DRAM scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, 72--83. Google ScholarDigital Library
- M. Patel, J. S. Kim, and O. Mutlu. 2017. The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). 255--268. Google ScholarDigital Library
- Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). ACM, 193--204. Google ScholarDigital Library
- Young Hoon Son, Sukhan Lee, Seongil O, Sanghyuk Kown, Nam Sung Kim, and Jung Ho Ahn. 2015. CiDRA: A cache-inspired DRAM resilience architecture. In Proceedings of IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 502--513.Google ScholarCross Ref
- Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press. Google ScholarDigital Library
- Andrew S. Tanenbaum. 2007. Modern Operating Systems (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA. Google ScholarDigital Library
- Dong Tang, Peter Carruthers, Zuheir Totari, and Michael W. Shapiro. 2006. Assessment of the effect of memory page retirement on system RAS against hardware faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE Computer Society, 365--370. Google ScholarDigital Library
- R. K. Venkatesan, S. Herr, and E. Rotenberg. 2006. Retention-aware placement in DRAM (RAPID): Software methods for quasi-non-volatile DRAM. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture. 155--165.Google Scholar
- Ran Wang, Krishnendu Chakrabarty, and Sudipta Bhawmik. 2015. Built-in self-test and test scheduling for interposer-based 2.5D IC. ACM Trans. Design Automat. Electron. Syst. 20, 4, Article 58 (Sept. 2015). Google ScholarDigital Library
- Xianwei Zhang, Youtao Zhang, Bruce R. Childers, and Jun Yang. 2017. On the restore time variations of future DRAM memory. ACM Trans. Design Automat. Electron. Syst. 22, 2, Article 26 (Feb. 2017). Google ScholarDigital Library
- Ruohuang Zheng and Michael C. Huang. 2017. Redundant memory array architecture for efficient selective protection. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, 214--227. Google ScholarDigital Library
Index Terms
- Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management
Recommendations
Recovering from Biased Distribution of Faulty Cells in Memory by Reorganizing Replacement Regions through Universal Hashing
Recently, scaling down dynamic random access memory (DRAM) has become more of a challenge, with more faults than before and a significant degradation in yield. To improve the yield in DRAM, a redundancy repair technique with intra-subarray replacement ...
Online fault tolerance for FPGA logic blocks
Most adaptive computing systems use reconfigurable hardware in the form of field programmable gate arrays (FPGAs). For these systems to be fielded in harsh environments where high reliability and availability are a must, the applications running on the ...
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor
In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...
Comments