skip to main content
research-article

Live-Out Register Fencing: Interrupt-Triggered Soft Error Correction Based on the Elimination of Register-to-Register Communication

Published: 11 May 2016 Publication History

Abstract

This article introduces Live-Out Register Fencing (LoRF), a soft error correction mechanism that uses the novel Spill Register File as a container of checkpointing data. LoRF’s Spill Register File holds the values shared among basic blocks in the program, and, coupled with a new compilation strategy, LoRF allows for error correction in the same basic block where the error was detected. In LoRF, error correction is triggered by a hardware interrupt that restores the registers of a basic block from the Spill Register File. After these registers are restored, the basic block where the error was detected can just be re-executed, thus reducing the costs of error recovery. LoRF’s error correction policy eliminates the need for expensive architectural support for checkpointing and rollback, reducing the performance overhead of online soft error correction. LoRF relies on both a modified processor architecture and a corresponding compiler. The architecture was implemented in synthesizable VHDL, whereas the compiler was developed as an extension of the LLVM framework. Fault injection experiments support an error correction coverage of 99.35% and a mean performance overhead of 1.33 for the entire life cycle of an error from its occurrence to its elimination from the system.

References

[1]
Todd M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’32). IEEE Computer Society, Washington, DC, 196--207.
[2]
J. R. Azambuja, M. Altieri, J. Becker, and F. L. Kastensmidt. 2013. HETA: Hybrid error-detection technique using assertions. IEEE Transactions on Nuclear Science 60, 4 (Aug 2013), 2805--2812.
[3]
José Rodrigo Azambuja, Fernanda Kastensmidt, and Jurgen Becker. 2014. Hybrid Fault Tolerance Techniques to Detect Transient Faults in Embedded Processors (1st ed.). Springer, New York, NY.
[4]
P. Bernardi, L. Bolzani, M. Rebaudengo, M. S. Reorda, F. Vargas, and M. Violante. 2005. On-line detection of control-flow errors in SoCs by means of an infrastructure IP core. In Proceedings of the 2005 International conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, 50--58.
[5]
David Bernick, Bill Bruckert, Paul Del Vigna, David Garcia, Robert Jardine, Jim Klecka, and Jim Smullen. 2005. NonStop®advanced architecture. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, 12--21. http://dx.doi.org/10.1109/DSN.2005.70.
[6]
Nathan Binkert and others. 2011. The Gem5 simulator. SIGARCH Computer Architecture News 39, 2 (Aug. 2011), 1--7.
[7]
Jason A. Blome, Shantanu Gupta, Shuguang Feng, and Scott Mahlke. 2006. Cost-efficient soft error protection for embedded microprocessors. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, New York, NY, 421--431.
[8]
Hao Chen and Chengmo Yang. 2013. Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation. In Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’13). IEEE Press, Piscataway, NJ, Article 22, 10 pages.
[9]
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August. 2011. Encore: Low-cost, fine-grained transient fault recovery. In MICRO-44. ACM, 398--409.
[10]
Ronaldo R. Ferreira, Jean da Rolt, Gabriel L. Nazar, Álvaro F. Moreira, and Luigi Carro. 2014. Adaptive low-power architecture for high-performance and reliable embedded computing. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE Computer Society, Washington, DC, 538--549.
[11]
J. R. Goodman and W.-C. Hsu. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the 2nd International Conference on Supercomputing (ICS’88). ACM, New York, NY, 442--452.
[12]
Weining Gu, Z. Kalbarczyk, K. Ravishankar Iyer, and Zhenyu Yang. 2003. Characterization of linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’03). IEEE Computer Society Press, Washington, DC, 459--468.
[13]
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop (WWC’01). IEEE Computer Society, Washington, DC, 3--14.
[14]
Said Hamdioui, Michael Nicolaidis, Dimitris Gizopoulos, Arnaud Grasset, Groeseneken Guido, and Philippe Bonnot. 2013. Reliability challenges of real-time systems in forthcoming technology nodes. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). EDA Consortium, San Jose, CA, 129--134.
[15]
E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson. 1994. Fault injection into VHDL models: The MEFIST O tool. In 24th International Symposium on Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers, Austin, TX, IEEE Computer Society, Washington, DC, 66--75.
[16]
Tamar Kranenburg and Rene Van Leuken. 2010. MB-LITE: A robust, light-weight soft-core implementation of the MicroBlaze architecture. In DATE’10: Design, Automation Test in Europe. IEEE, 997--1000.
[17]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--87.
[18]
Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, and Yuanyuan Zhou. 2008. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 265--276.
[19]
V. N. Makarov. 2004. Fighting register pressure in GCC. In Proceedings of the 2004 GCC Developer’s Summit, conference location Ottawa, Ontario, Canada, published by Red Hat Inc, Raleigh, NC. 85--104.
[20]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.
[21]
Nithin Nakka, Karthik Pattabiraman, and Ravishankar Iyer. 2007. Processor-level selective replication. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’07). IEEE, Washington, DC, 544--553.
[22]
L. Parra and others. 2014. Efficient mitigation of data and control flow errors in microprocessors. IEEE Transactions on Nuclear Science 61, 4 (Aug 2014), 1590--1596.
[23]
E. Petersen. 2011. Single Event Effects in Aerospace (1st ed.). Wiley-IEEE Press.
[24]
Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, Washington, DC, 111--122.
[25]
Steven K. Reinhardt and Shubhendu S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 25--36.
[26]
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, Washington, DC, 243--254.
[27]
T. Santini, P. Rech, G. Nazar, L. C arro, and F. R. Wagner. 2014. Reducing embedded software radiation-induced failures through cache memories. In 19th IEEE European Test Symposium (ETS). IEEE Computer Society, Washington, DC, conference location Paderborn, Germany. IEEE, 1--6.
[28]
Harsh Sharangpani and Ken Arora. 2000. Itanium processor microarchitecture. IEEE Micro 20, 5 (Sept. 2000), 24--43.
[29]
Dominique Thiebaut and Harold S. Stone. 1987. Footprints in the cache. ACM Transactions in Computer Systems 5, 4 (Oct. 1987), 305--329.
[30]
N. J. Wang and S. J. Patel. 2005. ReStore: Symptom based soft error detection in microprocessors. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, conference location Yokohama, Japan, 30--39.
[31]
Jianjun Xu, Qingping Tan, Lanfang Tan, and Huiping Zhou. 2013. An instruction-level fine-grained recovery approach for soft errors. In Proceedings of the 28th Annual ACM Symposium on Applied Computing (SAC’13). ACM, New York, NY, 1511--1516.

Cited By

View all
  • (2017)Protecting Caches from Soft ErrorsACM Transactions on Embedded Computing Systems10.1145/306318016:4(1-28)Online publication date: 11-May-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 15, Issue 3
July 2016
520 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2899033
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 11 May 2016
Accepted: 01 December 2015
Revised: 01 December 2015
Received: 01 July 2015
Published in TECS Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Checkpointing
  2. compiler
  3. error correction
  4. fault recovery
  5. hardening
  6. liveness
  7. register file
  8. soft error

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
  • European Commission through the LoRelei

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Protecting Caches from Soft ErrorsACM Transactions on Embedded Computing Systems10.1145/306318016:4(1-28)Online publication date: 11-May-2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media