skip to main content
research-article
Public Access

Edge-TM: Exploiting Transactional Memory for Error Tolerance and Energy Efficiency

Published: 27 September 2017 Publication History

Abstract

Scaling of semiconductor devices has enabled higher levels of integration and performance improvements at the price of making devices more susceptible to the effects of static and dynamic variability. Adding safety margins (guardbands) on the operating frequency or supply voltage prevents timing errors, but has a negative impact on performance and energy consumption. We propose Edge-TM, an adaptive hardware/software error management policy that (i) optimistically scales the voltage beyond the edge of safe operation for better energy savings and (ii) works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism. The policy applies dynamic voltage scaling (DVS) (while keeping frequency fixed) based on the feedback provided by HTM, which makes it simple and generally applicable. Experiments on an embedded platform show our technique capable of 57% energy improvement compared to using voltage guardbands and an extra 21-24% improvement over existing state-of-the-art error tolerance solutions, at a nominal area and time overhead.

References

[1]
S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. 2003. Parameter variations and impact on circuits and microarchitecture. In DAC. 338--342.
[2]
K. A. Bowman, J. W. Tschanz, Nam Sung Kim, J. C. Lee, C. B. Wilkerson, S. L. Lu, T. Karnik, and V. K. De. 2009. Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance. IEEE JSSC 44, 1 (Jan 2009), 49--63.
[3]
K. A. Bowman, J. W. Tschanz, S. L. Lu, P. A. Aseron, M. M. Khellah, A. Raychowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik, and V. K. De. 2011. A 45nm resilient microprocessor core for dynamic variation tolerance. IEEE JSSC 46, 1 (Jan 2011), 194--208.
[4]
F. Chaix, G. Bizot, M. Nicolaidis, and N. E. Zergainoh. 2011. Variability-aware task mapping strategies for many-cores processor chips. In IOLTS. 55--60.
[5]
Cristian Constantinescu. 2008. Intermittent faults and effects on reliability of integrated circuits. In RAMS. 370--374.
[6]
S. Das, D. Roberts, Seokwoo Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. 2006. A self-tuning DVS processor using delay-error detection and correction. IEEE JSSC 41, 4 (April 2006), 792--804.
[7]
S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw. 2009. RazorII: In situ error detection and correction for PVT and SER tolerance. IEEE JSSC 44, 1 (Jan 2009), 32--48.
[8]
S. Dighe, S. R. Vangal, P. Aseron, S. Kumar, T. Jacob, K. A. Bowman, J. Howard, J. Tschanz, V. Erraguntla, N. Borkar, V. K. De, and S. Borkar. 2011. Within-die variation-aware dynamic-voltage-frequency-scaling with optimal core allocation and thread hopping for the 80-core TeraFLOPS processor. JSSC 46, 1 (Jan 2011), 184--193.
[9]
Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner, and Trevor Mudge. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO. 7--. http://dl.acm.org/citation.cfm?id=956417.956571
[10]
M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. M. Harris, D. Blaauw, and D. Sylvester. 2013. Bubble razor: Eliminating timing margins in an ARM cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and correction. IEEE JSSC 48, 1 (Jan 2013), 66--81.
[11]
Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In ISCA. 289--300.
[12]
Sungpack Hong, Tayo Oguntebi, Jared Casper, Nathan Bronson, Christos Kozyrakis, and Kunle Olukotun. 2010. Eigenbench: A simple exploration tool for orthogonal TM characteristics. In IISWC. 1--11.
[13]
Intel. 2009. Voltage Regulator Module and Enterprise Voltage Regulator-Down 11.1. (2009). http://www.intel.com/Assets/en_US/PDF/designguide/321736.pdf.
[14]
A. B. Kahng, S. Kang, R. Kumar, and J. Sartori. 2010. Slack redistribution for graceful degradation under voltage overscaling. In 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC). 825--831.
[15]
Veit B. Kleeberger, Petra R. Maier, and Ulf Schlichtmann. 2014. Workload- and instruction-aware timing analysis: The missing link between technology and system-level resilience. In DAC.
[16]
L. Leem, Hyungmin Cho, J. Bau, Q. A. Jacobson, and S. Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In DATE. 1560--1565.
[17]
Lai Liangzhen and Puneet Gupta. 2014. A Case Study of Logic Delay Fault Behaviors on General-Purpose Embedded Processor Under Voltage Overscaling. Technical Report. University of California. Retrieved from http://escholarship.org/uc/item/3967v8hw.
[18]
S. Narayanan, G. Lyle, R. Kumar, and D. Jones. 2009. Testing the critical operating point (COP) hypothesis using FPGA emulation of timing errors in over-scaled soft-processors. In SELSE.
[19]
OpenMP. 2017. The OpenMP Application Program Interface v.3.0. available through www.openmp.org. (2017).
[20]
Dimitra Papagiannopoulou, Andrea Marongiu, Tali Moreshet, Luca Benini, Maurice Herlihy, and Iris Bahar. 2015. Playing with fire: Transactional memory revisited for error-resilient and energy-efficient MPSoC execution. In GLSVLSI. 9--14.
[21]
D. Papagiannopoulou, T. Moreshet, A. Marongiu, L. Benini, M. Herlihy, and R. Iris Bahar. 2014. Speculative synchronization for coherence-free embedded NUMA architectures. In SAMOS. 99--106.
[22]
J. Patel. 2008. CMOS process variations: A critical operation point hypothesis. web.stanford.edu/class/ee380/Abstracts/080402-jhpatel.pdf. (2008). http://web.stanford.edu/class/ee380/Abstracts/080402-jhpatel.pdf.
[23]
Francesco Paterna, Andrea Acquaviva, Alberto Caprara, Francesco Papariello, Giuseppe Desoli, and Luca Benini. 2012. Variability-aware task allocation for energy-efficient quality of service provisioning in embedded streaming multimedia applications. IEEE TOC 61, 7 (2012), 939--953.
[24]
Abbas Rahimi, Daniele Cesarini, Andrea Marongiu, Rajesh K. Gupta, and Luca Benini. 2014. Improving resilience to timing errors by exposing variability effects to software in tightly-coupled processor clusters. JETCAS 4, 2 (2014), 216--229.
[25]
D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini. 2015. PULP: A parallel ultra low power platform for next generation IoT applications. In Hot Chips.
[26]
Davide Rossi, Antonio Pullini, Igor Loi, Michael Gautschi, Frank Kagan Gurkaynak, Adam Teman, Jeremy Constantin, Andreas Burg, Ivan Miro-Panades, Edith Beigné, Fabien Clermidy, Fady Abouzeid, Philippe Flatresse, and Luca Benini. 2016. 193 MOPS/mW @ 162 MOPS, 0.32V to 1.15V voltage range multi-core accelerator for energy efficient parallel and sequential digital processing. In COOL CHIPS.
[27]
S. R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas. 2008. VARIUS: A model of process variation and resulting timing errors for microarchitects. IEEE TSM 21, 1 (Feb 2008), 3--13.
[28]
John Sartori and Rakesh Kumar. 2010. Overscaling-friendly timing speculation architectures. In GLSVLSI. 209--214.
[29]
J. Tschanz, K. Bowman, S. Walstra, M. Agostinelli, T. Karnik, and Vivek De. 2009. Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance. In SVC. 112--113.
[30]
Jons-Tobias Wamhoff, Mario Schwalbe, Rasha Faqeh, Christof Fetzer, and Pascal Felber. 2013. Transactional encoding for tolerating transient hardware errors. In Stabilization, Safety, and Security of Distributed Systems. Vol. 8255. Springer Intl. Pub., 1--16.
[31]
Philip M. Wells, Koushik Chakraborty, and Gurindar S. Sohi. 2008. Adapting to intermittent faults in multicore systems. In ASPLOS.
[32]
G. Yalcin, A. Cristal, O. Unsal, A. Sobe, D. Harmanci, P. Felber, A. Voronin, J.-T. Wamhoff, and C. Fetzer. 2014. Combining error detection and transactional memory for energy-efficient computing below safe operation margins. In PDP. 248--255.
[33]
Gulay Yalcin, Osman Unsal, and Adrian Cristal. 2013. FaulTM: Error detection and recovery using hardware transactional memory. In DATE. 220--225. http://dl.acm.org/citation.cfm?id=2485288.2485344
[34]
Gulay Yalcin, Osman Sabri Unsal, and Adrian Cristal. 2013. Fault tolerance for multi-threaded applications by leveraging hardware transactional memory. In Computing Frontiers. Article 4, 9 pages.

Cited By

View all
  • (2020)Transaction-Based Core Reliability2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00027(168-179)Online publication date: May-2020
  • (2019)IgnoreTM: Opportunistically Ignoring Timing Violations for Energy Savings using HTM2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715139(1571-1574)Online publication date: Mar-2019
  • (2019)Special Session: Does Approximation Make Testing Harder (or Easier)?2019 IEEE 37th VLSI Test Symposium (VTS)10.1109/VTS.2019.8758649(1-9)Online publication date: Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 16, Issue 5s
Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
October 2017
1448 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3145508
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 27 September 2017
Accepted: 01 July 2017
Revised: 01 May 2017
Received: 01 March 2017
Published in TECS Volume 16, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Energy efficiency
  2. Error Tolerance
  3. Reliability
  4. Transactional Memory
  5. Variability

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)6
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Transaction-Based Core Reliability2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00027(168-179)Online publication date: May-2020
  • (2019)IgnoreTM: Opportunistically Ignoring Timing Violations for Energy Savings using HTM2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715139(1571-1574)Online publication date: Mar-2019
  • (2019)Special Session: Does Approximation Make Testing Harder (or Easier)?2019 IEEE 37th VLSI Test Symposium (VTS)10.1109/VTS.2019.8758649(1-9)Online publication date: Apr-2019
  • (2018)Energy-Quality Scalable Integrated Circuits and Systems: Continuing Energy Scaling in the Twilight of Moore’s LawIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2018.28814618:4(653-678)Online publication date: Dec-2018
  • (2017)Evaluating critical bits in arithmetic operations due to timing violations2017 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2017.8091090(1-7)Online publication date: Sep-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media