skip to main content
10.1145/2463209.2488857acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

Reliable on-chip systems in the nano-era: lessons learnt and future trends

Published: 29 May 2013 Publication History

Abstract

Reliability concerns due to technology scaling have been a major focus of researchers and designers for several technology nodes. Therefore, many new techniques for enhancing and optimizing reliability have emerged particularly within the last five to ten years. This perspective paper introduces the most prominent reliability concerns from today's points of view and roughly recapitulates the progress in the community so far. The focus of this paper is on perspective trends from the industrial as well as academic points of view that suggest a way for coping with reliability challenges in upcoming technology nodes.

References

[1]
"Int'l technology roadmap for semiconductors", 2009.
[2]
L. Wanner et al., "Hardware variability-aware duty cycling for embedded sensors", IEEE Transactions on Very Large Scale Integration Systems, vol. PP, no. 99, 2012.
[3]
T.-B. Chan, R. Ghaida, and P. Gupta, "Electrical modeling of lithographic imperfections", in International Conference on VLSI Design, 2010, pp. 423--428.
[4]
M. Gottscho, A. Kagalwalla, and P. Gupta, "Power variability in contemporary drams", IEEE Embedded Systems Letters, vol. 4, no. 2, pp. 37--40, 2012.
[5]
W. Wang et al., "Compact modeling and simulation of circuit reliability for 65nm cmos technology", IEEE Trans. on Device and Materials Reliability, vol. 7, no. 4, 2007.
[6]
R. Zheng et al., "Circuit aging prediction for low-power operation", in Custom Integr. Circ. Conf., 2009, pp. 427--430.
[7]
J. B. Velamala et al., "Aging statistics based on trap-ping/detrapping: Silicon evidence, modeling and long-term prediction", in Int'l Reliability Physics Symposium, 2012.
[8]
M. White, "Microelectronics reliability: physics-of-failure based modeling and lifetime evaluation", JPL Publ., 2008.
[9]
H. Nguyen et al., "Effect of thermal gradients on the electro-migration life-time in power electronics", IEEE Int'l Symposium on Reliability Physics, pp. 619--620, 2004.
[10]
J. B. Bernstein et al., "Electronic circuit reliability modeling", Microelect. Reliab., vol. 46, no. 12, pp. 1957--1979, 2006.
[11]
J. Srinivasan et al., "The case for lifetime reliability-aware microprocessors", SIGARCH Computer Archrchitecture News, pp. 276--287, 2004.
[12]
S. Jahinuzzaman, M. Sharifkhani, and M. Sachdev, "An analytical model for soft error critical charge of nanometric SRAMs", IEEE Transactions on Very Large Scale Integration Systems, vol. 17, no. 9, pp. 1187--1195, 2009.
[13]
A. Dixit and A. Wood, "The impact of new technology on soft error rates", in IEEE Int. Reliab. Physics Symp., 2011.
[14]
J. L. Autran et al., "Altitude and underground real-time SER characterization of CMOS 65nm SRAM", in European Conf. Radiation and its Effects Components and Systems (RADECS), 2008, pp. 519--524.
[15]
S. Mitra and E. J. McCluskey, "Word voter: A new voter design for triple modular redundant systems", in VTS, 2000, pp. 465--470.
[16]
V. Izosimov et al., "Synthesis of fault-tolerant embedded systems with checkpointing and replication", in Intl. Work. on Electronic Design, Test and Appl., 2006, pp. 440--447.
[17]
R. Lyions and W. Vanderkulk, "The use of triple modular redundancy to improve computer reliability", IBM Journal of Research, vol. 6, no. 2, pp. 200--209, 1962.
[18]
R. Vadlamani et al., "Multicore soft error rate stabilization using adaptive dual modular redundancy", in Conference on Design, Automation and Test in Europe, 2010, pp. 27--32.
[19]
D. Ernst et al., "Razor: circuit-level correction of timing errors for low-power operation", IEEE Micro, vol. 24, no. 6, pp. 10--20, 2004.
[20]
S. S. Mukherjee et al., "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor", in IEEE/ACM Int'l Symp. on Microarchitecture, 2003, pp. 29--40.
[21]
C. Kong, "A hardware overview of the NonStop Himalaya K10000 server", Tandem Syst. Review, vol. 10, no. 1, 1994.
[22]
T. Calin, M. Nicolaidis, and R. Velazco, "Upset hardened memory design for submicron CMOS technology", IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2874--2878, 1996.
[23]
N. Seifert et al., "On the radiation-induced soft error performance of hardened sequential elements in advanced bulk CMOS technologies", in IEEE International Reliability Physics Symposium (IRPS), 2010, pp. 188--197.
[24]
S. Mitra et al.,"Robust system design with built-in soft-error resilience", IEEE Computer, vol. 38, no. 2, pp. 43--52, 2005.
[25]
J. Furuta et al., "A 65nm bistable cross-coupled dual modular redundancy Flip-Flop capable of protecting soft errors on the C-element", in IEEE Sym. on VLSIC, 2010, pp. 123--124.
[26]
O. Ruano, J. A. Maestro, and P. Reviriego, "A methodology for automatic insertion of selective TMR in digital circuits affected by SEUs", IEEE Transactions on Nuclear Science, pp. 2091--2102, 2009.
[27]
T. Balen et al., "A self-checking scheme to mitigate single event upset effects in SRAM-based FPAAs", IEEE Trans. on Nuclear Science, vol. 56, no. 4, pp. 1950--1957, 2009.
[28]
S. Ghosh, P. Ndai, and K. Roy, "A novel low overhead fault tolerant Kogge-Stone adder using adaptive clocking", in DATE, 2008, pp. 366--371.
[29]
H. Ando et al., "A 1.3-GHz fifth-generation SPARC64 microprocessor", IEEE Journal of Solid-State Circuits, vol. 38, no. 11, pp. 1896--1905, 2003.
[30]
C. Chen and M. Hsiao, "Error-correcting codes for semi-conductor memory applications: A state-of-the-art review", IBM Journal of Research and Development, vol. 28, pp. 124--134, 1984.
[31]
M. Nicolaidis, "Desig for soft error mitigation", IEEE Trans. Device and Materials Reliability (TDMR), 2005.
[32]
E. H. Neto et al., "Using bulk built-in current sensors to detect soft errors", IEEE Micro, vol. 26, no. 5, pp. 10--18, 2006.
[33]
T. Austin, "DIVA: A reliable substrate for deep submicron microarchitecture design", in International Symposium on Microarchitecture (MICRO), 1999, pp. 196--207.
[34]
A. Meixner, M. Bauer, and D. Sorin, "Argus: Low-cost, comprehensive error detection in simple cores", in Int'l Symposium on Microarchitecture (MICRO), 2007.
[35]
A. Drake et al., "A distributed critical-path timing monitor for a 65nm high-performance microprocessor", in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. IEEE, 2007, pp. 398--399.
[36]
J. Tschanz et al., "A 45nm resilient and adaptive microprocessor core for dynamic variation tolerance", in IEEE Int'l Solid-State Circuits Conference, 2010, pp. 282--283.
[37]
S. Kumar, C. Kim, and S. Sapatnekar, "Adaptive techniques for overcoming performance degradation due to aging in digital circuits", in Asia and South Pacific Design Automation Conference, 2009, pp. 284--289.
[38]
E. Mintarno et al., "Self-tuning for maximized lifetime energy-efficiency in the presence of circuit aging", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 5, pp. 760--773, 2011.
[39]
K. Constantinides et al., "Software-based online detection of hardware defects mechanisms, architectural support, and evaluation", in IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 97--108.
[40]
Y. Li et al., "Concurrent autonomous self-test for uncore components in system-on-chips", in VLSI Test Symposium (VTS), 2010, pp. 232--237.
[41]
A. K. Coskun et al., "Temperature-aware MPSoC scheduling for reducing hot spots and gradients", Asia and South Pacific Design Automation Conf., pp. 49--54, 2008.
[42]
K. Skadron et al., "Temperature-aware microarchitecture", in Int'l Symp. on Computer Architecture, 2003, pp. 2--13.
[43]
J. Donald and M. Martonosi, "Techniques for multicore thermal management: Classification and new exploration", in Int'l Symp. on Computer Architecture, 2006, pp. 78--88.
[44]
T. Ebi, H. Amrouch, and J. Henkel, "COOL: control-based optimization of load-balancing for thermal behavior", Int'l Conf. CODES+ISSS, pp. 255--264, 2012.
[45]
G. Reis, J. Chang, and D. August, "Automatic instruction-level software-only recovery", IEEE Micro, vol. 27, no. 1, pp. 36--47, 2007.
[46]
J. Hu et al., "Compiler-directed instruction duplication for soft error detection", in DATE, 2005, pp. 1056--1057.
[47]
G. A. Reis et al., "Software-controlled fault tolerance", ACM Trans. Archit. Code Optim., vol. 2, no. 4, pp. 366--396, 2005.
[48]
N. Oh, P. Shirvani, and E. McCluskey, "Error detection by duplicated instructions in super-scalar processors", IEEE Transactions on Reliability, vol. 51, no. 1, pp. 63--75, 2002.
[49]
J. Hu, S. Wang, and S. Ziavras, "In-register duplication: Exploiting narrow-width value for improving register file reliability", in International Conference on Dependable Systems and Networks, 2006, pp. 281--290.
[50]
R. Venkatasubramanian, J. Hayes, and B. Murray, "Low-cost on-line fault detection using control flow assertions", in IEEE On-Line Testing Symposium, 2003, pp. 137--143.
[51]
P. Shirvani, N. Saxena, and E. McCluskey, "Software- implemented EDAC protection against SEUs", IEEE Trans. on Reliability, vol. 49, no. 3, pp. 273--284, 2000.
[52]
J. Yan and W. Zhang, "Compiler-guided register reliability improvement against soft errors", in ACM international conference on Embedded software, 2005, pp. 203--209.
[53]
A. Masrur et al., "Schedulability analysis for processors with aging-aware autonomic frequency scaling", in IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Seoul, Korea, 2012.
[54]
F. Ahmed et al., "Wearout-aware compiler-directed register assignment for embedded systems", in International Symposium on Quality Electronic Design, 2012, pp. 33--40.
[55]
J. Xu, Q. Tan, and R. Shen, "The instruction scheduling for soft errors based on data flow analysis", in IEEE Pacific Rim Int'l Symp. on Dependable Comp., 2009, pp. 372--378.
[56]
A. Benso et al., "A C/C++ source-to-source compiler for dependable applications", in International Conference on-Dependable Systems and Networks, 2000, pp. 71--78.
[57]
X. Fu et al., "Optimizing issue queue reliability to soft errors on simultaneous multithreaded architectures", in Int'l Conference on Parallel Processing, 2008, pp. 190--197.
[58]
E. Borin et al., "Software-based transparent and comprehensive control-flow error detection", in Int'l Symposium on Code Generation and Optimization, 2006, pp. 333--345.
[59]
J. George et al., "Probabilistic arithmetic and energy efficient embedded signal processing", in Int'l Conf. on Compilers, Architect. and Synthesis for Emb. Syst., 2006, pp. 158--168.
[60]
N. R. Shanbhag et al., "Stochastic computation", in ACM/IEEE Design Automation Conf., 2010, pp. 859--864.
[61]
L. Leem et al., "ERSA: Error resilient system architecture for probabilistic applications", in Int'l Conf. on Design, Automation and Test in Europe, 2010, pp. 1560--1565.
[62]
M. May, M. Alles, and N. Wehn, "A case study in reliability-aware design: A resilient LDPC code decoder", in Design, Automation and Test in Europe, 2008, pp. 456--461.
[63]
A. Khajeh et al., "Cross-layer co-exploration of exploiting error resilience for video over wireless applications", in Workshop ESTIMedia, 2008, pp. 13--18.
[64]
D. Ernst et al., "Razor: A low-power pipeline based on circuit-level timing speculation", in International Symposium on Microarchitecture, 2003, pp. 7--18.
[65]
T. Austin et al., "Making typical silicon matter with razor", IEEE Computer, vol. 37, no. 3, pp. 57--65, 2004.
[66]
C. Brehm et al.,"A case study on error resilient architectures for wireless communication", in Architecture of Computing Systems, 2012, pp. 13--24.
[67]
"NSF Variability Expedition", http://variability.org.
[68]
A. Biswas et al., "Computing architectural vulnerability factors for address-based structures", in International Symposium on Computer Architecture, 2005, pp. 532--543.
[69]
B. Zhang, W. Wang, and M. Orshansky, "FASER: Fast analysis of soft error susceptibility for cell-based designs", in Int'l Symp. on Quality Elec. Design, 2006, pp. 755--760.
[70]
S. Krishnaswamy et al., "Accurate reliability evaluation and enhancement via probabilistic transfer matrices", in Design, Autom. and Test in Europe Conference, 2005, pp. 282--287.
[71]
G. Norman et al., "Evaluating the reliability of NAND multiplexing with PRISM", IEEE Trans. on CAD of Integr. Circuits and Sys., vol. 24, no. 10, pp. 1629--1637, 2005.
[72]
D. Bhaduri et al., "Scalable techniques and tools for reliability analysis of large circuits", in IEEE International Conference on VLSI Design, 2007, pp. 705--710.
[73]
G. Asadi and M. B. Tahoori, "An analytical approach for soft error rate estimation in digital circuits", in Int'l Symp. on Circuits and Systems, 2005, pp. 2991--2994.
[74]
L. Chen and M. B. Tahoori, "An efficient probability framework for error propagation and correlation estimation", in IEEE Int'l On-Line Testing Symposium, 2012, pp. 170--175.
[75]
S. Rehman et al., "Reliable software for unreliable hardware: Embedded code generation aiming at reliability", in IEEE International Conference on Hardware/Software Codesign and System Synthesis, 2011, pp. 237--246.
[76]
S. Rehman, M. Shafique, and J. Henkel, "Instruction scheduling for reliability-aware compilation", in IEEE Design Automation Conference, 2012, pp. 1288--1296.
[77]
J. Yan and W. Zhang, "Compiler-guided register reliability improvement against soft errors", in International Conference on Embedded Software, 2005, pp. 203--209.
[78]
P. Gupta et al., "Underdesigned and opportunistic computing in presence of hardware variability", IEEE Transactions on CAD, 2013.
[79]
V. K. Chippa et al., "Scalable effort hardware design: exploiting algorithmic resilience for energy efficiency", in Design Automation Conference, 2010, pp. 555--560.
[80]
P. Kulkarni, P. Gupta, and M. D. Ercegovac, "Trading accuracy for power in a multiplier architecture", Journal of Low Power Electronics, vol. 7, no. 4, pp. 490--501, 2011.
[81]
A. Kahng et al., "Designing a processor from the ground up to allow voltage/reliability tradeoffs", in IEEE Int'l Symp. on High Perf. Computer Architecture, 2010, pp. 1--11.
[82]
A. Pant, P. Gupta, and M. van der Schaar, "AppAdapt: Opportunistic application adaptation in presence of hardware variation", IEEE Transactions on Very Large Scale Integration Systems, vol. 20, no. 11, pp. 1986--1996, 2012.
[83]
L. A. D. Bathen et al., "ViPZonE: OS-level memory variability-driven physical address zoning for energy savings", in CODES+ISSS, 2012, pp. 33--42.
[84]
T.-B. Chan et al., "DDRO: A novel performance monitoring methodology based on design-dependent ring oscillators", in Int'l Symp. on Quality Electronic Design, 2012, pp. 633--640.
[85]
C. Kim et al., "An on-die CMOS leakage current sensor for measuring process variation in sub-90nm generations", in International Conference on Integrated Circuit Design and Technology, 2005, pp. 221--222.
[86]
P. Singh et al., "Dynamic NBTI management using a 45 nm multi-degradation sensor", IEEE Transactions on Circuits and Systems I, vol. 58, no. 9, pp. 2026--2037, 2011.
[87]
L. Lai et al., "SlackProbe: A low overhead in situ on-line timing slack monitoring methodology", in DATE, 2013.
[88]
H. Inoue, Y. Li, and S. Mitra, "VAST: Virtualization-assisted concurrent autonomous self-test", in IEEE International Test Conference, 2008, pp. 1--10.
[89]
S. Sahoo et al., "Using likely program invariants to detect hardware errors", in IEEE International Conference on Dependable Systems and Networks, 2008, pp. 70--79.
[90]
T. Ebi, M. A. Al Faruque, and J. Henkel, "TAPE: thermal-aware agent-based power economy for multi/many-core architectures", in International Conference on Computer-Aided Design, 2009, pp. 302--309.
[91]
T. Ebi et al., "Economic learning for thermal-aware power budgeting in many-core architectures", in Int'l Conf. on HW/SW Codes. and Sys. Synth., 2011, pp. 189--196.
[92]
S. Bhardwaj et al., "Predictive modeling of the NBTI effect for reliable design", in IEEE Custom Integrated Circuits Conference, 2006, pp. 189--192.
[93]
S. Krishnappa, H. Singh, and H. Mahmoodi, "Incorporating effects of process, voltage, and temperature variation in BTI model for circuit design", in IEEE Latin American Symp. on Circuits and Systems, 2010, pp. 236--239.
[94]
W. Wang et al.,"The impact of NBTI on the performance of combinational and sequential circuits", in Design Automation Conference, 2007, pp. 364--369.
[95]
P. Li, "Critical path analysis considering temperature, power supply variations and temperature induced leakage", in Int'l Symposium on Quality Electronic Design, 2006, pp. 254--259.

Cited By

View all
  • (2024)Fast Cell Library Characterization for Design Technology Co-Optimization Based on Graph Neural NetworksProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473933(472-477)Online publication date: 22-Jan-2024
  • (2023)Microfluidic Actuated and Controlled Systems and Application for Lab-on-Chip in Space Life ScienceSpace: Science & Technology10.34133/space.00083Online publication date: Jan-2023
  • (2023)Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control UnitsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607086(1-14)Online publication date: 12-Nov-2023
  • Show More Cited By

Index Terms

  1. Reliable on-chip systems in the nano-era: lessons learnt and future trends

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            DAC '13: Proceedings of the 50th Annual Design Automation Conference
            May 2013
            1285 pages
            ISBN:9781450320719
            DOI:10.1145/2463209
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            In-Cooperation

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 29 May 2013

            Permissions

            Request permissions for this article.

            Check for updates

            Qualifiers

            • Research-article

            Conference

            DAC '13
            Sponsor:

            Acceptance Rates

            Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

            Upcoming Conference

            DAC '25
            62nd ACM/IEEE Design Automation Conference
            June 22 - 26, 2025
            San Francisco , CA , USA

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)35
            • Downloads (Last 6 weeks)2
            Reflects downloads up to 08 Mar 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Fast Cell Library Characterization for Design Technology Co-Optimization Based on Graph Neural NetworksProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473933(472-477)Online publication date: 22-Jan-2024
            • (2023)Microfluidic Actuated and Controlled Systems and Application for Lab-on-Chip in Space Life ScienceSpace: Science & Technology10.34133/space.00083Online publication date: Jan-2023
            • (2023)Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control UnitsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607086(1-14)Online publication date: 12-Nov-2023
            • (2023)Aging-Aware Critical Path Selection via Graph Attention NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.327694442:12(5006-5019)Online publication date: Dec-2023
            • (2023)FLEA - FIT-Aware Heuristic for Application Allocation in Many-Cores based on Q-Learning2023 XIII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC60926.2023.10324296(1-6)Online publication date: 21-Nov-2023
            • (2023)Adversarial Testing: A Novel On-Line Testing Method for Deep Learning Processors2023 IEEE 32nd Asian Test Symposium (ATS)10.1109/ATS59501.2023.10317994(1-6)Online publication date: 14-Oct-2023
            • (2023)Dependable DNN Accelerator for Safety-Critical Systems: A Review on the Aging PerspectiveIEEE Access10.1109/ACCESS.2023.330037611(89803-89834)Online publication date: 2023
            • (2022)Reliable Circuit Design Using a Fast Incremental-Based Gate Sizing Under Process VariationIEEE Transactions on Device and Materials Reliability10.1109/TDMR.2022.317591422:3(371-380)Online publication date: Sep-2022
            • (2022)Quantitative Analysis of Sparsely Synchronized Fail-Safe Processors2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS57517.2022.00109(1057-1068)Online publication date: Dec-2022
            • (2022)Software Product Reliability Based on Basic Block Metrics Recomposition2022 IEEE 28th International Symposium on On-Line Testing and Robust System Design (IOLTS)10.1109/IOLTS56730.2022.9897289(1-5)Online publication date: 12-Sep-2022
            • Show More Cited By

            View Options

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media