research-article

Combating the Reliability Challenge of GPU Register File at Low Supply Voltage

Authors:

Jingweijia Tan,

Shuaiwen Leon Song,

Andres Marquez,

Darren KerbysonAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 3 - 15

https://doi.org/10.1145/2967938.2967951

Published: 11 September 2016 Publication History

Abstract

Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit ($V_{min}$) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. To better understand the reliability issues posed by undervolting and its energy-saving potential, we first rigorously model and analyze the process variation impact on the GPU register file at different voltages. By further analyzing the GPU architecture, we make a key observation that the time GPU registers contain useless data (i.e., dead time) is long, providing a unique opportunity to enhance register reliability. We then propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages. GR-Guard is both effective and low-cost, and does not affect normal (i.e., non-faulty) register accesses. Experimental results show that for a 28nm baseline GPU under aggressive voltage reduction, GR-Guard can maintain the register file reliability with less than 2\% overall performance degradation, while achieving an average of 31% energy reduction across various applications.

References

[1]

Nvidia cuda sdk. https://developer.nvidia.com/cuda-downloads.

[2]

NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced GPU Ever Made. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForceGTX980WhitepaperFINAL.PDF.

[3]

NVIDIA's Next Generation CUDA Computer Architecture: Fermi. http://www.nvidia.com/content/pdf/fermiwhitepapers/nvidiafermicomputearchitecturewhitepaper.pdf.

[4]

NVIDIA's Next Generation CUDA Computer Architecture: Kepler. http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[5]

Parboil benchmark suite. https://github.com/abduld/Parboil.

[6]

R: The r project for statistical computing. https://www.r-project.org/.

[7]

M. Abdel-Majeed and M. Annavaram. Warped register le: A power efficient register le for gpgpus. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA '13, pages 412--423, Washington, DC, USA, 2013. IEEE Computer Society.

Digital Library

[8]

M. Abdel-Majeed, D. Wong, and M. Annavaram. Warped gates: Gating aware scheduling and power gating for gpgpus. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 111--122, New York, NY, USA, 2013. ACM.

Digital Library

[9]

AMD. AMD Accelerated Parallel Processing: OpenCL Programming Guide. http://developer.amd.com/wordpress/media/2013/07/AMDAcceleratedParallelProcessingOpenCLProgrammingGuide-rev-2.7.pdf.

[10]

A. Ansari, S. Feng, S. Gupta, and S. Mahlke. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 539--550, Feb 2011.

Digital Library

[11]

A. Ansari, S. Gupta, S. Feng, and S. Mahlke. Zerehcache: Armoring cache architectures in high defect density technologies. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 100--110, Dec 2009.

Digital Library

[12]

A. W. Appel. Modern Compiler Implementation in C: Basic Techniques. Cambridge University Press, New York, NY, USA, 1997.

Digital Library

[13]

A. Bacha and R. Teodorescu. Dynamic reduction of voltage margins by leveraging on-chip ecc in itanium ii processors. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 297--307, New York, NY, USA, 2013. ACM.

Digital Library

[14]

A. Bacha and R. Teodorescu. Using ecc feedback to guide voltage speculation in low-voltage processors. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 306--318, Dec 2014.

Digital Library

[15]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.

[16]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Shea er, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct 2009.

Digital Library

[17]

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235--246, New York, NY, USA, 2011. ACM.

Digital Library

[18]

N. Goswami, B. Cao, and T. Li. Power-performance co-optimization of throughput core architecture using resistive memory. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, pages 342--353, Feb 2013.

Digital Library

[19]

P. Hammarlund, A. Martinez, A. Bajwa, D. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. Micro, IEEE, 34(2):6--20, Mar 2014.

[20]

H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram. Gpu register le virtualization. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 420--432, New York, NY, USA, 2015. ACM.

Digital Library

[21]

N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang. An energy-efficient and scalable edram-based register le architecture for gpgpu. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 344--355, New York, NY, USA, 2013. ACM.

Digital Library

[22]

R. B. P. B. V. J. R. Jingwen Leng, Alper Buyuktosunoglu. Safe limits on voltage reduction efficiency in gpus: a direct measurementapproach. In Proceedings of the IEEE International Symposium On Microarchitecture (MICRO), Dec 2015.

Digital Library

[23]

U. R. Karpuzcu, K. B. Kolluru, N. S. Kim, and J. Torrellas. Varius-ntv: A microarchitectural model to capture the increased sensitivity of manycores to process variations at near-threshold voltages. In Proceedings of the 2012 42Nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), DSN '12, pages 1--11, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[24]

J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe. Multi-bit error tolerant caches using two-dimensional error coding. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 197--209, Dec 2007.

Digital Library

[25]

E. Krimer, P. Chiang, and M. Erez. Lane decoupling for improving the timing-error resiliency of wide-simd architectures. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on, pages 237--248, June 2012.

Digital Library

[26]

S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. Warped-compression: Enabling power efficient gpus through register compression. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 502--514, New York, NY, USA, 2015. ACM.

Digital Library

[27]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM.

Digital Library

[28]

J. Leng, Y. Zu, and V. Reddi. Gpu voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in gpu architectures. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 161--173, Feb 2015.

[29]

X. Liang, R. Canal, G.-Y. Wei, and D. Brooks. Process variation tolerant 3t1d-based cache architectures. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 15--26, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[30]

S. Lin and D. J. Costello. Error Control Coding, Second Edition. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2004.

Digital Library

[31]

P. J. Nair, D.-H. Kim, and M. K. Qureshi. Archshield: Architectural framework for assisting dram scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 72--83, New York, NY, USA, 2013. ACM.

Digital Library

[32]

NVIDIA. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[33]

D. Palframan, N. S. Kim, and M. Lipasti. ipatch: Intelligent fault patching to improve energy efficiency. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 428--438, Feb 2015.

[34]

M. Rhu, M. Sullivan, J. Leng, and M. Erez. A locality-aware memory hierarchy for energy-efficient gpu architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98, New York, NY, USA, 2013. ACM.

Digital Library

[35]

T. G. Rogers, D. R. Johnson, M. O'Connor, and S. W. Keckler. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 489--501, New York, NY, USA, 2015. ACM.

Digital Library

[36]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[37]

S. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas. Varius: A model of process variation and resulting timing errors for microarchitects. Semiconductor Manufacturing, IEEE Transactions on, 21(1):3--13, Feb 2008.

[38]

S. Seo, R. Dreslinski, M. Woh, Y. Park, C. Charkrabari, S. Mahlke, D. Blaauw, and T. Mudge. Process variation in near-threshold wide simd architectures. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 980--987, June 2012.

Digital Library

[39]

J. Tan and X. Fu. Mitigating the susceptibility of gpgpus register le to process variations. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 969--978, May 2015.

Digital Library

[40]

C. Wilkerson, H. Gao, A. R. Alameldeen, Z. Chishti, M. Khellah, and S.-L. Lu. Trading o cache capacity for reliability to enable low voltage operation. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 203--214, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[41]

S. Wilton and N. Jouppi. Cacti: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5):677--688, May 1996.

[42]

D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. Jouppi, and M. Erez. Free-p: Protecting non-volatile memory against both hard and soft errors. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 466--477, Feb 2011.

Digital Library

[43]

W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh. Sram-dram hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 247--258, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Toca-Díaz YGran Tejero RValero A(2024)Shift-and-Safe: Addressing permanent faults in aggressively undervolted CNN acceleratorsJournal of Systems Architecture10.1016/j.sysarc.2024.103292157(103292)Online publication date: Dec-2024
https://doi.org/10.1016/j.sysarc.2024.103292
Toca-Díaz YHernández Palacios RGran Tejero RValero A(2024)Flip-and-Patch: A fault-tolerant technique for on-chip memories of CNN accelerators at low supply voltageMicroprocessors and Microsystems10.1016/j.micpro.2024.105023106(105023)Online publication date: Apr-2024
https://doi.org/10.1016/j.micpro.2024.105023
Show More Cited By

Index Terms

Combating the Reliability Challenge of GPU Register File at Low Supply Voltage
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
  2. Dependable and fault-tolerant systems and networks
    1. Reliability
2. Hardware
  1. Power and energy
  2. Robustness
    1. Design for manufacturability
      1. Process variations

Recommendations

GPU register file virtualization
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge register file, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large register file is ...
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to ...
Dynamically adaptive register file architecture for energy reduction in embedded processors

Energy reduction in embedded processors is a must since most embedded systems run on batteries and processor energy reduction helps increase usage time before needing a recharge. Register files are among the most power consuming parts of a processor ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Department of Energy

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
228
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)8

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Toca-Díaz YGran Tejero RValero A(2024)Shift-and-Safe: Addressing permanent faults in aggressively undervolted CNN acceleratorsJournal of Systems Architecture10.1016/j.sysarc.2024.103292157(103292)Online publication date: Dec-2024
https://doi.org/10.1016/j.sysarc.2024.103292
Toca-Díaz YHernández Palacios RGran Tejero RValero A(2024)Flip-and-Patch: A fault-tolerant technique for on-chip memories of CNN accelerators at low supply voltageMicroprocessors and Microsystems10.1016/j.micpro.2024.105023106(105023)Online publication date: Apr-2024
https://doi.org/10.1016/j.micpro.2024.105023
Tan JChen KWang WYan KWei X(2023)MCM-GPU Voltage Noise Characterization and Architecture-Level MitigationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.327930442:12(5084-5097)Online publication date: Dec-2023
https://doi.org/10.1109/TCAD.2023.3279304
Toca-Díaz YMuñoz NTejero RValero A(2023)On Fault-Tolerant Microarchitectural Techniques for Voltage Underscaling in On-Chip Memories of CNN Accelerators2023 26th Euromicro Conference on Digital System Design (DSD)10.1109/DSD60849.2023.00029(138-145)Online publication date: 6-Sep-2023
https://doi.org/10.1109/DSD60849.2023.00029
Zhang HLi LLiu HZhuang DLiu RHuan CSong STao DLiu YHe CWu YSong SRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Bring orders into uncertaintyProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532379(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532379
Ze TXianglong RJun ZFeihu FYue C(2022)Design of Shared Register File of GPU Unified Shader Array2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/CSCloud-EdgeCom54986.2022.00030(123-128)Online publication date: Jun-2022
https://doi.org/10.1109/CSCloud-EdgeCom54986.2022.00030
Mendes FTomás PRoma N(2022)Decoupling GPGPU voltage-frequency scaling for deep-learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.03.004Online publication date: Mar-2022
https://doi.org/10.1016/j.jpdc.2022.03.004
Ranganath KSuetterlein JManzano JSong SWong Dde Supinski BHall MGamblin T(2021)MAPAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480853(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3480853
Maroudas ELalis SBellas NAntonopoulos CPalesi MTumeo AGoumas GAlmudever C(2021)Exploring the potential of context-aware dynamic CPU undervoltingProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458658(73-82)Online publication date: 11-May-2021
https://dl.acm.org/doi/10.1145/3457388.3458658
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents