skip to main content
10.1145/2967938.2967951acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Combating the Reliability Challenge of GPU Register File at Low Supply Voltage

Published:11 September 2016Publication History

ABSTRACT

Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit ($V_{min}$) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. To better understand the reliability issues posed by undervolting and its energy-saving potential, we first rigorously model and analyze the process variation impact on the GPU register file at different voltages. By further analyzing the GPU architecture, we make a key observation that the time GPU registers contain useless data (i.e., dead time) is long, providing a unique opportunity to enhance register reliability. We then propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages. GR-Guard is both effective and low-cost, and does not affect normal (i.e., non-faulty) register accesses. Experimental results show that for a 28nm baseline GPU under aggressive voltage reduction, GR-Guard can maintain the register file reliability with less than 2\% overall performance degradation, while achieving an average of 31% energy reduction across various applications.

References

  1. Nvidia cuda sdk. https://developer.nvidia.com/cuda-downloads.Google ScholarGoogle Scholar
  2. NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced GPU Ever Made. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForceGTX980WhitepaperFINAL.PDF.Google ScholarGoogle Scholar
  3. NVIDIA's Next Generation CUDA Computer Architecture: Fermi. http://www.nvidia.com/content/pdf/fermiwhitepapers/nvidiafermicomputearchitecturewhitepaper.pdf.Google ScholarGoogle Scholar
  4. NVIDIA's Next Generation CUDA Computer Architecture: Kepler. http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google ScholarGoogle Scholar
  5. Parboil benchmark suite. https://github.com/abduld/Parboil.Google ScholarGoogle Scholar
  6. R: The r project for statistical computing. https://www.r-project.org/.Google ScholarGoogle Scholar
  7. M. Abdel-Majeed and M. Annavaram. Warped register le: A power efficient register le for gpgpus. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA '13, pages 412--423, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Abdel-Majeed, D. Wong, and M. Annavaram. Warped gates: Gating aware scheduling and power gating for gpgpus. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 111--122, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. AMD. AMD Accelerated Parallel Processing: OpenCL Programming Guide. http://developer.amd.com/wordpress/media/2013/07/AMDAcceleratedParallelProcessingOpenCLProgrammingGuide-rev-2.7.pdf.Google ScholarGoogle Scholar
  10. A. Ansari, S. Feng, S. Gupta, and S. Mahlke. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 539--550, Feb 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Ansari, S. Gupta, S. Feng, and S. Mahlke. Zerehcache: Armoring cache architectures in high defect density technologies. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 100--110, Dec 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. W. Appel. Modern Compiler Implementation in C: Basic Techniques. Cambridge University Press, New York, NY, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Bacha and R. Teodorescu. Dynamic reduction of voltage margins by leveraging on-chip ecc in itanium ii processors. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 297--307, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Bacha and R. Teodorescu. Using ecc feedback to guide voltage speculation in low-voltage processors. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 306--318, Dec 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Shea er, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235--246, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Goswami, B. Cao, and T. Li. Power-performance co-optimization of throughput core architecture using resistive memory. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, pages 342--353, Feb 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Hammarlund, A. Martinez, A. Bajwa, D. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. Micro, IEEE, 34(2):6--20, Mar 2014.Google ScholarGoogle ScholarCross RefCross Ref
  20. H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram. Gpu register le virtualization. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 420--432, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang. An energy-efficient and scalable edram-based register le architecture for gpgpu. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 344--355, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. B. P. B. V. J. R. Jingwen Leng, Alper Buyuktosunoglu. Safe limits on voltage reduction efficiency in gpus: a direct measurementapproach. In Proceedings of the IEEE International Symposium On Microarchitecture (MICRO), Dec 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. U. R. Karpuzcu, K. B. Kolluru, N. S. Kim, and J. Torrellas. Varius-ntv: A microarchitectural model to capture the increased sensitivity of manycores to process variations at near-threshold voltages. In Proceedings of the 2012 42Nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), DSN '12, pages 1--11, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe. Multi-bit error tolerant caches using two-dimensional error coding. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 197--209, Dec 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Krimer, P. Chiang, and M. Erez. Lane decoupling for improving the timing-error resiliency of wide-simd architectures. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on, pages 237--248, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. Warped-compression: Enabling power efficient gpus through register compression. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 502--514, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Leng, Y. Zu, and V. Reddi. Gpu voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in gpu architectures. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 161--173, Feb 2015.Google ScholarGoogle ScholarCross RefCross Ref
  29. X. Liang, R. Canal, G.-Y. Wei, and D. Brooks. Process variation tolerant 3t1d-based cache architectures. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 15--26, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Lin and D. J. Costello. Error Control Coding, Second Edition. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. J. Nair, D.-H. Kim, and M. K. Qureshi. Archshield: Architectural framework for assisting dram scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 72--83, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. NVIDIA. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google ScholarGoogle Scholar
  33. D. Palframan, N. S. Kim, and M. Lipasti. ipatch: Intelligent fault patching to improve energy efficiency. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 428--438, Feb 2015.Google ScholarGoogle ScholarCross RefCross Ref
  34. M. Rhu, M. Sullivan, J. Leng, and M. Erez. A locality-aware memory hierarchy for energy-efficient gpu architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. G. Rogers, D. R. Johnson, M. O'Connor, and S. W. Keckler. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 489--501, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas. Varius: A model of process variation and resulting timing errors for microarchitects. Semiconductor Manufacturing, IEEE Transactions on, 21(1):3--13, Feb 2008.Google ScholarGoogle Scholar
  38. S. Seo, R. Dreslinski, M. Woh, Y. Park, C. Charkrabari, S. Mahlke, D. Blaauw, and T. Mudge. Process variation in near-threshold wide simd architectures. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 980--987, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Tan and X. Fu. Mitigating the susceptibility of gpgpus register le to process variations. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 969--978, May 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C. Wilkerson, H. Gao, A. R. Alameldeen, Z. Chishti, M. Khellah, and S.-L. Lu. Trading o cache capacity for reliability to enable low voltage operation. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 203--214, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Wilton and N. Jouppi. Cacti: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5):677--688, May 1996.Google ScholarGoogle Scholar
  42. D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. Jouppi, and M. Erez. Free-p: Protecting non-volatile memory against both hard and soft errors. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 466--477, Feb 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh. Sram-dram hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 247--258, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
            September 2016
            474 pages
            ISBN:9781450341219
            DOI:10.1145/2967938

            Copyright © 2016 ACM

            © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 September 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%

            Upcoming Conference

            PACT '24
            International Conference on Parallel Architectures and Compilation Techniques
            October 14 - 16, 2024
            Southern California , CA , USA

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader