ABSTRACT
Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit ($V_{min}$) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. To better understand the reliability issues posed by undervolting and its energy-saving potential, we first rigorously model and analyze the process variation impact on the GPU register file at different voltages. By further analyzing the GPU architecture, we make a key observation that the time GPU registers contain useless data (i.e., dead time) is long, providing a unique opportunity to enhance register reliability. We then propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages. GR-Guard is both effective and low-cost, and does not affect normal (i.e., non-faulty) register accesses. Experimental results show that for a 28nm baseline GPU under aggressive voltage reduction, GR-Guard can maintain the register file reliability with less than 2\% overall performance degradation, while achieving an average of 31% energy reduction across various applications.
- Nvidia cuda sdk. https://developer.nvidia.com/cuda-downloads.Google Scholar
- NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced GPU Ever Made. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForceGTX980WhitepaperFINAL.PDF.Google Scholar
- NVIDIA's Next Generation CUDA Computer Architecture: Fermi. http://www.nvidia.com/content/pdf/fermiwhitepapers/nvidiafermicomputearchitecturewhitepaper.pdf.Google Scholar
- NVIDIA's Next Generation CUDA Computer Architecture: Kepler. http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google Scholar
- Parboil benchmark suite. https://github.com/abduld/Parboil.Google Scholar
- R: The r project for statistical computing. https://www.r-project.org/.Google Scholar
- M. Abdel-Majeed and M. Annavaram. Warped register le: A power efficient register le for gpgpus. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA '13, pages 412--423, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
- M. Abdel-Majeed, D. Wong, and M. Annavaram. Warped gates: Gating aware scheduling and power gating for gpgpus. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 111--122, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- AMD. AMD Accelerated Parallel Processing: OpenCL Programming Guide. http://developer.amd.com/wordpress/media/2013/07/AMDAcceleratedParallelProcessingOpenCLProgrammingGuide-rev-2.7.pdf.Google Scholar
- A. Ansari, S. Feng, S. Gupta, and S. Mahlke. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 539--550, Feb 2011. Google ScholarDigital Library
- A. Ansari, S. Gupta, S. Feng, and S. Mahlke. Zerehcache: Armoring cache architectures in high defect density technologies. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 100--110, Dec 2009. Google ScholarDigital Library
- A. W. Appel. Modern Compiler Implementation in C: Basic Techniques. Cambridge University Press, New York, NY, USA, 1997. Google ScholarDigital Library
- A. Bacha and R. Teodorescu. Dynamic reduction of voltage margins by leveraging on-chip ecc in itanium ii processors. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 297--307, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- A. Bacha and R. Teodorescu. Using ecc feedback to guide voltage speculation in low-voltage processors. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 306--318, Dec 2014. Google ScholarDigital Library
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.Google ScholarCross Ref
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Shea er, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct 2009. Google ScholarDigital Library
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235--246, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- N. Goswami, B. Cao, and T. Li. Power-performance co-optimization of throughput core architecture using resistive memory. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, pages 342--353, Feb 2013. Google ScholarDigital Library
- P. Hammarlund, A. Martinez, A. Bajwa, D. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. Micro, IEEE, 34(2):6--20, Mar 2014.Google ScholarCross Ref
- H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram. Gpu register le virtualization. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 420--432, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang. An energy-efficient and scalable edram-based register le architecture for gpgpu. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 344--355, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- R. B. P. B. V. J. R. Jingwen Leng, Alper Buyuktosunoglu. Safe limits on voltage reduction efficiency in gpus: a direct measurementapproach. In Proceedings of the IEEE International Symposium On Microarchitecture (MICRO), Dec 2015. Google ScholarDigital Library
- U. R. Karpuzcu, K. B. Kolluru, N. S. Kim, and J. Torrellas. Varius-ntv: A microarchitectural model to capture the increased sensitivity of manycores to process variations at near-threshold voltages. In Proceedings of the 2012 42Nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), DSN '12, pages 1--11, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
- J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe. Multi-bit error tolerant caches using two-dimensional error coding. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 197--209, Dec 2007. Google ScholarDigital Library
- E. Krimer, P. Chiang, and M. Erez. Lane decoupling for improving the timing-error resiliency of wide-simd architectures. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on, pages 237--248, June 2012. Google ScholarDigital Library
- S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. Warped-compression: Enabling power efficient gpus through register compression. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 502--514, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- J. Leng, Y. Zu, and V. Reddi. Gpu voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in gpu architectures. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 161--173, Feb 2015.Google ScholarCross Ref
- X. Liang, R. Canal, G.-Y. Wei, and D. Brooks. Process variation tolerant 3t1d-based cache architectures. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 15--26, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- S. Lin and D. J. Costello. Error Control Coding, Second Edition. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2004. Google ScholarDigital Library
- P. J. Nair, D.-H. Kim, and M. K. Qureshi. Archshield: Architectural framework for assisting dram scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 72--83, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- NVIDIA. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google Scholar
- D. Palframan, N. S. Kim, and M. Lipasti. ipatch: Intelligent fault patching to improve energy efficiency. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 428--438, Feb 2015.Google ScholarCross Ref
- M. Rhu, M. Sullivan, J. Leng, and M. Erez. A locality-aware memory hierarchy for energy-efficient gpu architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- T. G. Rogers, D. R. Johnson, M. O'Connor, and S. W. Keckler. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 489--501, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
- S. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas. Varius: A model of process variation and resulting timing errors for microarchitects. Semiconductor Manufacturing, IEEE Transactions on, 21(1):3--13, Feb 2008.Google Scholar
- S. Seo, R. Dreslinski, M. Woh, Y. Park, C. Charkrabari, S. Mahlke, D. Blaauw, and T. Mudge. Process variation in near-threshold wide simd architectures. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 980--987, June 2012. Google ScholarDigital Library
- J. Tan and X. Fu. Mitigating the susceptibility of gpgpus register le to process variations. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 969--978, May 2015. Google ScholarDigital Library
- C. Wilkerson, H. Gao, A. R. Alameldeen, Z. Chishti, M. Khellah, and S.-L. Lu. Trading o cache capacity for reliability to enable low voltage operation. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 203--214, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- S. Wilton and N. Jouppi. Cacti: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5):677--688, May 1996.Google Scholar
- D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. Jouppi, and M. Erez. Free-p: Protecting non-volatile memory against both hard and soft errors. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 466--477, Feb 2011. Google ScholarDigital Library
- W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh. Sram-dram hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 247--258, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Index Terms
- Combating the Reliability Challenge of GPU Register File at Low Supply Voltage
Recommendations
GPU register file virtualization
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureTo support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge register file, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large register file is ...
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to ...
Dynamically adaptive register file architecture for energy reduction in embedded processors
Energy reduction in embedded processors is a must since most embedded systems run on batteries and processor energy reduction helps increase usage time before needing a recharge. Register files are among the most power consuming parts of a processor ...
Comments