skip to main content
10.1145/3579371.3589039acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

Published:17 June 2023Publication History

ABSTRACT

A generally used GPU programming methodology is that adjacent threads access data in neighbor or specific-stride memory addresses and perform computations with the fetched data. This paper demonstrates that the memory addresses often exhibit a simple linear value pattern across GPU threads, as each thread uses built-in variables and constant values to compute the memory addresses. However, since the threads compute their context data individually, GPUs incur a heavy instruction overhead to calculate the memory addresses, even though they exhibit a simple pattern. We propose a GPU architecture called Removing ReDunDancy Utilizing Linearity of Address Generation (R2D2), reducing a large amount of the dynamic instruction count by detecting such linear patterns in the memory addresses and exploiting them for kernel computations. R2D2 detects linearities of the memory addresses with software support and pre-computes them before the threads execute the instructions. With the proposed scheme, each thread is able to compute its memory addresses with fewer dynamic instructions than conventional GPUs. In our evaluation, R2D2 achieves dynamic instruction reduction by 28%, 1.25x speedup, and energy consumption reduction by 17% over baseline GPU.

References

  1. Tor M Aamodt, Wilson Wai Lun Fung, and Timothy G Rogers. 2018. General-purpose graphics processor architectures. Synthesis Lectures on Computer Architecture 13, 2 (2018), 1--140.Google ScholarGoogle ScholarCross RefCross Ref
  2. Krste Asanovic, Stephen W. Keckler, Yunsup Lee, Ronny Krashinsky, and Vinod Grover. 2013. Convergence and Scalarization for Data-Parallel Architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, USA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163--174. Google ScholarGoogle ScholarCross RefCross Ref
  4. Saisanthosh Balakrishnan and Gurindar S. Sohi. 2003. Exploiting Value Locality in Physical Register Files. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, USA, 265.Google ScholarGoogle Scholar
  5. J. Adam Butts and Guri Sohi. 2002. Dynamic Dead-Instruction Detection and Elimination. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 199--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheafer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zhongliang Chen and David Kaeli. 2016. Balancing Scalar and Vector Execution on GPU Architectures. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 973--982. Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhongliang Chen, David Kaeli, and Norman Rubin. 2013. Characterizing scalar opportunities in GPGPU applications. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 225--234. Google ScholarGoogle ScholarCross RefCross Ref
  9. Sylvain Collange, David Defour, and Yao Zhang. 2009. Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations. In Proceedings of the 2009 International Conference on Parallel Processing (Delft, The Netherlands) (Euro-Par'09). Springer-Verlag, Berlin, Heidelberg, 46--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1989. An Efficient Method of Computing Static Single Assignment Form. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '89). Association for Computing Machinery, New York, NY, USA, 25--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ahmed ElTantawy and Tor M. Aamodt. 2018. Warp Scheduling for Fine-Grained Synchronization. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 375--388. Google ScholarGoogle ScholarCross RefCross Ref
  12. Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2012. Power-Efficient Computing for Compute-Intensive GPGPU Applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (Minneapolis, Minnesota, USA) (PACT '12). Association for Computing Machinery, New York, NY, USA, 445--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of Persistent Threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). 1--14. Google ScholarGoogle ScholarCross RefCross Ref
  14. Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 608--619. Google ScholarGoogle ScholarCross RefCross Ref
  15. Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and Uri C. Weiser. 2009. Many-Core vs. Many-Thread Machines: Stay Away From the Valley. IEEE Computer Architecture Letters 8, 1 (2009), 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding Sources of Inefficiency in General-Purpose Chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA '10). Association for Computing Machinery, New York, NY, USA, 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd International Symposium on Computer Architecture (Seoul, Republic of Korea) (ISCA '16). IEEE Press, 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019).Google ScholarGoogle Scholar
  19. Stephen Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A Novel Renaming Scheme to Exploit Value Temporal Locality through Physical Register Reuse and Unification. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (Dallas, Texas, USA) (MICRO 31). IEEE Computer Society Press, Washington, DC, USA, 216--225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 473--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-Warp GPU Register TimeSharing. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 816--828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bogil Kim, Sungjae Lee, Chanho Park, Hyeonjin Kim, and William J. Song. 2021. The Nebula Benchmark Suite: Implications of Lightweight Neural Networks. IEEE Trans. Comput. 70, 11 (2021), 1887--1900. Google ScholarGoogle ScholarCross RefCross Ref
  23. Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. 2013. Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). Association for Computing Machinery, New York, NY, USA, 130--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Keunsoo Kim and Won Woo Ro. 2018. WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 389--402. Google ScholarGoogle ScholarCross RefCross Ref
  25. Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Murali Annavaram, and Won Woo Ro. 2017. Improving Energy Efficiency of GPUs through Data Compression and Compressed Execution. IEEE Trans. Comput. 66, 5 (2017), 834--847. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-Compression: Enabling Power Efficient GPUs through Register Compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA '15). Association for Computing Machinery, New York, NY, USA, 502--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). Association for Computing Machinery, New York, NY, USA, 487--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kevin M. Lepak and Mikko H. Lipasti. 2000. On the Value Locality of Store Instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture (Vancouver, British Columbia, Canada) (ISCA '00). Association for Computing Machinery, New York, NY, USA, 182--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (2008), 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value Locality and Load Value Prediction. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Massachusetts, USA) (ASPLOS VII). Association for Computing Machinery, New York, NY, USA, 138--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization Aware GPGPU Warp Scheduling for Multiple Independent Warp Schedulers. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 383--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 601--612. Google ScholarGoogle ScholarCross RefCross Ref
  33. Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. 2010. Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 337--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mike Mantor. 2012. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In 2012 IEEE Hot Chips 24 Symposium (HCS). IEEE, 1--35.Google ScholarGoogle ScholarCross RefCross Ref
  35. Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Austin, Texas) (SC '15). Association for Computing Machinery, New York, NY, USA, Article 69, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NVIDIA. 2023. Cuda samples. Retrieved Mar 30, 2023 from https://github.com/NVIDIA/cuda-samplesGoogle ScholarGoogle Scholar
  37. NVIDIA. 2023. cuFFT. Retrieved Mar 30, 2023 from https://docs.nvidia.com/cuda/cufft/Google ScholarGoogle Scholar
  38. Yunho Oh, Myung Kuk Yoon, William J. Song, and Won Woo Ro. 2018. FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (Fukuoka, Japan) (MICRO-51). IEEE Press, 364--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 694--706. Google ScholarGoogle ScholarCross RefCross Ref
  40. V. Petric, A. Bracy, and A. Roth. 2002. Three extensions to register integration. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings. 37--47. Google ScholarGoogle ScholarCross RefCross Ref
  41. V. Petric, T. Sha, and A. Roth. 2005. RENO: a rename-based instruction optimizer. In 32nd International Symposium on Computer Architecture (ISCA'05). 98--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Louis-Noël Pouchet. 2015. PolyBench/C: the Polyhedral Benchmark suite. Retrieved Mar 30, 2023 from https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/Google ScholarGoogle Scholar
  43. Behnam Pourghassemi, Chenghao Zhang, Joo Hwan Lee, and Aparna Chandramowlishwaran. 2020. On the Limits of Parallelizing Convolutional Neural Networks on GPUs. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (Virtual Event, USA) (SPAA '20). Association for Computing Machinery, New York, NY, USA, 567--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1988. Global Value Numbers and Redundant Computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, California, USA) (POPL '88). Association for Computing Machinery, New York, NY, USA, 12--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Amir Roth and Gurindar S. Sohi. 2000. Register Integration: A Simple and Efficient Implementation of Squash Reuse. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (Monterey, California, USA) (MICRO 33). Association for Computing Machinery, New York, NY, USA, 223--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic Instruction Reuse. In Proceedings of the 24th Annual International Symposium on Computer Architecture (Denver, Colorado, USA) (ISCA '97). Association for Computing Machinery, New York, NY, USA, 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012), 27.Google ScholarGoogle Scholar
  48. Kai Wang and Calvin Lin. 2017. Decoupled Affine Computation for SIMT GPUs. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, New York, NY, USA, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shasha Wen, Milind Chabbi, and Xu Liu. 2017. REDSPY: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). Association for Computing Machinery, New York, NY, USA, 47--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S.J.E. Wilton and N.P. Jouppi. 1996. CACTI: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677--688. Google ScholarGoogle ScholarCross RefCross Ref
  51. Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. 2013. Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (Eugene, Oregon, USA) (ICS '13). Association for Computing Machinery, New York, NY, USA, 433--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yi Yang, Ping Xiang, Michael Mantor, Norman Rubin, Lisa Hsu, Qunfeng Dong, and Huiyang Zhou. 2014. A Case for a Flexible Scalar Unit in SIMT Architecture. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, USA, 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Tsung Tai Yeh, Roland N. Green, and Timothy G. Rogers. 2020. Dimensionality-Aware Redundant SIMT Instruction Elimination. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 1327--1340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yulong Yu, Weijun Xiao, Xubin He, He Guo, Yuxin Wang, and Xin Chen. 2015. A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-Level Parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture
      June 2023
      1225 pages
      ISBN:9798400700958
      DOI:10.1145/3579371

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 June 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate543of3,203submissions,17%

      Upcoming Conference

      ISCA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader