research-article

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

Authors:
Dongho Ha

Yonsei University, Seoul, Republic of Korea

Yonsei University, Seoul, Republic of Korea

https://orcid.org/0009-0005-4090-4025
View Profile

,
Yunho Oh

Korea University, Seoul, Republic of Korea

Korea University, Seoul, Republic of Korea

https://orcid.org/0000-0001-6442-3705
View Profile

,
Won Woo Ro

Yonsei University, Seoul, Republic of Korea

Yonsei University, Seoul, Republic of Korea

https://orcid.org/0000-0001-5390-6445
View Profile

ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureJune 2023Article No.: 4Pages 1–14https://doi.org/10.1145/3579371.3589039

Published:17 June 2023Publication History

ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

Pages 1–14

ABSTRACT

A generally used GPU programming methodology is that adjacent threads access data in neighbor or specific-stride memory addresses and perform computations with the fetched data. This paper demonstrates that the memory addresses often exhibit a simple linear value pattern across GPU threads, as each thread uses built-in variables and constant values to compute the memory addresses. However, since the threads compute their context data individually, GPUs incur a heavy instruction overhead to calculate the memory addresses, even though they exhibit a simple pattern. We propose a GPU architecture called Removing ReDunDancy Utilizing Linearity of Address Generation (R2D2), reducing a large amount of the dynamic instruction count by detecting such linear patterns in the memory addresses and exploiting them for kernel computations. R2D2 detects linearities of the memory addresses with software support and pre-computes them before the threads execute the instructions. With the proposed scheme, each thread is able to compute its memory addresses with fewer dynamic instructions than conventional GPUs. In our evaluation, R2D2 achieves dynamic instruction reduction by 28%, 1.25x speedup, and energy consumption reduction by 17% over baseline GPU.

References

Tor M Aamodt, Wilson Wai Lun Fung, and Timothy G Rogers. 2018. General-purpose graphics processor architectures. Synthesis Lectures on Computer Architecture 13, 2 (2018), 1--140.Google ScholarCross Ref
Krste Asanovic, Stephen W. Keckler, Yunsup Lee, Ronny Krashinsky, and Vinod Grover. 2013. Convergence and Scalarization for Data-Parallel Architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, USA, 1--11. Google ScholarDigital Library
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163--174. Google ScholarCross Ref
Saisanthosh Balakrishnan and Gurindar S. Sohi. 2003. Exploiting Value Locality in Physical Register Files. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, USA, 265.Google Scholar
J. Adam Butts and Guri Sohi. 2002. Dynamic Dead-Instruction Detection and Elimination. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 199--210. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheafer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarDigital Library
Zhongliang Chen and David Kaeli. 2016. Balancing Scalar and Vector Execution on GPU Architectures. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 973--982. Google ScholarCross Ref
Zhongliang Chen, David Kaeli, and Norman Rubin. 2013. Characterizing scalar opportunities in GPGPU applications. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 225--234. Google ScholarCross Ref
Sylvain Collange, David Defour, and Yao Zhang. 2009. Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations. In Proceedings of the 2009 International Conference on Parallel Processing (Delft, The Netherlands) (Euro-Par'09). Springer-Verlag, Berlin, Heidelberg, 46--55.Google ScholarDigital Library
R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1989. An Efficient Method of Computing Static Single Assignment Form. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '89). Association for Computing Machinery, New York, NY, USA, 25--35. Google ScholarDigital Library
Ahmed ElTantawy and Tor M. Aamodt. 2018. Warp Scheduling for Fine-Grained Synchronization. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 375--388. Google ScholarCross Ref
Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2012. Power-Efficient Computing for Compute-Intensive GPGPU Applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (Minneapolis, Minnesota, USA) (PACT '12). Association for Computing Machinery, New York, NY, USA, 445--446. Google ScholarDigital Library
Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of Persistent Threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). 1--14. Google ScholarCross Ref
Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 608--619. Google ScholarCross Ref
Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and Uri C. Weiser. 2009. Many-Core vs. Many-Thread Machines: Stay Away From the Valley. IEEE Computer Architecture Letters 8, 1 (2009), 25--28. Google ScholarDigital Library
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding Sources of Inefficiency in General-Purpose Chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA '10). Association for Computing Machinery, New York, NY, USA, 37--47. Google ScholarDigital Library
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd International Symposium on Computer Architecture (Seoul, Republic of Korea) (ISCA '16). IEEE Press, 243--254. Google ScholarDigital Library
Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019).Google Scholar
Stephen Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A Novel Renaming Scheme to Exploit Value Temporal Locality through Physical Register Reuse and Unification. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (Dallas, Texas, USA) (MICRO 31). IEEE Computer Society Press, Washington, DC, USA, 216--225.Google ScholarDigital Library
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 473--486. Google ScholarDigital Library
Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-Warp GPU Register TimeSharing. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 816--828. Google ScholarDigital Library
Bogil Kim, Sungjae Lee, Chanho Park, Hyeonjin Kim, and William J. Song. 2021. The Nebula Benchmark Suite: Implications of Lightweight Neural Networks. IEEE Trans. Comput. 70, 11 (2021), 1887--1900. Google ScholarCross Ref
Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. 2013. Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). Association for Computing Machinery, New York, NY, USA, 130--141. Google ScholarDigital Library
Keunsoo Kim and Won Woo Ro. 2018. WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 389--402. Google ScholarCross Ref
Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Murali Annavaram, and Won Woo Ro. 2017. Improving Energy Efficiency of GPUs through Data Compression and Compressed Execution. IEEE Trans. Comput. 66, 5 (2017), 834--847. Google ScholarDigital Library
Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-Compression: Enabling Power Efficient GPUs through Register Compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA '15). Association for Computing Machinery, New York, NY, USA, 502--514. Google ScholarDigital Library
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). Association for Computing Machinery, New York, NY, USA, 487--498. Google ScholarDigital Library
Kevin M. Lepak and Mikko H. Lipasti. 2000. On the Value Locality of Store Instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture (Vancouver, British Columbia, Canada) (ISCA '00). Association for Computing Machinery, New York, NY, USA, 182--191. Google ScholarDigital Library
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (2008), 39--55. Google ScholarDigital Library
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value Locality and Load Value Prediction. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Massachusetts, USA) (ASPLOS VII). Association for Computing Machinery, New York, NY, USA, 138--147. Google ScholarDigital Library
Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization Aware GPGPU Warp Scheduling for Multiple Independent Warp Schedulers. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 383--394. Google ScholarDigital Library
Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 601--612. Google ScholarCross Ref
Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. 2010. Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 337--348. Google ScholarDigital Library
Mike Mantor. 2012. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In 2012 IEEE Hot Chips 24 Symposium (HCS). IEEE, 1--35.Google ScholarCross Ref
Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Austin, Texas) (SC '15). Association for Computing Machinery, New York, NY, USA, Article 69, 12 pages. Google ScholarDigital Library
NVIDIA. 2023. Cuda samples. Retrieved Mar 30, 2023 from https://github.com/NVIDIA/cuda-samplesGoogle Scholar
NVIDIA. 2023. cuFFT. Retrieved Mar 30, 2023 from https://docs.nvidia.com/cuda/cufft/Google Scholar
Yunho Oh, Myung Kuk Yoon, William J. Song, and Won Woo Ro. 2018. FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (Fukuoka, Japan) (MICRO-51). IEEE Press, 364--376. Google ScholarDigital Library
Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 694--706. Google ScholarCross Ref
V. Petric, A. Bracy, and A. Roth. 2002. Three extensions to register integration. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings. 37--47. Google ScholarCross Ref
V. Petric, T. Sha, and A. Roth. 2005. RENO: a rename-based instruction optimizer. In 32nd International Symposium on Computer Architecture (ISCA'05). 98--109. Google ScholarDigital Library
Louis-Noël Pouchet. 2015. PolyBench/C: the Polyhedral Benchmark suite. Retrieved Mar 30, 2023 from https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/Google Scholar
Behnam Pourghassemi, Chenghao Zhang, Joo Hwan Lee, and Aparna Chandramowlishwaran. 2020. On the Limits of Parallelizing Convolutional Neural Networks on GPUs. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (Virtual Event, USA) (SPAA '20). Association for Computing Machinery, New York, NY, USA, 567--569. Google ScholarDigital Library
B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1988. Global Value Numbers and Redundant Computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, California, USA) (POPL '88). Association for Computing Machinery, New York, NY, USA, 12--27. Google ScholarDigital Library
Amir Roth and Gurindar S. Sohi. 2000. Register Integration: A Simple and Efficient Implementation of Squash Reuse. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (Monterey, California, USA) (MICRO 33). Association for Computing Machinery, New York, NY, USA, 223--234. Google ScholarDigital Library
Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic Instruction Reuse. In Proceedings of the 24th Annual International Symposium on Computer Architecture (Denver, Colorado, USA) (ISCA '97). Association for Computing Machinery, New York, NY, USA, 194--205. Google ScholarDigital Library
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012), 27.Google Scholar
Kai Wang and Calvin Lin. 2017. Decoupled Affine Computation for SIMT GPUs. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, New York, NY, USA, 295--306. Google ScholarDigital Library
Shasha Wen, Milind Chabbi, and Xu Liu. 2017. REDSPY: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). Association for Computing Machinery, New York, NY, USA, 47--61. Google ScholarDigital Library
S.J.E. Wilton and N.P. Jouppi. 1996. CACTI: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677--688. Google ScholarCross Ref
Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. 2013. Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (Eugene, Oregon, USA) (ICS '13). Association for Computing Machinery, New York, NY, USA, 433--442. Google ScholarDigital Library
Yi Yang, Ping Xiang, Michael Mantor, Norman Rubin, Lisa Hsu, Qunfeng Dong, and Huiyang Zhou. 2014. A Case for a Flexible Scalar Unit in SIMT Architecture. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, USA, 93--102. Google ScholarDigital Library
Tsung Tai Yeh, Roland N. Green, and Timothy G. Rogers. 2020. Dimensionality-Aware Redundant SIMT Instruction Elimination. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 1327--1340. Google ScholarDigital Library
Yulong Yu, Weijun Xiao, Xubin He, He Guo, Yuxin Wang, and Xin Chen. 2015. A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-Level Parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 15--24. Google ScholarDigital Library

Index Terms

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
Leveraging GPUs using cooperative loop speculation

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture
June 2023
1225 pages
ISBN:9798400700958
DOI:10.1145/3579371
Chair:
Yan Solihin,
General Chair:
Mark Heinrich
University of Central Florida
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
single instruction multiple thread
redundant instruction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 859
  Total Downloads
- Downloads (Last 12 months)859
- Downloads (Last 6 weeks)130
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

An OpenCL micro-benchmark suite for GPUs and CPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Leveraging GPUs using cooperative loop speculation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

An OpenCL micro-benchmark suite for GPUs and CPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Leveraging GPUs using cooperative loop speculation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media