skip to main content
10.1145/3330345.3330362acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Address-stride assisted approximate load value prediction in GPUs

Published: 26 June 2019 Publication History

Abstract

Value prediction holds the promise of significantly improving the performance and energy efficiency. However, if the values are predicted incorrectly, significant performance overheads are observed due to execution rollbacks. To address these overheads, value approximation is introduced, which leverages the observation that the rollbacks are not necessary as long as the application-level loss in quality due to value misprediction is acceptable to the user. However, in the context of Graphics Processing Units (GPUs), our evaluations show that the existing approximate value predictors are not optimal in improving the prediction accuracy as they do not consider memory request order, a key characteristic in determining the accuracy of value prediction. As a result, the overall data movement reduction benefits are capped as it is necessary to limit the percentage of predicted values (i.e., prediction coverage) for an acceptable value of application-level error.
To this end, we propose a new Address-Stride Assisted Approximate Value Predictor (ASAP) that explicitly considers the memory addresses and their request order information so as to provide high value prediction accuracy. We take advantage of our new observation that the stride between memory request addresses and the stride between their corresponding data values are highly correlated in several applications. Therefore, ASAP predicts the values only for those requests that have regular strides in their addresses. We evaluate ASAP on a diverse set of GPGPU applications. The results show that ASAP can significantly improve the value prediction accuracy over the previously proposed mechanisms at the same coverage, or can achieve higher coverage (leading to higher performance/energy improvements) under a fixed error threshold.

References

[1]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.
[2]
M. Carbin, S. Misailovic, and M. C. Rinard, "Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware," ACM SIGPLAN Notices, vol. 48, no. 10, pp. 33--52, 2013.
[3]
R. J. Eickemeyer and S. Vassiliadis, "A Load-Instruction Unit for Pipelined Processors," IBM Journal of Research and Development, vol. 37, no. 4, pp. 547--564, 1993.
[4]
F. Gabbay, "Speculative Execution Based on Value Prediction," Technion - Israel Institute of Technology, Tech. Rep. 1080, 1996.
[5]
GPGPU-Sim v3.2.1. (2014) GTX 480 Configuration. {Online}. Available: https://dev.ece.ubc.ca/projects/gpgpu-sim/browser/v3.x/configs/GTX480
[6]
Hynix. (2009) Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. {Online}. Available: http://0x04.net/~mwk/ram/H5GQ1H24AFR%28Rev1.0%29.pdf
[7]
A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das, "Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications," in GPGPU, 2014.
[8]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in ISCA, 2013.
[9]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013.
[10]
G. Kadam, D. Zhang, and A. Jog, "RCoal: Mitigating GPU Timing Attack via Subwarp-based Randomized Coalescing Techniques," in HPCA, 2018.
[11]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in PACT, 2013.
[12]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," in MICRO, 2011.
[13]
D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, "Rumba: An Online Quality Management System for Approximate Computing," in ISCA, 2015.
[14]
J. Lee, M. Samadi, and S. A. Mahlke, "Orchestrating Multiple Data-Parallel Kernels on Multiple Devices," in PACT, 2015.
[15]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in ISCA, 2013.
[16]
A. Li, S. L. Song, M. Wijtvliet, A. Kumar, and H. Corporaal, "SFU-Driven Transparent Approximation Acceleration on GPUs," in ICS, 2016.
[17]
M. H. Lipasti and J. P. Shen, "Exceeding the Dataflow Limit via Value Prediction," in MICRO, 1996.
[18]
D. Mahajan, K. Ramkrishnan, R. Jariwala, A. Yazdanbakhsh, J. Park, B. Thwaites, A. Nagendrakumar, A. Rahimi, H. Esmaeilzadeh, and K. Bazargan, "Axilog: Abstractions for Approximate Hardware Design and Reuse," in MICRO, 2015.
[19]
K. Menychtas, K. Shen, and M. L. Scott, "Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators," in ASPLOS, 2014.
[20]
J. S. Miguel, M. Badr, and E. N. Jerger, "Load Value Approximation," in MICRO, 2014.
[21]
T. Nakra, R. Gupta, and M. L. Soffa, "Global Context-Based Value Prediction," in HPCA, 1999.
[22]
NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," 2012.
[23]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU Concurrency with Elastic Kernels," in ASPLOS, 2013.
[24]
J.J. K. Park, Y. Park, and S. A. Mahlke, "Chimera: Collaborative Preemption for Multitasking on a Shared GPU," in ASPLOS, 2015.
[25]
J. Park, H. Esmaeilzadeh, X. Zhang, M. Naik, and W. Harris, "Flexjava: Language Support for Safe and Modular Approximate Programming," in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015.
[26]
A. Perais and A. Seznec, "Practical Data Value Speculation for Future High-End Processors," in HPCA, 2014.
[27]
Perais, Arthur and Seznec, André, "EOLE: Paving the Way for an Effective Implementation of Value Prediction," in ISCA, 2014.
[28]
Perais, Arthur and Seznec, André, "BeBoP: A Cost Effective Predictor Infrastructure for Superscalar Value Prediction," in HPCA, 2015.
[29]
L.-N. Pouchet, "Polybench: the Polyhedral Benchmark Suite," in URL: http://www.cs.ucla.edu/~pouchet/software/polybench/, 2012.
[30]
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A Case for MLP-Aware Cache Replacement," in ISCA, 2006.
[31]
M. K. Qureshi and Y. N. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," in MICRO, 2006.
[32]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012.
[33]
M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, "SAGE: Self-Tuning Approximation for Graphics Engines," in MICRO, 2013.
[34]
A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "EnerJ: Approximate Data Types for Safe and General Low-Power Computation," ACM SIGPLAN Notices, vol. 46, no. 6, pp. 164--174, 2011.
[35]
J. San Miguel, J. Albericio, N. Enright Jerger, and A. Jaleel, "The Bunker Cache for Spatio-Value Approximation," in MICRO, 2016.
[36]
J. San Miguel, J. Albericio, A. Moshovos, and N. Enright Jerger, "Doppelganger: A Cache for Approximate Computing," in MICRO, 2015.
[37]
Y. Sazeides and J. E. Smith, "Implementations of Context Based Value Predictors," Technical Report ECE-97-8, University of Wisconsin-Madison, Tech. Rep., 1997.
[38]
Sazeides, Yiannakis and Smith, James E, "The Predictability of Data Values," in MICRO, 1997.
[39]
Sazeides, Yiannakis and Smith, James E, "Modeling Program Predictability," in ISCA, 1998.
[40]
R. Thomas and M. Franklin, "Using Dataflow Based Context for Accurate Value Prediction," in PACT, 2001.
[41]
R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S.V.Adve, "Approxilyzer: Towards a Systematic Framework for Instruction-Level Approximate Computing and Its Application to Hardware Resiliency," in MICRO, 2016.
[42]
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, O. Mutlu, C. Das, M. T. Kandemir, T. Mowry, and R. Ausavarungnirun, "Enabling Efficient Data Compression in GPUs," in ISCA, 2015.
[43]
H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog, "Efficient and Fair Multiprogramming in GPUs via Effective Bandwidth Management," in HPCA, 2018.
[44]
D. Wong, N. S. Kim, and M. Annavaram, "Approximating Warps with Intra-Warp Operand Value Similarity," in HPCA, 2016.
[45]
A. Yazdanbakhsh, G. Pekhimenko, B. Thwaites, H. Esmaeilzadeh, O. Mutlu, and T. C. Mowry, "RFVP: Rollback-free Value Prediction with Safe-to-Approximate Loads," ACM Transactions on Architecture and Code Optimization (TACO), vol. 12, no. 4, p. 62, 2016.

Cited By

View all
  • (2023)Approximate Computing: Hardware and Software Techniques, Tools and Their ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662430001033:04Online publication date: 20-Sep-2023
  • (2021)Analyzing and Leveraging Decoupled L1 Caches in GPUs2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00047(467-478)Online publication date: Feb-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 2019
533 pages
ISBN:9781450360791
DOI:10.1145/3330345
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. approximation
  3. scheduling
  4. value prediction

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)58
  • Downloads (Last 6 weeks)10
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Approximate Computing: Hardware and Software Techniques, Tools and Their ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662430001033:04Online publication date: 20-Sep-2023
  • (2021)Analyzing and Leveraging Decoupled L1 Caches in GPUs2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00047(467-478)Online publication date: Feb-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media