research-article

Public Access

Address-stride assisted approximate load value prediction in GPUs

Authors:

Mohamed Ibrahim,

Adwait JogAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 184 - 194

https://doi.org/10.1145/3330345.3330362

Published: 26 June 2019 Publication History

Abstract

Value prediction holds the promise of significantly improving the performance and energy efficiency. However, if the values are predicted incorrectly, significant performance overheads are observed due to execution rollbacks. To address these overheads, value approximation is introduced, which leverages the observation that the rollbacks are not necessary as long as the application-level loss in quality due to value misprediction is acceptable to the user. However, in the context of Graphics Processing Units (GPUs), our evaluations show that the existing approximate value predictors are not optimal in improving the prediction accuracy as they do not consider memory request order, a key characteristic in determining the accuracy of value prediction. As a result, the overall data movement reduction benefits are capped as it is necessary to limit the percentage of predicted values (i.e., prediction coverage) for an acceptable value of application-level error.

To this end, we propose a new Address-Stride Assisted Approximate Value Predictor (ASAP) that explicitly considers the memory addresses and their request order information so as to provide high value prediction accuracy. We take advantage of our new observation that the stride between memory request addresses and the stride between their corresponding data values are highly correlated in several applications. Therefore, ASAP predicts the values only for those requests that have regular strides in their addresses. We evaluate ASAP on a diverse set of GPGPU applications. The results show that ASAP can significantly improve the value prediction accuracy over the previously proposed mechanisms at the same coverage, or can achieve higher coverage (leading to higher performance/energy improvements) under a fixed error threshold.

References

[1]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.

[2]

M. Carbin, S. Misailovic, and M. C. Rinard, "Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware," ACM SIGPLAN Notices, vol. 48, no. 10, pp. 33--52, 2013.

Digital Library

[3]

R. J. Eickemeyer and S. Vassiliadis, "A Load-Instruction Unit for Pipelined Processors," IBM Journal of Research and Development, vol. 37, no. 4, pp. 547--564, 1993.

Digital Library

[4]

F. Gabbay, "Speculative Execution Based on Value Prediction," Technion - Israel Institute of Technology, Tech. Rep. 1080, 1996.

[5]

GPGPU-Sim v3.2.1. (2014) GTX 480 Configuration. {Online}. Available: https://dev.ece.ubc.ca/projects/gpgpu-sim/browser/v3.x/configs/GTX480

[6]

Hynix. (2009) Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. {Online}. Available: http://0x04.net/~mwk/ram/H5GQ1H24AFR%28Rev1.0%29.pdf

[7]

A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das, "Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications," in GPGPU, 2014.

[8]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in ISCA, 2013.

Digital Library

[9]

A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013.

Digital Library

[10]

G. Kadam, D. Zhang, and A. Jog, "RCoal: Mitigating GPU Timing Attack via Subwarp-based Randomized Coalescing Techniques," in HPCA, 2018.

[11]

O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in PACT, 2013.

Digital Library

[12]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," in MICRO, 2011.

Digital Library

[13]

D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, "Rumba: An Online Quality Management System for Approximate Computing," in ISCA, 2015.

[14]

J. Lee, M. Samadi, and S. A. Mahlke, "Orchestrating Multiple Data-Parallel Kernels on Multiple Devices," in PACT, 2015.

Digital Library

[15]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in ISCA, 2013.

Digital Library

[16]

A. Li, S. L. Song, M. Wijtvliet, A. Kumar, and H. Corporaal, "SFU-Driven Transparent Approximation Acceleration on GPUs," in ICS, 2016.

Digital Library

[17]

M. H. Lipasti and J. P. Shen, "Exceeding the Dataflow Limit via Value Prediction," in MICRO, 1996.

Digital Library

[18]

D. Mahajan, K. Ramkrishnan, R. Jariwala, A. Yazdanbakhsh, J. Park, B. Thwaites, A. Nagendrakumar, A. Rahimi, H. Esmaeilzadeh, and K. Bazargan, "Axilog: Abstractions for Approximate Hardware Design and Reuse," in MICRO, 2015.

[19]

K. Menychtas, K. Shen, and M. L. Scott, "Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators," in ASPLOS, 2014.

Digital Library

[20]

J. S. Miguel, M. Badr, and E. N. Jerger, "Load Value Approximation," in MICRO, 2014.

Digital Library

[21]

T. Nakra, R. Gupta, and M. L. Soffa, "Global Context-Based Value Prediction," in HPCA, 1999.

Digital Library

[22]

NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," 2012.

[23]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU Concurrency with Elastic Kernels," in ASPLOS, 2013.

[24]

J.J. K. Park, Y. Park, and S. A. Mahlke, "Chimera: Collaborative Preemption for Multitasking on a Shared GPU," in ASPLOS, 2015.

Digital Library

[25]

J. Park, H. Esmaeilzadeh, X. Zhang, M. Naik, and W. Harris, "Flexjava: Language Support for Safe and Modular Approximate Programming," in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015.

Digital Library

[26]

A. Perais and A. Seznec, "Practical Data Value Speculation for Future High-End Processors," in HPCA, 2014.

[27]

Perais, Arthur and Seznec, André, "EOLE: Paving the Way for an Effective Implementation of Value Prediction," in ISCA, 2014.

Digital Library

[28]

Perais, Arthur and Seznec, André, "BeBoP: A Cost Effective Predictor Infrastructure for Superscalar Value Prediction," in HPCA, 2015.

[29]

L.-N. Pouchet, "Polybench: the Polyhedral Benchmark Suite," in URL: http://www.cs.ucla.edu/~pouchet/software/polybench/, 2012.

[30]

M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A Case for MLP-Aware Cache Replacement," in ISCA, 2006.

Digital Library

[31]

M. K. Qureshi and Y. N. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," in MICRO, 2006.

Digital Library

[32]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012.

Digital Library

[33]

M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, "SAGE: Self-Tuning Approximation for Graphics Engines," in MICRO, 2013.

Digital Library

[34]

A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "EnerJ: Approximate Data Types for Safe and General Low-Power Computation," ACM SIGPLAN Notices, vol. 46, no. 6, pp. 164--174, 2011.

Digital Library

[35]

J. San Miguel, J. Albericio, N. Enright Jerger, and A. Jaleel, "The Bunker Cache for Spatio-Value Approximation," in MICRO, 2016.

Digital Library

[36]

J. San Miguel, J. Albericio, A. Moshovos, and N. Enright Jerger, "Doppelganger: A Cache for Approximate Computing," in MICRO, 2015.

Digital Library

[37]

Y. Sazeides and J. E. Smith, "Implementations of Context Based Value Predictors," Technical Report ECE-97-8, University of Wisconsin-Madison, Tech. Rep., 1997.

[38]

Sazeides, Yiannakis and Smith, James E, "The Predictability of Data Values," in MICRO, 1997.

Digital Library

[39]

Sazeides, Yiannakis and Smith, James E, "Modeling Program Predictability," in ISCA, 1998.

Digital Library

[40]

R. Thomas and M. Franklin, "Using Dataflow Based Context for Accurate Value Prediction," in PACT, 2001.

Digital Library

[41]

R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S.V.Adve, "Approxilyzer: Towards a Systematic Framework for Instruction-Level Approximate Computing and Its Application to Hardware Resiliency," in MICRO, 2016.

Digital Library

[42]

N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, O. Mutlu, C. Das, M. T. Kandemir, T. Mowry, and R. Ausavarungnirun, "Enabling Efficient Data Compression in GPUs," in ISCA, 2015.

[43]

H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog, "Efficient and Fair Multiprogramming in GPUs via Effective Bandwidth Management," in HPCA, 2018.

[44]

D. Wong, N. S. Kim, and M. Annavaram, "Approximating Warps with Intra-Warp Operand Value Similarity," in HPCA, 2016.

[45]

A. Yazdanbakhsh, G. Pekhimenko, B. Thwaites, H. Esmaeilzadeh, O. Mutlu, and T. C. Mowry, "RFVP: Rollback-free Value Prediction with Safe-to-Approximate Loads," ACM Transactions on Architecture and Code Optimization (TACO), vol. 12, no. 4, p. 62, 2016.

Digital Library

Cited By

Raza MJaved SKazmi MAziz AUl Haque MQazi S(2023)Approximate Computing: Hardware and Software Techniques, Tools and Their ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662430001033:04Online publication date: 20-Sep-2023
https://doi.org/10.1142/S0218126624300010
Ibrahi MKayiran OEckert YLoh GJog A(2021)Analyzing and Leveraging Decoupled L1 Caches in GPUs2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00047(467-478)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00047

Recommendations

Register file prefetching
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that ...
Load value prediction via path-based address prediction: avoiding mispredictions due to conflicting stores
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Current flagship processors excel at extracting instruction-level-parallelism (ILP) by forming large instruction windows. Even then, extracting ILP is inherently limited by true data dependencies. Value prediction was proposed to address this ...
Early Address Prediction: Efficient Pipeline Prefetch and Reuse

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
276
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)10

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Raza MJaved SKazmi MAziz AUl Haque MQazi S(2023)Approximate Computing: Hardware and Software Techniques, Tools and Their ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662430001033:04Online publication date: 20-Sep-2023
https://doi.org/10.1142/S0218126624300010
Ibrahi MKayiran OEckert YLoh GJog A(2021)Analyzing and Leveraging Decoupled L1 Caches in GPUs2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00047(467-478)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00047

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten