research-article

PreScaler: an efficient system-aware precision scaling framework on heterogeneous systems

Authors:

Kyunghwan Choi,

Yongjun ParkAuthors Info & Claims

CGO '20: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

Pages 280 - 292

https://doi.org/10.1145/3368826.3377917

Published: 22 February 2020 Publication History

Abstract

Graphics processing units (GPUs) have been commonly utilized to accelerate multiple emerging applications, such as big data processing and machine learning. While GPUs are proven to be effective, approximate computing, to trade off performance with accuracy, is one of the most common solutions for further performance improvement. Precision scaling of originally high-precision values into lower-precision values has recently been the most widely used GPU-side approximation technique, including hardware-level half-precision support. Although several approaches to find optimal mixed-precision configuration of GPU-side kernels have been introduced, total program performance gain is often low because total execution time is the combination of data transfer, type conversion, and kernel execution. As a result, kernel-level scaling may incur high type-conversion overhead of the kernel input/output data. To address this problem, this paper proposes an automatic precision scaling framework called PreScaler that maximizes the program performance at the memory object level by considering whole OpenCL program flows. The main difficulty is that the best configuration cannot be easily predicted due to various application- and system-specific characteristics. PreScaler solves this problem using search space minimization and decision-tree-based search processes. First, it minimizes the number of test configurations based on the information from system inspection and dynamic profiling. Then, it finds the best memory-object level mixed-precision configuration using a decision-tree-based search. PreScaler achieves an average performance gain of 1.33x over the baseline while maintaining the target output quality level.

References

[1]

Clang. a C language family frontend for LLVM, 2007. http://clang.llvm.org .

[2]

A. Anant, M. C. Rinard, S. Sidiroglou, S. Misailovic, and H. Hoﬀmann. Using code perforation to improve performance, reduce energy consumption, and respond to failures. 2009.

[3]

W. Baek and T. M. Chilimbi. Green: a framework for supporting energy-conscious programming using controlled approximation. In In ACM SIGPLAN Conference on Programming language design and implementation, PLDI ’10. Citeseer, 2010.

Digital Library

[4]

R. E. Bryant, O. David Richard, and O. David Richard. Computer systems: a programmer’s perspective, volume 281. Prentice Hall Upper Saddle River, 2003.

Digital Library

[5]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In 2011 38th Annual international symposium on computer architecture (ISCA), pages 365–376. IEEE, 2011.

[6]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 449–460. IEEE Computer Society, 2012.

[7]

S. Graillat, F. Jézéquel, R. Picot, F. Févotte, and B. Lathuilière. Autotuning for ﬂoating-point precision with discrete stochastic arithmetic. Journal of Computational Science, 2019.

[8]

S. Hong, I. Lee, and Y. Park. Nn compactor: Minimizing memory and logic resources for small neural networks. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 581–584. IEEE, 2018.

[9]

INTEL. Intel xeon e5-2600 model speciﬁcation, 2016. https://www.intel.com/content/www/us/en/processors/xeon xeon-e5-brief.html .

[10]

INTEL. Intel gold 5115 model speciﬁcation, 2017. https://ark.intel.com/content/www/kr/ko/ark/products/120484 intel-xeon-gold-5115-processor-13-75m-cache-2-40-ghz.html .

[11]

S. Kang, Y. Yu, J. Kim, and Y. Park. Gate: A generalized dataﬂowlevel approximation tuning engine for data parallel architectures. In Proceedings of the 56th Annual Design Automation Conference 2019, page 24. ACM, 2019.

Digital Library

[12]

KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010. http://www.khronos.org.

[13]

I. Laguna, P. C. Wood, R. Singh, and S. Bagchi. Gpumixer: Performance-driven ﬂoating-point tuning for gpu scientiﬁc applications. In International Conference on High Performance Computing, pages 227–246. Springer, 2019.

[14]

M. O. Lam, J. K. Hollingsworth, B. R. de Supinski, and M. P. LeGendre. Automatically adapting programs for mixed-precision ﬂoating-point computation. In Proceedings of the 27th international ACM conference on International conference on supercomputing, pages 369–378. ACM, 2013.

[15]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75–86, 2004.

[16]

M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang. Input responsiveness: Using canary inputs to dynamically steer approximation. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pages 161–176, New York, NY, USA, 2016. ACM.

Digital Library

[17]

D. Lustig and M. Martonosi. Reducing gpu oﬄoad latency via ﬁne-grained cpu-gpu synchronization. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 354–365. IEEE, 2013.

[18]

S. Mittal. A survey of techniques for approximate computing. ACM Comput. Surv., 48(4):62:1–62:33, Mar. 2016.

Digital Library

[19]

T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze, and M. Oskin. Snnap: Approximate computing on programmable socs via neural acceleration. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 603–614. IEEE, 2015.

[20]

R. Nathan, H. Naeimi, D. J. Sorin, and X. Sun. Proﬁle-driven automated mixed precision. arXiv preprint arXiv:1606.00251, 2016.

[21]

J. Nickolls et al. NVIDIA CUDA software and GPU parallel computing architecture. In Microprocessor Forum, May 2007.

[22]

NVIDIA. NVIDIA Tesla P100, 2016. https://https://images.nvidia.com/content/pdf/tesla/ whitepaper/pascal-architecture-whitepaper.pdf .

[23]

NVIDIA. NVIDIA DGX Station, 2017. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ dgx-station/nvidia-dgx-station-datasheet.pdf .

[24]

NVIDIA. NVIDIA Tesla V100, 2017. https://images.nvidia.com/content/volta-architecture/pdf/ volta-architecture-whitepaper.pdf .

[25]

NVIDIA. NVIDIA Titan Xp Graphics Cards, 2017. https://www.nvidia.com/en-us/titan/titan-xp/ .

[26]

NVIDIA. Nvidia Graphic Driver, 2018. Available at https://www.nvidia.com/Download/index.aspx .

[27]

NVIDIA. NVIDIA RTX 2080 Ti Graphics Cards, 2018. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/ .

[28]

NVIDIA. Nvidia Container Toolkit. build and run docker containers leveraging nvidia gpus, 2019. Available at https://github.com/NVIDIA/nvidia-docker .

[29]

NVIDIA. Throughput of native arithmetic instructions, 2019. https://docs.nvidia.com/cuda/cuda-c-programming-guide/ .

[30]

J. Park, H. Esmaeilzadeh, X. Zhang, M. Naik, and W. Harris. Flexjava: Language support for safe and modular approximate programming. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 745–757. ACM, 2015.

[31]

Polybench. the polyhedral benchmark suite, 2011. http://www.cse.ohio-state.edu/ pouchet/software/polybench.

[32]

C. Rau. Ieee 754-based half-precision ﬂoating-point library, 2017. http://half.sourceforge.net .

[33]

C. Rubio-González, C. Nguyen, H. D. Nguyen, J. Demmel, W. Kahan, K. Sen, D. H. Bailey, C. Iancu, and D. Hough. Precimonious: Tuning assistant for ﬂoating-point precision. In SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–12. IEEE, 2013.

[34]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

Digital Library

[35]

M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke. Paraprox: Patternbased approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 35– 50, New York, NY, USA, 2014. ACM.

Digital Library

[36]

M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. Sage: Self-tuning approximation for graphics engines. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MI-CRO), pages 13–24, Dec 2013.

Digital Library

[37]

A. Sampson, A. Baixo, B. Ransford, T. Moreau, J. Yip, L. Ceze, and M. Oskin. Accept: A programmer-guided compiler framework for practical approximate computing. University of Washington Technical Report UW-CSE-15-01, 1(2), 2015.

[38]

A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. Enerj: Approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, volume 46, pages 164–174. ACM, 2011.

[39]

Y. Tian, Q. Zhang, T. Wang, F. Yuan, and Q. Xu. Approxma: Approximate memory access for dynamic precision scaling. In Proceedings of the 25th edition on Great Lakes Symposium on VLSI, pages 337–342. ACM, 2015.

[40]

S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan. Axnn: energy-eﬃcient neuromorphic systems using approximate computing. In Proceedings of the 2014 international symposium on Low power electronics and design, pages 27–32. ACM, 2014.

[41]

Q. Zhang, T. Wang, Y. Tian, F. Yuan, and Q. Xu. Approxann: An approximate computing framework for artiﬁcial neural network. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pages 701–706. EDA Consortium, 2015.

Cited By

Leon VHanif MArmeniakos GJiao XShafique MPekmestzi KSoudris D(2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
https://dl.acm.org/doi/10.1145/3711683
Dalloo AJaleel Humaidi AAl Mhdawi AAl-Raweshidy H(2024)Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future DirectionsIEEE Access10.1109/ACCESS.2024.346737512(146022-146088)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3467375
Fink ZParasyris KGeorgakoudis GMenon HMohror KArnold DBadia R(2023)HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607095(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607095

Index Terms

PreScaler: an efficient system-aware precision scaling framework on heterogeneous systems
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Just-in-time compilers

Recommendations

Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support

The heterogeneous system architecture (HSA), announced by the HSA Foundation, is an approach to integrate central processing unit (CPU) and graphics processing unit (GPU) architectures. The open computing language (OpenCL) is a programming framework ...
Heterogeneous multicore parallel programming for graphics processing units
Software Development for Multi-core Computing Systems

Hybrid parallel multicore architectures based on graphics processing units (GPUs) can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting ...
Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

The Intel® Xeon Phi™ coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '20: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

February 2020

329 pages

ISBN:9781450370479

DOI:10.1145/3368826

General Chairs:
Jason Mars
University of Michigan, USA
,
Lingjia Tang
University of Michigan, USA
,
Program Chairs:
Jingling Xue
UNSW, Australia
,
Peng Wu
Futurewei Technologies, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 22 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Samsung Research Funding & Incubation Center

Conference

CGO '20

Sponsor:

CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization

February 22 - 26, 2020

CA, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
374
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leon VHanif MArmeniakos GJiao XShafique MPekmestzi KSoudris D(2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
https://dl.acm.org/doi/10.1145/3711683
Dalloo AJaleel Humaidi AAl Mhdawi AAl-Raweshidy H(2024)Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future DirectionsIEEE Access10.1109/ACCESS.2024.346737512(146022-146088)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3467375
Fink ZParasyris KGeorgakoudis GMenon HMohror KArnold DBadia R(2023)HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607095(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607095

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten