skip to main content
10.1145/3368826.3377917acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

PreScaler: an efficient system-aware precision scaling framework on heterogeneous systems

Published: 22 February 2020 Publication History

Abstract

Graphics processing units (GPUs) have been commonly utilized to accelerate multiple emerging applications, such as big data processing and machine learning. While GPUs are proven to be effective, approximate computing, to trade off performance with accuracy, is one of the most common solutions for further performance improvement. Precision scaling of originally high-precision values into lower-precision values has recently been the most widely used GPU-side approximation technique, including hardware-level half-precision support. Although several approaches to find optimal mixed-precision configuration of GPU-side kernels have been introduced, total program performance gain is often low because total execution time is the combination of data transfer, type conversion, and kernel execution. As a result, kernel-level scaling may incur high type-conversion overhead of the kernel input/output data. To address this problem, this paper proposes an automatic precision scaling framework called PreScaler that maximizes the program performance at the memory object level by considering whole OpenCL program flows. The main difficulty is that the best configuration cannot be easily predicted due to various application- and system-specific characteristics. PreScaler solves this problem using search space minimization and decision-tree-based search processes. First, it minimizes the number of test configurations based on the information from system inspection and dynamic profiling. Then, it finds the best memory-object level mixed-precision configuration using a decision-tree-based search. PreScaler achieves an average performance gain of 1.33x over the baseline while maintaining the target output quality level.

References

[1]
Clang. a C language family frontend for LLVM, 2007. http://clang.llvm.org .
[2]
A. Anant, M. C. Rinard, S. Sidiroglou, S. Misailovic, and H. Hoffmann. Using code perforation to improve performance, reduce energy consumption, and respond to failures. 2009.
[3]
W. Baek and T. M. Chilimbi. Green: a framework for supporting energy-conscious programming using controlled approximation. In In ACM SIGPLAN Conference on Programming language design and implementation, PLDI ’10. Citeseer, 2010.
[4]
R. E. Bryant, O. David Richard, and O. David Richard. Computer systems: a programmer’s perspective, volume 281. Prentice Hall Upper Saddle River, 2003.
[5]
H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In 2011 38th Annual international symposium on computer architecture (ISCA), pages 365–376. IEEE, 2011.
[6]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 449–460. IEEE Computer Society, 2012.
[7]
S. Graillat, F. Jézéquel, R. Picot, F. Févotte, and B. Lathuilière. Autotuning for floating-point precision with discrete stochastic arithmetic. Journal of Computational Science, 2019.
[8]
S. Hong, I. Lee, and Y. Park. Nn compactor: Minimizing memory and logic resources for small neural networks. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 581–584. IEEE, 2018.
[9]
INTEL. Intel xeon e5-2600 model specification, 2016. https://www.intel.com/content/www/us/en/processors/xeon xeon-e5-brief.html .
[10]
INTEL. Intel gold 5115 model specification, 2017. https://ark.intel.com/content/www/kr/ko/ark/products/120484 intel-xeon-gold-5115-processor-13-75m-cache-2-40-ghz.html .
[11]
S. Kang, Y. Yu, J. Kim, and Y. Park. Gate: A generalized dataflowlevel approximation tuning engine for data parallel architectures. In Proceedings of the 56th Annual Design Automation Conference 2019, page 24. ACM, 2019.
[12]
KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010. http://www.khronos.org.
[13]
I. Laguna, P. C. Wood, R. Singh, and S. Bagchi. Gpumixer: Performance-driven floating-point tuning for gpu scientific applications. In International Conference on High Performance Computing, pages 227–246. Springer, 2019.
[14]
M. O. Lam, J. K. Hollingsworth, B. R. de Supinski, and M. P. LeGendre. Automatically adapting programs for mixed-precision floating-point computation. In Proceedings of the 27th international ACM conference on International conference on supercomputing, pages 369–378. ACM, 2013.
[15]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75–86, 2004.
[16]
M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang. Input responsiveness: Using canary inputs to dynamically steer approximation. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pages 161–176, New York, NY, USA, 2016. ACM.
[17]
D. Lustig and M. Martonosi. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 354–365. IEEE, 2013.
[18]
S. Mittal. A survey of techniques for approximate computing. ACM Comput. Surv., 48(4):62:1–62:33, Mar. 2016.
[19]
T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze, and M. Oskin. Snnap: Approximate computing on programmable socs via neural acceleration. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 603–614. IEEE, 2015.
[20]
R. Nathan, H. Naeimi, D. J. Sorin, and X. Sun. Profile-driven automated mixed precision. arXiv preprint arXiv:1606.00251, 2016.
[21]
J. Nickolls et al. NVIDIA CUDA software and GPU parallel computing architecture. In Microprocessor Forum, May 2007.
[22]
NVIDIA. NVIDIA Tesla P100, 2016. https://https://images.nvidia.com/content/pdf/tesla/ whitepaper/pascal-architecture-whitepaper.pdf .
[23]
NVIDIA. NVIDIA DGX Station, 2017. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ dgx-station/nvidia-dgx-station-datasheet.pdf .
[24]
NVIDIA. NVIDIA Tesla V100, 2017. https://images.nvidia.com/content/volta-architecture/pdf/ volta-architecture-whitepaper.pdf .
[25]
NVIDIA. NVIDIA Titan Xp Graphics Cards, 2017. https://www.nvidia.com/en-us/titan/titan-xp/ .
[26]
NVIDIA. Nvidia Graphic Driver, 2018. Available at https://www.nvidia.com/Download/index.aspx .
[27]
NVIDIA. NVIDIA RTX 2080 Ti Graphics Cards, 2018. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/ .
[28]
NVIDIA. Nvidia Container Toolkit. build and run docker containers leveraging nvidia gpus, 2019. Available at https://github.com/NVIDIA/nvidia-docker .
[29]
NVIDIA. Throughput of native arithmetic instructions, 2019. https://docs.nvidia.com/cuda/cuda-c-programming-guide/ .
[30]
J. Park, H. Esmaeilzadeh, X. Zhang, M. Naik, and W. Harris. Flexjava: Language support for safe and modular approximate programming. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 745–757. ACM, 2015.
[31]
Polybench. the polyhedral benchmark suite, 2011. http://www.cse.ohio-state.edu/ pouchet/software/polybench.
[32]
C. Rau. Ieee 754-based half-precision floating-point library, 2017. http://half.sourceforge.net .
[33]
C. Rubio-González, C. Nguyen, H. D. Nguyen, J. Demmel, W. Kahan, K. Sen, D. H. Bailey, C. Iancu, and D. Hough. Precimonious: Tuning assistant for floating-point precision. In SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–12. IEEE, 2013.
[34]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[35]
M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke. Paraprox: Patternbased approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 35– 50, New York, NY, USA, 2014. ACM.
[36]
M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. Sage: Self-tuning approximation for graphics engines. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MI-CRO), pages 13–24, Dec 2013.
[37]
A. Sampson, A. Baixo, B. Ransford, T. Moreau, J. Yip, L. Ceze, and M. Oskin. Accept: A programmer-guided compiler framework for practical approximate computing. University of Washington Technical Report UW-CSE-15-01, 1(2), 2015.
[38]
A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. Enerj: Approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, volume 46, pages 164–174. ACM, 2011.
[39]
Y. Tian, Q. Zhang, T. Wang, F. Yuan, and Q. Xu. Approxma: Approximate memory access for dynamic precision scaling. In Proceedings of the 25th edition on Great Lakes Symposium on VLSI, pages 337–342. ACM, 2015.
[40]
S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan. Axnn: energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 international symposium on Low power electronics and design, pages 27–32. ACM, 2014.
[41]
Q. Zhang, T. Wang, Y. Tian, F. Yuan, and Q. Xu. Approxann: An approximate computing framework for artificial neural network. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pages 701–706. EDA Consortium, 2015.

Cited By

View all
  • (2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
  • (2024)Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future DirectionsIEEE Access10.1109/ACCESS.2024.346737512(146022-146088)Online publication date: 2024
  • (2023)HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607095(1-14)Online publication date: 12-Nov-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '20: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization
February 2020
329 pages
ISBN:9781450370479
DOI:10.1145/3368826
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 22 February 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Compiler
  2. HSA
  3. Precision Scaling
  4. Profile-guided
  5. Runtime

Qualifiers

  • Research-article

Funding Sources

  • Samsung Research Funding & Incubation Center

Conference

CGO '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
  • (2024)Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future DirectionsIEEE Access10.1109/ACCESS.2024.346737512(146022-146088)Online publication date: 2024
  • (2023)HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607095(1-14)Online publication date: 12-Nov-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media