research-article

PAQSIM: Fast Performance Model for Graphics Workload on Mobile GPUs

Authors:

Chu-Cheow LimAuthors Info & Claims

LCTES '20: The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

Pages 3 - 14

https://doi.org/10.1145/3372799.3394359

Published: 16 June 2020 Publication History

Abstract

As the popularity of GPU in embedded systems keeps increasing, there is a growing demand for performance models for rapid estimation and tuning. One major challenge of developing a GPU performance model is the balance between accuracy and speed. The analytical model and the architectural model, two prevailing performance models, both have their weaknesses. The analytical model is fast to execute and simple to implement but usually suffers from low simulation accuracy. On the other hand, the cycle-level architectural model can offer high accuracy, but often at the expense of the execution time.

In this work, we present a hybrid performance model for core-level performance studies. Our model takes advantage of the speed of the analytical model and the accuracy of the cycle-level architectural model. We model the resource contention as in traditional architectural models but reduce the pipeline stages when no contention is expected. The graphics workloads have shown uniform characteristics, which allows us to replace some detailed simulation with analytical models for latency estimation in key events such as memory accesses, texture fetches, and synchronizations. Such design greatly reduces the simulation time while maintains decent simulation accuracy.

We evaluate our performance model against commercial mobile GPUs. The experiments using graphics workloads from popular games show great simulation speed and high accuracy in predicting the GPU performance. For simulations using the aggressive mode, the simulator can achieve an average 4.1x slowdown, with an average error rate at 6% and the peak error rate at 27.9%.

Supplementary Material

MP4 File (3372799.3394359.mp4)

Presentation Video

Download
26.12 MB

References

[1]

Jia, W., Shaw, K.A. and Martonosi, M. 2012. Stargazer: Automated regression-based GPU design space exploration. ISPASS 2012 - IEEE International Symposium on Performance Analysis of Systems and Software (2012).

Digital Library

[2]

Jooya, A., Baniasadi, A. and Dimopoulos, N.J. 2012. Efficient design space exploration of GPGPU architectures. European Conference on Parallel Processing (2012), 518--527.

[3]

Ceballos, G., Sembrant, A., Carlson, T.E. and Black-Schaffer, D. 2018. Behind the scenes: Memory analysis of graphical workloads on tile-based GPUs. 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018), 1--11.

[4]

Bertolli, C., Betts, A., Loriant, N., Mudalige, G.R., Radford, D., Ham, D.A., Giles, M.B. and Kelly, P.H.J. 2012. Compiler optimizations for industrial unstructured mesh cfd applications on gpus. International Workshop on Languages and Compilers for Parallel Computing (2012), 112--126.

[5]

Gong, X., Chen, Z., Ziabari, A.K., Ubal, R. and Kaeli, D. 2017. TwinKernels: an execution model to improve GPU hardware scheduling at compile time. 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (2017), 39--49.

[6]

Jia, W., Garza, E., Shaw, K.A. and Martonosi, M. 2015. GPU performance and power tuning using regression trees. ACM Transactions on Architecture and Code Optimization (TACO). 12, 2 (2015), 1--26.

Digital Library

[7]

Sim, J., Dasgupta, A., Kim, H. and Vuduc, R. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (2012), 11--22.

Digital Library

[8]

Huang, J.-C., Lee, J.H., Kim, H. and Lee, H.-H.S. 2014. GPUMech: GPU performance modeling technique based on interval analysis. 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (2014), 268--279.

Digital Library

[9]

Hong, S. and Kim, H. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Proceedings of the 36th annual international symposium on Computer architecture (2009), 152--163.

Digital Library

[10]

Bernacchia, G. and Papaefthymiou, M.C. 1999. Analytical macromodeling for high-level power estimation. 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No. 99CH37051) (1999), 280--283.

[11]

Sun, Y. et al. 2019. MGPUSim: Enabling multi-GPU performance modeling and optimization. Proceedings - International Symposium on Computer Architecture (2019).

Digital Library

[12]

Barr, K.C., Pan, H., Zhang, M. and Asanovi?, K. 2005. Accelerating multiprocessor simulation with a memory timestamp record. ISPASS 2005 - IEEE International Symposium on Performance Analysis of Systems and Software (2005).

Digital Library

[13]

Ekman, M. and Stenstrom, P. 2005. Enhancing multiprocessor architecture simulation speed using matched-pair comparison. ISPASS 2005 - IEEE International Symposium on Performance Analysis of Systems and Software (2005).

Digital Library

[14]

Wunderlich, R.E., Wenisch, T.F., Falsafi, B. and Hoe, J.C. 2006. Statistical sampling of microarchitecture simulation. ACM Transactions on Modeling and Computer Simulation. (2006).

[15]

Chiou, D., Sunwoo, D., Kim, J., Patil, N.A., Reinhart, W., Johnson, D.E., Keefe, J. and Angepat, H. 2007. Fpga-accelerated simulation technologies (fast): Fast, full-system, cycle-accurate simulators. 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007) (2007), 249--261.

[16]

Pellauer, M., Vijayaraghavan, M., Adler, M., Arvind and Emer, J. 2008. Quick performance models quickly: Closely-coupled partitioned simulation on FPGAs. ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and Software (2008).

Digital Library

[17]

Chanjuan, W., Jiawei, O. and Jinyuan, J. 2010. GPGPU-based Smoothed Particle Hydrodynamic Fluid Simulation [J]. Journal of Computer-Aided Design & Computer Graphics. 3, (2010).

[18]

Vigueras, G., Roy, I., Cookson, A., Lee, J., Smith, N. and Nordsletten, D. 2014. Toward GPGPU accelerated human electromechanical cardiac simulations. International journal for numerical methods in biomedical engineering. 30, 1 (2014), 117--134.

[19]

Maia, J.D.C., Urquiza Carvalho, G.A., Mangueira Jr, C.P., Santana, S.R., Cabral, L.A.F. and Rocha, G.B. 2012. GPU linear algebra libraries and GPGPU programming for accelerating MOPAC semiempirical quantum chemistry calculations. Journal of chemical theory and computation. 8, 9 (2012), 3072--3081.

[20]

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM international conference on Multimedia (2014), 675--678.

Digital Library

[21]

Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O. and Patt, Y.N. 2011. Improving GPU performance via large warps and two-level warp scheduling. Proceedings of the Annual International Symposium on Microarchitecture, MICRO (2011).

[22]

Suh, J.W. and Kim, Y. 2013. Accelerating MATLAB with GPU Computing: A Primer with Examples.

[23]

Okada, S., Murakami, K., Amako, K., Sasaki, T., Incerti, S., Karamitros, M., Henderson, N., Gerritsen, M., Asai, M. and Dotti, A. 2016. GPU acceleration of monte carlo simulation at the cellular and DNA levels. Smart Innovation, Systems and Technologies (2016).

[24]

Coutinho, B., Teodoro, G., Sachetto, R., Guedes, D. and Ferreira, R. 2009. Profiling general purpose GPU applications. Proceedings - Symposium on Computer Architecture and High Performance Computing (2009).

[25]

Mistry, P. and Purnomo, B. 2019. Profiling OpenCL kernels using wavefront occupancy with radeon GPU profiler. ACM International Conference Proceeding Series (2019).

[26]

Meyerson, J. 2014. The go programming language. IEEE Software. (2014).

[27]

Ubal, R., Jang, B., Mistry, P., Schaa, D. and Kaeli, D. 2012. Multi2Sim: a simulation framework for CPU-GPU computing. 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT) (2012), 335--344.

[28]

Malhotra, G., Goel, S. and Sarangi, S.R. 2014. GpuTejas: A parallel simulator for GPU architectures. 2014 21st International Conference on High Performance Computing, HiPC 2014 (2014).

[29]

Hill, M.D. and Marty, M.R. 2008. Amdahl's law in the multicore era. Computer. (2008).

[30]

Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D. and Hwu, W.W. 2010. An adaptive performance modeling tool for GPU architectures. Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming (2010), 105--114.

[31]

Zhang, Y. and Owens, J.D. 2011. A quantitative performance analysis model for GPU architectures. 2011 IEEE 17th international symposium on high performance computer architecture (2011), 382--393.

[32]

Parakh, A.K., Balakrishnan, M. and Paul, K. 2012. Performance estimation of GPUs with cache. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012 (2012).

[33]

Lai, J. and Seznec, A. 2012. Break down GPU execution time with an analytical method. ACM International Conference Proceeding Series (2012).

[34]

Karkhanis, T.S. and Smith, J.E. 2004. A first-order superscalar processor model. Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA (2004).

[35]

Eyerman, S., Eeckhout, L., Karkhanis, T. and Smith, J.E. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems. (2009).

[36]

Wang, L., Jahre, M., Adileh, A., Wang, Z. and Eeckhout, L. 2019. Modeling Emerging Memory-Divergent GPU Applications. IEEE Computer Architecture Letters. 18, 2 (2019), 95--98.

Digital Library

[37]

Williams, S., Waterman, A. and Patterson, D. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM. (2009).

Digital Library

[38]

Nugteren, C., Van Den Braak, G.J. and Corporaal, H. 2014. Roofline-aware DVFS for GPUs. ACM International Conference Proceeding Series (2014).

[39]

Doerfler, D., Deslippe, J., Williams, S., Oliker, L., Cook, B., Kurth, T., Lobet, M., Malas, T., Vay, J.L. and Vincenti, H. 2016. Applying the roofline performance model to the intel xeon phi knights landing processor. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016).

[40]

Hill, M. and Janapa Reddi, V. 2019. Gables: A roofline model for mobile SoCs. Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019 (2019).

[41]

Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H. and Aamodt, T.M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. 2009 IEEE International Symposium on Performance Analysis of Systems and Software (2009), 163--174.

[42]

Power, J., Hestness, J., Orr, M.S., Hill, M.D. and Wood, D.A. 2014. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Computer Architecture Letters. 14, 1 (2014), 34--36.

Digital Library

[43]

Gutierrez, A., Beckmann, B.M., Dutu, A., Gross, J., LeBeane, M., Kalamatianos, J., Kayiran, O., Poremba, M., Potter, B., Puthoor, S. and others 2018. Lost in abstraction: Pitfalls of analyzing GPUs at the intermediate language level. 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2018), 608--619.

[44]

Sander, B. 2016. HSAIL: Portable compiler IR for HSA. 2013 IEEE Hot Chips 25 Symposium, HCS 2013 (2016).

[45]

Gera, P., Kim, H., Kim, H., Hong, S., George, V. and Luk, C.-K.C.K. 2018. Performance characterisation and simulation of intel's integrated GPU architecture. 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018), 139--148.

Cited By

Bi WMa YHan YChen YTian DDu JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)FusionRender: Harnessing WebGPU's Power for Enhanced Graphics Performance on Web BrowsersProceedings of the ACM Web Conference 202410.1145/3589334.3645395(2890-2901)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645395
Lee JHa YLee SWoo JLee JJang HKim YSalapura VZahran MChong FTang L(2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527384

Index Terms

PAQSIM: Fast Performance Model for Graphics Workload on Mobile GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
  2. Embedded and cyber-physical systems
    1. System on a chip
2. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
  2. Modeling and simulation

Recommendations

A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

This paper presents an optimized CPU--GPU hybrid implementation and a GPU performance model for the kernel-independent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU ...
CPU–GPU heterogeneous code acceleration of a finite volume Computational Fluid Dynamics solver
Abstract
This research focuses on accelerating the finite-volume Computational Fluid Dynamics (CFD) solver, SENSEI, through concurrent CPU–GPU heterogeneous computing, leveraging multiple CPUs and GPUs. An overview of SENSEI, its discretization, and the ...
Highlights
- A performance model is proposed for CPU–GPU heterogeneous computing.
- An 18% performance gain is achieved in heterogeneous co-execution over pure GPU.
- The proposed performance model can fairly accurately estimate the performance ...
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)

We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

LCTES '20: The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

June 2020

163 pages

ISBN:9781450370943

DOI:10.1145/3372799

General Chair:
Jingling Xue
UNSW Sydney, Australia
,
Program Chair:
Changhee Jung
Purdue University, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

LCTES '20

Sponsor:

LCTES '20: 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

June 16, 2020

London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 116 of 438 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
267
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bi WMa YHan YChen YTian DDu JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)FusionRender: Harnessing WebGPU's Power for Enhanced Graphics Performance on Web BrowsersProceedings of the ACM Web Conference 202410.1145/3589334.3645395(2890-2901)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645395
Lee JHa YLee SWoo JLee JJang HKim YSalapura VZahran MChong FTang L(2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527384

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten