skip to main content
10.1145/3372799.3394359acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article

PAQSIM: Fast Performance Model for Graphics Workload on Mobile GPUs

Published: 16 June 2020 Publication History

Abstract

As the popularity of GPU in embedded systems keeps increasing, there is a growing demand for performance models for rapid estimation and tuning. One major challenge of developing a GPU performance model is the balance between accuracy and speed. The analytical model and the architectural model, two prevailing performance models, both have their weaknesses. The analytical model is fast to execute and simple to implement but usually suffers from low simulation accuracy. On the other hand, the cycle-level architectural model can offer high accuracy, but often at the expense of the execution time.
In this work, we present a hybrid performance model for core-level performance studies. Our model takes advantage of the speed of the analytical model and the accuracy of the cycle-level architectural model. We model the resource contention as in traditional architectural models but reduce the pipeline stages when no contention is expected. The graphics workloads have shown uniform characteristics, which allows us to replace some detailed simulation with analytical models for latency estimation in key events such as memory accesses, texture fetches, and synchronizations. Such design greatly reduces the simulation time while maintains decent simulation accuracy.
We evaluate our performance model against commercial mobile GPUs. The experiments using graphics workloads from popular games show great simulation speed and high accuracy in predicting the GPU performance. For simulations using the aggressive mode, the simulator can achieve an average 4.1x slowdown, with an average error rate at 6% and the peak error rate at 27.9%.

Supplementary Material

MP4 File (3372799.3394359.mp4)
Presentation Video

References

[1]
Jia, W., Shaw, K.A. and Martonosi, M. 2012. Stargazer: Automated regression-based GPU design space exploration. ISPASS 2012 - IEEE International Symposium on Performance Analysis of Systems and Software (2012).
[2]
Jooya, A., Baniasadi, A. and Dimopoulos, N.J. 2012. Efficient design space exploration of GPGPU architectures. European Conference on Parallel Processing (2012), 518--527.
[3]
Ceballos, G., Sembrant, A., Carlson, T.E. and Black-Schaffer, D. 2018. Behind the scenes: Memory analysis of graphical workloads on tile-based GPUs. 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018), 1--11.
[4]
Bertolli, C., Betts, A., Loriant, N., Mudalige, G.R., Radford, D., Ham, D.A., Giles, M.B. and Kelly, P.H.J. 2012. Compiler optimizations for industrial unstructured mesh cfd applications on gpus. International Workshop on Languages and Compilers for Parallel Computing (2012), 112--126.
[5]
Gong, X., Chen, Z., Ziabari, A.K., Ubal, R. and Kaeli, D. 2017. TwinKernels: an execution model to improve GPU hardware scheduling at compile time. 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (2017), 39--49.
[6]
Jia, W., Garza, E., Shaw, K.A. and Martonosi, M. 2015. GPU performance and power tuning using regression trees. ACM Transactions on Architecture and Code Optimization (TACO). 12, 2 (2015), 1--26.
[7]
Sim, J., Dasgupta, A., Kim, H. and Vuduc, R. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (2012), 11--22.
[8]
Huang, J.-C., Lee, J.H., Kim, H. and Lee, H.-H.S. 2014. GPUMech: GPU performance modeling technique based on interval analysis. 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (2014), 268--279.
[9]
Hong, S. and Kim, H. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Proceedings of the 36th annual international symposium on Computer architecture (2009), 152--163.
[10]
Bernacchia, G. and Papaefthymiou, M.C. 1999. Analytical macromodeling for high-level power estimation. 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No. 99CH37051) (1999), 280--283.
[11]
Sun, Y. et al. 2019. MGPUSim: Enabling multi-GPU performance modeling and optimization. Proceedings - International Symposium on Computer Architecture (2019).
[12]
Barr, K.C., Pan, H., Zhang, M. and Asanovi?, K. 2005. Accelerating multiprocessor simulation with a memory timestamp record. ISPASS 2005 - IEEE International Symposium on Performance Analysis of Systems and Software (2005).
[13]
Ekman, M. and Stenstrom, P. 2005. Enhancing multiprocessor architecture simulation speed using matched-pair comparison. ISPASS 2005 - IEEE International Symposium on Performance Analysis of Systems and Software (2005).
[14]
Wunderlich, R.E., Wenisch, T.F., Falsafi, B. and Hoe, J.C. 2006. Statistical sampling of microarchitecture simulation. ACM Transactions on Modeling and Computer Simulation. (2006).
[15]
Chiou, D., Sunwoo, D., Kim, J., Patil, N.A., Reinhart, W., Johnson, D.E., Keefe, J. and Angepat, H. 2007. Fpga-accelerated simulation technologies (fast): Fast, full-system, cycle-accurate simulators. 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007) (2007), 249--261.
[16]
Pellauer, M., Vijayaraghavan, M., Adler, M., Arvind and Emer, J. 2008. Quick performance models quickly: Closely-coupled partitioned simulation on FPGAs. ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and Software (2008).
[17]
Chanjuan, W., Jiawei, O. and Jinyuan, J. 2010. GPGPU-based Smoothed Particle Hydrodynamic Fluid Simulation [J]. Journal of Computer-Aided Design & Computer Graphics. 3, (2010).
[18]
Vigueras, G., Roy, I., Cookson, A., Lee, J., Smith, N. and Nordsletten, D. 2014. Toward GPGPU accelerated human electromechanical cardiac simulations. International journal for numerical methods in biomedical engineering. 30, 1 (2014), 117--134.
[19]
Maia, J.D.C., Urquiza Carvalho, G.A., Mangueira Jr, C.P., Santana, S.R., Cabral, L.A.F. and Rocha, G.B. 2012. GPU linear algebra libraries and GPGPU programming for accelerating MOPAC semiempirical quantum chemistry calculations. Journal of chemical theory and computation. 8, 9 (2012), 3072--3081.
[20]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM international conference on Multimedia (2014), 675--678.
[21]
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O. and Patt, Y.N. 2011. Improving GPU performance via large warps and two-level warp scheduling. Proceedings of the Annual International Symposium on Microarchitecture, MICRO (2011).
[22]
Suh, J.W. and Kim, Y. 2013. Accelerating MATLAB with GPU Computing: A Primer with Examples.
[23]
Okada, S., Murakami, K., Amako, K., Sasaki, T., Incerti, S., Karamitros, M., Henderson, N., Gerritsen, M., Asai, M. and Dotti, A. 2016. GPU acceleration of monte carlo simulation at the cellular and DNA levels. Smart Innovation, Systems and Technologies (2016).
[24]
Coutinho, B., Teodoro, G., Sachetto, R., Guedes, D. and Ferreira, R. 2009. Profiling general purpose GPU applications. Proceedings - Symposium on Computer Architecture and High Performance Computing (2009).
[25]
Mistry, P. and Purnomo, B. 2019. Profiling OpenCL kernels using wavefront occupancy with radeon GPU profiler. ACM International Conference Proceeding Series (2019).
[26]
Meyerson, J. 2014. The go programming language. IEEE Software. (2014).
[27]
Ubal, R., Jang, B., Mistry, P., Schaa, D. and Kaeli, D. 2012. Multi2Sim: a simulation framework for CPU-GPU computing. 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT) (2012), 335--344.
[28]
Malhotra, G., Goel, S. and Sarangi, S.R. 2014. GpuTejas: A parallel simulator for GPU architectures. 2014 21st International Conference on High Performance Computing, HiPC 2014 (2014).
[29]
Hill, M.D. and Marty, M.R. 2008. Amdahl's law in the multicore era. Computer. (2008).
[30]
Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D. and Hwu, W.W. 2010. An adaptive performance modeling tool for GPU architectures. Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming (2010), 105--114.
[31]
Zhang, Y. and Owens, J.D. 2011. A quantitative performance analysis model for GPU architectures. 2011 IEEE 17th international symposium on high performance computer architecture (2011), 382--393.
[32]
Parakh, A.K., Balakrishnan, M. and Paul, K. 2012. Performance estimation of GPUs with cache. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012 (2012).
[33]
Lai, J. and Seznec, A. 2012. Break down GPU execution time with an analytical method. ACM International Conference Proceeding Series (2012).
[34]
Karkhanis, T.S. and Smith, J.E. 2004. A first-order superscalar processor model. Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA (2004).
[35]
Eyerman, S., Eeckhout, L., Karkhanis, T. and Smith, J.E. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems. (2009).
[36]
Wang, L., Jahre, M., Adileh, A., Wang, Z. and Eeckhout, L. 2019. Modeling Emerging Memory-Divergent GPU Applications. IEEE Computer Architecture Letters. 18, 2 (2019), 95--98.
[37]
Williams, S., Waterman, A. and Patterson, D. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM. (2009).
[38]
Nugteren, C., Van Den Braak, G.J. and Corporaal, H. 2014. Roofline-aware DVFS for GPUs. ACM International Conference Proceeding Series (2014).
[39]
Doerfler, D., Deslippe, J., Williams, S., Oliker, L., Cook, B., Kurth, T., Lobet, M., Malas, T., Vay, J.L. and Vincenti, H. 2016. Applying the roofline performance model to the intel xeon phi knights landing processor. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016).
[40]
Hill, M. and Janapa Reddi, V. 2019. Gables: A roofline model for mobile SoCs. Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019 (2019).
[41]
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H. and Aamodt, T.M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. 2009 IEEE International Symposium on Performance Analysis of Systems and Software (2009), 163--174.
[42]
Power, J., Hestness, J., Orr, M.S., Hill, M.D. and Wood, D.A. 2014. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Computer Architecture Letters. 14, 1 (2014), 34--36.
[43]
Gutierrez, A., Beckmann, B.M., Dutu, A., Gross, J., LeBeane, M., Kalamatianos, J., Kayiran, O., Poremba, M., Potter, B., Puthoor, S. and others 2018. Lost in abstraction: Pitfalls of analyzing GPUs at the intermediate language level. 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2018), 608--619.
[44]
Sander, B. 2016. HSAIL: Portable compiler IR for HSA. 2013 IEEE Hot Chips 25 Symposium, HCS 2013 (2016).
[45]
Gera, P., Kim, H., Kim, H., Hong, S., George, V. and Luk, C.-K.C.K. 2018. Performance characterisation and simulation of intel's integrated GPU architecture. 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018), 139--148.

Cited By

View all
  • (2024)FusionRender: Harnessing WebGPU's Power for Enhanced Graphics Performance on Web BrowsersProceedings of the ACM Web Conference 202410.1145/3589334.3645395(2890-2901)Online publication date: 13-May-2024
  • (2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
LCTES '20: The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems
June 2020
163 pages
ISBN:9781450370943
DOI:10.1145/3372799
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpu
  2. graphics
  3. performance model
  4. simulation
  5. soc

Qualifiers

  • Research-article

Conference

LCTES '20

Acceptance Rates

Overall Acceptance Rate 116 of 438 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)4
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)FusionRender: Harnessing WebGPU's Power for Enhanced Graphics Performance on Web BrowsersProceedings of the ACM Web Conference 202410.1145/3589334.3645395(2890-2901)Online publication date: 13-May-2024
  • (2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media