Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Choi, Hong Jun; Son, Dong Oh; Kim, Jong Myon; Kim, Cheol Hong

doi:10.1007/s11227-014-1155-4

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Published: 15 March 2014

Volume 69, pages 330–356, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hong Jun Choi¹,
Dong Oh Son¹,
Jong Myon Kim² &
…
Cheol Hong Kim¹

363 Accesses
2 Citations
Explore all metrics

Abstract

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Article 23 November 2019

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

Article 27 October 2020

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Article 01 February 2018

References

Lee R (1999) Efficiency of microSIMD architectures and index-mapped data for media processing. In: Proceedings of IS and T/SPIE symposium on electric imaging, pp 34–46
Flynn M (1972) Some computer organizations and their effectiveness. IEEE Trans Comput C–21(9):948–960
Article MathSciNet Google Scholar
Luebke D, Humphreys G (2007) How GPUs work. J Comput 40(2):96–100
Article Google Scholar
Lee VW, Kim CK, Chhugani J, Deisher M, Kim DH, Nguyen AD, Satish N, Smelyanskiy M, Chennupaty S, Hammarlund P, Singhal R, Dubey P (2010) Debunking the 100\(\times \) GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of international symposium on computer architecture, pp 451–460
General-purpose computation on graphics hardware, available at http://www.gpgpu.org/. Accessed 2 Jul 2011
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 777–786
Owens JD, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Euro-graphics 2005, state of the art reports, pp 21–51
CUDA Programming Guide Version 3.0, available at https://developer.nvidia.com/cuda-toolkit-30-downloads/. Accessed 11 Aug 2011
ATI Stream Technology, available at http://developer.amd.com/tools-and-sdks/. Accessed 28 Sep 2011
Khronos Group, OpenCL, available at http://www.khronos.org/opencl/. Accessed 1 Feb 2012
Cg, available at https://developer.nvidia.com/cg-toolkit. Accessed 9 Apr 2012
HLSL, available at http://msdn2.microsoft.com/en-us/library/bb509638.aspx. Accessed 13 Apr 2012
OpenGL, available at http://www.opengl.org/registry/doc/GLSLangSpec.Full.1.20.8.pdf. Accessed 27 Jun 2012
Rhu M, Erez M (2012) CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In: Proceedings of international symposium on computer architecture, pp 61–71
Gilani SZ, Kim NS, Michael J (2013) Schulte, power-efficient computing for compute-intensive GPGPU applications. In: Proceedings of international symposium on high performance computer architecture, pp 412–423
Levinthal A, Porter T (1984) Chap—a SIMD graphics processor. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 77–82
Moy S, Lindholm E (2005) US patent 6,947,047: method and system for programmable pipelined graphics processing with branching instructions, available at http://www.google.com/patents/US6947047. Accessed 7 Jan 2012
Lorie RA, Strong HR (1984) US patent 4,435,758: method for conditional branch execution in SIMD vector processors, available at http://www.google.com/patents/US4435758. Accessed 10 Jan 2012
Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of international symposium on microarchitecture, pp 407–420
Fung WWL, Aamodt TM (2011) Thread block compaction for efficient SIMT control flow. In: Proceedings of international symposium on high performance computer architecture, pp 25–36
Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of international symposium on microarchitecture, pp 308–317
Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of international symposium on computer architecture, pp 235–246
Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of international conference on high, performance computing, pp 197–208
Giles M (2008) Jacobi iteration for a Laplace discretisation on a 3D structured grid. Technical Report, available at http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIAlaplace3d.pdf. Accessed 19 Jan 2012
Chen L, Das H, Pan S (2009) An implementation of ray tracing in CUDA. CSE 260 Project Report, available at http://cseweb.ucsd.edu/~baden/classes/Exemplars/260_fa09/ChenDasPan_cse260_fa09.pdf. Accessed 25 Jan 2012
Harris M (2007) Parallel prefix sum (scan) with CUDA. Project Report, available at http://beowulf.lcs.mit.edu/18.337-2008/lectslides/scan.pdf. Accessed 1 Feb 2012
Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: Proceedings of workshop on general purpose processing on graphics processing units, pp 79–84
Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8(1):3–30
Article MATH Google Scholar
NVIDA Co., Ltd., available at http://www.nvidia.com/. Accessed 22 Jul 2011
AMD(Advanced Micro Devices) Inc., available at http://www.amd.com/. Accessed 18 Oct 2011
QuadroFX5800, available at http://www.nvidia.com/object/product_quadro_fx_5800_us.html. Accessed 16 Sep 2012
NVIDIA Co., Ltd. (2009) NVIDIA’s next generation CUDA compute architecture: Fermi, White paper, available at http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 11 Dec 2011
Thornton JE (1964) Parallel operation in the control data 6600, In: AFIPS proceedings of FJCC, Part 2, vol 26, pp 33–40
Coon BW, Lindholm EJ (2008) United States Patent No. 7,353,369: system and method for managing divergent threads in a SIMD architecture
Coon BW, Mills PC, Oberman SF, Siu MY (2008) United States Patent No. 7,434,032: tracking register usage during multithreaded processing using a scorebard having separate memory regions and storing sequential register size indicators
Lindholm J, Moy S (2010) United States Patent Application No. 2005/0138328 A1: across-thread out-of-order instruction dispatch in a multithreaded microprocessor
Woop S, Schmittler J, Slusallek P (2005) RPU: a programmable ray processing unit for realtime ray tracing. In: Proceedings of conference on computer graphics and Interactive Techniques(SIGGRAPH), pp 434–444
Muchnick S (1997) Advanced compiler design and implementation. Morgan Kaufmanns, San Francisco
Google Scholar
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of international symposium on performance analysis of systems and software, pp 163–174
Burger DC, Austin TM (1997) The SimpleScalar tool set, version 2.0. Comput Archit News 25(3):13–25
Article Google Scholar
Dally WJ, Towles B (2004) Interconnection Networks. Morgan Kaufmann, San Francisco
Cuda, SDK, available at http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html Accessed 30 Sep 2012
Kirk D, Hwu WW (2010) Programming massively parallel processors: a hands-on approach. Morgan Kaufmanns, San Francisco
Google Scholar
Tarjan D, Thoziyor S, Jouppi NP (2006) CACTI 4.0. Technical Report HPL-2006–86

Download references

Acknowledgments

This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (2012R1A1B4003492) and the ITRC (Information Technology Research Center) support program (NIPA-2013-H0301-13-3005) supervised by the NIPA (National IT Industry Promotion Agency).

Author information

Authors and Affiliations

School of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea
Hong Jun Choi, Dong Oh Son & Cheol Hong Kim
School of Electrical Engineering, University of Ulsan, Ulsan, Korea
Jong Myon Kim

Authors

Hong Jun Choi
View author publications
You can also search for this author in PubMed Google Scholar
Dong Oh Son
View author publications
You can also search for this author in PubMed Google Scholar
Jong Myon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Cheol Hong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheol Hong Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, H.J., Son, D.O., Kim, J.M. et al. Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization. J Supercomput 69, 330–356 (2014). https://doi.org/10.1007/s11227-014-1155-4

Download citation

Published: 15 March 2014
Issue Date: July 2014
DOI: https://doi.org/10.1007/s11227-014-1155-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Abstract

Access this article

Similar content being viewed by others

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Abstract

Access this article

Similar content being viewed by others

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation