MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Stratton, John A.; Stone, Sam S.; Hwu, Wen-mei W.

doi:10.1007/978-3-540-89740-8_2

John A. Stratton²,
Sam S. Stone² &
Wen-mei W. Hwu²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5335))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

1468 Accesses
116 Citations

Abstract

CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

NVIDIA: NVIDIA CUDA, http://www.nvidia.com/cuda
Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28(2) (in press, 2008)
Google Scholar
Woop, S., Schmittler, J., Slusallek, P.: RPU: A programmable ray processing unit for realtime ray tracing. ACM Trans. Graph. 24(3), 434–444 (2005)
Article Google Scholar
Intel: Intel 64 and IA-32 Architectures Software Developer’s Manual (May 2007)
Google Scholar
Devices, A.M.: 3DNow! technology manual. Technical Report 21928, Advanced Micro Devices, Sunnyvale, CA (May 1998)
Google Scholar
Ashcroft, E., Manna, Z.: Transforming ’goto’ programs into ’while’ programs. In: Proceedings of the International Federation of Information Processing Congress 1971, August 1971, pp. 250–255 (1971)
Google Scholar
Kennedy, K., Allen, R.: Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, San Francisco (2002)
Google Scholar
Lee, S., Johnson, T., Eigenmann, R.: Cetus - An extensible compiler infrastructure for source-to-source transformation. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, Springer, Heidelberg (2004)
Chapter Google Scholar
Ayguadé, E., Blainey, B., Duran, A., Labarta, J., Martínez, F., Martorell, X., Silvera, R.: Is the schedule clause really necessary in OpenMP? In: Proceedings of the International Workshop on OpenMP Applications and Tools, June 2003, pp. 147–159 (2003)
Google Scholar
Markatos, E.P., LeBlanc, T.J.: Using processor affinity in loop scheduling on shared-memory multiprocessors. In: Proceedings of the 1992 International Conference on Supercomputing, July 1992, pp. 104–113 (1992)
Google Scholar
Hummel, S.F., Schonberg, E., Flynn, L.E.: Factoring: A practical and robust method for scheduling parallel loops. In: Proceedings of the 1001 International Conference of Supercomputing, June 1991, pp. 610–632 (1991)
Google Scholar
Bull, J.M.: Feedback guidaed dynamic loop scheduling: Algorithms and experiments. In: European Conference on Parallel Processing, September 1998, pp. 377–382 (1998)
Google Scholar
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (February 2008)
Google Scholar
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A., Hwu, W.W: Program optimization space pruning for a multithreaded GPU. In: Proceedings of the 2008 International Symposium on Code Generation and Optimization (April 2008)
Google Scholar
Volkov, V., Demmel, J.W.: LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, CA (May 2008)
Google Scholar
Cowie, J.H., Nicol, D.M., Ogielski, A.T.: Modeling the global internet. Computing in Science and Eng. 1(1), 42–50 (1999)
Article Google Scholar
OpenMP Architecture Review Board: OpenMP application program interface (May 2005)
Google Scholar
Intel: Threading Building Blocks, http://threadingbuildingblocks.org/
Forum, H.P.F.: High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Rice University (May 1993)
Google Scholar
Liao, S.W., Du, Z., Wu, G., Lueh, G.Y.: Data and computation transformations for Brook streaming applications on multiprocessors. In: Proceedings of the 4th International Symposium on Code Generation and Optimization, March 2006, pp. 196–207 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Reliable and High-Performance Computing and Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA
John A. Stratton, Sam S. Stone & Wen-mei W. Hwu

Authors

John A. Stratton
View author publications
You can also search for this author in PubMed Google Scholar
Sam S. Stone
View author publications
You can also search for this author in PubMed Google Scholar
Wen-mei W. Hwu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, T6G-2E8, Edmonton, AB, Canada
José Nelson Amaral

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stratton, J.A., Stone, S.S., Hwu, Wm.W. (2008). MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs. In: Amaral, J.N. (eds) Languages and Compilers for Parallel Computing. LCPC 2008. Lecture Notes in Computer Science, vol 5335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89740-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-89740-8_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89739-2
Online ISBN: 978-3-540-89740-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics