skip to main content
10.1145/1810085.1810106acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Published: 02 June 2010 Publication History

Abstract

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines.
This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.

References

[1]
L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, 1994.
[2]
S. Baghsorkhi, M. Lathara, and W. mei Hwu. CUDA-lite: Reducing GPU Programming Complexity. In LCPC 2008, 2008.
[3]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In International Conference on Supercomputing, pages 225--234, 2008.
[4]
P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32(5):1260--1279, 2003.
[5]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In SFCS '94: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356--368, Washington, DC, USA, 1994. IEEE Computer Society.
[6]
R. D. Chamberlain, J. M. Lancaster, and R. K. Cytron. Visions for application development on hybrid computing systems. Parallel Comput., 34(4--5):201--216, 2008.
[7]
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society.
[8]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[9]
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11, New York, NY, USA, 2009. ACM.
[10]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT08: IEEE International Conference on Parallel Architecture and Compilation Techniques 2008, 2008.
[11]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[12]
R. Jin and G. Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining, Apr. 2001.
[13]
R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In Proceedings of the second SIAM conference on Data Mining, Apr. 2002.
[14]
Khronos. OpenCL 1.0. http://www.khronos.org/opencl/.
[15]
A. Klockner. PyCuda. http://mathema.tician.de/software/pycuda, 2008.
[16]
R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32--38, 2005.
[17]
D. M. Kunzman and L. V. Kalé. Towards a framework for abstracting accelerators in parallel applications: experience with cell. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM.
[18]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004.
[19]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 101--110, New York, NY, USA, 2008. ACM.
[20]
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Micro-42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45--55, New York, NY, USA, 2009. ACM.
[21]
W. Ma and G. Agrawal. A translation system for enabling data mining applications on gpus. In ICS '09: Proceedings of the 23rd international conference on Conference on Supercomputing, pages 400--409, New York, NY, USA, 2009. ACM.
[22]
E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 221--234, New York, NY, USA, 2009. ACM.
[23]
NVidia. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. version 2.0. http://developer.download.nvidia.com/compute/cuda/2.0-Beta2/docs/Programming_Guide_2.0beta2.pdf, June 7 2008.
[24]
K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6):559--572, 1901.
[25]
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of International Symposium on High Performance Computer Architecture, 2007, pages 13--24, 2007.
[26]
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S.-Z. Ueng, J. Stratton, and W. mei Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008, pages 195--204. ACM, April 2008.
[27]
B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. SIGPLAN Not., 44(6):431--440, 2009.
[28]
N. Sundaram, A. Raghunathan, and S. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS, 2009.
[29]
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. In J. P. Shen and M. Martonosi, editors, ASPLOS, pages 325--335. ACM, 2006.
[30]
S. Venkatasubramanian and R. W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 244--255, New York, NY, USA, 2009. ACM.
[31]
P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 156--166. ACM Press, 2007.
[32]
W. Zhao and J. A. Stankovic. Performance analysis of fcfs and improved fcfs scheduling algorithms for dynamic real-time computer systems. In IEEE Real-Time Systems Symposium, pages 156--165, 1989.

Cited By

View all
  • (2022)Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulationsThe Journal of Supercomputing10.1007/s11227-022-04671-579:1(1065-1080)Online publication date: 21-Jul-2022
  • (2022)Feasibility Study of Molecular Dynamics Kernels Exploitation Using EngineCLEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_11(129-140)Online publication date: 9-Jun-2022
  • (2020)Compiling generalized histograms for GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433830(1-14)Online publication date: 9-Nov-2020
  • Show More Cited By

Index Terms

  1. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
    June 2010
    365 pages
    ISBN:9781450300186
    DOI:10.1145/1810085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. dynamic work distribution
    3. generalized reductions
    4. heterogeneous systems
    5. multi-cores

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICS'10
    Sponsor:
    ICS'10: International Conference on Supercomputing
    June 2 - 4, 2010
    Ibaraki, Tsukuba, Japan

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulationsThe Journal of Supercomputing10.1007/s11227-022-04671-579:1(1065-1080)Online publication date: 21-Jul-2022
    • (2022)Feasibility Study of Molecular Dynamics Kernels Exploitation Using EngineCLEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_11(129-140)Online publication date: 9-Jun-2022
    • (2020)Compiling generalized histograms for GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433830(1-14)Online publication date: 9-Nov-2020
    • (2020)Compiling Generalized Histograms for GPUSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00101(1-14)Online publication date: Nov-2020
    • (2020)A Heterogeneous Dynamic Scheduling Minimize Energy—HDSMEInternational Conference on Innovative Computing and Communications10.1007/978-981-15-5113-0_8(97-115)Online publication date: 2-Aug-2020
    • (2019)Nested MIMD-SIMD Parallelization for Heterogeneous MicroprocessorsACM Transactions on Architecture and Code Optimization10.1145/336830416:4(1-27)Online publication date: 17-Dec-2019
    • (2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
    • (2018)WT_DMDA new scheduling strategy for conjugate gradient solver on heterogeneous architectureInternational Journal of Autonomic Computing10.1504/IJAC.2018.0925633:1(52-71)Online publication date: 1-Jan-2018
    • (2018)Automatic Matching of Legacy Code to Heterogeneous APIsACM SIGPLAN Notices10.1145/3296957.317318253:2(139-153)Online publication date: 19-Mar-2018
    • (2018)Automatic Matching of Legacy Code to Heterogeneous APIsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173182(139-153)Online publication date: 19-Mar-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media