skip to main content
10.1145/3299771.3299773acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
research-article

ThrustHetero: A Framework to Simplify Heterogeneous Computing Platform Programming using Design Abstraction

Published: 14 February 2019 Publication History

Abstract

Heterogeneous compute architectures like Multi-Core CPUs, CUDA GPUs, and Intel Xeon Phis have become prevalent over the years. While heterogeneity makes architecture specific features available to the programmer, it also makes application development difficult, as one needs to plan for optimal usage of architectural features, suitable partitioning of the workload, communication and data transfer among the participating devices. A suitable design abstraction that hides such variabilities of the underlying devices and at the same time exploits their computing capabilities, can improve developer productivity. In this work, we present "ThrustHetero", a lightweight framework based on NVIDIA's Thrust, that provides an abstraction over several devices such as GPUs, Xeon Phis and multicore, yet allows developers to easily leverage the full compute capability of these devices. We also demonstrate a novel method for workload distribution in two stages - micro-benchmarking during framework installations to find good proportions and then using this information during application execution. We consider four classes of applications based on how they would perform on various computing architectures on the basis of the amount of branching present in the application. We show that the framework produces a good workload distribution proportions for each class of application and also show that the framework is scalable and portable. Further, we compare the performance and ease of development when using the framework with the native versions of various benchmarks and obtain favorable results.

References

[1]
2017. CUDA C Programming Guide. Technical Report PG-02829-001_v8.0. NVIDIA. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
[2]
Martin Abadi, Paul Barham, and other. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation. 265--283.
[3]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. Technical Report. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.
[4]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198.
[5]
Nathan Bell and Jared Hoberock. 2011. GPU Computing Gems Jade Edition (1st ed.). Morgan Kaufmann Publishers Inc., Chapter Thrust: A Productivity-Oriented Library for CUDA, 359--371.
[6]
Gordon Brown, Ruyman Reyes, and Michael Wong. 2017. Asynchronous Managed Pointer for Heterogeneous and Distributed Computing. techreport P0567r1. Open Standards. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0567r1.html
[7]
J.Lawrence Carter and Mark N. Wegman. 1979. Universal classes of hash functions. J. Comput. System Sci. 18, 2 (apr 1979), 143--154.
[8]
H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos, Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 12 (2014), 3202--3216.
[9]
Shuai Che, Michael Boyer, et al. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE Intl. Symp. on Workload Characterization (IISWC).
[10]
Murray Cole. 2004. Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel computing 30, 3 (2004), 389--406.
[11]
Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. 2002. Torch: a modular machine learning software library. Technical Report. Idiap.
[12]
Alejandro Duran, Eduard Ayguadé, Rosa M Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 02 (2011), 173--193.
[13]
Michael Garland, Manjunath Kudlur, and Yili Zheng. 2012. Designing a unified programming model for heterogeneous machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 67.
[14]
Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In Proceedings of the 2007 international workshop on Parallel symbolic computation. ACM, 15--23.
[15]
A. V. George, S. Manoj, S. R. Gupte, S. Mitra, and S. Sarkar. 2017. Thrust++: Extending Thrust Framework for Better Abstraction and Performance. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 368--377.
[16]
A. V. George, S. Manoj, S. R. Gupte, and S. Sarkar. 2017. An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework. In 46th International Conference on Parallel Processing Workshops (ICPPW). IEEE Computer Society.
[17]
Horacio González-Vélez and Mario Leyton. 2010. A Survey of Algorithmic Skeleton Frameworks: High-level Structured Parallel Programming Enablers. Softw.Pract. Exper. 40, 12 (2010), 1135--1160.
[18]
Kate Gregory and Ade Miller. 2012. C++ AMP: accelerated massive parallelism with Microsoft Visual C++. Microsoft Press.
[19]
John L Gustafson. 1988. Reevaluating Amdahl's law. Commun. ACM 31, 5 (1988), 532--533.
[20]
Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. Hpx: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. ACM, 6.
[21]
Bob Kuhn, Paul Petersen, and Eamonn O Toole. 2000. OpenMP versus Threading in C/C+. Concurrency: Pract. Exper 12 (2000), 1165--1176.
[22]
Joao VF Lima and Daniel Di Domenico. 2017. HPSM: A Programming Framework for Multi-CPU and Multi-GPU Systems. In 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). IEEE, 31--36.
[23]
Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 45--55.
[24]
Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. 2012. Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures. In Parallel Processing (ICPP), 2012 41st International Conference on. IEEE, 48--57.
[25]
Corporation NVIDIA. 2007. cuBLAS Library. http://docs.nvidia.com/cuda/cublas/#axzz49AimKtns
[26]
Eri Rubin, Ely Levy, Amnon Barak, and Tal Ben-Nun. 2014. MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction. ACM Trans. Archit. Code Optim. 11, 4 (Dec. 2014), 44:1--44:22.
[27]
Santonu Sarkar, Sayantan Mitra, and Ashok Srinivasan. 2012. Reuse and Refactoring of GPU Kernels to Design Complex Applications. In Intl. Symp. on Parallel and Distributed Processing with Applications. 134--141.
[28]
Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller. 2013. Assessing the Performance of OpenMP Programs on the Intel Xeon Phi. Springer Berlin Heidelberg, 547--558.
[29]
Hiroyuki Takizawa, Katuto Sato, and Hiroaki Kobayashi. 2008. SPRAT: Runtime processor selection for energy-aware computing. In Cluster Computing, 2008 IEEE International Conference on. IEEE, 386--393.
[30]
M. Voss and W. Kim. 2011. Multicore Desktop Programming with Intel Threading Building Blocks. IEEE Software 28, 1 (01 2011), 23--31.
[31]
Michael Wolfe. 2013. The OpenACC application programming interface.
[32]
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao. 2015. VirtCL: A Framework for OpenCL Device Abstraction and Management. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, 161--172.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ISEC '19: Proceedings of the 12th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)
February 2019
238 pages
ISBN:9781450362153
DOI:10.1145/3299771
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. Design Simplicity
  3. Heterogeneous Computing
  4. High Performance Computing
  5. Programming Abstraction
  6. Thrust

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ISEC'19

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 84
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media