research-article

How Effective is Design Abstraction in Thrust?: An Empirical Evaluation

Authors:

Ajai V. George,

Sanket R. Gupte,

Santonu SarkarAuthors Info & Claims

SEM4HPC '17: Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications

Pages 3 - 10

https://doi.org/10.1145/3085158.3086159

Published: 26 June 2017 Publication History

Abstract

High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. A good design abstraction paradigm strikes a balance between the abstraction and visibility over the hardware. This allows the programmer to write applications without having to understand the hardware nuances while exploiting the computing power optimally. In this paper we have analyzed the power of design abstraction of a popular design abstraction framework called Thrust both from ease of programming and performance perspectives. We have shown that while Thrust framework is good in describing an algorithm compared to the native CUDA or OpenMP version but it has quite a few design limitations. With respect to CUDA it does not provide any abstraction over the shared, texture or constant memory usage to the programmer. We have compared the performance of a Thrust application code in CUDA, OpenMP and the CPP backends with respect to the native versions (implementing exactly same algorithm), written for these backends and found that the current Thrust version performs poorly in most of the cases. While we conclude that the framework is not ready for writing applications that can exploit the optimal performance from the hardware, we also highlight the improvements necessary for the framework to make the performance comparable.

References

[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, and other. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265--283, 2016.

Digital Library

[2]

S. V. Adve, V. S. Adve, G. Agha, M. I. Frank, M. J. Garzarin, J. C. Hart, W. mei W. Hwu, R. E. Johnson, L. Kale, R. Kumar, D. Marinov, K. Nahrstedt, D. Padua, M. Parthasarathy, S. Patel, G. Rosu, D. Roth, M. Snir, J. Torrellas, and C. Zilles. Parallel@illinois. Technical report, University of Illinois at Urbana-Champaign, 2008.

[3]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical report, EECS Department, University of California, Berkeley, 2006.

[4]

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Digital Library

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE Computer Society, 2009.

Digital Library

[6]

R. Collobert, S. Bengio, and J. Mariéthoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002.

[7]

H. González-Vélez and M. Leyton. A survey of algorithmic skeleton frameworks: High-level structured parallel programming enablers. Softw. Pract. Exper., 40(12):1135--1160, 2010.

[8]

W.-m. W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., 1st edition, 2011.

Digital Library

[9]

B. Kuhn, P. Petersen, and E. O'Toole. Openmp versus threading in c/c+. Concurrency: Pract. Exper, 12:1165--1176, 2000.

[10]

C. NVIDIA. cublas library, 2007.

[11]

NVIDIA Corporation. CUDA C PROGRAMMING GUIDE. NVIDIA Corporation, 2015.

[12]

E. Rubin, E. Levy, A. Barak, and T. Ben-Nun. Maps: Optimizing massively parallel applications using device-level memory abstraction. ACM Trans. Archit. Code Optim., 11(4):44:1--44:22, Dec. 2014.

Digital Library

[13]

S. Ryoo, C. I. Rodriguesy, et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). ACM, 2008.

Digital Library

[14]

D. Schmidl, T. Cramer, S. Wienke, C. Terboven, and M. S. Müller. Assessing the Performance of OpenMP Programs on the Intel Xeon Phi, pages 547--558. Springer Berlin Heidelberg, 2013.

Digital Library

[15]

Y.-P. You, H.-J. Wu, Y.-N. Tsai, and Y.-T. Chao. Virtcl: A framework for opencl device abstraction and management. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015), pages 161--172. ACM, 2015.

Digital Library

Index Terms

How Effective is Design Abstraction in Thrust?: An Empirical Evaluation

Recommendations

Gecko: Hierarchical Distributed View of Heterogeneous Shared Memory Architectures
PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores

The November 2018 TOP500 report shows that 86 systems in the list are heterogeneous systems configured with accelerators and co-processors, of which 60 use NVIDIA GPUs, 21 use Intel Xeon Phi cards, one uses AMD FirePro GPUs, one uses PEZY technology, and ...
Performance of the NVIDIA Jetson TK1 in HPC
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

The NVIDIA Jetson is demonstrated as a competitiveHPC platform. The Jetson has 192 Kepler CUDA cores that are"true" in that they share a processor: in the case of the Jetson, a32-bit ARM Cortex-A15 variant low power architecture. Ourwork explores the ...
Efficient Neighbor Searching for Agent-Based Simulation on GPU
DS-RT '14: Proceedings of the 2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications

This paper introduces a strategy to accelerate neighbor searching in agent-based simulations on GPU platforms. Because of their autonomous nature, agents can be processed by threads concurrently on GPU, and the overall simulation can be accelerated ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SEM4HPC '17: Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications

June 2017

36 pages

ISBN:9781450350006

DOI:10.1145/3085158

Program Chairs:
Atul Kumar
IBM Research, India
,
Santonu Sarkar
BITS Pilani, K K Birla Goa Campus, Goa, India
,
Michael Gerndt
Technical University of Munich, Germany

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Science and Engineering Research Board Govt. of India

Conference

HPDC '17

Sponsor:

University of Arizona
SIGARCH

HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing

June 26, 2017

DC, Washington, USA

Acceptance Rates

SEM4HPC '17 Paper Acceptance Rate 3 of 5 submissions, 60%;

Overall Acceptance Rate 8 of 16 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
105
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten