research-article

Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns

Authors:

Kevin J. Brown,

HyoukJoong Lee,

Arvind K. Sujeeth,

Christopher De Sa,

Christopher Aberger,

Kunle OlukotunAuthors Info & Claims

CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Pages 194 - 205

https://doi.org/10.1145/2854038.2854042

Published: 29 February 2016 Publication History

Abstract

High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers we need a programming model that is sufficiently expressive to support a variety of parallel applications, sufficiently performant to surpass hand-optimized sequential implementations, and sufficiently portable to support a variety of heterogeneous hardware. Unfortunately existing systems tend to fall short of these requirements. In this paper we introduce the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. We show straightforward analyses that determine what data to distribute based on its usage as well as powerful transformations of nested patterns that restructure computation to enable distribution and optimize for heterogeneous devices. We present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in multiple distributed programming models.

References

[1]

Livejournal social network. http://snap.stanford.edu/ data/soc-LiveJournal1.html.

[2]

J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: a Java-compatible and synthesizable language for heterogeneous architectures. OOPSLA. ACM, 2010.

Digital Library

[3]

M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. Springer Verlag, 2010.

Digital Library

[4]

G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 1996.

Digital Library

[5]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. 2008.

[6]

M. Bravenboer, A. van Dam, K. Olmos, and E. Visser. Program transformation with scoped dynamic rewrite rules. Fundam. Inf., 69:123–178, July 2005. ISSN 0169-2968.

Digital Library

[7]

K. J. Brown, A. K. Sujeeth, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domain-specific languages. PACT, 2011.

Digital Library

[8]

C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. PLDI. ACM, 2010. ISBN 978-1-4503- 0019-3.

Digital Library

[9]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an objectoriented approach to non-uniform cluster computing. SIGPLAN Not., 2005.

Digital Library

[10]

C. Click and K. D. Cooper. Combining analyses, combining optimizations. ACM Trans. Program. Lang. Syst., 17:181– 196, March 1995. ISSN 0164-0925.

Digital Library

[11]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, OSDI, pages 137– 150, 2004.

Digital Library

[12]

C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI ’12, 2012.

Digital Library

[13]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.

Digital Library

[14]

T. Grosser, A. Groesslinger, and C. Lengauer. Polly: performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.

[15]

J. Hoberock and N. Bell. Thrust: C++ template library for CUDA, 2009.

[16]

S. Hong, S. Salihoglu, J. Widom, and K. Olukotun. Simplifying scalable graph processing with a domain-specific language. CGO, 2014.

Digital Library

[17]

C.-C. Huang, Q. Chen, Z. Wang, R. Power, J. Ortiz, J. Li, and Z. Xiao. Spartan: A distributed array framework with smart tiling. USENIX Association, 2015.

Digital Library

[18]

M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. SIGMOD. ACM, 2009.

Digital Library

[19]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. EuroSys. ACM, 2007.

Digital Library

[20]

S. L. P. Jones, R. Leshchinskiy, G. Keller, and M. M. T. Chakravarty. Harnessing the multicores: Nested data parallelism in Haskell. In FSTTCS, pages 383–414, 2008.

[21]

H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on gpus. IEEE Micro, 2014.

Digital Library

[22]

S. Lerner, D. Grove, and C. Chambers. Composing dataflow analyses and transformations. SIGPLAN Not., 37:270–282, January 2002. ISSN 0362-1340.

Digital Library

[23]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD ’10. ACM, 2010.

Digital Library

[24]

B. L. Massingill, T. G. Mattson, and B. A. Sanders. A pattern language for parallel application programs. In Euro-Par 2000 Parallel Processing, pages 678–681. Springer, 2000.

Digital Library

[25]

F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost?

[26]

D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. SOSP ’13, 2013.

Digital Library

[27]

F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, 24:693– 701, 2011.

[28]

F. Niu, C. Zhang, C. Ré, and J. W. Shavlik. Deepdive: Webscale knowledge-base construction using statistical learning and inference. VLDS, 12:25–28, 2012.

[29]

A. Prokopec, P. Bagwell, and T. R. abd Martin Odersky. A generic parallel collection framework. Euro-Par, 2010.

Digital Library

[30]

T. Rompf, A. K. Sujeeth, N. Amin, K. Brown, V. Jovanovic, H. Lee, M. Jonnalagedda, K. Olukotun, and M. Odersky. Optimizing data structures in high-level programs. POPL, 2013.

Digital Library

[31]

C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. ACM, 2013.

Digital Library

[32]

A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domainspecific languages. ECOOP, 2013.

Digital Library

[33]

T. L. Veldhuizen and J. G. Siek. Combining optimizations, combining theories. Technical report, Indiana University, 2008.

[34]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim., 9(4), Jan. 2013.

Digital Library

[35]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mc-Cauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2011.

Digital Library

[36]

C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. Proceedings of the VLDB Endowment, 2014.

Digital Library

Cited By

Yadav RSundram SLee WGarland MBauer MAiken AKjolstad FEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707216
Swamy TRucker AShahbaz MGaur IOlukotun KFalsafi BFerdman MLu SWenisch T(2022)Taurus: a data plane architecture for per-packet MLProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507726(1099-1114)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507726
Brahmakshatriya AAmarasinghe SLee J(2022)GraphIt to CUDA compiler in 2021 LOCProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741280(53-65)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741280
Show More Cited By

Index Terms

Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language features
      2. Language types

Recommendations

Generating Configurable Hardware from Parallel Patterns
ASPLOS '16

In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...
Have Abstraction and Eat Performance Too: Optimized Heterogeneous Computing with Parallel Patterns
Exploring stream parallel patterns in distributed MPI environments
Highlights
- We present GrPPI MPI, an execution policy for distributed-hybrid environments for the Pipeline and Farm parallel patterns.
Abstract
In recent years, the large volumes of stream data and the near real-time requirements of data streaming applications have exacerbated the need for new scalable algorithms and programming interfaces for distributed and shared-memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

February 2016

283 pages

ISBN:9781450337786

DOI:10.1145/2854038

General Chair:
Bjoern Franke
University of Edinburgh, UK
,
Program Chairs:
Youfeng Wu
Intel, USA
,
Fabrice Rastello
Inria, France

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '16

Sponsor:

CGO '16: 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 12 - 18, 2016

Barcelona, Spain

Acceptance Rates

CGO '16 Paper Acceptance Rate 25 of 108 submissions, 23%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yadav RSundram SLee WGarland MBauer MAiken AKjolstad FEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707216
Swamy TRucker AShahbaz MGaur IOlukotun KFalsafi BFerdman MLu SWenisch T(2022)Taurus: a data plane architecture for per-packet MLProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507726(1099-1114)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507726
Brahmakshatriya AAmarasinghe SLee J(2022)GraphIt to CUDA compiler in 2021 LOCProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741280(53-65)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741280
Wu SDong XChen HWang LWang QZhu Z(2022)Simplified High Level Parallelism Expression on Heterogeneous Systems through Data Partition Pattern DescriptionThe Computer Journal10.1093/comjnl/bxac01766:6(1400-1418)Online publication date: 14-Mar-2022
https://doi.org/10.1093/comjnl/bxac017
Xiao JAndelfinger PCai WEckhoff DKnoll A(2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
https://doi.org/10.1007/978-3-030-95391-1_48
Essertel GTahboub RRompf TTilevich EDe Roover C(2021)On-stack replacement for program generators and source-to-source compilersProceedings of the 20th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3486609.3487207(156-169)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3486609.3487207
Pervan BKnezovic J(2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
https://doi.org/10.23919/MIPRO48935.2020.9245341
Devarajan HKougkas ASun X(2020)HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00017(62-72)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00017
Zhang YWang XWang CXu YZhang JLin XSun GZheng GYin SYe XLi LSong ZMiao D(2020)Bigflow: A General Optimization Layer for Distributed Computing FrameworksJournal of Computer Science and Technology10.1007/s11390-020-9702-335:2(453-467)Online publication date: 27-Mar-2020
https://doi.org/10.1007/s11390-020-9702-3
Essertel GTahboub RWang FDecker JRompf T(2019)Flare & lanternProceedings of the VLDB Endowment10.14778/3352063.335209712:12(1910-1913)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.14778/3352063.3352097
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten