skip to main content
10.1145/2854038.2854042acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns

Published: 29 February 2016 Publication History

Abstract

High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers we need a programming model that is sufficiently expressive to support a variety of parallel applications, sufficiently performant to surpass hand-optimized sequential implementations, and sufficiently portable to support a variety of heterogeneous hardware. Unfortunately existing systems tend to fall short of these requirements. In this paper we introduce the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. We show straightforward analyses that determine what data to distribute based on its usage as well as powerful transformations of nested patterns that restructure computation to enable distribution and optimize for heterogeneous devices. We present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in multiple distributed programming models.

References

[1]
Livejournal social network. http://snap.stanford.edu/ data/soc-LiveJournal1.html.
[2]
J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: a Java-compatible and synthesizable language for heterogeneous architectures. OOPSLA. ACM, 2010.
[3]
M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. Springer Verlag, 2010.
[4]
G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 1996.
[5]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. 2008.
[6]
M. Bravenboer, A. van Dam, K. Olmos, and E. Visser. Program transformation with scoped dynamic rewrite rules. Fundam. Inf., 69:123–178, July 2005. ISSN 0169-2968.
[7]
K. J. Brown, A. K. Sujeeth, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domain-specific languages. PACT, 2011.
[8]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. PLDI. ACM, 2010. ISBN 978-1-4503- 0019-3.
[9]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an objectoriented approach to non-uniform cluster computing. SIGPLAN Not., 2005.
[10]
C. Click and K. D. Cooper. Combining analyses, combining optimizations. ACM Trans. Program. Lang. Syst., 17:181– 196, March 1995. ISSN 0164-0925.
[11]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, OSDI, pages 137– 150, 2004.
[12]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI ’12, 2012.
[13]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.
[14]
T. Grosser, A. Groesslinger, and C. Lengauer. Polly: performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.
[15]
J. Hoberock and N. Bell. Thrust: C++ template library for CUDA, 2009.
[16]
S. Hong, S. Salihoglu, J. Widom, and K. Olukotun. Simplifying scalable graph processing with a domain-specific language. CGO, 2014.
[17]
C.-C. Huang, Q. Chen, Z. Wang, R. Power, J. Ortiz, J. Li, and Z. Xiao. Spartan: A distributed array framework with smart tiling. USENIX Association, 2015.
[18]
M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. SIGMOD. ACM, 2009.
[19]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. EuroSys. ACM, 2007.
[20]
S. L. P. Jones, R. Leshchinskiy, G. Keller, and M. M. T. Chakravarty. Harnessing the multicores: Nested data parallelism in Haskell. In FSTTCS, pages 383–414, 2008.
[21]
H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on gpus. IEEE Micro, 2014.
[22]
S. Lerner, D. Grove, and C. Chambers. Composing dataflow analyses and transformations. SIGPLAN Not., 37:270–282, January 2002. ISSN 0362-1340.
[23]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD ’10. ACM, 2010.
[24]
B. L. Massingill, T. G. Mattson, and B. A. Sanders. A pattern language for parallel application programs. In Euro-Par 2000 Parallel Processing, pages 678–681. Springer, 2000.
[25]
F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost?
[26]
D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. SOSP ’13, 2013.
[27]
F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, 24:693– 701, 2011.
[28]
F. Niu, C. Zhang, C. Ré, and J. W. Shavlik. Deepdive: Webscale knowledge-base construction using statistical learning and inference. VLDS, 12:25–28, 2012.
[29]
A. Prokopec, P. Bagwell, and T. R. abd Martin Odersky. A generic parallel collection framework. Euro-Par, 2010.
[30]
T. Rompf, A. K. Sujeeth, N. Amin, K. Brown, V. Jovanovic, H. Lee, M. Jonnalagedda, K. Olukotun, and M. Odersky. Optimizing data structures in high-level programs. POPL, 2013.
[31]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. ACM, 2013.
[32]
A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domainspecific languages. ECOOP, 2013.
[33]
T. L. Veldhuizen and J. G. Siek. Combining optimizations, combining theories. Technical report, Indiana University, 2008.
[34]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim., 9(4), Jan. 2013.
[35]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mc-Cauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2011.
[36]
C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. Proceedings of the VLDB Endowment, 2014.

Cited By

View all
  • (2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
  • (2022)Taurus: a data plane architecture for per-packet MLProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507726(1099-1114)Online publication date: 28-Feb-2022
  • (2022)GraphIt to CUDA compiler in 2021 LOCProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741280(53-65)Online publication date: 2-Apr-2022
  • Show More Cited By

Index Terms

  1. Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization
        February 2016
        283 pages
        ISBN:9781450337786
        DOI:10.1145/2854038
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        • IEEE-CS: Computer Society

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 29 February 2016

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Parallel patterns
        2. distributed memory
        3. pattern transformations

        Qualifiers

        • Research-article

        Conference

        CGO '16

        Acceptance Rates

        CGO '16 Paper Acceptance Rate 25 of 108 submissions, 23%;
        Overall Acceptance Rate 312 of 1,061 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)20
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 16 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
        • (2022)Taurus: a data plane architecture for per-packet MLProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507726(1099-1114)Online publication date: 28-Feb-2022
        • (2022)GraphIt to CUDA compiler in 2021 LOCProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741280(53-65)Online publication date: 2-Apr-2022
        • (2022)Simplified High Level Parallelism Expression on Heterogeneous Systems through Data Partition Pattern DescriptionThe Computer Journal10.1093/comjnl/bxac01766:6(1400-1418)Online publication date: 14-Mar-2022
        • (2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
        • (2021)On-stack replacement for program generators and source-to-source compilersProceedings of the 20th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3486609.3487207(156-169)Online publication date: 17-Oct-2021
        • (2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
        • (2020)HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00017(62-72)Online publication date: May-2020
        • (2020)Bigflow: A General Optimization Layer for Distributed Computing FrameworksJournal of Computer Science and Technology10.1007/s11390-020-9702-335:2(453-467)Online publication date: 27-Mar-2020
        • (2019)Flare & lanternProceedings of the VLDB Endowment10.14778/3352063.335209712:12(1910-1913)Online publication date: 1-Aug-2019
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media