research-article

Modular divide-and-conquer parallelization of nested loops

Authors:
Azadeh Farzan

University of Toronto, Canada

University of Toronto, Canada
View Profile

,
Victor Nicolet

University of Toronto, Canada

University of Toronto, Canada
View Profile

PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and ImplementationJune 2019Pages 610–624https://doi.org/10.1145/3314221.3314612

Published:08 June 2019Publication History

PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 610–624

ABSTRACT

We propose a methodology for automatic generation of divide-and-conquer parallel implementations of sequential nested loops. We focus on a class of loops that traverse read-only multidimensional collections (lists or arrays) and compute a function over these collections. Our approach is modular, in that, the inner loop nest is abstracted away to produce a simpler loop nest for parallelization. The summarized version of the loop nest is then parallelized. The main challenge addressed by this paper is that to perform the code transformations necessary in each step, the loop nest may have to be augmented (automatically) with extra computation to make possible the abstraction and/or the parallelization tasks. We present theoretical results to justify the correctness of our modular approach, and algorithmic solutions for automation. Experimental results demonstrate that our approach can parallelize highly non-trivial loop nests efficiently.

Supplemental Material

p610-farzan.webm

webm

60.4 MB

Download

References

Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design 2013 (FMCAD’ 13). IEEE, 1–8.Google Scholar
David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler Transformations for High-performance Computing. ACM Comput. Surv. 26, 4 (Dec. 1994), 345–420. Google ScholarDigital Library
Cedric Bastoul. 2004. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT ’04). IEEE Computer Society, 7–16. Google ScholarDigital Library
Yosi Ben-Asher and Gadi Haber. 2001. Parallel Solutions of Simple Indexed Recurrence Equations. IEEE Trans. Parallel Distrib. Syst. 12, 1 (Jan. 2001), 22–37. Google ScholarDigital Library
Guy E Blelloch. 1993. Prefix sums and their applications. In Synthesis of Parallel Algorithms (1st ed.). Morgan Kaufmann Publishers Inc.Google Scholar
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In Proceedings of Symposium on Principles and Practice of Parallel Programming, PPOPP 2012. 181–192. Google ScholarDigital Library
William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeflinger, Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu. 1996. Parallel Programming with Polaris. Computer 29, 12 (Dec. 1996), 78–82. Google ScholarDigital Library
Gilberto Contreras and Margaret Martonosi. 2008. Characterizing and improving the performance of Intel Threading Building Blocks. In 4th International Symposium on Workload Characterization, 2008. 57–66.Google ScholarCross Ref
Daniel Cordes, Heiko Falk, and Peter Marwedel. 2009. A fast and precise static loop analysis based on abstract interpretation, program slicing and polytope models. In CGO 2009. IEEE, 136–146.Google ScholarDigital Library
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46–55. Google ScholarDigital Library
Azadeh Farzan and Victor Nicolet. 2017. Synthesis of Divide and Conquer Parallelism for Loops. In Proceedings of the 38th ACM SIG-PLAN Conference on Programming Language Design and Implementation (PLDI 2017). 540–555.Google ScholarDigital Library
Azadeh Farzan and Victor Nicolet. 2019. Modular Synthesis of Divide-and-Conquer Parallelism for Nested Loops (Extended Version). arXiv: cs.PL/1904.01031Google Scholar
Azadeh Farzan and Victor Nicolet. 2019. Parsynt. http://www.cs. toronto.edu/~victorn/parsynt/index.htmlGoogle Scholar
Grigory Fedyukovich, Maaz Bin Safeer Ahmad, and Rastislav Bodik. 2017. Gradual Synthesis for Static Parallelization of Single-pass Arrayprocessing Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). 572–585. Google ScholarDigital Library
Allan L. Fisher and Anwar M. Ghuloum. 1994. Parallelizing Complex Scans and Reductions. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI ’94). 135–146. Google ScholarDigital Library
Alfons Geser and Sergei Gorlatch. 1997. Parallelizing Functional Programs by Generalization. In Proceedings of the 6th International Joint Conference on Algebraic and Logic Programming (ALP ’97-HOA ’97). 46–60. Google ScholarDigital Library
Jeremy Gibbons. 1996. The Third Homomorphism Theorem. J. Funct. Program. 6, 4 (1996), 657–665.Google ScholarCross Ref
Sergei Gorlatch. 1996. Systematic Extraction and Implementation of Divide-and-Conquer Parallelism. In Proceedings of the 8th International Symposium on Programming Languages: Implementations, Logics, and Programs (PLILP ’96). 274–288.Google ScholarDigital Library
Sergei Gorlatch. 1999. Extracting and Implementing List Homomorphisms in Parallel Program Development. Sci. Comput. Program. 33, 1 (Jan. 1999), 1–27. Google ScholarDigital Library
Jan Gustafsson, Andreas Ermedahl, Christer Sandberg, and Bjorn Lisper. 2006. Automatic derivation of loop bounds and infeasible paths for WCET analysis using abstract execution. In Real-Time Systems Symposium, 2006. RTSS’06. 27th IEEE International. IEEE, 57–66. Google ScholarDigital Library
Hwansoo Han and Chau-Wen Tseng. 2001. A comparison of parallelization techniques for irregular reductions. In Parallel and Distributed Processing Symposium., Proceedings 15th International. 27. Google ScholarDigital Library
W Daniel Hillis and Guy L Steele Jr. 1986. Data parallel algorithms. Commun. ACM 29, 12 (1986), 1170–1183. Google ScholarDigital Library
Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles E. Leiserson, and Rezaul Alam Chowdhury. 2016. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016. 145–164. Google ScholarDigital Library
Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando SolarLezama. 2016. Verified Lifting of Stencil Computations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). 711–726. Google ScholarDigital Library
Emanuel Kitzelmann and Ute Schmid. 2006. Inductive synthesis of functional programs: An explanation based generalization approach. Journal of Machine Learning Research 7, Feb (2006), 429–454.Google Scholar
Richard E Ladner and Michael J Fischer. 1980. Parallel prefix computation. Journal of the ACM (JACM) 27, 4 (1980), 831–838. Google ScholarDigital Library
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR’10). Springer-Verlag, 348–370. Google ScholarDigital Library
Claude Marché and Xavier Urbain. 1998. Termination of associativecommutative rewriting by dependency pairs. In International Conference on Rewriting Techniques and Applications. Springer, 241–255. Google ScholarDigital Library
Akimasa Morihata and Kiminori Matsuzaki. 2010. Automatic Parallelization of Recursive Functions Using Quantifier Elimination. In Functional and Logic Programming, 10th International Symposium, FLOPS 2010, Sendai, Japan, April 19-21, 2010. Proceedings. 321–336. Google ScholarDigital Library
Kazutaka Morita, Akimasa Morihata, Kiminori Matsuzaki, Zhenjiang Hu, and Masato Takeichi. 2007. Automatic Inversion Generates Divideand-conquer Parallel Programs. In Proceedings of the 28th ACM SIG-PLAN Conference on Programming Language Design and Implementation (PLDI ’07). 146–155. Google ScholarDigital Library
Paliath Narendran and Michael Rusinowitch. 1991. Any ground associative-commutative theory has a finite canonical system. In International Conference on Rewriting Techniques and Applications. Springer, 423–434. Google ScholarDigital Library
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. In ACM SIGGRAPH 2008 classes. ACM, 16. Google ScholarDigital Library
Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges 23, 4 (2008), 298–298. Google ScholarDigital Library
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of Parallelism in Algorithms. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). ACM, 12–25. Google ScholarDigital Library
Cosmin Radoi, Stephen J. Fink, Rodric Rabbah, and Manu Sridharan. 2014. Translating Imperative Code to MapReduce. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14). 909–927. Google ScholarDigital Library
Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). 153–167.Google ScholarDigital Library
Ron Shamir and Dekel Tsur. 1999. Faster subtree isomorphism. Journal of Algorithms 33, 2 (1999), 267–280. Google ScholarDigital Library
Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, 326–340. Google ScholarDigital Library
YN Srikant and Priti Shankar. 2002. The compiler design handbook: optimizations and machine code generation. CRC Press. Google ScholarDigital Library
Nicolas Vasilache, Cédric Bastoul, and Albert Cohen. 2006. Polyhedral Code Generation in the Real World. In Proceedings of the 15th International Conference on Compiler Construction (CC’06). 185–201. Google ScholarDigital Library

Index Terms

Modular divide-and-conquer parallelization of nested loops
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models
  2. Semantics and reasoning
    1. Program reasoning
      1. Program verification

Recommendations

A method for estimating optimal unrolling times for nested loops
ISPAN '97: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks

Loop unrolling is one of the most promising parallelization techniques, because the nature of programs causes most of the processing time to be spent in their loops. Unrolling not only the innermost loop but also outer loops greatly expands the scope ...
Read More
Parallelizing tightly nested loops
IPPS '91: Proceedings of the Fifth International Parallel Processing Symposium

Presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. ...
Read More
Transformations techniques for extracting parallelism in non-uniform nested loops

Executing a program in parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. This yields to improve the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2019
1162 pages
ISBN:9781450367127
DOI:10.1145/3314221
General Chair:
Kathryn S. McKinley
Google, USA
,
Program Chair:
Kathleen Fisher
Tufts University, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Divide and Conquer
Homomorphisms
Parallelization
Program Synthesis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate406of2,067submissions,20%
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 315
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modular divide-and-conquer parallelization of nested loops

PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A method for estimating optimal unrolling times for nested loops

Parallelizing tightly nested loops

Transformations techniques for extracting parallelism in non-uniform nested loops