ABSTRACT
We propose a methodology for automatic generation of divide-and-conquer parallel implementations of sequential nested loops. We focus on a class of loops that traverse read-only multidimensional collections (lists or arrays) and compute a function over these collections. Our approach is modular, in that, the inner loop nest is abstracted away to produce a simpler loop nest for parallelization. The summarized version of the loop nest is then parallelized. The main challenge addressed by this paper is that to perform the code transformations necessary in each step, the loop nest may have to be augmented (automatically) with extra computation to make possible the abstraction and/or the parallelization tasks. We present theoretical results to justify the correctness of our modular approach, and algorithmic solutions for automation. Experimental results demonstrate that our approach can parallelize highly non-trivial loop nests efficiently.
Supplemental Material
- Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design 2013 (FMCAD’ 13). IEEE, 1–8.Google Scholar
- David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler Transformations for High-performance Computing. ACM Comput. Surv. 26, 4 (Dec. 1994), 345–420. Google ScholarDigital Library
- Cedric Bastoul. 2004. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT ’04). IEEE Computer Society, 7–16. Google ScholarDigital Library
- Yosi Ben-Asher and Gadi Haber. 2001. Parallel Solutions of Simple Indexed Recurrence Equations. IEEE Trans. Parallel Distrib. Syst. 12, 1 (Jan. 2001), 22–37. Google ScholarDigital Library
- Guy E Blelloch. 1993. Prefix sums and their applications. In Synthesis of Parallel Algorithms (1st ed.). Morgan Kaufmann Publishers Inc.Google Scholar
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In Proceedings of Symposium on Principles and Practice of Parallel Programming, PPOPP 2012. 181–192. Google ScholarDigital Library
- William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeflinger, Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu. 1996. Parallel Programming with Polaris. Computer 29, 12 (Dec. 1996), 78–82. Google ScholarDigital Library
- Gilberto Contreras and Margaret Martonosi. 2008. Characterizing and improving the performance of Intel Threading Building Blocks. In 4th International Symposium on Workload Characterization, 2008. 57–66.Google ScholarCross Ref
- Daniel Cordes, Heiko Falk, and Peter Marwedel. 2009. A fast and precise static loop analysis based on abstract interpretation, program slicing and polytope models. In CGO 2009. IEEE, 136–146.Google ScholarDigital Library
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46–55. Google ScholarDigital Library
- Azadeh Farzan and Victor Nicolet. 2017. Synthesis of Divide and Conquer Parallelism for Loops. In Proceedings of the 38th ACM SIG-PLAN Conference on Programming Language Design and Implementation (PLDI 2017). 540–555.Google ScholarDigital Library
- Azadeh Farzan and Victor Nicolet. 2019. Modular Synthesis of Divide-and-Conquer Parallelism for Nested Loops (Extended Version). arXiv: cs.PL/1904.01031Google Scholar
- Azadeh Farzan and Victor Nicolet. 2019. Parsynt. http://www.cs. toronto.edu/~victorn/parsynt/index.htmlGoogle Scholar
- Grigory Fedyukovich, Maaz Bin Safeer Ahmad, and Rastislav Bodik. 2017. Gradual Synthesis for Static Parallelization of Single-pass Arrayprocessing Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). 572–585. Google ScholarDigital Library
- Allan L. Fisher and Anwar M. Ghuloum. 1994. Parallelizing Complex Scans and Reductions. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI ’94). 135–146. Google ScholarDigital Library
- Alfons Geser and Sergei Gorlatch. 1997. Parallelizing Functional Programs by Generalization. In Proceedings of the 6th International Joint Conference on Algebraic and Logic Programming (ALP ’97-HOA ’97). 46–60. Google ScholarDigital Library
- Jeremy Gibbons. 1996. The Third Homomorphism Theorem. J. Funct. Program. 6, 4 (1996), 657–665.Google ScholarCross Ref
- Sergei Gorlatch. 1996. Systematic Extraction and Implementation of Divide-and-Conquer Parallelism. In Proceedings of the 8th International Symposium on Programming Languages: Implementations, Logics, and Programs (PLILP ’96). 274–288.Google ScholarDigital Library
- Sergei Gorlatch. 1999. Extracting and Implementing List Homomorphisms in Parallel Program Development. Sci. Comput. Program. 33, 1 (Jan. 1999), 1–27. Google ScholarDigital Library
- Jan Gustafsson, Andreas Ermedahl, Christer Sandberg, and Bjorn Lisper. 2006. Automatic derivation of loop bounds and infeasible paths for WCET analysis using abstract execution. In Real-Time Systems Symposium, 2006. RTSS’06. 27th IEEE International. IEEE, 57–66. Google ScholarDigital Library
- Hwansoo Han and Chau-Wen Tseng. 2001. A comparison of parallelization techniques for irregular reductions. In Parallel and Distributed Processing Symposium., Proceedings 15th International. 27. Google ScholarDigital Library
- W Daniel Hillis and Guy L Steele Jr. 1986. Data parallel algorithms. Commun. ACM 29, 12 (1986), 1170–1183. Google ScholarDigital Library
- Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles E. Leiserson, and Rezaul Alam Chowdhury. 2016. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016. 145–164. Google ScholarDigital Library
- Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando SolarLezama. 2016. Verified Lifting of Stencil Computations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). 711–726. Google ScholarDigital Library
- Emanuel Kitzelmann and Ute Schmid. 2006. Inductive synthesis of functional programs: An explanation based generalization approach. Journal of Machine Learning Research 7, Feb (2006), 429–454.Google Scholar
- Richard E Ladner and Michael J Fischer. 1980. Parallel prefix computation. Journal of the ACM (JACM) 27, 4 (1980), 831–838. Google ScholarDigital Library
- K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR’10). Springer-Verlag, 348–370. Google ScholarDigital Library
- Claude Marché and Xavier Urbain. 1998. Termination of associativecommutative rewriting by dependency pairs. In International Conference on Rewriting Techniques and Applications. Springer, 241–255. Google ScholarDigital Library
- Akimasa Morihata and Kiminori Matsuzaki. 2010. Automatic Parallelization of Recursive Functions Using Quantifier Elimination. In Functional and Logic Programming, 10th International Symposium, FLOPS 2010, Sendai, Japan, April 19-21, 2010. Proceedings. 321–336. Google ScholarDigital Library
- Kazutaka Morita, Akimasa Morihata, Kiminori Matsuzaki, Zhenjiang Hu, and Masato Takeichi. 2007. Automatic Inversion Generates Divideand-conquer Parallel Programs. In Proceedings of the 28th ACM SIG-PLAN Conference on Programming Language Design and Implementation (PLDI ’07). 146–155. Google ScholarDigital Library
- Paliath Narendran and Michael Rusinowitch. 1991. Any ground associative-commutative theory has a finite canonical system. In International Conference on Rewriting Techniques and Applications. Springer, 423–434. Google ScholarDigital Library
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. In ACM SIGGRAPH 2008 classes. ACM, 16. Google ScholarDigital Library
- Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges 23, 4 (2008), 298–298. Google ScholarDigital Library
- Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of Parallelism in Algorithms. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). ACM, 12–25. Google ScholarDigital Library
- Cosmin Radoi, Stephen J. Fink, Rodric Rabbah, and Manu Sridharan. 2014. Translating Imperative Code to MapReduce. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14). 909–927. Google ScholarDigital Library
- Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). 153–167.Google ScholarDigital Library
- Ron Shamir and Dekel Tsur. 1999. Faster subtree isomorphism. Journal of Algorithms 33, 2 (1999), 267–280. Google ScholarDigital Library
- Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, 326–340. Google ScholarDigital Library
- YN Srikant and Priti Shankar. 2002. The compiler design handbook: optimizations and machine code generation. CRC Press. Google ScholarDigital Library
- Nicolas Vasilache, Cédric Bastoul, and Albert Cohen. 2006. Polyhedral Code Generation in the Real World. In Proceedings of the 15th International Conference on Compiler Construction (CC’06). 185–201. Google ScholarDigital Library
Index Terms
- Modular divide-and-conquer parallelization of nested loops
Recommendations
A method for estimating optimal unrolling times for nested loops
ISPAN '97: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and NetworksLoop unrolling is one of the most promising parallelization techniques, because the nature of programs causes most of the processing time to be spent in their loops. Unrolling not only the innermost loop but also outer loops greatly expands the scope ...
Parallelizing tightly nested loops
IPPS '91: Proceedings of the Fifth International Parallel Processing SymposiumPresents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. ...
Transformations techniques for extracting parallelism in non-uniform nested loops
Executing a program in parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. This yields to improve the ...
Comments