ABSTRACT
Task parallel models supporting dynamic and hierarchical parallelism are believed to offer a promising direction to achieving higher performance and programmability. Divide-and-conquer is the most frequently used idiom in task parallel models, which decomposes the problem instance into smaller ones until they become "trivial" to solve. However, it incurs a high tasking overhead if a task is created for each subproblem. In order to reduce this overhead, a "cut-off" is commonly used, which eliminates task creations where they are unlikely to be beneficial. The manual cut-off typically enlarges leaf tasks by stopping task creations when a subproblem becomes smaller than a threshold, and possibly transforms the enlarged leaf tasks into specialized versions for solving small instances (e.g., use loops instead of recursive calls); it duplicates the coding work and hinders productivity.
In this paper, we describe a compiler performing an effective cut-off method, called a static cut-off. Roughly speaking, it achieves the effect of manual cut-off, but automatically. The compiler tries to identify a condition in which the recursion stops within a constant number of steps and, when it is the case, eliminates task creations at compile time, which allows further compiler optimizations. Based on the termination condition analysis, two more optimization methods are developed to optimize the resulting leaf tasks in addition to replacing them with function calls; the first is to eliminate those function calls without exponential code growth; the second transforms the resulting leaf task into a loop, which further reduces the overhead and even promotes vectorization. The evaluation shows that our proposed cut-off optimization obtained significant speedups of a geometric mean of 8.0x compared to the original ones.
- S. G. Akl and N. Santoro. Optimal parallel merging and sorting without memory conflicts. IEEE Trans. Comput., 36(11):1367--1369, Nov. 1987. Google ScholarDigital Library
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 38--49, Jun. 2009. Google ScholarDigital Library
- J. Bi, X. Liao, Y. Zhang, C. Ye, H. Jin, and L. T. Yang. An adaptive task granularity based scheduling for task-centric parallelism. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, HPCC '14, pages 165--172, Aug. 2014. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '95, pages 207--216, Jul. 1995. Google ScholarDigital Library
- R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J. ACM, 24(1):44--67, Jan. 1977. Google ScholarDigital Library
- A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for task parallelism. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 36:1--36:11, Austin, Texas, USA, Nov. 2008. Google ScholarDigital Library
- A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, pages 124--131, Sept. 2009. Google ScholarDigital Library
- D. L. Eager, J. Zahorjan, and E. D. Lozowska. Speedup versus efficiency in parallel systems. IEEE Trans. Comput., 38(3):408--423, Mar. 1989. Google ScholarDigital Library
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, Nov. 2006. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, pages 285--, Oct. 1999. Google ScholarDigital Library
- T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L.-N. Pouchet. Polly - Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques, IMPACT '11, Apr. 2011.Google Scholar
- P. G. Harrison and H. Khoshnevisan. A new approach to recursion removal. Theoretical Computer Science, 93(1):91 -- 113, 1992. Google ScholarDigital Library
- C. A. Herrmann and C. Lengauer. Transformation of Divide & Conquer to Nested Parallel Loops, pages 95--109. PLILP '97. Springer-Verlag, 1997. Google ScholarDigital Library
- S. Himpe, F. Catthoor, and G. Deconinck. Control flow analysis for recursion removal. In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, SCOPES '03, pages 101--116. Springer, Sept 2003.Google ScholarCross Ref
- D. Insa and J. Silva. Automatic transformation of iterative loops into recursive methods. Information and Software Technology, 58:95--109, 2015.Google ScholarCross Ref
- Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic vectorization of tree traversals. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 363--374, Sept. 2013. Google ScholarDigital Library
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 145--156, Jun. 2000. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '04, pages 75--, Mar. 2004. Google ScholarDigital Library
- Y. A. Liu and S. D. Stoller. From recursion to iteration: What are the optimizations? In Proceedings of the 2000 ACM SIGPLAN Workshop on Partial Evaluation and Semantics-based Program Manipulation, PEPM '00, pages 73--82, Jan. 2000. Google ScholarDigital Library
- H.-W. Loidl and K. Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 Glasgow Workshop on Functional Programming, GWFP '95. Springer-Verlag, Jul. 1995. Google ScholarDigital Library
- E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, Jun. 1990. Google ScholarDigital Library
- J. Nakashima and K. Taura. MassiveThreads: A thread library for high productivity languages. In Concurrent Objects and Beyond, volume 8665 of Lecture Notes in Computer Science, pages 222--238. 2014.Google ScholarCross Ref
- D. Nuzman and A. Zaks. Autovectorization in GCC - two years later. In Proceedings of the 2006 GCC Developers' Summit, pages 145--158, Jun. 2006.Google Scholar
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC '06, pages 235--250, 2007. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP Application Program Interface Version 3.0, May 2008.Google Scholar
- L. Petersen, D. Orchard, and N. Glew. Automatic SIMD vectorization for Haskell. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ICFP '13, pages 25--36, Sept. 2013. Google ScholarDigital Library
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media, 2007. Google ScholarDigital Library
- B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, pages 509--520, Jun. 2015. Google ScholarDigital Library
- R. Rugina and M. C. Rinard. Recursion unrolling for divide and conquer programs. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers, LCPC '00, pages 34--48, Aug. 2001. Google ScholarDigital Library
- G. L. Steele, Jr. Debunking the "expensive procedure call"; myth or, procedure call implementations considered harmful or, lambda: The ultimate goto. In Proceedings of the 1977 Annual Conference, ACM '77, pages 153--162, Jan. 1977. Google ScholarDigital Library
- G. Stitt and J. Villarreal. Recursion flattening. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI, GLSVLSI '08, pages 131--134, May 2008. Google ScholarDigital Library
- P. Tang. Complete inlining of recursive calls: Beyond tail-recursion elimination. In Proceedings of the 44th Annual Southeast Regional Conference, ACMSE '44, pages 579--584, Mar. 2006. Google ScholarDigital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, Jul. 2011. Google ScholarDigital Library
- P. Thoman, H. Jordan, and T. Fahringer. Adaptive granularity control in task parallel programs using multiversioning. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par'13, pages 164--177, Aug. 2013. Google ScholarDigital Library
- Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 169--181, Jun. 2000. Google ScholarDigital Library
Index Terms
- A Static Cut-off for Task Parallel Programs
Recommendations
Adaptive granularity control in task parallel programs using multiversioning
Euro-Par'13: Proceedings of the 19th international conference on Parallel ProcessingTask parallelism is a programming technique that has been shown to be applicable in a wide variety of problem domains. A central parameter that needs to be controlled to ensure efficient execution of task-parallel programs is the granularity of tasks. ...
Communicating Data-Parallel Tasks: An MPI Library for HPF
HIPC '96: Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data-parallel computing. However, HPF does not support task parallelism or heterogeneous computing adequately. This paper presents a summary of our work on a library-based ...
Mixed task and data parallel executions in general linear methods
On many parallel target platforms it can be advantageous to implement parallel applications as a collection of multiprocessor tasks that are concurrently executed and are internally implemented with fine-grain SPMD parallelism. A class of applications ...
Comments