skip to main content
10.1145/2967938.2967968acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

A Static Cut-off for Task Parallel Programs

Published:11 September 2016Publication History

ABSTRACT

Task parallel models supporting dynamic and hierarchical parallelism are believed to offer a promising direction to achieving higher performance and programmability. Divide-and-conquer is the most frequently used idiom in task parallel models, which decomposes the problem instance into smaller ones until they become "trivial" to solve. However, it incurs a high tasking overhead if a task is created for each subproblem. In order to reduce this overhead, a "cut-off" is commonly used, which eliminates task creations where they are unlikely to be beneficial. The manual cut-off typically enlarges leaf tasks by stopping task creations when a subproblem becomes smaller than a threshold, and possibly transforms the enlarged leaf tasks into specialized versions for solving small instances (e.g., use loops instead of recursive calls); it duplicates the coding work and hinders productivity.

In this paper, we describe a compiler performing an effective cut-off method, called a static cut-off. Roughly speaking, it achieves the effect of manual cut-off, but automatically. The compiler tries to identify a condition in which the recursion stops within a constant number of steps and, when it is the case, eliminates task creations at compile time, which allows further compiler optimizations. Based on the termination condition analysis, two more optimization methods are developed to optimize the resulting leaf tasks in addition to replacing them with function calls; the first is to eliminate those function calls without exponential code growth; the second transforms the resulting leaf task into a loop, which further reduces the overhead and even promotes vectorization. The evaluation shows that our proposed cut-off optimization obtained significant speedups of a geometric mean of 8.0x compared to the original ones.

References

  1. S. G. Akl and N. Santoro. Optimal parallel merging and sorting without memory conflicts. IEEE Trans. Comput., 36(11):1367--1369, Nov. 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 38--49, Jun. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Bi, X. Liao, Y. Zhang, C. Ye, H. Jin, and L. T. Yang. An adaptive task granularity based scheduling for task-centric parallelism. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, HPCC '14, pages 165--172, Aug. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '95, pages 207--216, Jul. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J. ACM, 24(1):44--67, Jan. 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for task parallelism. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 36:1--36:11, Austin, Texas, USA, Nov. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, pages 124--131, Sept. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. L. Eager, J. Zahorjan, and E. D. Lozowska. Speedup versus efficiency in parallel systems. IEEE Trans. Comput., 38(3):408--423, Mar. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, Nov. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, pages 285--, Oct. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L.-N. Pouchet. Polly - Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques, IMPACT '11, Apr. 2011.Google ScholarGoogle Scholar
  12. P. G. Harrison and H. Khoshnevisan. A new approach to recursion removal. Theoretical Computer Science, 93(1):91 -- 113, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. A. Herrmann and C. Lengauer. Transformation of Divide & Conquer to Nested Parallel Loops, pages 95--109. PLILP '97. Springer-Verlag, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Himpe, F. Catthoor, and G. Deconinck. Control flow analysis for recursion removal. In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, SCOPES '03, pages 101--116. Springer, Sept 2003.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. Insa and J. Silva. Automatic transformation of iterative loops into recursive methods. Information and Software Technology, 58:95--109, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  16. Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic vectorization of tree traversals. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 363--374, Sept. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 145--156, Jun. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '04, pages 75--, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. A. Liu and S. D. Stoller. From recursion to iteration: What are the optimizations? In Proceedings of the 2000 ACM SIGPLAN Workshop on Partial Evaluation and Semantics-based Program Manipulation, PEPM '00, pages 73--82, Jan. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H.-W. Loidl and K. Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 Glasgow Workshop on Functional Programming, GWFP '95. Springer-Verlag, Jul. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, Jun. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Nakashima and K. Taura. MassiveThreads: A thread library for high productivity languages. In Concurrent Objects and Beyond, volume 8665 of Lecture Notes in Computer Science, pages 222--238. 2014.Google ScholarGoogle ScholarCross RefCross Ref
  23. D. Nuzman and A. Zaks. Autovectorization in GCC - two years later. In Proceedings of the 2006 GCC Developers' Summit, pages 145--158, Jun. 2006.Google ScholarGoogle Scholar
  24. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC '06, pages 235--250, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. OpenMP Architecture Review Board. OpenMP Application Program Interface Version 3.0, May 2008.Google ScholarGoogle Scholar
  26. L. Petersen, D. Orchard, and N. Glew. Automatic SIMD vectorization for Haskell. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ICFP '13, pages 25--36, Sept. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, pages 509--520, Jun. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Rugina and M. C. Rinard. Recursion unrolling for divide and conquer programs. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers, LCPC '00, pages 34--48, Aug. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. L. Steele, Jr. Debunking the "expensive procedure call"; myth or, procedure call implementations considered harmful or, lambda: The ultimate goto. In Proceedings of the 1977 Annual Conference, ACM '77, pages 153--162, Jan. 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Stitt and J. Villarreal. Recursion flattening. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI, GLSVLSI '08, pages 131--134, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Tang. Complete inlining of recursive calls: Beyond tail-recursion elimination. In Proceedings of the 44th Annual Southeast Regional Conference, ACMSE '44, pages 579--584, Mar. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, Jul. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Thoman, H. Jordan, and T. Fahringer. Adaptive granularity control in task parallel programs using multiversioning. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par'13, pages 164--177, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 169--181, Jun. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Static Cut-off for Task Parallel Programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
        September 2016
        474 pages
        ISBN:9781450341219
        DOI:10.1145/2967938

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 September 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader