research-article

A Static Cut-off for Task Parallel Programs

Authors:
Shintaro Iwasaki

The University of Tokyo, Tokyo, Japan

The University of Tokyo, Tokyo, Japan
View Profile

,
Kenjiro Taura

The University of Tokyo, Tokyo, Japan

The University of Tokyo, Tokyo, Japan
View Profile

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationSeptember 2016Pages 139–150https://doi.org/10.1145/2967938.2967968

Published:11 September 2016Publication History

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 139–150

ABSTRACT

Task parallel models supporting dynamic and hierarchical parallelism are believed to offer a promising direction to achieving higher performance and programmability. Divide-and-conquer is the most frequently used idiom in task parallel models, which decomposes the problem instance into smaller ones until they become "trivial" to solve. However, it incurs a high tasking overhead if a task is created for each subproblem. In order to reduce this overhead, a "cut-off" is commonly used, which eliminates task creations where they are unlikely to be beneficial. The manual cut-off typically enlarges leaf tasks by stopping task creations when a subproblem becomes smaller than a threshold, and possibly transforms the enlarged leaf tasks into specialized versions for solving small instances (e.g., use loops instead of recursive calls); it duplicates the coding work and hinders productivity.

In this paper, we describe a compiler performing an effective cut-off method, called a static cut-off. Roughly speaking, it achieves the effect of manual cut-off, but automatically. The compiler tries to identify a condition in which the recursion stops within a constant number of steps and, when it is the case, eliminates task creations at compile time, which allows further compiler optimizations. Based on the termination condition analysis, two more optimization methods are developed to optimize the resulting leaf tasks in addition to replacing them with function calls; the first is to eliminate those function calls without exponential code growth; the second transforms the resulting leaf task into a loop, which further reduces the overhead and even promotes vectorization. The evaluation shows that our proposed cut-off optimization obtained significant speedups of a geometric mean of 8.0x compared to the original ones.

References

S. G. Akl and N. Santoro. Optimal parallel merging and sorting without memory conflicts. IEEE Trans. Comput., 36(11):1367--1369, Nov. 1987. Google ScholarDigital Library
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 38--49, Jun. 2009. Google ScholarDigital Library
J. Bi, X. Liao, Y. Zhang, C. Ye, H. Jin, and L. T. Yang. An adaptive task granularity based scheduling for task-centric parallelism. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, HPCC '14, pages 165--172, Aug. 2014. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '95, pages 207--216, Jul. 1995. Google ScholarDigital Library
R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J. ACM, 24(1):44--67, Jan. 1977. Google ScholarDigital Library
A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for task parallelism. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 36:1--36:11, Austin, Texas, USA, Nov. 2008. Google ScholarDigital Library
A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, pages 124--131, Sept. 2009. Google ScholarDigital Library
D. L. Eager, J. Zahorjan, and E. D. Lozowska. Speedup versus efficiency in parallel systems. IEEE Trans. Comput., 38(3):408--423, Mar. 1989. Google ScholarDigital Library
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, Nov. 2006. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, pages 285--, Oct. 1999. Google ScholarDigital Library
T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L.-N. Pouchet. Polly - Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques, IMPACT '11, Apr. 2011.Google Scholar
P. G. Harrison and H. Khoshnevisan. A new approach to recursion removal. Theoretical Computer Science, 93(1):91 -- 113, 1992. Google ScholarDigital Library
C. A. Herrmann and C. Lengauer. Transformation of Divide & Conquer to Nested Parallel Loops, pages 95--109. PLILP '97. Springer-Verlag, 1997. Google ScholarDigital Library
S. Himpe, F. Catthoor, and G. Deconinck. Control flow analysis for recursion removal. In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, SCOPES '03, pages 101--116. Springer, Sept 2003.Google ScholarCross Ref
D. Insa and J. Silva. Automatic transformation of iterative loops into recursive methods. Information and Software Technology, 58:95--109, 2015.Google ScholarCross Ref
Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic vectorization of tree traversals. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 363--374, Sept. 2013. Google ScholarDigital Library
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 145--156, Jun. 2000. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '04, pages 75--, Mar. 2004. Google ScholarDigital Library
Y. A. Liu and S. D. Stoller. From recursion to iteration: What are the optimizations? In Proceedings of the 2000 ACM SIGPLAN Workshop on Partial Evaluation and Semantics-based Program Manipulation, PEPM '00, pages 73--82, Jan. 2000. Google ScholarDigital Library
H.-W. Loidl and K. Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 Glasgow Workshop on Functional Programming, GWFP '95. Springer-Verlag, Jul. 1995. Google ScholarDigital Library
E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, Jun. 1990. Google ScholarDigital Library
J. Nakashima and K. Taura. MassiveThreads: A thread library for high productivity languages. In Concurrent Objects and Beyond, volume 8665 of Lecture Notes in Computer Science, pages 222--238. 2014.Google ScholarCross Ref
D. Nuzman and A. Zaks. Autovectorization in GCC - two years later. In Proceedings of the 2006 GCC Developers' Summit, pages 145--158, Jun. 2006.Google Scholar
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC '06, pages 235--250, 2007. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP Application Program Interface Version 3.0, May 2008.Google Scholar
L. Petersen, D. Orchard, and N. Glew. Automatic SIMD vectorization for Haskell. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ICFP '13, pages 25--36, Sept. 2013. Google ScholarDigital Library
J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media, 2007. Google ScholarDigital Library
B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, pages 509--520, Jun. 2015. Google ScholarDigital Library
R. Rugina and M. C. Rinard. Recursion unrolling for divide and conquer programs. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers, LCPC '00, pages 34--48, Aug. 2001. Google ScholarDigital Library
G. L. Steele, Jr. Debunking the "expensive procedure call"; myth or, procedure call implementations considered harmful or, lambda: The ultimate goto. In Proceedings of the 1977 Annual Conference, ACM '77, pages 153--162, Jan. 1977. Google ScholarDigital Library
G. Stitt and J. Villarreal. Recursion flattening. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI, GLSVLSI '08, pages 131--134, May 2008. Google ScholarDigital Library
P. Tang. Complete inlining of recursive calls: Beyond tail-recursion elimination. In Proceedings of the 44th Annual Southeast Regional Conference, ACMSE '44, pages 579--584, Mar. 2006. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, Jul. 2011. Google ScholarDigital Library
P. Thoman, H. Jordan, and T. Fahringer. Adaptive granularity control in task parallel programs using multiversioning. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par'13, pages 164--177, Aug. 2013. Google ScholarDigital Library
Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 169--181, Jun. 2000. Google ScholarDigital Library

Index Terms

A Static Cut-off for Task Parallel Programs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Adaptive granularity control in task parallel programs using multiversioning
Euro-Par'13: Proceedings of the 19th international conference on Parallel Processing

Task parallelism is a programming technique that has been shown to be applicable in a wide variety of problem domains. A central parameter that needs to be controlled to ensure efficient execution of task-parallel programs is the granularity of tasks. ...
Read More
Communicating Data-Parallel Tasks: An MPI Library for HPF
HIPC '96: Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)

High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data-parallel computing. However, HPF does not support task parallelism or heterogeneous computing adequately. This paper presents a summary of our work on a library-based ...
Read More
Mixed task and data parallel executions in general linear methods

On many parallel target platforms it can be advantageous to implement parallel applications as a collection of multiprocessor tasks that are concurrently executed and are internally implemented with fine-grain SPMD parallelism. A class of applications ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compilers
cut-off
performance optimization
task parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%
More
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 288
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Static Cut-off for Task Parallel Programs

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adaptive granularity control in task parallel programs using multiversioning

Communicating Data-Parallel Tasks: An MPI Library for HPF

Mixed task and data parallel executions in general linear methods