skip to main content
10.1145/3087556.3087586acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Public Access

Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms

Published: 24 July 2017 Publication History

Abstract

Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and hence are not portable and not cache-adaptive. On the other hand, standard cache-oblivious recursive divide-and-conquer algorithms have optimal serial cache complexity but often have low parallelism due to artificial dependencies among subtasks. Recently, we introduced cache-oblivious recursive wavefront (COW) algorithms, which do not have any artificial dependencies, but they are too complicated to develop, analyze, implement, and generalize. Though COW algorithms are based on fork-join primitives, they extensively use atomic operations for ensuring correctness, and as a result, performance guarantees (i.e., parallel running time and parallel cache complexity) provided by state-of-the-art schedulers (e.g., the randomized work-stealing scheduler) for programs with fork-join primitives do not apply. Also, extensive use of atomic locks may result in high overhead in implementation.
In this paper, we show how to systematically transform standard cache-oblivious recursive divide-and-conquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under state-of-the-art schedulers for fork-join programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closed-form formulas to compute the time when each divide-and-conquer function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms. We present theoretical analyses and experimental performance and scalability results showing a superiority of these new algorithms over existing algorithms.

References

[1]
Comet Supercomputing Cluster. http://www.sdsc.edu/support/user_guides/comet.html. (2016).
[2]
PAPI-5.3. http://icl.cs.utk.edu/papi/index.html. (2016).
[3]
Stampede Supercomputing Cluster. https://www.tacc.utexas.edu/stampede/. (2016).
[4]
Shaizeen Aga, Sriram Krishnamoorthy, and Satish Narayanasamy. 2015. CilkSpec: optimistic concurrency for Cilk. In SC. 83.
[5]
Kunal Agrawal, Charles E Leiserson, and Jim Sukha. 2010. Executing task graphs using work-stealing. In IPDPS. 1--12.
[6]
Vineet Bafna and Nathan Edwards. 2003. On de novo interpretation of tandem mass spectra for peptide identification. In RECOMB. 9--18.
[7]
Michael A Bender, Roozbeh Ebrahimi, Jeremy T Fineman, Golnaz Ghasemiesfeh, Rob Johnson, and Samuel McCauley. 2014. Cache-adaptive algorithms. In SODA. 958--971.
[8]
Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In SPAA. 355--366.
[9]
Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multithreaded computations by work stealing. JACM (1999), 46(5):720--748.
[10]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices (2008), 43(6):101--113.
[11]
Richard Brent. 1974. The parallel evaluation of general arithmetic expressions. JACM (1974), 21:201--206.
[12]
Cary Cherng and Richard E Ladner. 2005. Cache efficient simple dynamic programming. In AofA DMTCS. 49:58.
[13]
Rezaul Chowdhury. 2007. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. Ph.D. Dissertation. Department of Computer Sciences, The University of Texas at Austin.
[14]
Rezaul Chowdhury, Pramod Ganapathi, Jesmin Jahan Tithi, Charles Bachmeier, Bradley C. Kuszmaul, Armando Solar-Lezama Charles E. Leiserson, and Yuan Tang. 2016. AutoGen: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs. In PPoPP.
[15]
Rezaul Chowdhury and Vijaya Ramachandran. 2006. Cache-oblivious dynamic programming. In SODA. 591--600.
[16]
Rezaul Chowdhury and Vijaya Ramachandran. 2008. Cache-efficient dynamic programming algorithms for multicores. In SPAA. 207--216.
[17]
Rezaul Chowdhury and Vijaya Ramachandran. 2010. The cache-oblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems (2010), 47(4):878--919.
[18]
Rezaul Chowdhury, Vijaya Ramachandran, Francesco Silvestri, and Brandon Blakeley. 2013. Oblivious algorithms for multicores and networks of processors. J. Parallel and Distrib. Comput. (2013), 73(7):911--925.
[19]
Thomas Cormen, Charles Leiserson, Ronald Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT press.
[20]
Alain Darte, Georges-André Silber, and Frédéric Vivien. 1997. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Processing Letters (1997), 7(4):379--392.
[21]
Jun Du, Ce Yu, Jizhou Sun, Chao Sun, Shanjiang Tang, and Yanlong Yin. 2013. EasyHPS: a multilevel hybrid parallel system for dynamic programming. In IPDPSW. 630--639.
[22]
Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge university press.
[23]
Robert W Floyd. 1962. Algorithm 97: shortest path. CACM (1962), 5(6):345.
[24]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-oblivious algorithms. In 40th FOCS. 285--297.
[25]
Zvi Galil and Raffaele Giancarlo. 1989. Speeding up dynamic programming with applications to molecular biology. TCS (1989), 64(1):107--118.
[26]
Zvi Galil and Kunsoo Park. 1994. Parallel algorithms for dynamic programming recurrences with more than O (1) dependency. JPDC (1994), 21(2):213--222.
[27]
Pramod Ganapathi. 2016. Automatic Discovery of Efficient Divide-&-Conquer Algorithms for Dynamic Programming Problems. Ph.D. Dissertation. Department of Computer Science, Stony Brook University.
[28]
Robert Giegerich and Georg Sauthoff. 2011. Yield grammar analysis in the Bellman's GAP compiler. In Workshop on language descriptions, tools & applications.
[29]
Georgios Goumas, Aristidis Sotiropoulos, and Nectarios Koziris. 2001. Minimizing completion time for loop tiling with computation and communication overlapping. In IPDPS. 10.
[30]
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly-performing polyhedral optimizations on a low-level intermediate representation. PPL 22(04) (2012).
[31]
Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.
[32]
Daniel S. Hirschberg. 1975. A linear space algorithm for computing maximal common subsequences. CACM (1975), 18(6):341--343.
[33]
Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury. 2016. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In OOPSLA. ACM, 145--164.
[34]
Kenneth E. Iverson. 1962. A Programming Language. Wiley.
[35]
Anany V Levitin. 2009. Introduction to Design & Analysis of Algorithms: For Anna University, 2/e. Pearson Education India.
[36]
Weiguo Liu and Bertil Schmidt. 2004. A generic parallel pattern-based system for bioinformatics. In Euro-Par. 989--996.
[37]
Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D Dutt, and Alexandru Nicolau. 1999. Augmenting loop tiling with data alignment for improved cache performance. Computers, IEEE Transactions on (1999), 48(2):142--149.
[38]
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J Ramanujam, and P Sadayappan. 2010. Combined iterative and model-driven optimization in an automatic parallelization framework. In SC. 1--11.
[39]
Yewen Pu, Rastislav Bodik, and Saurabh Srivastava. 2011. Synthesis of first-order dynamic programming algorithms. In ACM SIGPLAN Notices. 46(10):83--98.
[40]
Raphael Reitzig. 2012. Automated Parallelisation of Dynamic Programming Recursions. Master's thesis. University of Kaiserslautern.
[41]
Lakshminarayanan Renganarayanan, Daegon Kim, Michelle Mills Strout, and Sanjay Rajopadhye. 2012. Parameterized loop tiling. TOPLAS (2012), 34(1):3.
[42]
Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. Transactions on Acoustics, Speech and Signal Processing (1978), 26(1):43--49.
[43]
Vivek Sarkar and Nimrod Megiddo. 2000. An analytical model for loop tiling and its solution. In ISPASS. 146--153.
[44]
Shanjiang Tang, Ce Yu, Jizhou Sun, Bu-Sung Lee, Tao Zhang, Zhen Xu, and Huabei Wu. 2012. EasyPDP: an efficient parallel dynamic programming runtime system for computational biology. TPDS (2012), 23(5):862--872.
[45]
Yuan Tang, Rezaul Chowdhury, Bradley C Kuszmaul, Chi-Keung Luk, and Charles E Leiserson. 2011. The pochoir stencil compiler. In SPAA. 117--128.
[46]
Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul Chowdhury. 2014. Improving parallelism of recursive stencil computations without sacrificing cache performance. In 2nd WOSC. 1--7.
[47]
Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul Chowdhury. 2015. Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In PPoPP. 205--214.
[48]
Jesmin Jahan Tithi. 2015. Engineering High-performance Parallel Algorithms with Applications to Bioinformatics. Ph.D. Dissertation. State University of New York at Stony Brook, ProQuest Dissertations Publishing.
[49]
Jesmin Jahan Tithi, Pramod Ganapathi, Aakrati Talati, Sonal Aggarwal, and Rezaul Chowdhury. 2015. High-performance energy-efficient recursive dynamic programming with matrix-multiplication-like flexible kernels. In IPDPS. 303--312.
[50]
George Tzenakis, Angelos Papatriantafyllou, Hans Vandierendonck, Polyvios Pratikakis, and Dimitrios S Nikolopoulos. 2013. BDDT: block-level dynamic dependence analysis for task-based parallelism. In APPT. 17--31.
[51]
Stephen Warshall. 1962. A theorem on boolean matrices. JACM (1962), 9(1):11--12.
[52]
Michael S Waterman et al. 1995. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall Ltd.
[53]
Michael Edward Wolf. 1992. Improving Locality and Parallelism in Nested Loops. Ph.D. Dissertation. Stanford University.

Cited By

View all
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
  • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
  • (2022)Parallel Cover Trees and their ApplicationsProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538581(259-272)Online publication date: 11-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures
July 2017
392 pages
ISBN:9781450345934
DOI:10.1145/3087556
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache-oblivious
  2. divide-and-conquer
  3. dynamic programming
  4. parallel
  5. parallelism
  6. recursive
  7. wavefront

Qualifiers

  • Research-article

Funding Sources

Conference

SPAA '17
Sponsor:

Acceptance Rates

SPAA '17 Paper Acceptance Rate 31 of 127 submissions, 24%;
Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25
37th ACM Symposium on Parallelism in Algorithms and Architectures
July 28 - August 1, 2025
Portland , OR , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)10
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
  • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
  • (2022)Parallel Cover Trees and their ApplicationsProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538581(259-272)Online publication date: 11-Jul-2022
  • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
  • (2022)An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix TranspositionInformation Processing Letters10.1016/j.ipl.2021.106166173:COnline publication date: 22-Apr-2022
  • (2021)Efficient Stepping Algorithms and Implementations for Parallel Shortest PathsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461782(184-197)Online publication date: 6-Jul-2021
  • (2021)Understanding Recursive Divide-and-Conquer Dynamic Programs in Fork-Join and Data-Flow Execution Models2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00069(407-416)Online publication date: Jun-2021
  • (2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
  • (2020)Optimal Parallel Algorithms in the Binary-Forking ModelProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400227(89-102)Online publication date: 6-Jul-2020
  • (2019)Toward Efficient Architecture-Independent Algorithms for Dynamic ProgramsHigh Performance Computing10.1007/978-3-030-20656-7_8(143-164)Online publication date: 17-May-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media