Toward Efficient Architecture-Independent Algorithms for Dynamic Programs

Javanmard, Mohammad Mahdi; Ganapathi, Pramod; Das, Rathish; Ahmad, Zafar; Tschudi, Stephen; Chowdhury, Rezaul

doi:10.1007/978-3-030-20656-7_8

Mohammad Mahdi Javanmard¹⁸,
Pramod Ganapathi¹⁹,
Rathish Das¹⁸,
Zafar Ahmad¹⁸,
Stephen Tschudi²⁰ &
…
Rezaul Chowdhury¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11501))

Included in the following conference series:

International Conference on High Performance Computing

1210 Accesses

Abstract

We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.

Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines.

Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r ($\ge 2$) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds.

We also report empirical results for our GPU and distribute memory algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

Article Open access 24 January 2022

A general and efficient divide-and-conquer algorithm framework for multi-core clusters

Article 14 February 2017

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures

Notes

1.
As of November 2018, the supercomputers ranked 1 (Summit), 2 (Sierra), 6 (ABCI), 7 (Piz Daint), and 8 (Titan) in order of Rpeak (TFlop/s) are networks of hybrid CPU+GPU nodes [4].
2.
Temporal locality — whenever a block of data is brought into a faster level of cache/memory from a slower level, as much useful work as possible is performed on this data before removing the block from the faster level.
3.
I.e., faster and closer to the processing core(s).

References

Standard Template Library for Extra Large Data Sets (STXXL). http://stxxl.sourceforge.net/
The Stampede Supercomputing Cluster. https://www.tacc.utexas.edu/stampede/
The Stampede2 Supercomputing Cluster. https://www.tacc.utexas.edu/systems/stampede2/
Top 500 Supercomputers of the World. https://www.top500.org/lists/2018/06/
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)
Article Google Scholar
Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theor. Comput. Sci. 71(1), 3–28 (1990)
Article MathSciNet MATH Google Scholar
Aho, A.V., Hopcroft, J.E.: The Design and Analysis of Computer Algorithms. Pearson Education India, Noida (1974)
MATH Google Scholar
Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, 1–155 (2014)
Article MathSciNet MATH Google Scholar
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for strassen’s matrix multiplication. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 193–204. ACM (2012)
Google Scholar
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)
Article MathSciNet MATH Google Scholar
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM (JACM) 59(6), 32 (2012)
Article MathSciNet MATH Google Scholar
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)
MATH Google Scholar
Bender, M., Ebrahimi, R., Fineman, J., Ghasemiesfeh, G., Johnson, R., McCauley, S.: Cache-adaptive algorithms. In: SODA (2014)
Google Scholar
Buluç, A., Gilbert, J.R., Budak, C.: Solving path problems on the GPU. Parallel Comput. 36(5), 241–253 (2010)
Article MATH Google Scholar
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Technical report, Montana State University. Bozeman Engineering Research Labs (1969)
Google Scholar
Carson, E., Knight, N., Demmel, J.: Avoiding communication in two-sided Krylov subspace methods. Technical report, EECS, UC Berkeley (2011)
Google Scholar
Cherng, C., Ladner, R.: Cache efficient simple dynamic programming. In: AofA, pp. 49–58 (2005)
Google Scholar
Chowdhury, R., Ganapathi, P., Tang, Y., Tithi, J.J.: Provably efficient scheduling of cache-oblivious wavefront algorithms. In: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 339–350. ACM, July 2017
Google Scholar
Chowdhury, R., et al.: AUTOGEN: automatic discovery of efficient recursive divide-&-conquer algorithms for solving dynamic programming problems. ACM Trans. Parallel Comput. 4(1), 4 (2017). https://doi.org/10.1145/3125632
Article Google Scholar
Chowdhury, R.A., Ramachandran, V.: Cache-efficient dynamic programming algorithms for multicores. In: SPAA, pp. 207–216 (2008)
Google Scholar
Chowdhury, R.A., Ramachandran, V.: The cache-oblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. Theory Comput. Syst. 47(4), 878–919 (2010)
Article MathSciNet MATH Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
MATH Google Scholar
D’Alberto, P., Nicolau, A.: R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2), 203–213 (2007)
Article MathSciNet MATH Google Scholar
Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM J. Comput. 10(4), 657–675 (1981)
Article MathSciNet MATH Google Scholar
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), A206–A239 (2012)
Article MathSciNet MATH Google Scholar
Diament, B., Ferencz, A.: Comparison of parallel APSP algorithms (1999)
Google Scholar
Djidjev, H., Thulasidasan, S., Chapuis, G., Andonov, R., Lavenier, D.: Efficient multi-GPU computation of all-pairs shortest paths. In: IPDPS, pp. 360–369 (2014)
Google Scholar
Driscoll, M., Georganas, E., Koanantakool, P., Solomonik, E., Yelick, K.: A communication-optimal n-body algorithm for direct interactions. In: IPDPS, pp. 1075–1084. IEEE (2013)
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: FOCS, pp. 285–297 (1999)
Google Scholar
Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. TCS 64(1), 107–118 (1989)
Article MathSciNet MATH Google Scholar
Galil, Z., Park, K.: Parallel algorithms for dynamic programming recurrences with more than $O(1)$ dependency. JPDC 21(2), 213–222 (1994)
MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Habbal, M.B., Koutsopoulos, H.N., Lerman, S.R.: A decomposition algorithm for the all-pairs shortest path problem on massively parallel computer architectures. Transp. Sci. 28(4), 292–308 (1994)
Article MathSciNet MATH Google Scholar
Harish, P., Narayanan, P.: Accelerating large graph algorithms on the GPU using CUDA. In: HiPC, pp. 197–208 (2007)
Google Scholar
Holzer, S., Wattenhofer, R.: Optimal distributed all pairs shortest paths and applications. In: PODC, pp. 355–364. ACM (2012)
Google Scholar
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Article MATH Google Scholar
Itzhaky, S., et al.: Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In: OOPSLA, pp. 145–164. ACM (2016)
Google Scholar
Jenq, J.F., Sahni, S.: All pairs shortest paths on a hypercube multiprocessor (1987)
Google Scholar
Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19(11), 1235–1257 (1993)
Article Google Scholar
Katz, G.J., Kider Jr., J.T.: All-pairs shortest-paths for large graphs on the GPU. In: ACM SIGGRAPH/EUROGRAPHICS, pp. 47–55 (2008)
Google Scholar
Kogge, P., Shalf, J.: Exascale computing trends: adjusting to the “new normal” for computer architecture. Comput. Sci. Eng. 15(6), 16–26 (2013)
Article Google Scholar
Krusche, P., Tiskin, A.: Efficient longest common subsequence computation using bulk-synchronous parallelism. In: Gavrilova, M.L., et al. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 165–174. Springer, Heidelberg (2006). https://doi.org/10.1007/11751649_18
Chapter Google Scholar
Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and Analysis of Algorithms, vol. 400. Benjamin/Cummings, Redwood City (1994)
MATH Google Scholar
Kumar, V., Singh, V.: Scalability of parallel algorithms for the all-pairs shortest-path problem. J. Parallel Distrib. Comput. 13(2), 124–138 (1991)
Article Google Scholar
Liu, W., Schmidt, B., Voss, G., Muller-Wittig, W.: Streaming algorithms for biological sequence alignment on GPUs. TPDS 18(9), 1270–1281 (2007)
Google Scholar
Liu, W., Schmidt, B., Voss, G., Schroder, A., Muller-Wittig, W.: Bio-sequence database scanning on a GPU. In: IPDPS, 8 pp. (2006)
Google Scholar
Lund, B., Smith, J.W.: A multi-stage CUDA kernel for Floyd-Warshall. arXiv preprint arXiv:1001.4108 (2010)
Manavski, S.A., Valle, G.: CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinform. 9(2), 1 (2008)
Google Scholar
Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: HPCC, pp. 145–152 (2011)
Google Scholar
Meyerhenke, H., Sanders, P., Schulz, C.: Parallel graph partitioning for complex networks. IEEE Trans. Parallel Distrib. Syst. 28(9), 2625–2638 (2017)
Article Google Scholar
Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: ICNC, pp. 320–326 (2011)
Google Scholar
Nishida, K., Nakano, K., Ito, Y.: Accelerating the dynamic programming for the optimal polygon triangulation on the GPU. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 1–15. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33078-0_1
Chapter Google Scholar
Rizk, G., Lavenier, D.: GPU accelerated RNA folding algorithm. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 1004–1013. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01970-8_101
Chapter Google Scholar
Schulte, M.J., et al.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)
Article Google Scholar
Sibeyn, J.F.: External matrix multiplication and all-pairs shortest path. IPL 91(2), 99–106 (2004)
Article MathSciNet MATH Google Scholar
Solomon, S., Thulasiraman, P.: Performance study of mapping irregular computations on GPUs. In: IPDPS Workshops and PhD Forum, pp. 1–8 (2010)
Google Scholar
Solomonik, E., Ballard, G., Demmel, J., Hoefler, T.: A communication-avoiding parallel algorithm for the symmetric eigenvalue problem. In: SPAA, pp. 111–121. ACM (2017)
Google Scholar
Solomonik, E., Buluc, A., Demmel, J.: Minimizing communication in all-pairs shortest paths. In: IPDPS, pp. 548–559 (2013)
Google Scholar
Solomonik, E., Carson, E., Knight, N., Demmel, J.: Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. TOPC 3(1), 3 (2016)
Google Scholar
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10
Chapter Google Scholar
Steffen, P., Giegerich, R., Giraud, M.: GPU parallelization of algebraic dynamic programming. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009. LNCS, vol. 6068, pp. 290–299. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14403-5_31
Chapter Google Scholar
Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges. In: IPDPS, pp. 1–10 (2009)
Google Scholar
Tan, G., Sun, N., Gao, G.R.: A parallel dynamic programming algorithm on a multi-core architecture. In: SPAA, pp. 135–144. ACM (2007)
Google Scholar
Tang, Y., You, R., Kan, H., Tithi, J., Ganapathi, P., Chowdhury, R.: Improving parallelism of recursive stencil computations without sacrificing cache performance. In: WOSC, pp. 1–7 (2014)
Google Scholar
Tiskin, A.: Bulk-synchronous parallel Gaussian elimination. J. Math. Sci. 108(6), 977–991 (2002)
Article MathSciNet MATH Google Scholar
Tiskin, A.: Communication-efficient parallel gaussian elimination. In: Malyshkin, V.E. (ed.) PaCT 2003. LNCS, vol. 2763, pp. 369–383. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45145-7_35
Chapter Google Scholar
Tiskin, A.: Communication-efficient parallel generic pairwise elimination. Future Gener. Comput. Syst. 23(2), 179–188 (2007)
Article Google Scholar
Tiskin, A.: All-pairs shortest paths computation in the BSP model. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 178–189. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48224-5_15
Chapter Google Scholar
Tithi, J.J., Ganapathi, P., Talati, A., Aggarwal, S., Chowdhury, R.: High-performance energy-efficient recursive dynamic programming with matrix-multiplication-like flexible kernels. In: IPDPS, pp. 303–312 (2015)
Google Scholar
Towns, J., et al.: XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16(5), 62–74 (2014)
Article Google Scholar
Venkataraman, G., Sahni, S., Mukhopadhyaya, S.: A blocked all-pairs shortest-paths algorithm. JEA 8, 2–2 (2003)
Article MathSciNet MATH Google Scholar
Volkov, V., Demmel, J.: LU, QR and Cholesky factorizations using vector capabilities of GPUs. EECS, UC Berkeley, Technical report UCB/EECS-2008-49, May 2008
Google Scholar
Waterman, M.S.: Introduction to Computational Biology: Maps. Sequences and Genomes. Chapman & Hall Ltd., New York (1995)
Book MATH Google Scholar
Wu, C.C., Wei, K.C., Lin, T.H.: Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. In: ICPADS, pp. 45–52 (2012)
Google Scholar
Xiao, S., Aji, A.M., Feng, W.c.: On the robust mapping of dynamic programming onto a graphics processing unit. In: ICPADS, pp. 26–33 (2009)
Google Scholar

Download references

Acknowledgements

This work is supported in part by NSF grants CCF-1439084, CNS-1553510 and CCF-1725428. Part of this work used the Extreme Science and Engineering Discovery Environment (XSEDE) which is supported by NSF grant ACI-1053575. The authors would like to thank anonymous reviewers for valuable comments and suggestions that have significantly improved the paper.

Author information

Authors and Affiliations

Computer Science, Stony Brook University, Stony Brook, NY, USA
Mohammad Mahdi Javanmard, Rathish Das, Zafar Ahmad & Rezaul Chowdhury
Computer Science and Engineering, Indian Institute of Technology, Indore, India
Pramod Ganapathi
Google Inc., Mountain View, CA, USA
Stephen Tschudi

Authors

Mohammad Mahdi Javanmard
View author publications
You can also search for this author in PubMed Google Scholar
Pramod Ganapathi
View author publications
You can also search for this author in PubMed Google Scholar
Rathish Das
View author publications
You can also search for this author in PubMed Google Scholar
Zafar Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Tschudi
View author publications
You can also search for this author in PubMed Google Scholar
Rezaul Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rezaul Chowdhury .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Michèle Weiland
Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
Guido Juckeland
Technical University of Munich, Munich, Germany
Carsten Trinitis
Ohio State University, Columbus, USA
Ponnuswamy Sadayappan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Javanmard, M.M., Ganapathi, P., Das, R., Ahmad, Z., Tschudi, S., Chowdhury, R. (2019). Toward Efficient Architecture-Independent Algorithms for Dynamic Programs. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-20656-7_8
Published: 17 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20655-0
Online ISBN: 978-3-030-20656-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics