Skip to main content

Toward Efficient Architecture-Independent Algorithms for Dynamic Programs

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11501))

Included in the following conference series:

  • 1210 Accesses

Abstract

We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.

Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines.

Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r (\(\ge 2\)) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds.

We also report empirical results for our GPU and distribute memory algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    As of November 2018, the supercomputers ranked 1 (Summit), 2 (Sierra), 6 (ABCI), 7 (Piz Daint), and 8 (Titan) in order of Rpeak (TFlop/s) are networks of hybrid CPU+GPU nodes [4].

  2. 2.

    Temporal locality — whenever a block of data is brought into a faster level of cache/memory from a slower level, as much useful work as possible is performed on this data before removing the block from the faster level.

  3. 3.

    I.e., faster and closer to the processing core(s).

References

  1. Standard Template Library for Extra Large Data Sets (STXXL). http://stxxl.sourceforge.net/

  2. The Stampede Supercomputing Cluster. https://www.tacc.utexas.edu/stampede/

  3. The Stampede2 Supercomputing Cluster. https://www.tacc.utexas.edu/systems/stampede2/

  4. Top 500 Supercomputers of the World. https://www.top500.org/lists/2018/06/

  5. Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)

    Article  Google Scholar 

  6. Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theor. Comput. Sci. 71(1), 3–28 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  7. Aho, A.V., Hopcroft, J.E.: The Design and Analysis of Computer Algorithms. Pearson Education India, Noida (1974)

    MATH  Google Scholar 

  8. Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, 1–155 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for strassen’s matrix multiplication. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 193–204. ACM (2012)

    Google Scholar 

  10. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  11. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM (JACM) 59(6), 32 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  12. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)

    MATH  Google Scholar 

  13. Bender, M., Ebrahimi, R., Fineman, J., Ghasemiesfeh, G., Johnson, R., McCauley, S.: Cache-adaptive algorithms. In: SODA (2014)

    Google Scholar 

  14. Buluç, A., Gilbert, J.R., Budak, C.: Solving path problems on the GPU. Parallel Comput. 36(5), 241–253 (2010)

    Article  MATH  Google Scholar 

  15. Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Technical report, Montana State University. Bozeman Engineering Research Labs (1969)

    Google Scholar 

  16. Carson, E., Knight, N., Demmel, J.: Avoiding communication in two-sided Krylov subspace methods. Technical report, EECS, UC Berkeley (2011)

    Google Scholar 

  17. Cherng, C., Ladner, R.: Cache efficient simple dynamic programming. In: AofA, pp. 49–58 (2005)

    Google Scholar 

  18. Chowdhury, R., Ganapathi, P., Tang, Y., Tithi, J.J.: Provably efficient scheduling of cache-oblivious wavefront algorithms. In: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 339–350. ACM, July 2017

    Google Scholar 

  19. Chowdhury, R., et al.: AUTOGEN: automatic discovery of efficient recursive divide-&-conquer algorithms for solving dynamic programming problems. ACM Trans. Parallel Comput. 4(1), 4 (2017). https://doi.org/10.1145/3125632

    Article  Google Scholar 

  20. Chowdhury, R.A., Ramachandran, V.: Cache-efficient dynamic programming algorithms for multicores. In: SPAA, pp. 207–216 (2008)

    Google Scholar 

  21. Chowdhury, R.A., Ramachandran, V.: The cache-oblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. Theory Comput. Syst. 47(4), 878–919 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  22. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  23. D’Alberto, P., Nicolau, A.: R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2), 203–213 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  24. Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM J. Comput. 10(4), 657–675 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  25. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), A206–A239 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  26. Diament, B., Ferencz, A.: Comparison of parallel APSP algorithms (1999)

    Google Scholar 

  27. Djidjev, H., Thulasidasan, S., Chapuis, G., Andonov, R., Lavenier, D.: Efficient multi-GPU computation of all-pairs shortest paths. In: IPDPS, pp. 360–369 (2014)

    Google Scholar 

  28. Driscoll, M., Georganas, E., Koanantakool, P., Solomonik, E., Yelick, K.: A communication-optimal n-body algorithm for direct interactions. In: IPDPS, pp. 1075–1084. IEEE (2013)

    Google Scholar 

  29. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: FOCS, pp. 285–297 (1999)

    Google Scholar 

  30. Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. TCS 64(1), 107–118 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  31. Galil, Z., Park, K.: Parallel algorithms for dynamic programming recurrences with more than \(O(1)\) dependency. JPDC 21(2), 213–222 (1994)

    MATH  Google Scholar 

  32. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  33. Habbal, M.B., Koutsopoulos, H.N., Lerman, S.R.: A decomposition algorithm for the all-pairs shortest path problem on massively parallel computer architectures. Transp. Sci. 28(4), 292–308 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  34. Harish, P., Narayanan, P.: Accelerating large graph algorithms on the GPU using CUDA. In: HiPC, pp. 197–208 (2007)

    Google Scholar 

  35. Holzer, S., Wattenhofer, R.: Optimal distributed all pairs shortest paths and applications. In: PODC, pp. 355–364. ACM (2012)

    Google Scholar 

  36. Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)

    Article  MATH  Google Scholar 

  37. Itzhaky, S., et al.: Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In: OOPSLA, pp. 145–164. ACM (2016)

    Google Scholar 

  38. Jenq, J.F., Sahni, S.: All pairs shortest paths on a hypercube multiprocessor (1987)

    Google Scholar 

  39. Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19(11), 1235–1257 (1993)

    Article  Google Scholar 

  40. Katz, G.J., Kider Jr., J.T.: All-pairs shortest-paths for large graphs on the GPU. In: ACM SIGGRAPH/EUROGRAPHICS, pp. 47–55 (2008)

    Google Scholar 

  41. Kogge, P., Shalf, J.: Exascale computing trends: adjusting to the “new normal” for computer architecture. Comput. Sci. Eng. 15(6), 16–26 (2013)

    Article  Google Scholar 

  42. Krusche, P., Tiskin, A.: Efficient longest common subsequence computation using bulk-synchronous parallelism. In: Gavrilova, M.L., et al. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 165–174. Springer, Heidelberg (2006). https://doi.org/10.1007/11751649_18

    Chapter  Google Scholar 

  43. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and Analysis of Algorithms, vol. 400. Benjamin/Cummings, Redwood City (1994)

    MATH  Google Scholar 

  44. Kumar, V., Singh, V.: Scalability of parallel algorithms for the all-pairs shortest-path problem. J. Parallel Distrib. Comput. 13(2), 124–138 (1991)

    Article  Google Scholar 

  45. Liu, W., Schmidt, B., Voss, G., Muller-Wittig, W.: Streaming algorithms for biological sequence alignment on GPUs. TPDS 18(9), 1270–1281 (2007)

    Google Scholar 

  46. Liu, W., Schmidt, B., Voss, G., Schroder, A., Muller-Wittig, W.: Bio-sequence database scanning on a GPU. In: IPDPS, 8 pp. (2006)

    Google Scholar 

  47. Lund, B., Smith, J.W.: A multi-stage CUDA kernel for Floyd-Warshall. arXiv preprint arXiv:1001.4108 (2010)

  48. Manavski, S.A., Valle, G.: CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinform. 9(2), 1 (2008)

    Google Scholar 

  49. Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: HPCC, pp. 145–152 (2011)

    Google Scholar 

  50. Meyerhenke, H., Sanders, P., Schulz, C.: Parallel graph partitioning for complex networks. IEEE Trans. Parallel Distrib. Syst. 28(9), 2625–2638 (2017)

    Article  Google Scholar 

  51. Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: ICNC, pp. 320–326 (2011)

    Google Scholar 

  52. Nishida, K., Nakano, K., Ito, Y.: Accelerating the dynamic programming for the optimal polygon triangulation on the GPU. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 1–15. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33078-0_1

    Chapter  Google Scholar 

  53. Rizk, G., Lavenier, D.: GPU accelerated RNA folding algorithm. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 1004–1013. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01970-8_101

    Chapter  Google Scholar 

  54. Schulte, M.J., et al.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)

    Article  Google Scholar 

  55. Sibeyn, J.F.: External matrix multiplication and all-pairs shortest path. IPL 91(2), 99–106 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  56. Solomon, S., Thulasiraman, P.: Performance study of mapping irregular computations on GPUs. In: IPDPS Workshops and PhD Forum, pp. 1–8 (2010)

    Google Scholar 

  57. Solomonik, E., Ballard, G., Demmel, J., Hoefler, T.: A communication-avoiding parallel algorithm for the symmetric eigenvalue problem. In: SPAA, pp. 111–121. ACM (2017)

    Google Scholar 

  58. Solomonik, E., Buluc, A., Demmel, J.: Minimizing communication in all-pairs shortest paths. In: IPDPS, pp. 548–559 (2013)

    Google Scholar 

  59. Solomonik, E., Carson, E., Knight, N., Demmel, J.: Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. TOPC 3(1), 3 (2016)

    Google Scholar 

  60. Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10

    Chapter  Google Scholar 

  61. Steffen, P., Giegerich, R., Giraud, M.: GPU parallelization of algebraic dynamic programming. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009. LNCS, vol. 6068, pp. 290–299. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14403-5_31

    Chapter  Google Scholar 

  62. Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges. In: IPDPS, pp. 1–10 (2009)

    Google Scholar 

  63. Tan, G., Sun, N., Gao, G.R.: A parallel dynamic programming algorithm on a multi-core architecture. In: SPAA, pp. 135–144. ACM (2007)

    Google Scholar 

  64. Tang, Y., You, R., Kan, H., Tithi, J., Ganapathi, P., Chowdhury, R.: Improving parallelism of recursive stencil computations without sacrificing cache performance. In: WOSC, pp. 1–7 (2014)

    Google Scholar 

  65. Tiskin, A.: Bulk-synchronous parallel Gaussian elimination. J. Math. Sci. 108(6), 977–991 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  66. Tiskin, A.: Communication-efficient parallel gaussian elimination. In: Malyshkin, V.E. (ed.) PaCT 2003. LNCS, vol. 2763, pp. 369–383. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45145-7_35

    Chapter  Google Scholar 

  67. Tiskin, A.: Communication-efficient parallel generic pairwise elimination. Future Gener. Comput. Syst. 23(2), 179–188 (2007)

    Article  Google Scholar 

  68. Tiskin, A.: All-pairs shortest paths computation in the BSP model. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 178–189. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48224-5_15

    Chapter  Google Scholar 

  69. Tithi, J.J., Ganapathi, P., Talati, A., Aggarwal, S., Chowdhury, R.: High-performance energy-efficient recursive dynamic programming with matrix-multiplication-like flexible kernels. In: IPDPS, pp. 303–312 (2015)

    Google Scholar 

  70. Towns, J., et al.: XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16(5), 62–74 (2014)

    Article  Google Scholar 

  71. Venkataraman, G., Sahni, S., Mukhopadhyaya, S.: A blocked all-pairs shortest-paths algorithm. JEA 8, 2–2 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  72. Volkov, V., Demmel, J.: LU, QR and Cholesky factorizations using vector capabilities of GPUs. EECS, UC Berkeley, Technical report UCB/EECS-2008-49, May 2008

    Google Scholar 

  73. Waterman, M.S.: Introduction to Computational Biology: Maps. Sequences and Genomes. Chapman & Hall Ltd., New York (1995)

    Book  MATH  Google Scholar 

  74. Wu, C.C., Wei, K.C., Lin, T.H.: Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. In: ICPADS, pp. 45–52 (2012)

    Google Scholar 

  75. Xiao, S., Aji, A.M., Feng, W.c.: On the robust mapping of dynamic programming onto a graphics processing unit. In: ICPADS, pp. 26–33 (2009)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by NSF grants CCF-1439084, CNS-1553510 and CCF-1725428. Part of this work used the Extreme Science and Engineering Discovery Environment (XSEDE) which is supported by NSF grant ACI-1053575. The authors would like to thank anonymous reviewers for valuable comments and suggestions that have significantly improved the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rezaul Chowdhury .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Javanmard, M.M., Ganapathi, P., Das, R., Ahmad, Z., Tschudi, S., Chowdhury, R. (2019). Toward Efficient Architecture-Independent Algorithms for Dynamic Programs. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20656-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20655-0

  • Online ISBN: 978-3-030-20656-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics