Skip to main content

Batch Matrix Exponentiation

  • Chapter
  • First Online:
Numerical Computations with GPUs

Abstract

Matrix–matrix multiplication can be considered a linchpin of applied numerical dense linear algebra as the performance of many common dense linear algebra packages is closely tied to the performance of matrix–matrix multiplication. Batch matrix–matrix multiplication, the matrix–matrix multiplication of a large number of relatively small matrices, is a developing area within dense linear algebra and is relevant to various application areas such as phylogenetics, finite element modeling, image processing, fluid dynamics, and hydrodynamics. Using batch matrix–matrix multiplication as the foundation, we have developed an optimized batch matrix exponentiation algorithm in CUDA that outperforms cublasXgemmBatched for small square matrices. After introducing the original motivation for our problem, matrix exponentiation from the phylogenetics domain, we discuss our algorithm in the context of both cublasXgemmBatched, and two alternative GPU methods for the numerical computation of matrix exponentiation: Lagrange interpolation, and Newton interpolation. All comparisons are done on both the Fermi and the Kepler architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this work, we refer to general matrix–matrix multiplication as GEMM, in adherence with the Basic Linear Algebra Subroutines (BLAS) standard [5].

  2. 2.

    Here, M is the dimension of the probability matrix and number of sites in the model. For example, M = 4 for the nucleotide model.

  3. 3.

    We use the following flop count throughout this work, regardless of the algorithm, implementation, or architecture:

    $$\displaystyle{ flops = n {\ast} (3m^{3} + 2m) }$$
    (3.9)

    where n is the number of branch lengths, and m is the dimension of the matrix E from Eq. (3.8). This count comes from Ln. 24 and 32 of Cd. 3.

References

  1. AMD Core Math Library (ACML): www.amd.com/acml. Cited 16 Dec 2013

  2. Amestoy, P.R., Duff, I.S., L’Excellent, J.Y.: Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Eng. (2000). doi: 10.1016/S0045-7825(99)00242X

    MATH  Google Scholar 

  3. Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hamarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM (1992). http://www.netlib.org/lapack/lug/. Cited 16 Dec 2013

  4. Ayres, D.L., Darling, A., Zwickl, D.J., Beerli, P., Holder, M.T., Lewis, P.O., Huelsenbeck, J.P., Ronquist, F., Swofford, D.L., Cummings, M.P., Rambaut, A., Suchard, M.A.: BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol. 61(1), 170–173 (2012)

    Article  Google Scholar 

  5. Basic Linear Algebra Technical Forum: http://www.netlib.org/blas/blast-forum/blas-report.pdf. Cited 16 Dec 2013

  6. Blackford, L.S., Choi, J., Cleary, A., D’Azevodo, E., Demmel, J., Dhillon, I., Dongarra, J.J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. SIAM (1997). http://www.netlib.org/scalapack/slug/. Cited 16 Dec 2013

  7. CUBLAS: https://developer.nvidia.com/cuBLAS. Cited 16 Dec 2013

  8. CUBLAS Documentation: http://docs.nvidia.com/cuda/cublas/. Cited 16 Dec 2013

  9. CUDA C Programming Guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Cited 16 Dec 2013

  10. CUDA Toolkit Documentation: http://docs.nvidia.com/cuda/cuda-samples/. Cited 16 Dec 2013

  11. CULA Tools: http://www.culatools.com/blog/2011/12/09/batched-operations/. Cited 16 Dec 2013

  12. Demmel, J., Volkov, V.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, vol. 31. IEEE Press, Piscataway (2008)

    Google Scholar 

  13. Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to sparse partial pivoting. SIAM J. Matrix Anal. Appl. (1999). doi: 10.1137/S0895479895291765

    MathSciNet  Google Scholar 

  14. Donfack, S., Dongarra, J., Faverge, M., Gates, M., Kurzak, J., Luszczek, P., Yamzaki, I.: LAPACK working note 280: On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties. Innovative Computing Laboratory, University of Tennessee, Knoxville (2013)

    Google Scholar 

  15. Dong, T., Dovrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters. Innovative Computing Laboratory, University of Tennessee (2013)

    Google Scholar 

  16. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurr. Comput. Pract. Exp. (2003). doi: 10.1002/cpe.728

    MATH  Google Scholar 

  17. Drummond, A., Rambaut, A.: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007)

    Article  Google Scholar 

  18. Drummond, A., Suchard, M., Xie, D., Rambaut, A.: Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29(8), 1969–1973 (2012)

    Google Scholar 

  19. Durbin, R., Eddy, S., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, 1st edn. Cambridge University Press, Cambridge (1997)

    Google Scholar 

  20. Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)

    Article  Google Scholar 

  21. Felsenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland (2003)

    Google Scholar 

  22. Feng, X., Buell, D., Rose, J., Waddell, P.: Parallel algorithms for Bayesian phylogenetic inference. J. Parallel Distrib. Comput. 63, 707–718 (2003)

    Article  Google Scholar 

  23. Feng, X., Cameron, K., Sosa, C., Smith, B.: Building the tree of life on terascale systems. In: Parallel Distributed Processing Symposium (IPDPS 2007), Washington (2007)

    Google Scholar 

  24. GoToBLAS: Texas Advanced Computing Center. http://www.tacc.utexas.edu/. Cited 16 Dec 2013

  25. Hasegawa, M., Kishino, H., Yano, T.: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22(2), 160–174 (1985)

    Article  Google Scholar 

  26. Huelsenbeck, J.P., Ronquist, F.: MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001)

    Article  Google Scholar 

  27. Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P.: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294(5550), 2310–2314 (2001)

    Article  Google Scholar 

  28. IBM: Engineering and Scientific Subroutine Library (ESSL) and parallel ESSL. http://www-03.ibm.com/systems/p/software/essl. Cited 16 Dec 2013

  29. Jhurani, C., Mullowney, P.: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. www.ices.utexas.edu/$\char126$chetan/preprints/2013-CJ-PM-GEMM.pdf. Cited 16 Dec 2013

    Google Scholar 

  30. Keane, T., Naughton, T., Travers, S., McInerney, J., McCormack, G.: DPRml: distributed phylogeny reconstruction by maximum likelihood. Bioinformatics 21, 969974 (2005)

    Google Scholar 

  31. Keeneland: http://keeneland.gatech.edu/. Cited 29 Jan 2014

  32. Kepler Whitepaper: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. Cited 16 Dec 2013

  33. Kurzak, J., Tomov, S., Dongarra, J.: LAPACK Working Note 245: Autotuning GEMMs for Fermi. Innovative Computing Laboratory, University of Tennessee (2011)

    Google Scholar 

  34. Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.: LAPACK Working Note 267: Preliminary Results of Autotuning Gemm Kernels for the NVIDIA Kepler Architecture. Innovative Computing Laboratory, University of Tennessee (2012)

    Google Scholar 

  35. Math Kernel Library (MKL): Intel(R). http://www.intel.com/cd/software/products/asmo-na/eng.347757.htm. Cited 16 Dec 2013

  36. Minh, B., Vinh, L., Haeseler, A., Schmidt, H.: pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics 21, 3794–3796 (2005)

    Article  Google Scholar 

  37. Moler, C., Van Loan, C.: Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. (2003). doi: 10.1137/S00361445024180

    Google Scholar 

  38. Moret, B., Badar, D., Warnow, T.: High-performance algorithm engineering for computational phylogenetics. J. Supercomput. 22, 99–11 (2002)

    Article  MATH  Google Scholar 

  39. Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for Fermi GPUs. Int. J. High Perform. Comput. 24(4), 511–515 (2010)

    Article  Google Scholar 

  40. Schmidt, H., Strimmer, K., Vingron, M., Haeseler, A.: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(2), 503–504 (2002)

    Google Scholar 

  41. Stamatakis, A., Meier, L.T.: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21(4), 456–463 (2005)

    Article  Google Scholar 

  42. Suchard, M., Rambaut, A.: Many-core algorithms for statistical phylogenetics. Bioinformatics 25, 1370–1376 (2009)

    Article  Google Scholar 

  43. Tierney, L.: Markov chains for exploring posterior distributions. Ann. Stat. 22(4), 1701–1728 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  44. Whaley, C.R., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)

    Article  MATH  Google Scholar 

  45. Zwickl, D.: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, University of Texas, Austin (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mitchel D. Horton .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Lopez, M.G., Horton, M.D. (2014). Batch Matrix Exponentiation. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06548-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06547-2

  • Online ISBN: 978-3-319-06548-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics