Batch Matrix Exponentiation

Lopez, M. Graham; Horton, Mitchel D.

doi:10.1007/978-3-319-06548-9_3

M. Graham Lopez² &
Mitchel D. Horton²

3042 Accesses
3 Citations

Abstract

Matrix–matrix multiplication can be considered a linchpin of applied numerical dense linear algebra as the performance of many common dense linear algebra packages is closely tied to the performance of matrix–matrix multiplication. Batch matrix–matrix multiplication, the matrix–matrix multiplication of a large number of relatively small matrices, is a developing area within dense linear algebra and is relevant to various application areas such as phylogenetics, finite element modeling, image processing, fluid dynamics, and hydrodynamics. Using batch matrix–matrix multiplication as the foundation, we have developed an optimized batch matrix exponentiation algorithm in CUDA that outperforms cublasXgemmBatched for small square matrices. After introducing the original motivation for our problem, matrix exponentiation from the phylogenetics domain, we discuss our algorithm in the context of both cublasXgemmBatched, and two alternative GPU methods for the numerical computation of matrix exponentiation: Lagrange interpolation, and Newton interpolation. All comparisons are done on both the Fermi and the Kepler architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this work, we refer to general matrix–matrix multiplication as GEMM, in adherence with the Basic Linear Algebra Subroutines (BLAS) standard [5].
2.
Here, M is the dimension of the probability matrix and number of sites in the model. For example, M = 4 for the nucleotide model.
3.
We use the following flop count throughout this work, regardless of the algorithm, implementation, or architecture:
$$\displaystyle{ flops = n {\ast} (3m^{3} + 2m) }$$
(3.9)
where n is the number of branch lengths, and m is the dimension of the matrix E from Eq. (3.8). This count comes from Ln. 24 and 32 of Cd. 3.

References

AMD Core Math Library (ACML): www.amd.com/acml. Cited 16 Dec 2013
Amestoy, P.R., Duff, I.S., L’Excellent, J.Y.: Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Eng. (2000). doi: 10.1016/S0045-7825(99)00242X
MATH Google Scholar
Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hamarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM (1992). http://www.netlib.org/lapack/lug/. Cited 16 Dec 2013
Ayres, D.L., Darling, A., Zwickl, D.J., Beerli, P., Holder, M.T., Lewis, P.O., Huelsenbeck, J.P., Ronquist, F., Swofford, D.L., Cummings, M.P., Rambaut, A., Suchard, M.A.: BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol. 61(1), 170–173 (2012)
Article Google Scholar
Basic Linear Algebra Technical Forum: http://www.netlib.org/blas/blast-forum/blas-report.pdf. Cited 16 Dec 2013
Blackford, L.S., Choi, J., Cleary, A., D’Azevodo, E., Demmel, J., Dhillon, I., Dongarra, J.J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. SIAM (1997). http://www.netlib.org/scalapack/slug/. Cited 16 Dec 2013
CUBLAS: https://developer.nvidia.com/cuBLAS. Cited 16 Dec 2013
CUBLAS Documentation: http://docs.nvidia.com/cuda/cublas/. Cited 16 Dec 2013
CUDA C Programming Guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Cited 16 Dec 2013
CUDA Toolkit Documentation: http://docs.nvidia.com/cuda/cuda-samples/. Cited 16 Dec 2013
CULA Tools: http://www.culatools.com/blog/2011/12/09/batched-operations/. Cited 16 Dec 2013
Demmel, J., Volkov, V.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, vol. 31. IEEE Press, Piscataway (2008)
Google Scholar
Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to sparse partial pivoting. SIAM J. Matrix Anal. Appl. (1999). doi: 10.1137/S0895479895291765
MathSciNet Google Scholar
Donfack, S., Dongarra, J., Faverge, M., Gates, M., Kurzak, J., Luszczek, P., Yamzaki, I.: LAPACK working note 280: On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties. Innovative Computing Laboratory, University of Tennessee, Knoxville (2013)
Google Scholar
Dong, T., Dovrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters. Innovative Computing Laboratory, University of Tennessee (2013)
Google Scholar
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurr. Comput. Pract. Exp. (2003). doi: 10.1002/cpe.728
MATH Google Scholar
Drummond, A., Rambaut, A.: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007)
Article Google Scholar
Drummond, A., Suchard, M., Xie, D., Rambaut, A.: Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29(8), 1969–1973 (2012)
Google Scholar
Durbin, R., Eddy, S., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, 1st edn. Cambridge University Press, Cambridge (1997)
Google Scholar
Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)
Article Google Scholar
Felsenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland (2003)
Google Scholar
Feng, X., Buell, D., Rose, J., Waddell, P.: Parallel algorithms for Bayesian phylogenetic inference. J. Parallel Distrib. Comput. 63, 707–718 (2003)
Article Google Scholar
Feng, X., Cameron, K., Sosa, C., Smith, B.: Building the tree of life on terascale systems. In: Parallel Distributed Processing Symposium (IPDPS 2007), Washington (2007)
Google Scholar
GoToBLAS: Texas Advanced Computing Center. http://www.tacc.utexas.edu/. Cited 16 Dec 2013
Hasegawa, M., Kishino, H., Yano, T.: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22(2), 160–174 (1985)
Article Google Scholar
Huelsenbeck, J.P., Ronquist, F.: MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001)
Article Google Scholar
Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P.: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294(5550), 2310–2314 (2001)
Article Google Scholar
IBM: Engineering and Scientific Subroutine Library (ESSL) and parallel ESSL. http://www-03.ibm.com/systems/p/software/essl. Cited 16 Dec 2013
Jhurani, C., Mullowney, P.: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. www.ices.utexas.edu/$\char126$chetan/preprints/2013-CJ-PM-GEMM.pdf. Cited 16 Dec 2013
Google Scholar
Keane, T., Naughton, T., Travers, S., McInerney, J., McCormack, G.: DPRml: distributed phylogeny reconstruction by maximum likelihood. Bioinformatics 21, 969974 (2005)
Google Scholar
Keeneland: http://keeneland.gatech.edu/. Cited 29 Jan 2014
Kepler Whitepaper: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. Cited 16 Dec 2013
Kurzak, J., Tomov, S., Dongarra, J.: LAPACK Working Note 245: Autotuning GEMMs for Fermi. Innovative Computing Laboratory, University of Tennessee (2011)
Google Scholar
Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.: LAPACK Working Note 267: Preliminary Results of Autotuning Gemm Kernels for the NVIDIA Kepler Architecture. Innovative Computing Laboratory, University of Tennessee (2012)
Google Scholar
Math Kernel Library (MKL): Intel(R). http://www.intel.com/cd/software/products/asmo-na/eng.347757.htm. Cited 16 Dec 2013
Minh, B., Vinh, L., Haeseler, A., Schmidt, H.: pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics 21, 3794–3796 (2005)
Article Google Scholar
Moler, C., Van Loan, C.: Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. (2003). doi: 10.1137/S00361445024180
Google Scholar
Moret, B., Badar, D., Warnow, T.: High-performance algorithm engineering for computational phylogenetics. J. Supercomput. 22, 99–11 (2002)
Article MATH Google Scholar
Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for Fermi GPUs. Int. J. High Perform. Comput. 24(4), 511–515 (2010)
Article Google Scholar
Schmidt, H., Strimmer, K., Vingron, M., Haeseler, A.: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(2), 503–504 (2002)
Google Scholar
Stamatakis, A., Meier, L.T.: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21(4), 456–463 (2005)
Article Google Scholar
Suchard, M., Rambaut, A.: Many-core algorithms for statistical phylogenetics. Bioinformatics 25, 1370–1376 (2009)
Article Google Scholar
Tierney, L.: Markov chains for exploring posterior distributions. Ann. Stat. 22(4), 1701–1728 (1994)
Article MATH MathSciNet Google Scholar
Whaley, C.R., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Article MATH Google Scholar
Zwickl, D.: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, University of Texas, Austin (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, 30332, USA
M. Graham Lopez & Mitchel D. Horton

Authors

M. Graham Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Mitchel D. Horton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mitchel D. Horton .

Editor information

Editors and Affiliations

National Center for Supercomputing Applications, University of Illinois, Urbana, Illinois, USA
Volodymyr Kindratenko

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lopez, M.G., Horton, M.D. (2014). Batch Matrix Exponentiation. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-06548-9_3
Published: 09 June 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06547-2
Online ISBN: 978-3-319-06548-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics