Abstract
Matrix–matrix multiplication can be considered a linchpin of applied numerical dense linear algebra as the performance of many common dense linear algebra packages is closely tied to the performance of matrix–matrix multiplication. Batch matrix–matrix multiplication, the matrix–matrix multiplication of a large number of relatively small matrices, is a developing area within dense linear algebra and is relevant to various application areas such as phylogenetics, finite element modeling, image processing, fluid dynamics, and hydrodynamics. Using batch matrix–matrix multiplication as the foundation, we have developed an optimized batch matrix exponentiation algorithm in CUDA that outperforms cublasXgemmBatched for small square matrices. After introducing the original motivation for our problem, matrix exponentiation from the phylogenetics domain, we discuss our algorithm in the context of both cublasXgemmBatched, and two alternative GPU methods for the numerical computation of matrix exponentiation: Lagrange interpolation, and Newton interpolation. All comparisons are done on both the Fermi and the Kepler architectures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this work, we refer to general matrix–matrix multiplication as GEMM, in adherence with the Basic Linear Algebra Subroutines (BLAS) standard [5].
- 2.
Here, M is the dimension of the probability matrix and number of sites in the model. For example, M = 4 for the nucleotide model.
- 3.
We use the following flop count throughout this work, regardless of the algorithm, implementation, or architecture:
$$\displaystyle{ flops = n {\ast} (3m^{3} + 2m) }$$(3.9)where n is the number of branch lengths, and m is the dimension of the matrix E from Eq. (3.8). This count comes from Ln. 24 and 32 of Cd. 3.
References
AMD Core Math Library (ACML): www.amd.com/acml. Cited 16 Dec 2013
Amestoy, P.R., Duff, I.S., L’Excellent, J.Y.: Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Eng. (2000). doi: 10.1016/S0045-7825(99)00242X
Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hamarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM (1992). http://www.netlib.org/lapack/lug/. Cited 16 Dec 2013
Ayres, D.L., Darling, A., Zwickl, D.J., Beerli, P., Holder, M.T., Lewis, P.O., Huelsenbeck, J.P., Ronquist, F., Swofford, D.L., Cummings, M.P., Rambaut, A., Suchard, M.A.: BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol. 61(1), 170–173 (2012)
Basic Linear Algebra Technical Forum: http://www.netlib.org/blas/blast-forum/blas-report.pdf. Cited 16 Dec 2013
Blackford, L.S., Choi, J., Cleary, A., D’Azevodo, E., Demmel, J., Dhillon, I., Dongarra, J.J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. SIAM (1997). http://www.netlib.org/scalapack/slug/. Cited 16 Dec 2013
CUBLAS: https://developer.nvidia.com/cuBLAS. Cited 16 Dec 2013
CUBLAS Documentation: http://docs.nvidia.com/cuda/cublas/. Cited 16 Dec 2013
CUDA C Programming Guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Cited 16 Dec 2013
CUDA Toolkit Documentation: http://docs.nvidia.com/cuda/cuda-samples/. Cited 16 Dec 2013
CULA Tools: http://www.culatools.com/blog/2011/12/09/batched-operations/. Cited 16 Dec 2013
Demmel, J., Volkov, V.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, vol. 31. IEEE Press, Piscataway (2008)
Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to sparse partial pivoting. SIAM J. Matrix Anal. Appl. (1999). doi: 10.1137/S0895479895291765
Donfack, S., Dongarra, J., Faverge, M., Gates, M., Kurzak, J., Luszczek, P., Yamzaki, I.: LAPACK working note 280: On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties. Innovative Computing Laboratory, University of Tennessee, Knoxville (2013)
Dong, T., Dovrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters. Innovative Computing Laboratory, University of Tennessee (2013)
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurr. Comput. Pract. Exp. (2003). doi: 10.1002/cpe.728
Drummond, A., Rambaut, A.: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007)
Drummond, A., Suchard, M., Xie, D., Rambaut, A.: Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29(8), 1969–1973 (2012)
Durbin, R., Eddy, S., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, 1st edn. Cambridge University Press, Cambridge (1997)
Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)
Felsenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland (2003)
Feng, X., Buell, D., Rose, J., Waddell, P.: Parallel algorithms for Bayesian phylogenetic inference. J. Parallel Distrib. Comput. 63, 707–718 (2003)
Feng, X., Cameron, K., Sosa, C., Smith, B.: Building the tree of life on terascale systems. In: Parallel Distributed Processing Symposium (IPDPS 2007), Washington (2007)
GoToBLAS: Texas Advanced Computing Center. http://www.tacc.utexas.edu/. Cited 16 Dec 2013
Hasegawa, M., Kishino, H., Yano, T.: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22(2), 160–174 (1985)
Huelsenbeck, J.P., Ronquist, F.: MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001)
Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P.: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294(5550), 2310–2314 (2001)
IBM: Engineering and Scientific Subroutine Library (ESSL) and parallel ESSL. http://www-03.ibm.com/systems/p/software/essl. Cited 16 Dec 2013
Jhurani, C., Mullowney, P.: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. www.ices.utexas.edu/$\char126$chetan/preprints/2013-CJ-PM-GEMM.pdf. Cited 16 Dec 2013
Keane, T., Naughton, T., Travers, S., McInerney, J., McCormack, G.: DPRml: distributed phylogeny reconstruction by maximum likelihood. Bioinformatics 21, 969974 (2005)
Keeneland: http://keeneland.gatech.edu/. Cited 29 Jan 2014
Kepler Whitepaper: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. Cited 16 Dec 2013
Kurzak, J., Tomov, S., Dongarra, J.: LAPACK Working Note 245: Autotuning GEMMs for Fermi. Innovative Computing Laboratory, University of Tennessee (2011)
Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.: LAPACK Working Note 267: Preliminary Results of Autotuning Gemm Kernels for the NVIDIA Kepler Architecture. Innovative Computing Laboratory, University of Tennessee (2012)
Math Kernel Library (MKL): Intel(R). http://www.intel.com/cd/software/products/asmo-na/eng.347757.htm. Cited 16 Dec 2013
Minh, B., Vinh, L., Haeseler, A., Schmidt, H.: pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics 21, 3794–3796 (2005)
Moler, C., Van Loan, C.: Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. (2003). doi: 10.1137/S00361445024180
Moret, B., Badar, D., Warnow, T.: High-performance algorithm engineering for computational phylogenetics. J. Supercomput. 22, 99–11 (2002)
Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for Fermi GPUs. Int. J. High Perform. Comput. 24(4), 511–515 (2010)
Schmidt, H., Strimmer, K., Vingron, M., Haeseler, A.: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(2), 503–504 (2002)
Stamatakis, A., Meier, L.T.: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21(4), 456–463 (2005)
Suchard, M., Rambaut, A.: Many-core algorithms for statistical phylogenetics. Bioinformatics 25, 1370–1376 (2009)
Tierney, L.: Markov chains for exploring posterior distributions. Ann. Stat. 22(4), 1701–1728 (1994)
Whaley, C.R., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Zwickl, D.: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, University of Texas, Austin (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Lopez, M.G., Horton, M.D. (2014). Batch Matrix Exponentiation. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-06548-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06547-2
Online ISBN: 978-3-319-06548-9
eBook Packages: Computer ScienceComputer Science (R0)