Abstract
This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first parallelize with CUDA the calculation of the polynomial covariance matrix. Then, following a formal transformation of the polynomial matrix multiplication code—extensively used by SMD—we insert in this code the cublasDgemm function of CUBLAS library. Furthermore, a specialized cache memory system is implemented within the GPU to greatly limit the PC-to-GPU transfers of slices of polynomial matrices. The resulting SMD code can be applied efficiently over high-dimensional data. The proposed method is verified using sequences of images of airplanes with varying spatial orientation. The performance of the parallel codes for polynomial covariance matrix generation and SMD is evaluated and reveals speedups of up to 161 and 67, respectively, relative to sequential execution on a PC.





















Similar content being viewed by others
Abbreviations
- PEVD:
-
Polynomial eigenvalue decomposition
- SMD:
-
Sequential matrix diagonalization
- MIMO:
-
Multi-input multi-output (convolution)
- PCA:
-
Principal component analysis
- n :
-
Dimensionality of the data
- \(\mathbf{A}\) :
-
Polynomial matrix
- \(\mathbf{R}\) :
-
Polynomial covariance matrix
- \(\mathbf{E}\) :
-
Matrix of polynomial eigenvectors
- W :
-
Lag window half-length used to calculate \(\mathbf{R}\)
- \(m_r\) :
-
Lag window length used to calculate \(\mathbf{R}\)
- \(\mathbf{[R]}_{\mathbf{p}}\) :
-
Slice of \(\mathbf{R}\) at lag index p
- GPU:
-
Graphical processing unit
- CUDA:
-
Compute unified device architecture
- CUBLAS:
-
CUDA basic linear algebra subroutines
References
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore
Kailath T (1980) Linear systems. Prentice-Hall, Englewood Cliffs
Vaidyanathan PP (1993) Multirate systems and filter banks. Prentice-Hall, Englewood Cliffs
McWhirter JG, Baxter PD, Cooper T, Redif S, Foster J (2007) An EVD algorithm for para-Hermitian polynomial matrices. IEEE Trans Signal Proces 55(5):2158–2169
Redif S, Weiss S, McWhirter JG (2015) Sequential matrix diagonalisation algorithms for polynomial EVD of parahermitian matrices. IEEE Trans Signal Proces 63(1):81–89
Weiss S, Redif S, Cooper T, Liu C, Baxter P, McWhirter JG (2006) Paraunitary oversampled filter bank design for channel coding. EURASIP J Appl Signal Process 3:1–10
Moret N, Tonello A, Weiss S (2011) MIMO precoding for filter bank modulation systems based on PSVD. In: Proceedings of the IEEE 73rd Vehicular Technology Conference, pp 1–95
Foster J, McWhirter JG, Lambotharan S, Proudler I, Davies M, Chambers J (2012) Polynomial matrix QR decomposition and iterative decoding of frequency selective MIMO channels. IET Signal Process 6(7):704–712
Tohidian M, Amindavar H, Reza AM (2013) A DFT-based approximate eigenvalue and singular value decomposition of polynomial matrices. EURASIP J Adv Signal Process 1:1–16
Brandt R, Bengtsson M (2011) Wideband MIMO channel diagonalization in the time domain. In: Proceedings of the International Symposium on Personal, Indoor and Mobile Radio Communications, pp 1914–1918
Lambert RH, Joho M, Mathis H (2001) Polynomial singular values for number of wideband source estimation and principal components analysis. In: Proceedings of the International Conference on Independent Component Analysis, San Diego, CA, pp 379–383
Redif S, McWhirter JG, Baxter P, Cooper T (2006) Robust broadband adaptive beamforming via polynomial eigenvalues. In: Proceedings of the IEEE Ocean Conference, pp 1–6
Alrmah MA, Weiss S, Redif S, Lambotharan S, McWhirter JG (2013) Angle of arrival estimation for broadband signals: a comparison. In: Proceedings of the Intelligent Signal Processing Conference. IET, London
Redif S (2015) Fetal electrocardiogram estimation using polynomial eigenvalue decomposition. Turk J Electr Eng Comput Sci. doi:10.3906/elk-1401-19
Tkacenko A (2010) Approximate eigenvalue decomposition of para-Hermitian systems through successive FIR paraunitary transformations. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, pp 4074–4077
Redif S, Weiss S, McWhirter JG (2011) An approximate polynomial matrix eigenvalue decomposition algorithm for para-Hermitian matrices. In: Proceedings of the 11th IEEE International Symposium on Signal Processing and Information Technology, Bilbao, Spain, pp 421–425
Redif S, McWhirter JG, Weiss S (2011) Design of FIR paraunitary filter banks for subband coding using a polynomial eigenvalue decomposition. IEEE Trans Signal Process 59(11):5253–5264
Redif S (2006) Polynomial matrix decompositions and paraunitary filter banks. Ph.D. Thesis, University of Southampton, ECS, Southampton, UK
Redif S, Kasap S (2015) Novel reconfigurable hardware architecture for polynomial matrix multiplications. IEEE Trans VLSI Syst 23(3):454–465
Vaidyanathan PP (1998) Theory of optimal orthonormal subband coders. IEEE Trans Signal Process 46(6):1528–1543
Nvidia Corporation (2015) CUDA toolkit documentation v7.5. http://docs.nvidia.com/cuda
Nvidia Corporation (2015) CUBLAS library documentation v7.5. http://docs.nvidia.com/cuda/cublas
Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53
Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S (2011) LU factorization for accelerator-based systems. In: Proceedings of the AICCSA’ 11th Conference, pp 217–224
Carcenac M (2014) From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices. J Supercomput 68(1):365–413
TurboSquid, 3D models. http://www.turbosquid.com
Carcenac M, Redif S (2016) A highly scalable modular bottleneck neural network for image dimensionality reduction and image transformation. Appl Intell 44:557–610. doi:10.1007/s10489-015-0715-5
Pal D (2010) Image processing through fuzzy logic. InL 2010 IEEE Information Reuse and Integration Conference, August 4–6, Las Vegas
Himanshu T, Srinivas MB, Arabnia HR (2005) A need of quantum computing: reversible logic synthesis of parallel binary adder–subtractor. In: Proceeings of the 2005 International Conference on Embedded Systems and Applications (ESA’05), June, Las Vegas, pp 60–68
Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: 49th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’06), San Juan, August 6–9, Puerto Rico, pp 148–154
Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4:2 and 5:2 compressors with minimum number of transistors designed for low-power operations. In: Proceedings of the 2006 International Conference on Embedded Systems and Applications (ESA’06), June 26–29, Las Vegas, ISBN: 1-60132-017-5 160–166
Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using Fredkin and Feynman gates for industrial electronics and applications. In: Proceeding of the 2006 International Conference on Computer Design & Conference on Computing in Nanotechnology (CDES’06), June 26–29, Las Vegas, pp 70–74
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of the 2007 International Conference on Parallel & Distributed Processing Techniques & Applications (PDPTA’07), USA, pp 449–450
Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. In: Transactions on computational science journal, III, LNCS vol 5300. Springer, Berlin, pp 99–121
Balasubramanian P, Edwards DA, Arabnia HR (2011) Robust asynchronous carry lookahead adders. In: Proceedings of the International Conference on Computer Design (CDES’11), July, USA, pp 119–124
Balasubramanian P, Arisaka R, Arabnia HR (2012) RB_DSOP: a rule based disjoint sum of products synthesis method. In: Proceedings of the 2012 International Conference on Computer Design (CDES’12), July, Las Vegas, pp 39–43
Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. In: Gavrilova ML, Tan CJK (eds) Transactions in computational science (Springer), XVII, LNCS 7420. Springer, Berlin, pp 73–97
Yu S, Swartzlander EE Jr (2001) DCT implementation with distributed arithmetic. IEEE Trans Comput 50(9):985–991
Lim HS (1996) Multidimensional systolic arrays for computing discrete Fourier transform and discrete cosine transform, Chapter 6. In: Swartzlander EE Jr (ed) Application specific processors. Kluwer, Boston, pp 161–195
Bodda C, Steenbock N (2001) Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop Rapid System Prototyping, pp 38–43
Ahmedsaid A, Amira A, Bouridane A (2004) Accelerating MUSIC method on reconfigurable hardware for source localization. Proc Int Symp Circuits Syst 3:369–372
Bravo I, Mazo M, Lazaro JL, Jimenez P, Gardel A, Marron M (2008) Novel HW architecture based on FPGAs oriented to solve the eigen problem. IEEE Trans VLSI Syst 16(12):1722–1725
Redif S, Kasap S (2013) Parallel algorithm for computation of second-order sequential best rotations. Int J Electron 100(12):1646–1651 ISSN: 0020-7217
Kasap S, Redif S (2014) Novel field-programmable gate array architecture for computing the eigenvalue decomposition of para-Hermitian polynomial matrices. IEEE Trans VLSI Syst 22(3):522–536
Bracewell R (1999) The Fourier transform and its applications. McGraw-Hill, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Carcenac, M., Redif, S. & Kasap, S. GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data. J Supercomput 73, 3603–3634 (2017). https://doi.org/10.1007/s11227-017-1961-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-1961-6