GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

Carcenac, Manuel; Redif, Soydan; Kasap, Server

doi:10.1007/s11227-017-1961-6

GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

Published: 18 January 2017

Volume 73, pages 3603–3634, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

365 Accesses
1 Citation
Explore all metrics

Abstract

This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first parallelize with CUDA the calculation of the polynomial covariance matrix. Then, following a formal transformation of the polynomial matrix multiplication code—extensively used by SMD—we insert in this code the cublasDgemm function of CUBLAS library. Furthermore, a specialized cache memory system is implemented within the GPU to greatly limit the PC-to-GPU transfers of slices of polynomial matrices. The resulting SMD code can be applied efficiently over high-dimensional data. The proposed method is verified using sequences of images of airplanes with varying spatial orientation. The performance of the parallel codes for polynomial covariance matrix generation and SMD is evaluated and reveals speedups of up to 161 and 67, respectively, relative to sequential execution on a PC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel global optimization on GPU

Article 13 February 2016

Distributed Sparse Block Grids on GPUs

Considerations on the Implementation and Use of Anderson Acceleration on Distributed Memory and GPU-based Parallel Computers

Abbreviations

PEVD:: Polynomial eigenvalue decomposition
SMD:: Sequential matrix diagonalization
MIMO:: Multi-input multi-output (convolution)
PCA:: Principal component analysis
n :: Dimensionality of the data
$\mathbf{A}$ :: Polynomial matrix
$\mathbf{R}$ :: Polynomial covariance matrix
$\mathbf{E}$ :: Matrix of polynomial eigenvectors
W :: Lag window half-length used to calculate $\mathbf{R}$
$m_r$ :: Lag window length used to calculate $\mathbf{R}$
$\mathbf{[R]}_{\mathbf{p}}$ :: Slice of $\mathbf{R}$ at lag index p
GPU:: Graphical processing unit
CUDA:: Compute unified device architecture
CUBLAS:: CUDA basic linear algebra subroutines

References

Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore
MATH Google Scholar
Kailath T (1980) Linear systems. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Vaidyanathan PP (1993) Multirate systems and filter banks. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
McWhirter JG, Baxter PD, Cooper T, Redif S, Foster J (2007) An EVD algorithm for para-Hermitian polynomial matrices. IEEE Trans Signal Proces 55(5):2158–2169
Article MathSciNet Google Scholar
Redif S, Weiss S, McWhirter JG (2015) Sequential matrix diagonalisation algorithms for polynomial EVD of parahermitian matrices. IEEE Trans Signal Proces 63(1):81–89
Article Google Scholar
Weiss S, Redif S, Cooper T, Liu C, Baxter P, McWhirter JG (2006) Paraunitary oversampled filter bank design for channel coding. EURASIP J Appl Signal Process 3:1–10
Article MATH Google Scholar
Moret N, Tonello A, Weiss S (2011) MIMO precoding for filter bank modulation systems based on PSVD. In: Proceedings of the IEEE 73rd Vehicular Technology Conference, pp 1–95
Foster J, McWhirter JG, Lambotharan S, Proudler I, Davies M, Chambers J (2012) Polynomial matrix QR decomposition and iterative decoding of frequency selective MIMO channels. IET Signal Process 6(7):704–712
Article MathSciNet Google Scholar
Tohidian M, Amindavar H, Reza AM (2013) A DFT-based approximate eigenvalue and singular value decomposition of polynomial matrices. EURASIP J Adv Signal Process 1:1–16
Google Scholar
Brandt R, Bengtsson M (2011) Wideband MIMO channel diagonalization in the time domain. In: Proceedings of the International Symposium on Personal, Indoor and Mobile Radio Communications, pp 1914–1918
Lambert RH, Joho M, Mathis H (2001) Polynomial singular values for number of wideband source estimation and principal components analysis. In: Proceedings of the International Conference on Independent Component Analysis, San Diego, CA, pp 379–383
Redif S, McWhirter JG, Baxter P, Cooper T (2006) Robust broadband adaptive beamforming via polynomial eigenvalues. In: Proceedings of the IEEE Ocean Conference, pp 1–6
Alrmah MA, Weiss S, Redif S, Lambotharan S, McWhirter JG (2013) Angle of arrival estimation for broadband signals: a comparison. In: Proceedings of the Intelligent Signal Processing Conference. IET, London
Redif S (2015) Fetal electrocardiogram estimation using polynomial eigenvalue decomposition. Turk J Electr Eng Comput Sci. doi:10.3906/elk-1401-19
Google Scholar
Tkacenko A (2010) Approximate eigenvalue decomposition of para-Hermitian systems through successive FIR paraunitary transformations. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, pp 4074–4077
Redif S, Weiss S, McWhirter JG (2011) An approximate polynomial matrix eigenvalue decomposition algorithm for para-Hermitian matrices. In: Proceedings of the 11th IEEE International Symposium on Signal Processing and Information Technology, Bilbao, Spain, pp 421–425
Redif S, McWhirter JG, Weiss S (2011) Design of FIR paraunitary filter banks for subband coding using a polynomial eigenvalue decomposition. IEEE Trans Signal Process 59(11):5253–5264
Article MathSciNet Google Scholar
Redif S (2006) Polynomial matrix decompositions and paraunitary filter banks. Ph.D. Thesis, University of Southampton, ECS, Southampton, UK
Redif S, Kasap S (2015) Novel reconfigurable hardware architecture for polynomial matrix multiplications. IEEE Trans VLSI Syst 23(3):454–465
Article Google Scholar
Vaidyanathan PP (1998) Theory of optimal orthonormal subband coders. IEEE Trans Signal Process 46(6):1528–1543
Article Google Scholar
Nvidia Corporation (2015) CUDA toolkit documentation v7.5. http://docs.nvidia.com/cuda
Nvidia Corporation (2015) CUBLAS library documentation v7.5. http://docs.nvidia.com/cuda/cublas
Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53
Article MathSciNet Google Scholar
Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S (2011) LU factorization for accelerator-based systems. In: Proceedings of the AICCSA’ 11th Conference, pp 217–224
Carcenac M (2014) From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices. J Supercomput 68(1):365–413
Article Google Scholar
TurboSquid, 3D models. http://www.turbosquid.com
Carcenac M, Redif S (2016) A highly scalable modular bottleneck neural network for image dimensionality reduction and image transformation. Appl Intell 44:557–610. doi:10.1007/s10489-015-0715-5
Article Google Scholar
Pal D (2010) Image processing through fuzzy logic. InL 2010 IEEE Information Reuse and Integration Conference, August 4–6, Las Vegas
Himanshu T, Srinivas MB, Arabnia HR (2005) A need of quantum computing: reversible logic synthesis of parallel binary adder–subtractor. In: Proceeings of the 2005 International Conference on Embedded Systems and Applications (ESA’05), June, Las Vegas, pp 60–68
Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: 49th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’06), San Juan, August 6–9, Puerto Rico, pp 148–154
Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4:2 and 5:2 compressors with minimum number of transistors designed for low-power operations. In: Proceedings of the 2006 International Conference on Embedded Systems and Applications (ESA’06), June 26–29, Las Vegas, ISBN: 1-60132-017-5 160–166
Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using Fredkin and Feynman gates for industrial electronics and applications. In: Proceeding of the 2006 International Conference on Computer Design & Conference on Computing in Nanotechnology (CDES’06), June 26–29, Las Vegas, pp 70–74
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of the 2007 International Conference on Parallel & Distributed Processing Techniques & Applications (PDPTA’07), USA, pp 449–450
Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. In: Transactions on computational science journal, III, LNCS vol 5300. Springer, Berlin, pp 99–121
Balasubramanian P, Edwards DA, Arabnia HR (2011) Robust asynchronous carry lookahead adders. In: Proceedings of the International Conference on Computer Design (CDES’11), July, USA, pp 119–124
Balasubramanian P, Arisaka R, Arabnia HR (2012) RB_DSOP: a rule based disjoint sum of products synthesis method. In: Proceedings of the 2012 International Conference on Computer Design (CDES’12), July, Las Vegas, pp 39–43
Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. In: Gavrilova ML, Tan CJK (eds) Transactions in computational science (Springer), XVII, LNCS 7420. Springer, Berlin, pp 73–97
Yu S, Swartzlander EE Jr (2001) DCT implementation with distributed arithmetic. IEEE Trans Comput 50(9):985–991
Article Google Scholar
Lim HS (1996) Multidimensional systolic arrays for computing discrete Fourier transform and discrete cosine transform, Chapter 6. In: Swartzlander EE Jr (ed) Application specific processors. Kluwer, Boston, pp 161–195
Google Scholar
Bodda C, Steenbock N (2001) Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop Rapid System Prototyping, pp 38–43
Ahmedsaid A, Amira A, Bouridane A (2004) Accelerating MUSIC method on reconfigurable hardware for source localization. Proc Int Symp Circuits Syst 3:369–372
Google Scholar
Bravo I, Mazo M, Lazaro JL, Jimenez P, Gardel A, Marron M (2008) Novel HW architecture based on FPGAs oriented to solve the eigen problem. IEEE Trans VLSI Syst 16(12):1722–1725
Article Google Scholar
Redif S, Kasap S (2013) Parallel algorithm for computation of second-order sequential best rotations. Int J Electron 100(12):1646–1651 ISSN: 0020-7217
Article Google Scholar
Kasap S, Redif S (2014) Novel field-programmable gate array architecture for computing the eigenvalue decomposition of para-Hermitian polynomial matrices. IEEE Trans VLSI Syst 22(3):522–536
Article Google Scholar
Bracewell R (1999) The Fourier transform and its applications. McGraw-Hill, New York
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Via Mersin 10, Boǧaztepe, Turkey
Manuel Carcenac
Department of Electrical and Electronics Engineering, European University of Lefke, Via Mersin 10, Gemikonaǧı, Turkey
Soydan Redif
Department of Computer Engineering, College of Engineering and Technology, American University of the Middle East, Egaila, Kuwait
Server Kasap

Authors

Manuel Carcenac
View author publications
You can also search for this author in PubMed Google Scholar
Soydan Redif
View author publications
You can also search for this author in PubMed Google Scholar
Server Kasap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Carcenac.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Carcenac, M., Redif, S. & Kasap, S. GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data. J Supercomput 73, 3603–3634 (2017). https://doi.org/10.1007/s11227-017-1961-6

Download citation

Published: 18 January 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s11227-017-1961-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel global optimization on GPU

Distributed Sparse Block Grids on GPUs

Considerations on the Implementation and Use of Anderson Acceleration on Distributed Memory and GPU-based Parallel Computers

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now