Skip to main content
Log in

GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first parallelize with CUDA the calculation of the polynomial covariance matrix. Then, following a formal transformation of the polynomial matrix multiplication code—extensively used by SMD—we insert in this code the cublasDgemm function of CUBLAS library. Furthermore, a specialized cache memory system is implemented within the GPU to greatly limit the PC-to-GPU transfers of slices of polynomial matrices. The resulting SMD code can be applied efficiently over high-dimensional data. The proposed method is verified using sequences of images of airplanes with varying spatial orientation. The performance of the parallel codes for polynomial covariance matrix generation and SMD is evaluated and reveals speedups of up to 161 and 67, respectively, relative to sequential execution on a PC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Abbreviations

PEVD:

Polynomial eigenvalue decomposition

SMD:

Sequential matrix diagonalization

MIMO:

Multi-input multi-output (convolution)

PCA:

Principal component analysis

n :

Dimensionality of the data

\(\mathbf{A}\) :

Polynomial matrix

\(\mathbf{R}\) :

Polynomial covariance matrix

\(\mathbf{E}\) :

Matrix of polynomial eigenvectors

W :

Lag window half-length used to calculate \(\mathbf{R}\)

\(m_r\) :

Lag window length used to calculate \(\mathbf{R}\)

\(\mathbf{[R]}_{\mathbf{p}}\) :

Slice of \(\mathbf{R}\) at lag index p

GPU:

Graphical processing unit

CUDA:

Compute unified device architecture

CUBLAS:

CUDA basic linear algebra subroutines

References

  1. Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  2. Kailath T (1980) Linear systems. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  3. Vaidyanathan PP (1993) Multirate systems and filter banks. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  4. McWhirter JG, Baxter PD, Cooper T, Redif S, Foster J (2007) An EVD algorithm for para-Hermitian polynomial matrices. IEEE Trans Signal Proces 55(5):2158–2169

    Article  MathSciNet  Google Scholar 

  5. Redif S, Weiss S, McWhirter JG (2015) Sequential matrix diagonalisation algorithms for polynomial EVD of parahermitian matrices. IEEE Trans Signal Proces 63(1):81–89

    Article  Google Scholar 

  6. Weiss S, Redif S, Cooper T, Liu C, Baxter P, McWhirter JG (2006) Paraunitary oversampled filter bank design for channel coding. EURASIP J Appl Signal Process 3:1–10

    Article  MATH  Google Scholar 

  7. Moret N, Tonello A, Weiss S (2011) MIMO precoding for filter bank modulation systems based on PSVD. In: Proceedings of the IEEE 73rd Vehicular Technology Conference, pp 1–95

  8. Foster J, McWhirter JG, Lambotharan S, Proudler I, Davies M, Chambers J (2012) Polynomial matrix QR decomposition and iterative decoding of frequency selective MIMO channels. IET Signal Process 6(7):704–712

    Article  MathSciNet  Google Scholar 

  9. Tohidian M, Amindavar H, Reza AM (2013) A DFT-based approximate eigenvalue and singular value decomposition of polynomial matrices. EURASIP J Adv Signal Process 1:1–16

    Google Scholar 

  10. Brandt R, Bengtsson M (2011) Wideband MIMO channel diagonalization in the time domain. In: Proceedings of the International Symposium on Personal, Indoor and Mobile Radio Communications, pp 1914–1918

  11. Lambert RH, Joho M, Mathis H (2001) Polynomial singular values for number of wideband source estimation and principal components analysis. In: Proceedings of the International Conference on Independent Component Analysis, San Diego, CA, pp 379–383

  12. Redif S, McWhirter JG, Baxter P, Cooper T (2006) Robust broadband adaptive beamforming via polynomial eigenvalues. In: Proceedings of the IEEE Ocean Conference, pp 1–6

  13. Alrmah MA, Weiss S, Redif S, Lambotharan S, McWhirter JG (2013) Angle of arrival estimation for broadband signals: a comparison. In: Proceedings of the Intelligent Signal Processing Conference. IET, London

  14. Redif S (2015) Fetal electrocardiogram estimation using polynomial eigenvalue decomposition. Turk J Electr Eng Comput Sci. doi:10.3906/elk-1401-19

    Google Scholar 

  15. Tkacenko A (2010) Approximate eigenvalue decomposition of para-Hermitian systems through successive FIR paraunitary transformations. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, pp 4074–4077

  16. Redif S, Weiss S, McWhirter JG (2011) An approximate polynomial matrix eigenvalue decomposition algorithm for para-Hermitian matrices. In: Proceedings of the 11th IEEE International Symposium on Signal Processing and Information Technology, Bilbao, Spain, pp 421–425

  17. Redif S, McWhirter JG, Weiss S (2011) Design of FIR paraunitary filter banks for subband coding using a polynomial eigenvalue decomposition. IEEE Trans Signal Process 59(11):5253–5264

    Article  MathSciNet  Google Scholar 

  18. Redif S (2006) Polynomial matrix decompositions and paraunitary filter banks. Ph.D. Thesis, University of Southampton, ECS, Southampton, UK

  19. Redif S, Kasap S (2015) Novel reconfigurable hardware architecture for polynomial matrix multiplications. IEEE Trans VLSI Syst 23(3):454–465

    Article  Google Scholar 

  20. Vaidyanathan PP (1998) Theory of optimal orthonormal subband coders. IEEE Trans Signal Process 46(6):1528–1543

    Article  Google Scholar 

  21. Nvidia Corporation (2015) CUDA toolkit documentation v7.5. http://docs.nvidia.com/cuda

  22. Nvidia Corporation (2015) CUBLAS library documentation v7.5. http://docs.nvidia.com/cuda/cublas

  23. Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53

    Article  MathSciNet  Google Scholar 

  24. Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S (2011) LU factorization for accelerator-based systems. In: Proceedings of the AICCSA’ 11th Conference, pp 217–224

  25. Carcenac M (2014) From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices. J Supercomput 68(1):365–413

    Article  Google Scholar 

  26. TurboSquid, 3D models. http://www.turbosquid.com

  27. Carcenac M, Redif S (2016) A highly scalable modular bottleneck neural network for image dimensionality reduction and image transformation. Appl Intell 44:557–610. doi:10.1007/s10489-015-0715-5

    Article  Google Scholar 

  28. Pal D (2010) Image processing through fuzzy logic. InL 2010 IEEE Information Reuse and Integration Conference, August 4–6, Las Vegas

  29. Himanshu T, Srinivas MB, Arabnia HR (2005) A need of quantum computing: reversible logic synthesis of parallel binary adder–subtractor. In: Proceeings of the 2005 International Conference on Embedded Systems and Applications (ESA’05), June, Las Vegas, pp 60–68

  30. Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: 49th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’06), San Juan, August 6–9, Puerto Rico, pp 148–154

  31. Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4:2 and 5:2 compressors with minimum number of transistors designed for low-power operations. In: Proceedings of the 2006 International Conference on Embedded Systems and Applications (ESA’06), June 26–29, Las Vegas, ISBN: 1-60132-017-5 160–166

  32. Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using Fredkin and Feynman gates for industrial electronics and applications. In: Proceeding of the 2006 International Conference on Computer Design & Conference on Computing in Nanotechnology (CDES’06), June 26–29, Las Vegas, pp 70–74

  33. Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of the 2007 International Conference on Parallel & Distributed Processing Techniques & Applications (PDPTA’07), USA, pp 449–450

  34. Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. In: Transactions on computational science journal, III, LNCS vol 5300. Springer, Berlin, pp 99–121

  35. Balasubramanian P, Edwards DA, Arabnia HR (2011) Robust asynchronous carry lookahead adders. In: Proceedings of the International Conference on Computer Design (CDES’11), July, USA, pp 119–124

  36. Balasubramanian P, Arisaka R, Arabnia HR (2012) RB_DSOP: a rule based disjoint sum of products synthesis method. In: Proceedings of the 2012 International Conference on Computer Design (CDES’12), July, Las Vegas, pp 39–43

  37. Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. In: Gavrilova ML, Tan CJK (eds) Transactions in computational science (Springer), XVII, LNCS 7420. Springer, Berlin, pp 73–97

  38. Yu S, Swartzlander EE Jr (2001) DCT implementation with distributed arithmetic. IEEE Trans Comput 50(9):985–991

    Article  Google Scholar 

  39. Lim HS (1996) Multidimensional systolic arrays for computing discrete Fourier transform and discrete cosine transform, Chapter 6. In: Swartzlander EE Jr (ed) Application specific processors. Kluwer, Boston, pp 161–195

    Google Scholar 

  40. Bodda C, Steenbock N (2001) Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop Rapid System Prototyping, pp 38–43

  41. Ahmedsaid A, Amira A, Bouridane A (2004) Accelerating MUSIC method on reconfigurable hardware for source localization. Proc Int Symp Circuits Syst 3:369–372

    Google Scholar 

  42. Bravo I, Mazo M, Lazaro JL, Jimenez P, Gardel A, Marron M (2008) Novel HW architecture based on FPGAs oriented to solve the eigen problem. IEEE Trans VLSI Syst 16(12):1722–1725

    Article  Google Scholar 

  43. Redif S, Kasap S (2013) Parallel algorithm for computation of second-order sequential best rotations. Int J Electron 100(12):1646–1651 ISSN: 0020-7217

    Article  Google Scholar 

  44. Kasap S, Redif S (2014) Novel field-programmable gate array architecture for computing the eigenvalue decomposition of para-Hermitian polynomial matrices. IEEE Trans VLSI Syst 22(3):522–536

    Article  Google Scholar 

  45. Bracewell R (1999) The Fourier transform and its applications. McGraw-Hill, New York

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Carcenac.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carcenac, M., Redif, S. & Kasap, S. GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data. J Supercomput 73, 3603–3634 (2017). https://doi.org/10.1007/s11227-017-1961-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-1961-6

Keywords

Navigation