ABSTRACT
In this paper, we discuss the GPU-based implementation and optimization of Householder bidiagonalization, a matrix factorization method which is an integral part of full Singular Value Decomposition (SVD) - an important algorithm for many problems in the research domain of Multimedia Content Analysis (MMCA). On cluster computers, complex adaptive run-time techniques often must be implemented to overcome the growing negative performance impact of load imbalances and to ensure reasonable speedup. We show that the nature of the many-core platform can avoid the necessity of applying such complex run-time parallelization techniques in software while achieving a performance of 64 gigaflops/s on a single-GPU GTX 295 in double precision, 82% of the theoretical peak performance.
- }}M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on gpus. Technical Report RC24704, IBM, 2008.Google Scholar
- }}D. Evans and M. Gusev. Systolic svd and qr decomposition by householder reflections. Int. J. Comp. Math., 79(4):417--439, Jan. 2002.Google ScholarCross Ref
- }}N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha. Lu-gpu: Efficient algorithms for solving dense linear systems on graphics hardware. In SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- }}G. Golub and C. Reinsch. Singular value decomposition and least squares solutions. Numerische Mathematik, 14(5):403--20, Apr. 1970.Google ScholarDigital Library
- }}F. Liu and F. Seinstra. Adaptive parallel householder bidiagonalization. In Proceedings of the 15th International Euro-Par Conference (Euro-Par 2009), pages 821--833, Delft, The Netherlands, Aug. 2009. Google ScholarDigital Library
- }}F. Seinstra, J. Geusebroek, D. Koelma, C. Snoek, M. Worring, and A. Smeulders. High-performance distributed video content analysis with parallel-horus. IEEE Multimedia, 14(4):64--75, Oct. 2007. Google ScholarDigital Library
- }}R. V. van Nieuwpoort and J. W. Romein. Using many-core hardware to correlate radio astronomy signals. In Proceedings of the ACM International Conference on Supercomputing (ICS'09), pages 440--449, 2009. Google ScholarDigital Library
- }}J. H. Wilkinson. Householder's method for the solution of the algebraic eigenproblem. The Computer Journal, 3(1):23--27, Apr. 1960.Google ScholarCross Ref
Index Terms
- GPU-based parallel householder bidiagonalization
Recommendations
Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingEmerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation
We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...
Comments