Keywords

1 Introduction

There exist several relevant applications that require the computation of an orthonormal basis for a relatively small set of very long vectors. This problem is often tackled via a “tall-and-skinny” QR factorization (TSRQ), named for the form of the matrix containing the vectors. This type of decomposition can be leveraged, among others, for orthogonalization in Krylov subspace methods [11]; the analysis of big data applications characterized with a few descriptors only (e.g., large datasets with a few variables produced by long data acquisitions of several sensors) [2, 9]; and as a preprocessing step when computing the singular value decomposition (SVD) of a matrix [8] with many more row than columns.

It is well known that the blocked QR factorization based on Householder reflectors [8], hereafter QRF-H, is not efficient for the factorization of tall-and-skinny (TS) matrices on modern parallel processors; see, e.g., [4]. The reason is that, for matrices that have few columns but a large number of rows, the fraction of work of QRF-H that can be cast in terms of kernels from the Level-3 of BLAS (basic linear algebra subprograms [5]), as part of the highly parallel and efficient trailing update, cannot compensate the high cost of the panel factorization, which is performed via the much slower Level-1 and Level-2 BLAS.

TSQR-H [4] is an algorithm specifically designed for the factorization of TS matrices. In this approach, the matrix is split into small “square-like” blocks, and a QRF-H is computed for each block. These small QRF-H are then merged by pairs, using a structure-aware version of QRF-H. This procedure can follow a linear scheme [9] or, in parallel machines, a recursion tree [4], yielding a communication-reduction scheme.

The Cholesky QR factorization [12] (QRF-C) is an alternative that reduces the amount of communications but, unfortunately, often suffers from high orthogonality loss. The use of mixed precision (in particular, the introduction of extended precision) in [15] can improve the accuracy of QRF-C, but its implementation cannot benefit from standard BLAS libraries. A simpler solution is to perform a second step of QRF-C, to improve the orthogonality [6, 14]. This result connects neatly with the classical “twice is enough” rule for Gram-Schmidt re-orthogonalization [7]. However, as the number of vectors grows, the cost of QRF-C increases cubically. In addition, for very large problems the conditioning of the Gram matrix can become too large and a second “pass” may not be sufficient.

In this paper we follow the approach in [16] to combine QRF-C and block Gram-Schmidt (BGS) [10], but propose to perform two steps of QRF-C and Gram-Schmidt instead of mixed (extended) precision. Furthermore, from the implementation side, when the target platform is equipped with a graphics processing unit (GPU), we propose to introduce a static look-ahead strategy [13] to overlap computations on CPU, GPU, and CPU-GPU data transfers. At this point we note that the formulation of a hybrid look-ahead variant of the algorithm demonstrates that it is also possible to develop either a static implementation or a runtime-based version, for multicore architectures, where the cost of the panel factorization can be practically amortized; see, e.g., [3] and the references therein.

The rest of the paper is organized as follows. Section 2 presents the details of our “two-pass” algorithm for the QR factorization of TS matrices. Next, Sects. 3 and 4 respectively provide numerical and performance evaluations of the new method in comparison with a state-of-the-art hybrid library for CPU-GPU platforms. Finally, Sect. 5 summarizes the contributions of this work and suggests future lines of research.

2 The QRF-CBGS Algorithm

The QRF-C and BGS algorithms can be combined in several manners, depending on the specific order of the operations. Although these variants are mathematically equivalent, they have different numerical properties. In our work, we follow [12] to interlace the steps of QRF-C and BGS, in order to reduce the problems due to crossover orthogonality loss.

In particular, consider a TS matrix , with \(m \gg n\), and the compact QRF given by

$$\begin{aligned} A = QR, \end{aligned}$$
(1)

where has orthonormal columns and is upper triangular. Furthermore, consider the partitionings

$$\begin{aligned} A = \left[ A_1,~ A_2,\ldots , A_k\right] , \qquad R = \left[ \begin{array}{cccc} R_{1,1} &{} R_{1,2} &{} \ldots &{} R_{1,k} \\ 0 &{} R_{2,2} &{} \ldots &{} R_{2,k} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \ldots &{} R_{k,k} \end{array}\right] , \end{aligned}$$
(2)

where, for simplicity, we assume that n is an integer multiple of the algorithmic block size b, so that all block columns \(A_j\) consist of \(b=n/k\) columns each, and all blocks \(R_{i,j}\) contain \(b \times b\) elements each.

Our approach QRF-CBGS, combining the Cholesky QR factorization with the block Gram-Schmidt method, is then described in Algorithm 1.

figure a

In this formulation of the QRF-CBGS algorithm, upon completion, the input matrix A is replaced by the output matrix Q, which contains the first n orthonormal vectors of the factorization. Furthermore, in this algorithm, the block size b is a key parameter that should be set to a large value, with the specific value depending on the target platform, among other factors, to attain high performance. However, a block size that is “too large” can result in a considerable condition number of the Gram matrix \(A_j^TA_j\), which could cause numerical issues for the Cholesky decomposition.

All the operations in QRF-CBGS, except for one in line 2, can be realized using direct calls to the efficient Level-3 BLAS, such as the symmetric rank-k update (xSYRK), the triangular solve with multiple right-hand sides (xTRSM), the triangular matrix-matrix multiplication (xTRMM), and the general matrix-matrix multiplication (xGEMM). The Cholesky factorization in line 2 can be realized via a call to the LAPACK routine for the corresponding decomposition (xPOTRF), which is then internally decomposed into Level-3 BLAS routines. However, the Cholesky factorization contributes a minor factor to the total cost, as and, in practice, \(b \ll n\).

The Cholesky factorization can fail if the condition number of the matrix \(A_j^TA_j\) is too large. Fortunately, this failure is easy to detect thanks to the error value returned by xPOTRF or, more safely, it can be explicitly (and cheaply) identified by using a condition number estimator on \(A_j^TA_j\) [8]. This breakdown is not critical for QRF-CBGS as none of the original data has been modified, and the algorithm can continue by just switching to QRF-H for the ill-conditioned block \(A_j\).

The QRF-CBGS algorithm is a row variant of BGS where the vectors are updated several times during the orthogonalization process, similarly to what is done in the modified Gram-Schmidt method for single vectors. However, the orthogonality level attained by QRF-CBGS, measured as

$$\begin{aligned} \Vert Q^TQ - I\Vert , \end{aligned}$$
(3)

(where I denotes the identity matrix,) will often fail short of the maximum attainable accuracy given the floating-point representation. This effect is due to both the in-block QRF-C and the Gram-Schmidt among blocks.

The classical solution to the lack of orthogonality in Gram-Schmidt is to re-apply the procedure yielding a re-orthogonalization. Naturally, this technique is also a solution for the possible lack of orthogonality in the QRF-C procedure. Therefore, it seems logical to perform twice QRF-CBGS, applying re-orthogonalization to both and QRF-C and BGS, albeit in an interlaced manner. Algorithm 2 (QRF-CBGS2) is a simple solution that applies two steps of QRF-CBGS. Note that an efficient implementation should perform the update of R in line 3 as the algorithm advances without storing the full matrices \(R_1\) and \(R_2\).

figure b

An additional advantage of the row variant of BGS is that this algorithm can overlap the orthogonalization inside a block with the orthogonalization of the subsequent blocks. This look-ahead technique can be introduced into the algorithm by a careful reordering of the operations. Concretely, Algorithm 3 is equivalent to QRF-CBGS but, in the new version, the Cholesky decomposition in line 8 can be performed simultaneously with the trailing update in lines 9–13.

figure c

The introduction of (a static) look-ahead is of paramount importance to obtain an efficient hybrid CPU-GPU implementation. Concretely, in this version the Cholesky factorization can be computed on the CPU while the GPU proceeds with the trailing updates with respect to the factorization of previous panel(s). Furthermore, this technique also allows that the necessary memory transfers between CPU and GPU can be overlapped with computations in the GPU. The look-ahead strategy also reduces the impact of any Cholesky breakdown, as the QRF-H of the offending block can be performed while updating the rest of the matrix.

The implementation of both variants of algorithm QRF-CBGS on a GPU from NVIDIA is direct, as all operations are available in the cuBLAS GPU-specific implementation of the BLAS. Therefore, compared with other algorithms, an appealing property of QRF-CBGS2 is that no custom GPU kernels are required, as all major building blocks are available as part of cuBLAS. (Note that xPOTRF is computed on the CPU.) This greatly simplifies the implementation as well as favours high performance on present as well as future GPU architectures.

Table 1. Numerical comparison of QRF-H and QRF-CBGS2

3 Numerical Tests

In this section, we asses the numerical behaviour of the new algorithm QRF-CBGS2 by running some numerical experiments with TS matrices specifically designed to produce a breakdown of the Cholesky decomposition. This stress test is based on [1] and should allow a comparison of the reliability of QRF-CBGS2 with other alternative methods, as those in [1], whose implementation does not seem to be publicly available.

The test matrices are derived from the QR factorization of a \(m \times n = 1000\times 200\) matrix A with entries following a uniform random distribution in the interval [0, 1). We then set \(R_{100,100}=\rho \) in the upper triangular factor R, and multiply back Q and R to form \(\tilde{A}\). The parameter \(\rho \) controls the condition number of the assembled matrix, which is given by \(\kappa (\tilde{A})\approx 1/\rho \) so that, varying \(\rho \in [10^{-1},10^{-15}]\), we obtain matrices with a condition number of up to \(10^{15}\).

Table 1 compares the orthogonality loss

$$\begin{aligned} \Vert Q^TQ-I\Vert _F \end{aligned}$$
(4)

and relative residual,

$$\begin{aligned} \frac{\Vert A-QR\Vert _F}{\Vert A\Vert _F} \end{aligned}$$
(5)

of the QR factorizations computed by QRF-H and QRF-CBGS2 for matrices with different values of \(\rho \). All the tests were performed in a Intel Xeon E5-2630 v3 processor using ieee double-precision arithmetic. The QRF-H implementation corresponds to that in Intel MKL 2017 and the block size of QRF-CBGS2 was set to \(b=16\).

Table 1 shows that QRF-CBGS2 offers orthogonality and relative residuals quite similar to those of QRF-H. This excellent numerical behaviour is partially due to the re-orthogonalization approach of QRF-CBGS2 and the Cholesky breakdown detection mechanism. Specifically, in this experiment the fail-safe detection on the Cholesky factorization only triggers once for each one of the matrices with \(\rho \le 10^{-8}\). This means that, among the \(k (=\lceil n/b \rceil = \lceil 200/16 \rceil ) = 13\) Cholesky decompositions that have to be computed for each matrix, only one had to be recomputed using QRF-H. We expect that, for real applications, the probability of a Cholesky breakdown will be even smaller.

4 Performance Evaluation

Hardware Setup. In this section we compare the performance of QRF-CBGS2 and QRF-H on two distinct platforms equipped with two representative GPUs: a high end NVIDIA Tesla P100 (Pascal) and a cost-effective NVIDIA GeForce Titan X (Pascal). The P100 is paired with two Intel Xeon E5-2620 v4 processors (8 + 8 cores) while the Titan X is paired with an Intel Core i7-3770K CPU (4 cores).

Software Setup. All codes are compiled with version 8.0 of the CUDA development environment. The optimized implementations of BLAS and LAPACK are those provided by NVIDIA cuBLAS 8.0 for the GPU and Intel MKL 2017 for the CPU. Hyper-threading is disabled in the Intel architectures as suggested by the MKL library documentation. To reduce the variance of execution times, the OpenMP and MKL number of threads is set to the number of physical cores and each thread is mapped statically to one core. The GNU C compiler is the default version provided by the Linux operating system installed on those computers. Nevertheless, the optimizations made by these compilers are not relevant for our study, because all the performance-sensitive code is implemented inside the cuBLAS and MKL libraries. To avoid noise caused by other processes activity on the systems, the execution times reported are the median values from 10 executions. Those times are similar with the exception of the first execution in a batch, which is significantly larger due to the dynamic loading of libraries.

Input Data. The input matrices are created following the same procedure described in the numerical tests in Sect. 3, with random elements in range [0, 1), and setting \(\rho = 1\) so that the condition number is kept small. For brevity, we report results in (ieee) double precision only for the P100; the comparison using single precision on this platform offers similar conclusions. Conversely, as the Titan X offers very low performance in double precision, we only employ single precision on that platform.

Fig. 1.
figure 1

Performance comparison of QRF-CBGS2 with look-ahead over QRF-H for TS matrices on an two Intel Xeon E5-2620 v4 CPUs and an NVIDIA P100 GPU (double precision).

QRF Implementations. The baseline GPU implementation QRF-H is provided by xGEQRF (QR factorization via Householder reflectors) and xORGQR (assembly of the \(m \times n\) orthonormal matrix) available in the MAGMA library (version 2.2.0). Among the three variants of xGEQRF in MAGMA, we choose the first one as this is the only one with a corresponding implementation of xORGQR. To allow a fair comparison between QRF-H and QRF-CBGS2, as QRF-CBGS2 always produces the orthonormal matrix Q, the execution times of QRF-H reported next correspond to the combination of the runtimes for xGEQRF and xORGQR from MAGMA. We note that the implementation of xGEQRF in MAGMA employs look-ahead while xORGQR re-utilizes some intermediate factors computed in xGEQRF (in particular, the triangular factors for the blocked Householder reflectors) and it is much faster than xGEQRF. The block size in QRF-CBGS2 was set to 1,024 which we found optimal for both platforms.

Fig. 2.
figure 2

Performance comparison of QRF-CBGS2 with look-ahead over QRF-H for TS matrices on an Intel Core i7-3770K and an NVIDIA Titan X GPU (single precision).

Evaluation. Figures 1 and 2 compare the performance of QRF-H and QRF-CBGS2 with look-ahead on the P100 and the Titan X, respectively. The y-axis shows the ratio between the execution time of QRF-H divided by that of QRF-CBGS2. Each line corresponds to a group of matrices with the same number of rows n while the number of columns m is fixed as specified in the x-axis of the plot. The plots show a small performance gap in double precision for matrices with more than 3000 columns, and a larger gap in single precision for more than 7000 columns. Both cases are due to an increase of the block size internally employed by MAGMA (from 64 to 128 in double precision and from 128 to 256 in single precision). As expected, the performance of QRF-CBGS2 is much higher for TS matrices (up to 6 times) but this advantage diminishes as the difference between the number of columns and rows narrows. However, the performance drop is considerably less sharp in the Titan X platform, since (in part) the relative slow CPU on that machine drags the QRF-H performance down while QRF-CBGS2 benefits from the reduced communications. Quite unexpectedly, this effect allows QRF-CBGS2 to outperform QRF-H on the Titan X even for square matrices, as shown in Fig. 3.

Fig. 3.
figure 3

Performance comparison of QRF-CBGS2 with look-ahead over QRF-H for square matrices on an Intel Core i7-3770K and an NVIDIA Titan X GPU (single precision).

5 Conclusions

The new algorithm presented in this work, QRF-CBGS2, is a variant of block Gram-Schmidt that uses the Cholesky decomposition to orthogonalize the columns inside a block. To obtain a satisfactory level of orthogonality, our proposal leverages re-orthogonalization and avoids Cholesky breakdowns by leveraging the standard QR factorization via Householder reflectors (QRF-H) in case of ill-conditioned blocks.

The QRF-CBGS2 algorithm computes the triangular factor by rows, allowing for an effective look-ahead technique that computes the Cholesky decomposition of the “current” block while Gram-Schmidt is applied to re-orthogonalize the remaining columns of the matrix with respect to “previous” blocks. Furthermore, this look-ahead alleviates the extra cost in case of a Cholesky breakdown.

The variant of QRF-CBGS2 with look-ahead can be efficiently implemented on a hybrid CPU-GPU system, with the CPU in charge of the Cholesky decomposition while the rest of operations (all BLAS level-3) are performed on the GPU. One advantage of this approach is the reduced volume of communications between CPU and GPU compared with the blocked Householder QR implementations available in MAGMA. An additional advantage of QRF-CBGS2 is that its implementation is much simpler than other methods specifically designed for tall-skinny matrices as no custom GPU kernels are required. This also favours portability to new GPU architectures or even different types of accelerators.

The performance of the new approach is very competitive for tall-skinny matrices, and even outperforms MAGMA for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder-based QR.

The stability of QRF-CBGS2 has been analyzed with matrices that explicitly enforce Cholesky breakdowns. While this experiment shows that QRF-CBGS2 offers a level of orthogonality and relative error similar to those of the numerically-stable QRF-H, a numerical analysis may help to fully understand the behaviour of the algorithm and devise future strategies for improving the accuracy and reduce the re-orthogonalization cost.