Keywords

1 Introduction

Hyperspectral image (HSI) contains much more information than a regular image due to significant number of spectra bands and the spectral information can be considered as multiview. Therefore, HSI has been widely applied in remote sensing monitoring and its high spectral resolution can help distinguish different materials. Low rank hyperspectral image recovery (LRHSIR) [1] is one of popular techniques in image processing which can discover the underlying structure of the given observations. The inspiration of which is that, in real case, even very high-dimensional observations are from a low-dimensional subspace. As a theoretical foundation in computer vision, the effectiveness of LRHSIR has been proven by several fundamental tasks [2]. Therefore, recovering low-rank and sparse matrices from corrupted observations has obtained greatly attention in computer vision, machine learning and statistics. And, many computer vision tasks can be formulated as low dimensional linear models. Typical examples are Robust Principal Component Analysis (RPCA) [3] and Low Rank Representation (LRR) [4]. These two models are widely used for representation of images in computer vision tasks, such as subspace segmentation [5], image classification [6] and subspace clustering [7]. The advantage of these low rank algorithms is that the data points from different categories are treated as samples from a union of multiple low-rank subspaces. Thereby, it can provide a good representation of the data [8].

Among state-of-the-art models, RPCA is one of the most widely used algorithms. Candes et al. [3] proposed RPCA, which recovers a subspace structure from noisy data by decomposing the data matrix into two components of a low rank matrix and a sparse matrix through introducing the nuclear norm. It can be seen that singular value decomposition (SVD) is the most widely used technique when all components of the data matrix are observed. To avoid this expensive singular value decomposition, fixed-rank strategies have been proposed to obtain the low-rank representation. For example, Liu et al. [9] proposed the fixed-rank representation (FRR) as a unified framework for unsupervised visual learning which can reveal the structure of multiple subspaces in closed-form when the data is noiseless. Moreover, Wen et al. [10] constructed a nonlinear successive over-relaxation (SOR) algorithm and showed that its speed is several times faster than many other methods.

Recently, some researchers proposed another strategy which introduced bilinear factorization to accelerate the computation process. This approach benefits from fast numerical methods for optimization. Therefore, the calculation of a bilinear factorization of a matrix becomes a fundamental operation in many computer vision applications [11]. We can see that factorization approach to low-rank subspace estimation is a method to minimize a loss function between an observed measurement matrix and a bilinear factorization. Unfortunately, in the presence of missing data, the bilinear factorization problem becomes NP-Hard [12]. Thus, the problem of low-rank matrix factorization in the presence of missing data has seen significant attention in computer vision research. Many researchers focused on initialization strategies or algorithms that are robust to initialization. For example, Aguiar et al. [13] proposed an optimal algorithm in the absence of noise when the missing data follows a Young diagram. To solve those scenarios which exhibit band patterns, Buchanan et al. [14] showed that alternated minimization algorithms are subject to flattening and proposed a Newton method to optimize bilinear factor jointly.

However, the performance of RPCA could be depressed if observations are insufficient. Unfortunately, the presence of missing data is often the situation in practice. Hence, besides properly modeling the low-rank structure, it is the core to the performance of recovery when handling the problem of missing data. Thus, some researchers [15, 16] explored the idea of adding an indicator matrix or a weight matrix to measure the entries of the data matrix. Aside from that, Cabral et al. [17] proposed a unified approach to bilinear factorization and nuclear norm regularization that inherits the benefits of both. Moreover, Lin et al. [18] proposed a new algorithm for robust matrix factorization (RMF) which is based on the Majorization Minimization (MM) technique. These methods have also been extended to handle outliers, in which case, the L2 norm is only optimal to Gaussian noise and is fragile to outliers. For robustness, Ke and Kanade [19] suggested replacing the LS loss with the L1 norm, minimized by alternated linear programming. Guo et al. [20] proposed a method for recovering the low rank matrix with robust outlier estimation, termed as ROUTE, in a unified manner.

In this paper, we present a new method of utilizing bilinear factorization via recursive sample factoring. Inspired by the theoretical results in [21], it can be shown that many nuclear norm regularized problems can be optimized with a bilinear factorization by using the variational definition of the nuclear norm. Furthermore, data points are selected by introducing sample scaling factor in BF-RSF. The sample scaling factor is a cosine similarity metric accounting for the angle between each data point and the principal component vector of low-rank matrix in feature space. In our model, sample scaling factor is used to not only measure the significance of data points, but also to impose restriction onto the training dataset iteratively to reduce the noise effect. By this modeling, BF-RSF can learn better low-rank structure of clean data especially in a heavy noisy scenario, with noisy data being suppressed.

We summarize our key contributions in the following:

  1. (1)

    The advantage of bilinear factorization lies in that it is a fast numerical method for optimization. Meanwhile, the significance of each data point is measured by the sample scaling factor, which is learnt from the relationship between the data point and the principal component vector of the low-rank matrix. Therefore, noisy data points can be detected and suppressed from the dataset.

  2. (2)

    A better low-rank structure of clean data can be learnt through BF-RSF, by imposing sample scaling factor onto the training dataset iteratively.

  3. (3)

    Extensive experimental results demonstrate that our BF-RSF method significantly improves the performance of image clustering especially in a heavy noisy scenario.

The remainder of this paper is structured as follows. We present our proposed method of bilinear factorization via recursive sample factoring (BF-RSF) in Sect. 2. Experimental results are presented in Sect. 3. Finally, we draw conclusions in Sect. 4.

2 Methodology

2.1 Problem Formulation

In this subsection, we present a brief review of the formulation of our BF-RSF model. Formally, low-rank matrix recovery with missing data can be directly or indirectly formulated as

$$\begin{aligned} \begin{array}{*{20}{c}} {\mathop {\min }\limits _Z f\left( \mathbf {X - Z} \right) + \lambda {\left\| \mathbf {Z} \right\| _*}}\\ {s.t.}{\mathrm{{rank}}\left( \mathbf {Z} \right) = r} \end{array} \end{aligned}$$
(1)

where \(f\left( \cdot \right) \) denotes a loss function and \({\mathbf {X}} = \left[ {{{\mathbf {x}}_1},{{\mathbf {x}}_2}, \ldots {{\mathbf {x}}_\mathrm{{n}}}} \right] \in {\mathfrak {R}^{\mathrm{{d}} \times \mathrm{{n}}}}\) is the data matrix. \(\mathbf{Z}\) is the low rank matrix. \(\lambda \) is a trade-off parameter between the loss function and the low rank regularization induced by the nuclear norm.

According to [22], \(\mathbf{Z}\) can be decomposed as \({\mathbf{Z}} = \mathbf {U_{d \times r}}\mathbf {V_{n \times r}^T}\) in which r is the rank of \(\mathbf{Z}\). In [23], the nuclear norm can be alternated as \(\mathop {\mathrm {min}}\limits _{Z = U{V^T}} \frac{1}{2}\left( {\left\| \mathbf {U} \right\| _F^2 + \left\| \mathbf {V} \right\| _F^2} \right) \) by using the variational definition of nuclear norm. And it has been shown that the loss function can be written as \({\left\| \cdot \right\| _{{l_p}}}\). Therefore, Cabral et al. [17] unified bilinear factorization and nuclear norm and then Eq. 1 can be rewritten as follows:

$$\begin{aligned} \begin{array}{*{20}{c}} {\mathop {\min }\limits _{Z,U,V} {\left\| \mathbf {W \odot \left( {X - Z} \right) } \right\| _{{l_p}}} + \frac{\lambda }{2}\left( {\left\| \mathbf {U} \right\| _F^2 + \left\| \mathbf {V} \right\| _F^2} \right) }\\ {s.t.}{{\mathbf{Z}} = \mathbf{U}\mathbf {V^T}} \end{array} \end{aligned}$$
(2)

where \({\mathbf {X}} \in {\mathfrak {R}^{\mathrm{{d}} \times \mathrm{{n}}}}\) is the data matrix and \({\mathbf {Z}} \in {\mathfrak {R}^{\mathrm{{d}} \times \mathrm{{n}}}}\) is the low rank matrix. \(\left\| \cdot \right\| \) is a matrix norm. \(\mathbf {W}\) is a weight matrix with the same size as \(\mathbf {X}\). The entry value of \(\mathbf {W}\) being 0 means that the component at the same position in \(\mathbf {X}\) is missing, and 1 otherwise. The operator \( \odot \) is the Hadamard element-wise product. We can choose different types of \({l_p}\) according to different kinds of noise. For instance, it is reasonable to choose \({l_2}\) norm for the assumption of Gaussian noise, or more exactly, the Frobenius norm \({\left\| \cdot \right\| _F}\) in matrix form as the measurement. However, \({l_2}\) norm is not appropriate any longer for the presence of outliers and \({l_1}\) norm has long been recommended [24].

Different from traditional method which imposed a weight matrix directly, efforts are made to seek another reasonable solution to measure the significance of each data point in our BF-RSF model. We expect that clean data points have high weights while the noisy ones possess low weights. To this end, a cosine similarity metric is developed. As shown in Fig. 1, \({U_1}\) is the principal component vector, and \({x_i}\) and \({x_j}\) are two data points. One can be seen that, the angle between \({x_j}\) and \({U_1}\) is much larger than that of \({x_i}\). Evidently, \({x_i}\) is a clean data point, while \({x_j}\) is an outlier. This indicates that we can exploit \({d_i} = \left| {\cos {\theta _i}} \right| \) to measure the significance of the data points. And the choice of \({U_1}\) mentioned above will be discussed as follows. As claimed in [25], any \(\mathbf {Z}\) can be decomposed as \(\mathbf {U\Sigma {V^T}}\), where \(\mathbf{U}\) and \(\mathbf{V}\) are left and right singular vectors. In our model, we choose \({U_1}\) which is the column vectors of \(\mathbf{U}\) corresponding to the maximum eigenvalue of \(\mathbf{\sigma }\). This principal component vector is also illustrated in Fig. 1 from a geometrical view.

Fig. 1.
figure 1

Illustration of our model.

In summary, \({d_i}\) can be computed through

$$\begin{aligned} \begin{array}{*{20}{c}} {{d_i} = \left| {\cos {\theta _i}} \right| + \varepsilon = \left| {\frac{{x_i^T{U_1}}}{{\left\| {{x_i}} \right\| \;\left\| {{U_1}} \right\| }}} \right| + \varepsilon }\\ \end{array} \end{aligned}$$
(3)

where \(\varepsilon \) is a small constant to prevent \({d_i}\) from 0.

With the factor d, we can impose this penalty factor onto the data to alleviate the noise effect before inputting them in the low-rank model. Let \({{\mathbf {D}}} = \left[ {\begin{array}{*{20}{c}} {{d_1}}&{}{}&{}{}\\ {}&{}{ \ddots {d_i}}&{}{}\\ {}&{}{}&{}{ \ddots {d_n}} \end{array}} \right] \), and the low rank matrix is achieved by solving the following optimization:

$$\begin{aligned} \begin{array}{*{20}{c}} {\mathop {\min }\limits _{Z,U,V} {\left\| {\left( \mathbf {X - Z} \right) \mathbf {D}} \right\| _1} + \frac{\lambda }{2}\left( {\left\| \mathbf {U} \right\| _F^2 + \left\| \mathbf {V} \right\| _F^2} \right) }\\ {s.t.}{{\mathbf {ZD}} = {\mathbf {U}}\mathbf {V^T},\;\mathbf {\hat{X}} = \mathbf {X}{\mathbf {D}},\mathrm{{\;}}\mathbf {\hat{Z}} = \mathbf {Z}{\mathbf {D}}} \end{array} \end{aligned}$$
(4)

where \({\mathbf {X}} \in {\mathfrak {R}^{\mathrm{{d}} \times \mathrm{{n}}}}\) is the data matrix, \(\mathbf{Z}\) is a low-rank structure matrix, and \({\mathbf {U}} \in {\mathfrak {R}^{\mathrm{{d}} \times \mathrm{{r}}}}\), \({\mathbf {V}} \in {\mathfrak {R}^{\mathrm{{n}} \times \mathrm{{r}}}}\).

It can be seen that the factor d is suitable for our goal. As shown in Fig. 1, the angle is to \(\mathrm{{\pi }}/2\), the higher probability that the data point is a noisy point, and a lower factor is assigned to the data.

Therefore, the effect of noisy data points will be suppressed. By contrast, the effect of those data points that are close to the principal component vector will be enhanced.

The characteristics of BF-RSF are summarized as follows:

  1. (1)

    In our BF-RSF model, the scaling factor is learnt from relationship between the principal component vector and each data point by introducing the cosine similarity metric. The sample scaling factor matrix \(\mathbf{D}\) is introduced to measure the significance of data points and it can reduce the impact of noisy data points in modeling with the sample scaling factor being imposed onto the training dataset iteratively.

  2. (2)

    With this model, our BF-RSF can learn better low-rank structure of clean data especially in a heavy noisy scenario, as the effect of noisy data points in modeling being suppressed. When \(\mathbf{D}\) is imposed on \(\mathbf{X}\), the low rank solution of \(\mathbf{Z}\) will tend to keep more low-rank structure of data points with bigger cosine similarity. Therefore a better low-rank structure of clean data is achieved in our model.

  3. (3)

    \(\mathbf{Z}\) in Eq. 4 is used for representing the low-dimensional structure of the new dataset \(\mathbf {\hat{X}}\). In this way, we propose a low-rank matrix recovery framework with sample scaling factor. Compared to other algorithms, the performance of our method is the best.

2.2 Optimization

In this subsection, we introduce an optimization algorithm to solve the problem defined in Eq. 4. The augmented Lagrangian function of the problem in Eq. 4 is

$$\begin{aligned} \begin{array}{*{20}{c}} {\begin{array}{l} L = \mathop {\min }\limits _{Z,U,V,Y,\mu } {\left\| {\left( \mathbf {X - Z} \right) \mathbf {D}} \right\| _1} + \frac{\lambda }{2}\left( {\left\| \mathbf {U} \right\| _F^2 + \left\| \mathbf {V} \right\| _F^2} \right) \\ \qquad + \left\langle {\left. \mathbf {Y,\hat{Z} - U{V^T}} \right\rangle } \right. + \frac{\mu }{2}\left\| \mathbf {\hat{Z} - U{V^T}} \right\| _F^2 \end{array}}\\ \end{array} \end{aligned}$$
(5)

where \(\mathbf{Y}\) is a Lagrange multiplier, \(\mathrm{{\mu }} > 0\) is a penalty parameter, and \({\left\| \cdot \right\| _F}\) denotes the Frobenious norm of a matrix.

For \({\mathbf {U}} \in {\mathfrak {R}^{\mathrm{{d}} \times \mathrm{{r}}}}\) and \({\mathbf {V}} \in {\mathfrak {R}^{\mathrm{{n}} \times \mathrm{{r}}}}\), the solution is obtained by equating the derivatives of Eq. 5 with respect to \(\mathbf{U}\) and \(\mathbf{V}\) to 0. Then, \(\mathbf{U}\) and \(\mathbf{V}\) are obtained as follows:

$$\begin{aligned} \begin{array}{*{20}{c}} {{\mathbf {U}} = \left( {\mathbf {Y} + \mu \mathbf {\hat{Z}}} \right) \mathbf {V}{\left( {\lambda \mathbf {I_r} + \mu \left( \mathbf {{V^T}V} \right) } \right) ^{ - 1}}}\\ \end{array} \end{aligned}$$
(6)
$$\begin{aligned} \begin{array}{*{20}{c}} {{\mathbf {V}} = {\left( {\mathbf {Y} + \mu \mathbf {\hat{Z}}} \right) ^T}\mathbf {U}{\left( {\lambda \mathbf {I_r} + \mu \left( {\mathbf {U^T}{} \mathbf{U}} \right) } \right) ^{ - 1}}}\\ \end{array} \end{aligned}$$
(7)

For known \(\mathbf{U}\) and \(\mathbf{V}\), and introducing a relaxation variable \(\mathbf{J}\) to denote \(\left( {\mathbf {\hat{X}} - \mathbf {\hat{Z}}} \right) \), \(\mathbf{J}\) can be updated as follows:

$$\begin{aligned} \begin{array}{*{20}{c}} {{\mathbf {J}} = \mathop {\mathrm{{argmin}}}\limits _J \frac{1}{\mu }{\left\| \mathbf {J} \right\| _1} + \frac{1}{2}\left\| \mathbf {J - \left( {\hat{X} - U{V^T} + \frac{Y}{\mu }} \right) } \right\| _F^2}\\ \end{array} \end{aligned}$$
(8)

Finally, \(\mathbf{Z}\) can be obtained via \({\mathbf {Z}} = {\mathbf {U}}\mathbf {V^T}\mathbf {D^{ - 1}}\) and the whole algorithm is presented in Algorithm 1.

figure a

3 Experiments

In this section, we evaluate the performance of our proposed BF-RSF method in comparison with several state-of-the-art methods including Unifying [17], Practical [16], MoG [24] and ROUTE-LRMR [20]in image clustering [7].

Fig. 2.
figure 2

The 2D images of SalinasA. (Color figure online)

We perform the experiments on the hyperspectral image dataset, SalinasA-scene. Salinas was collected by the 224-band AVIRIS sensor over Salinas Valley, California, and is characterized by high spatial resolution (3.7-m pixels). The area covered comprises 512 lines by 217 samples. We discarded the 20 water absorption bands, in this case bands: [108–112], [154–167], 224. This image was available only as at sensor radiance data. It includes vegetables, bare soils and vineyard fields. Salinas groundtruth contains 16 classes. In our experiment, we choose SalinasA which is a small sub-scene of Salinas image and is usually used too. It comprises 86 * 83 pixels located within the same scene at [samples, lines]=[591–676, 158–240] and includes 6 classes.

Visualized Analysis on SalinasA. In order to show the advantage of our proposed method, we visualized the feature representation of SalinasA in 2D space. Figure 2 shows the two-dimensional representations obtained from the dataset. Different classes are denoted as different colors and there are 6 clusters in this SalinasA dataset.

Fig. 3.
figure 3

The examples of the original and corrupted images under different densities (Den.) from SalinasA dataset.

Data Preparation. The SalinasA image is reshaped to a resolution of \(7138\,\times \,204\). for our experiment. In our experiment, we gradually increase the level of noise and the levels are divided into three grades: 5%, 10% and 15%. Fig. 3 shows some images of the original and corrupted images under the above grades of noise on this dataset.

Parameter Settings. The parameters of each algorithm are tuned with a bootstrapped grid search under a level of 5% corruption rate. For Unifying and Practical methods, the parameter \(\lambda \) is set as 0.001. In MoG, the parameter \(\lambda \) is set as 0.1. \(\mathrm{{\alpha }}\), \(\mathrm{{\beta }}\) and \(\mathrm{{\gamma }}\) are set as 50, 1 and 0.01 respectively in ROUTE-LRMR provided by [20]. The parameter \(\lambda \) in our proposed method is set as 0.1.

After parameters being selected, the spectral clustering algorithms are used to segment the data into clusters and 6 clusters are obtained in SalinasA. To estimate the clustering effectiveness of each algorithm, the clustering process is repeated 5 times on SalinasA dataset, the clustering accuracies are recorded in Table 1. In Table 1, the mean of the clustering accuracies of each algorithm are recorded under degrees of 0%, 5%, 10%, 15% samples per pixel corrupted.

Table 1. The clustering accuracies obtained from our proposed method and other comparative methods on SalinasA dataset
Fig. 4.
figure 4

Variation of clustering accuracies in different corruption rates on SalinasA dataset.

Results. From Table 1, the clustering accuracies of our proposed BF-RSF are seen higher than other comparative methods when no corruption is added. By gradually increasing the corruption rate, the clustering robustness in both algorithms is seen in a decrease. For example, the clustering results in our BF-RSF are seen decreasing from 0.6857 to 0.6441 and from 0.6716 to 0.6261 in Unifying. However, it can be seen that our BF-RSF shows its robustness in clustering performance and the clustering accuracies are always higher than others under the various levels of corruptions. This demonstrates that the introduced sample scaling factor can suppress the effect of noisy data and improves the clustering performance in a heavy noisy scenario.

Moreover, Fig. 4 illustrates the variation of clustering results in different corruption rates on SalinasA dataset. It can be seen that the accuracies are not much differences between ours and ROUTE-LRMR under the levels of 0% and 5%. With the increasing of corruption rate, the accuracies of ours become higher than others. This indicates that the robustness of our proposed BF-RSF in clustering is enhanced by introducing the sample scaling factor.

From the above discussion, one can be seen that the performance of BF-RSF is better than others, especially in a high level of corruptions. In our proposed method, the sample scaling factor is imposed onto the training dataset iteratively which can suppress the effect of noise. Therefore, our BF-RSF can learn better low-rank structure of clean data especially in a heavy noisy scenario.

4 Conclusions

In this paper, we proposed the bilinear factorization via recursive sample factoring for low-rank hyperspectral image recovery, which is named as BF-RSF. In the presence of missing data or outliers, the robustness of L1-norm measurement is preferable over L2-norm. At the same time, different from traditional methods which directly imposed a weight matrix, we introduce a sample scaling factor to measure the significance of each data point. The sample scaling factor is a cosine similarity metric which is learnt through the relationship between the principal component vector of the low-rank matrix and each data point. By imposing the sample scaling factor onto the dataset iteratively, the noisy data will be detected and their impact will be suppressed. On the other hand, our BF-RSF model can learn better low-rank structure of clean data especially in a heavy noisy scenario, with the effect of noisy data points in modeling being suppressed. Extensive experimental results on SalinasA hyperspectral image dataset show that BF-RSF outperforms state-of-the-art low-rank matrix recovery methods in subspace clustering, especially in the experiments with heavy noisy data injected.