1 Introduction

In the recent years, deep learning has attracted considerable interests in computer vision field, as it has achieved promising performance in various tasks, e.g., image classification [1], object detection [2] and face recognition [3]. Generally, deep structure learning tends to extract hierarchical feature representations directly from raw data. Recent representative research works include: deep convolutional neural networks [4], deep neural networks [5], deep auto-encoder [6], and deeply-supervised nets [7].

Fig. 1.
figure 1

Illustration of our proposed algorithm. Corrupted data \(x _i, x _j\) are the inputs of the deep AE. After encoding and decoding process, the reconstructed \(x _i, x _j\) are encouraged to be close to \(D z_i, D z_j\) on the top, where D is the learned clean low-rank dictionary and \(z_i, z_j\) are corresponding coefficients. In addition, graph regularizers are added to the encoder layers to pass on the locality information.

Among different deep structures, auto-encoder (AE) [8] has been treated as robust feature extractors or pre-training scheme in various tasks [914]. Conventional AE was proposed to encourage similar or identical input-output pairs where the reconstruction loss is minimized after decoding [8]. Follow-up work with various additive noises in the input layer is able to progressively purify the data, which fulfills the purpose “denoising” against unknown corruptions in the testing data [15]. These works as well as the most recent AE variants, e.g., multi-view AE [13] and bi-shift AE [11], all assume the training data are clean, but can be intentionally corrupted. In fact, real-world data subject to corruptions such as changing illuminations, pose variations, or self-corruption do not meet the assumption above. Therefore, learning deep features from real-world corrupted data instead of intentionally corrupted data with additive noises becomes critical to build robust feature extractor that is generalized well to corrupted testing data. To the best of our knowledge, such AE based deep learning scheme has not been discussed before.

Recently, low-rank matrix constraint has been proposed to learn robust features from corrupted data. Specifically, when data are lying in a single subspace, robust PCA (RPCA) [16] could well recover the corrupted data by seeking a low-rank basis. While low-rank representation (LRR) [17] is designed to recover corrupted data and rule out noises in case of multiple subspaces. Due to these technical merits, low-rank modeling has already been successfully used in different scenarios, e.g., multi-view learning [18], transfer learning [1921], and dictionary learning [22]. However, fewer works link the low-rank modeling to deep learning framework for robust feature learning.

Inspired by the above facts, we develop a novel algorithm named as Deep Robust Encoder (DRE) with locality preserving low-rank dictionary. The core idea is to jointly optimize deep AE and a clean low-rank dictionary, which can rule out noises and extract robust deep features in a unified framework (Fig. 1). To sum up, our contributions are three folds as follows:

  • A low-rank dictionary and deep AE are jointly optimized based on the corrupted data, which can progressively denoise the already corrupted features in the hidden layers so that robust deep AE could be achieved for corrupted testing data.

  • The newly designed loss function, which is based on the clean low-rank dictionary and preserved locality information in the output layer, penalizes the corruptions or distortions, meanwhile ensures that the reconstruction is noise free.

  • Graph regularizers are developed to guide feature learning in each encoding layer to preserve more geometric structures within the data, in either unsupervised or supervised fashions.

The remaining sections of this paper are organized as follows. In Sect. 2, we present a brief discussion of the related works. Then we propose our novel deep robust encoder in Sect. 3, as well as the solution. Experimental evaluations are reported in Sect. 4, followed by the conclusion in Sect. 5.

2 Related Work

In this section, we mainly discuss the recent related works and highlight the differences between their approaches and ours.

Auto-encoder (AE) has attracted lots of research interests in computer vision fields. It was recently proposed as an efficient scheme for deep structures pre-training and dimensionality reduction [5, 8]. Denoising auto-encoder (DAE) generated a robust feature extractor by incorporating artificially random noise to the input data, and then minimized the square loss between reconstructed output and original clean data [15]. Most recently, appealing AE variants have been proposed to handle different learning tasks, e.g., transfer learning [11], domain generalization [12] and multi-view learning [13]. Generally, these variants aim to adapt the knowledge from one domain/view to another by tuning the input or the target data. Different from them, we consider that the real-world data already have been corrupted somehow and we develop an active deep denoising framework to handle the existing corruptions in the training data, which can then be well generalized to the unseen corrupted testing data. However, to the best knowledge, little has been discussed with regard to AE.

Low-rank modeling has demonstrated with appealing performance on robust feature extraction against noisy data. Recently, Robust PCA (RPCA) [16] has been proposed to rule out noises for data lying in a single subspace. Moreover, low-rank representation (LRR) [17] is presented recently to handle real-world noisy data lying in multiple subspaces. It can identify the global subspace structure as well as corruptions. Besides, low-rank modeling has also been adopted in different learning tasks, e.g., generic feature extraction [18], visual domain adaptation [21], robust transfer learning [19], and dictionary learning [22]. In this paper, we also involve the low-rank constraint on the dictionary learning to build a clean and compact basis. Differently, we exploit the low-rank dictionary to reconstruct the outputs of the deep AE with corrupted inputs, instead of the original data [22]. In this way, we could build an active deep denoising framework to generate more robust features from corrupted data. Furthermore, locality-preserving reconstruction helps maintain the geometric structure of the data, which has not been discussed with low-rank dictionary in deep learning before.

3 The Proposed Algorithm

In this section, we first introduce our motivation, and then propose our deep robust encoder through locality preserving low-rank dictionary. Finally, we present an efficient solution to the proposed framework.

3.1 Motivation

Intentional corruptions, e.g., random noises are added artificially while real-world ones are from data itself, e.g., varied lightings or occlusion. Most existing AE and its variants, e.g., DAE, take advantage of different additive noises on the clean data to improve the robustness of deep models. During the deep encoding/decoding process, the perturbed input data are gradually recovered. In this way, the learned deep model is able to tolerate certain corruptions simulated by the additive noises.

However, this raises two problems. First, the robustness of the system completely relies on the formulations of the noises. The richer the noisy patterns are, the better the performance will be. This inevitably increases the computational burden. In the worst case, the learned deep structure may not be well generalized to the unseen testing data. Second, real-world data usually suffer from contaminations of varied sources, and building robust feature extractors to rule out existing noises is more reasonable. In addition, recent advances in low-rank matrix modeling cast a light on denoising for data that are already corrupted. Based on these observations, we propose to jointly learn a deep AE framework and a clean low-rank dictionary to actively mitigate the noises or corruptions within the data (Fig. 2).

Fig. 2.
figure 2

The AE architecture with low-rank dictionary. A corrupted sample \(x \) is correlated to a low-rank clean version \(d \). The AE then maps it to hidden layer (via encoder layer) and attempts to reconstruct \(x \) via decoder layer, generating reconstruction \(\check{x }\). Finally, reconstruction error can be measured by different loss functions.

3.2 Locality Preserving Low-Rank Dictionary Learning

Suppose training data \(X \in \mathbb {R}^{d\times {n}}\) has n samples and \(x _i \in \mathbb {R}^{d}\) represents the i-th sample. For AE with single hidden layer [5, 8], it is usually consisted of two parts, encoder and decoder. The encoder, denoted as \(f_1\), attempts to map the input \(x _i\) into hidden representations, while the decoder, denoted as \(f_2\), tries to map the hidden representation back to the input \(x _i\). A typical cost function with square loss for AE can be formulated as:

$$\begin{aligned} \begin{array}{c} \min \limits _{W_1,b_1,W_2,b_2}\sum \limits _{i=1}^n\big \Vert x _i - f_2(f_1(x _i))\big \Vert _2^2, \end{array} \end{aligned}$$
(1)

where \(\{W_1 \in \mathbb {R}^{r\times {d}}, b_1 \in \mathbb {R}^r\}, \{W_2 \in \mathbb {R}^{d\times {r}}, b_2 \in \mathbb {R}^d\}\) are the parameters for encoding and decoding, respectively. Specifically, we have \(f_1(x _i) = \varphi (W_1x _i+b_1)\) and \(f_2(f_1(x _i)) = \varphi (W_2f_1(x _i)+b_2)\), where \(\varphi (\cdot )\) is an element-wise “activation function”, which is usually nonlinear, such as sigmoid function or tanh function. DAE manually involves artificial noise into the input training data so that it aims to train a denoising auto-encoder to remove the random noise.

In reality, however, \(x _i\) is usually corrupted already due to environmental factors or noises from the collecting devices. Intuitively, we need to build a network by detecting and removing noise from the corrupted data so that it could better generalize to corrupted testing data. To this end, we propose our robust auto-encoder with low-rank dictionary learning:

$$\begin{aligned} \begin{array}{c} \min \limits _{W_1,b_1,W_2,b_2,D}\sum \limits _{i=1}^n\Vert d _i - f_2(f_1(x _i))\Vert _2^2 +\lambda \mathrm {rank}(D), \end{array} \end{aligned}$$
(2)

where \(d _i \in \mathbb {R}^d\) is the i-th column of low-rank \(D \in \mathbb {R}^{d\times {n}}\) and \(\lambda \) is the tradeoff parameter. \(\mathrm {rank}(\cdot )\) means the rank operator of a matrix, which encourages to build a clean and compact basis. Generally, the convex surrogate of rank problem, i.e., nuclear norm \(\Vert \cdot \Vert _*\) will be employed to solve the rank minimization problem [16].

However, similar to the conventional AE and its variants, the point-to-point reconstruction scheme in Eq. (2) only considers one-to-one mapping, which may overfit the data and skip the structure knowledge within the data. To that end, we propose a novel locality preserving low-rank dictionary learning by introducing a new coefficient vector \(z_i\) to maintain the locality of each sample \(x _i\) throughout the network:

$$\begin{aligned} \begin{array}{c} \min \limits _{W_1,b_1, W_2,b_2,D}\sum \limits _{i=1}^n\Vert Dz_i - f_2(f_1(x _i))\Vert _2^2 +\lambda \Vert D\Vert _*, \end{array} \end{aligned}$$
(3)

where \(z_i \in \mathbb {R}^{n}\) is the coefficient vector for sample \(x _i\) w.r.t. dictionary D. There are different strategies to obtain the coefficient vector \(z_i\), in either unsupervised or supervised fashion, depending on the availability of label information. Specifically, the j-th element in \(z_i\) is defined as:

$$\begin{aligned} z_{ij} = \left\{ \begin{array}{cc} 1, &{} \text {if}~i=j,\\ \exp \left( -{\frac{\Vert x _i-x _j\Vert ^2}{2\sigma ^2}}\right) , &{} \text {if}~x _i \in \mathcal {N}_{k_1}(x _j), \\ 0, &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(4)

where \(x _i \in \mathcal {N}_{k_1}(x _j)\) means \(x _i\) is within the \(k_1\) nearest neighbors of \(x _j\). Specifically, we could define the locality-preserving coefficients \(z_i\) in two fashions. For unsupervised case, the \(k_1\) nearest neighbors are searched from the whole data, while for supervised case, the \(k_1\) nearest neighbors are searched from the data within the same class to \(x _i\). Actually, we could easily extend semi-supervised scenario. Note \(\sigma \) is a bandwidth for Gaussian kernel (we set \(\sigma = 5\) in this paper).

To sum up, our regularized deep auto-encoder transform the original AE’s point-to-point reconstruction strategy to our point-to-set reconstruction so that we could preserve more discriminative information. To further guide the locality preserving dictionary learning in the output layer, we propose to couple the discriminant graph regularizers with hidden feature learning during the optimization:

$$\begin{aligned} \begin{array}{c} \min \limits _{W_1,b_1, W_2,b_2,D}\sum \limits _{i=1}^n\Vert Dz_i - f_2(f_1(x _i))\Vert _2^2 +\lambda \Vert D\Vert _* \\ + \alpha \sum \limits _{j=1}^n\sum \limits _{k=1}^ns_{jk}(f_1(x _j)-f_1(x _k))^2, \end{array} \end{aligned}$$
(5)

where \(s_{jk}\) is the similarity between \(x _j\) and \(x _k\). \(\alpha \) is the balance parameter.

Specifically, \(s_{jk}\) can be calculated in unsupervised and supervised fashions as well:

$$\begin{aligned} s_{jk} = \left\{ \begin{array}{cc} \exp \left( -{\frac{\Vert x _j-x _k\Vert ^2}{2\sigma ^2}}\right) , &{} \text {if}~x _j \in \mathcal {N}_{k_2}(x _k), \\ 0, &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(6)

where \(x _j \in \mathcal {N}_{k_2}(x _k)\) means \(x _j\) is within the \(k_2\) nearest neighbors of \(x _k\). In the same way as \(z_i\), the \(k_2\) nearest neighbors are selected from the whole dataset for unsupervised case, while the \(k_2\) nearest neighbors are selected from the data within the same class to \(x _j\) for supervised case.

3.3 Deep Architecture

Considering the learning objective in Eq. (5) as a basic building block, we can train a more discriminant deep model. Existing popular training schemes for deep auto-encoder includes Stacked Auto-Encoder (SAE) [15] and Deep Auto-Encoder [6]. However, as our learning objective/building block is different from theirs, we have a different training scheme for the deep structure.

Assume we have L encoding layers and L decoding layers in our deep structure which minimizes the following loss:

$$\begin{aligned} \begin{array}{c} \min \limits _{W_l,b_l,D}\sum \limits _{i=1}^n\Vert Dz_i - \bar{x }_i\Vert _2^2 +\lambda \Vert D\Vert _* \\ + \alpha \sum \limits _{l=1}^L\sum \limits _{j=1}^n\sum \limits _{k=1}^ns_{jk}(f_l(x _j)-f_l(x _k))^2, \end{array} \end{aligned}$$
(7)

where \(\bar{x }_i\) is the output with a series of encoding and decoding from the input \(x _i\). \(\{W_l,b_l\}\), \((1\le l\le L)\) are the encoding parameters while \(\{W_l,b_l\}, (L+1\le l\le 2L)\) are the decoding parameters. The third term sums up the graph regularizers from each encoding layer to guide the locality preserving low-rank dictionary learning in the output layer.

3.4 Optimization

Equation (7) is difficult to address because of the non-convexity and non-linearity of the building block formulated in Eq. (5). To this end, we develop an alternating solution to iteratively update the encoding&  decoding functions \(f_l (1\le l \le 2L)\) and dictionary D. First we list the low-rank dictionary learning, then provide the regularized deep auto-encoder optimization.

Low-Rank Dictionary Learning. When \(f_l (1\le l \le 2L)\) are fixed, the objective function in Eq. (7) degenerates to a conventional low-rank recovery problem, which can be solved by augmented Lagrange multiplier algorithm [23]. To that end, we first involve a relaxing variable J, and write down its equivalent formulation as:

$$\begin{aligned} \min _{D,J}\Vert \bar{X} - DZ\Vert _\mathrm {F}^2+\lambda \Vert J\Vert _*,~~\mathrm {s.t.}~~D = J, \end{aligned}$$

where \(\bar{X} = [\bar{x }_1,\cdots ,\bar{x }_n]\) and \(Z = [z_1, \cdots , z_n]\). \(\Vert \cdot \Vert _\mathrm {F}^2\) is Frobenius norm of a matrix. Then we derive the corresponding augmented Lagrangian function w.r.t. DJ:

$$\begin{aligned} \Vert \bar{X} - DZ\Vert _\mathrm {F}^2+\lambda \Vert J\Vert _*+\langle R, D - J\rangle +\frac{\mu }{2}\Vert D-J\Vert _\mathrm {F}^2, \end{aligned}$$

where R is the Lagrange multiplier and \(\mu > 0\) is the penalty parameter. \(\langle , \rangle \) is the matrix inner product operator. Specifically, we have the following updating rules for DJ one variable at time t:

$$\begin{aligned} J_{t+1} = \mathop {\arg \min }_{J}\frac{\lambda }{\mu _t}\Vert J\Vert _*+\frac{1}{2}\Vert J-D_t-\frac{R_t}{\mu _t}\Vert _\mathrm {F}^2, \end{aligned}$$
(8)

which can be effectively addressed by the singular value thresholding (SVT) operator [24].

$$\begin{aligned} \begin{array}{rl} D_{t+1} = &{} \mathop {\arg \min }\limits _{D} \Vert \bar{X} - DZ\Vert _\mathrm {F}^2+\langle R_t, D - J_{t+1}\rangle +\frac{\mu _t}{2}\Vert D-J_{t+1}\Vert _\mathrm {F}^2\\ = &{}(2\bar{X}Z^\top +\mu _t{J}_{t+1}-R_t)(2ZZ^\top +\mu _t{\mathrm {I}_n})^{-1}, \end{array} \end{aligned}$$
(9)

where \(\mathrm {I}_n \in \mathbb {R}^{n\times {n}}\) is an identical matrix.

Deep Robust Encoder Learning. When D is fixed, the objective function in Eq. (7) can be reformulated to minimize the following objective function:

$$\begin{aligned} \mathcal {L} = \sum \limits _{i=1}^n\Vert \bar{x }_i- \bar{d _i}\Vert _2^2 +\alpha \sum \limits _{l=1}^L\sum \limits _{j=1}^n\sum \limits _{k=1}^ns_{jk}(f_l(x _j)-f_l(x _k))^2, \end{aligned}$$

where \(\bar{d _i} = Dz_i\). Since the loss function (Eq. (3.4)) is smooth and twice-differentiable, we can still adopt L-BFGS optimizer [25] to deal with this unconstrained problem, whose updating rules at time t are shown as follows:

$$\begin{aligned} \left\{ \begin{aligned} W_{l,t+1}&= W_{l,t} - \eta _t H_{l,t}\frac{\partial \mathcal {L}}{\partial W_l}|_{W_{l,t}},\\ b_{l,t+1}&= b_{l,t} - \eta _t G_{l,t}\frac{\partial \mathcal {L}}{\partial b_l}|_{b_{l,t}}, \end{aligned} \right. \end{aligned}$$
(10)

in which \(\eta _t\) denotes the learning rate, \(H_{l,t}\) and \(G_{l,t}\) are the approximations for the inverse Hessian matrices of \(\mathcal {L}\) w.r.t. to \(W_l\) and \(b_l\), respectively. The detailed formulations and discussions of \(\eta _t\), \(H_{l,t}\) and \(G_{l,t}\) are trivial, which can be referred to [25]. In this section, we mainly focus on the derivatives of \(\mathcal {L}\) w.r.t. to \(W_l\) and \(b_l\).

For the decoding layers \((L+1 \le l \le 2L)\), we have:

$$\begin{aligned} \begin{array}{c} \dfrac{\partial \mathcal {L}}{\partial W_{l}} = \sum \nolimits _{i=1}^{n}\mathcal {F}_{i,l}\mathbf{f ^\top _{i,l-1}},~~~~\dfrac{\partial \mathcal {L}}{\partial b_{l}} = \sum \nolimits _{i=1}^{n}\mathcal {F}_{i,l}, \end{array} \end{aligned}$$

where \(\mathbf f _{i,l-1} = f_{l-1}(x _i)\) is the l-\({1}^{\mathrm {th}}\)-layer hidden layer feature and the updating equations are computed as follows:

$$\begin{aligned} \begin{array}{l} \mathcal {F}_{i,2L} = 2(\bar{x }_i - \bar{d}_i)\odot {\varphi '(\mathbf u _{i,2L})},\\ ~~~\mathcal {F}_{i,l} = (W_{l+1}^\top \mathcal {F}_{i,l+1})\odot {\varphi '(\mathbf u _{i,l})}. \end{array} \end{aligned}$$

Here the operator \(\odot \) denotes the element-wise multiplication, and \(\mathbf u _{i,l}\) is computed by \(\mathbf u _{i,l} = W_l\mathbf f _{i,l-1}+b_l\).

For the encoding layers \((1 \le l \le L)\), we have:

$$\begin{aligned} \begin{array}{l} \dfrac{\partial \mathcal {L}}{\partial W_{l}} = \sum \limits _{i=1}^{n}\mathcal {F}_{i,l}\mathbf{f ^\top _{i,l-1}}+\\ ~~~~~~~~~~~~~~~2\alpha \sum \limits _{p=l}^L\sum \limits _{j=1}^n\sum \limits _{k=1}^ns_{jk}(\mathcal {G}_{jk,p}{} \mathbf f _{j,p-1}^\top +\mathcal {G}_{kj,p}{} \mathbf f _{k,p-1}^\top ), \\ \dfrac{\partial \mathcal {L}}{\partial b_{l}} = \sum \limits _{i=1}^{n}\mathcal {F}_{i,l}+2\alpha \sum \limits _{p=l}^L\sum \limits _{j=1}^n\sum \limits _{k=1}^ns_{jk}(\mathcal {G}_{jk,p}+\mathcal {G}_{kj,p}), \end{array} \end{aligned}$$

in which \(\mathcal {G}_{jk,l}\) and \(\mathcal {G}_{kj,l}\) are calculated as follows:

$$\begin{aligned} \begin{array}{l} \mathcal {G}_{jk,L} = (\mathbf f _{j,l} - \mathbf f _{k,l})\odot {\varphi '(\mathbf u _{j,L})},\\ \mathcal {G}_{kj,L} = (\mathbf f _{k,l} - \mathbf f _{j,l})\odot {\varphi '(\mathbf u _{k,L})},\\ \mathcal {G}_{jk,l} = (W_{l+1}^\top \mathcal {G}_{jk,l+1})\odot {\varphi '(\mathbf u _{j,l})},\\ \mathcal {G}_{kj,l} = (W_{l+1}^\top \mathcal {G}_{kj,l+1})\odot {\varphi '(\mathbf u _{k,l})}. \end{array} \end{aligned}$$

To that end, we can optimize low-rank dictionary and deep auto-encoder iteratively until convergence. The entire procedure of two sub-problems is listed in Algorithm 1. Before the alternative updating, the network parameters \(f_l (1 \le l \le 2L)\) are initialized through deep auto-encoder with the input and the target as X [6], whilst D is directly set as original data X for initialization.

figure a

4 Experiments

In this section, we conduct experiments to systematically evaluate our algorithm. First, we present the details of datasets and experimental settings. Then we do self-evaluation on our algorithm and present the comparison results with several state-of-the-art algorithms. Finally, we further testify several properties of the proposed algorithm, e.g., impacts of layer size, parameter analysis.

4.1 Datasets and Experimental Settings

COIL dataset Footnote 1 includes 72 views from 100 objects with different illumination conditions (Fig. 3). Each object is captured in equally spaced views, i.e., 5 degrees. In our experiments, we adopt the gray-scale images and resize them to 32 \(\times \) 32. We randomly select ten images per object to build the training set, and the rest images as the testing set. We repeat the random selection process 20 times, and report the average performance. In addition, we perform scalability evaluations by gradually involving more categories from 20 to 100. Furthermore, we also evaluate the robustness of different approaches to noise by adding 10 % random corruption to the original images.

Fig. 3.
figure 3

Samples of two datasets: COIL-100 (left) and CMU-PIE (right). We show original images and 10 % corrupted ones for COIL-100. For CMU-PIE, the original faces already show large variance with one subject.

CMU-PIE Face datasetFootnote 2 contains 68 subjects under different poses subject to large appearance differences (Fig. 3). In addition, for each pose, there are 21 various illumination conditions. We use face images from 8 different poses to construct various evaluation sets. The sizes of them vary from 2 to 5. Basically, we randomly select 15 images per pose per subject to build the training set while the left as the testing set. The face images are cropped and resized to \(64\times {64}\), and the raw features are used as the inputs.

Table 1. Recognition results (\(\%\)) of 4 approaches on different setting of three datasets.

ALOI datasetFootnote 3 consists of 1000 object categories captured from different viewing angles. Specifically, each object has 72 equally spaced views. In this experiments, we select the first 300 objects by following the setting in [26], where the images are transformed to gray-scale and resized to \(36\times 48\). Furthermore, 10 % pixel corruption is added to testify the robustness of different methods.

Note that previous algorithms, e.g., DAE [15], adopted the “corrupted” data with random noise as the input for training while using the “original” data for testing. However, we assume the data are “already corrupted” and we manage to detect and remove the noise. Thus, we adopt the “same” types of training and testing data without intentional corruptions. Notably, to challenge all comparisons, we introduce additional noises to the datasets that have already been corrupted by poor lighting or arbitrary views. Such practice can be found in previous work [22, 26].

4.2 Self-evaluation

In this section, we mainly testify if our low-rank dictionary D and locality preserving term \(Z = [z_1,\cdots ,z_n]\) would facilitate our robust feature learning. Specifically, we define the deep version of Eq.(2) as LAE (Auto-encoder with low-rank dictionary) and deep version of Eq. (3) as L\(^2\)AE (Auto-encoder with locality preserving low-rank dictionary). For L\(^2\)AE, we have two ways to learn Z, that is, we set \(k_1 = k_2 = 5\) for all cases in unsupervised fashion (L\(^2\)AE-u), while we set \(k_1, k_2\) as the size of each class for supervised fashion (L\(^2\)AE-s). A four-layer scheme is applied for all the comparisons for simplicity. We adopt corrupted COIL-100 and ALOI, while both original and corrupted images of CMU-PIE to testify these algorithms with the baseline, conventional AE [8]. The comparison results are shown in Table 1, where COIL-100c means the 10 % corrupted COIL using 100 objects, PIE-1 and PIE-2 denote the two views cases \(\{C02, C14\}, \{C02, C27\}\) with its 10 % corrupted versions PIE-1c and PIE-2c, respectively. ALOI-c represents the 10 % corrupted data.

From the results, we could observe that LAE outperforms the conventional AE, that means jointly learning the low-rank dictionary could boost the deep feature learning of auto-encoder. Furthermore, we witness that our robust AEs with locality preserving low-rank dictionary could achieve better performance than LAE and AE for both unsupervised and supervised settings. That is, locality preserving property could generate more discriminative features for classification.

4.3 Comparison Experiments

We mainly compare with (1) traditional feature extract methods: PCA [27], LDA [28]; (2) low-rank based algorithms: RPCA+LDA [16], LatLRR [29], DLRD [22], LRCS [18], SRRS [26]. Specifically, PCA, LDA, RPCA+LDA, LRCS and SRRS belong to dimensionality reduction algorithms so that we search the optimal dimensionality for each to report the performance. Besides, to further evaluate the effectiveness of our algorithm, DAE [15] is adopted as the baseline. For our algorithm, we have two modes, i.e., unsupervised mode (Ours-I), and supervised mode (Ours-II). Specifically, we set parameters \(\alpha = 10^2, \lambda = 10^{-2}\). For DAE and our two modes, we apply a four-layer deep structure. For Ours-I, we set \(k_1 = k_2 = 5\) for all cases, while for Ours-II, we set \(k_1, k_2\) as the size of each class. We apply the nearest neighbor classifier (NNC) for all algorithms except DLRD and show experimental results in Tables 2 and 3 and Fig. 4(a).

Table 2. Recognition results (\(\%\)) of 9 algorithms on COIL-100 in different evaluation sizes, from 20 to 100 objects, where C1 to C5 denote 20 objects to 100 objects, respectively. color denotes the best recognition rates. color denotes the second best.

From Tables 2 and 3 and Fig. 4(a), we could observe our proposed algorithm in two modes outperforms others in most cases, especially for the corruption cases. In the corruption cases, our method has a significant improvement over others on two datasets (about 7 % improvement on corrupted COIL dataset). All the algorithms suffer from additional noises; however, ours can still achieve appealing performance (only 1–2 % performance degradation), which demonstrates the superiority of our method against noises in feature learning.

Table 3. Recognition results (\(\%\)) on CMU-PIE face database, where P1: {C02, C14}, P2: {C02, C27}, P3: {C14, C27}, P4: {C05, C07, C29}, P5: {C05, C14, C29, C34}, P6: {C02, C05, C14, C29, C31}. color denotes the best recognition rates. color denotes the second best.
Fig. 4.
figure 4

(a) Recognition results (\(\%\)) of 9 algorithms on ALOI-300 in original and 10 % corrupted cases. (b) Recognition rates of all comparisons on COIL database with different levels of noise. (c) Parameters analysis on \(\alpha \) and \(\lambda \). (d) The impact of layer size to the recognition performance.

In COIL dataset, we can observe that low-rank modeling based methods also achieve very good results compared with DAE, although the latter adds additive noises to train robust deep models. This demonstrates the robustness of low-rank modeling against noisy data. In the CMU-PIE dataset, DAE could achieve very similar performance to low-rank modeling based methods, in both supervised and unsupervised fashions. Similar results can be found from ALOI dataset. On CMU-PIE dataset, our algorithm cannot significantly improve the performance. One reason is that the facial appearances under different views on CMU-PIE dataset are very different. Considering additional illumination variations, this raises a very challenging feature learning problem on real-world dataset. However, our algorithm could still achieve promising performance, even better than a most recent multi-view learning method, LRCS. This further verifies the robustness of our algorithm against noises from real world. Generally, our supervised model outperforms unsupervised one in almost all the cases. This demonstrates the importance of discriminative information in classification tasks.

4.4 Property Evaluation

In this section, we further evaluate several properties of our proposed algorithm, e.g., robustness to noise, parameter influence and layer size impact, to achieve a better understanding of the proposed model.

First of all, we evaluate the impacts of different corruption ratios to different algorithms. We evaluate 0 %, 10 %, 20 %, 30 %, 40 %, and 50 % corruptions with 20 objects on COIL dataset, and report results in Fig. 4(b), where our algorithm in two modes consistently outperforms other competitors. This demonstrates that our proposed algorithm can build a more robust feature extractor, especially for data with large corruption. Therefore, our algorithm could work efficiently in real-world applications with various noise.

Second, we conduct parameter analysis for our supervised model (Ours-II). Specifically, we evaluate the balance parameter \(\lambda \) and \(\alpha \) for the low-rank dictionary and the graph terms, respectively. For better illustration, we jointly evaluate two parameters on corrupted COIL dataset with all 100 objects. Parameter influence results are listed in Fig. 4(c). From the results, we can notice larger value of \(\alpha \) performs better especially when \(\lambda \) is small. Besides, we could see that small \(\lambda \) around \(10^{-2}\) performs better. That is, the graph regularizer is more critical to our algorithm comparing to the low-rank constraint on dictionary. Without loss of generality, we set \(\alpha = 10^2\) and \(\lambda =10^{-2}\) throughout our experiments.

Finally, we evaluate the impacts of layer size for Ours-II on corrupted COIL-100 with different corruptions (10 %, 20 %, 30 %). From Fig. 4(d), we can notice that our algorithm generally achieves better performance when the layer size goes up. That is, discriminative information is hopefully recovered by our deep encoding procedure. In other words, features would be refined from coarse to fine in a multi-layer fashion. However, we also observe that a much deeper structure would ruin the recognition performance. Therefore, in the experiments, we use a four-layer structure to generate the evaluation features.

5 Conclusion

In this paper, we developed a novel Deep Robust Encoder framework guided by a locality preserving low-rank dictionary learning scheme. Specifically, we designed a low-rank dictionary to constrain the output of the deep auto-encoder with corrupted input. In this way, the deep neural networks would generate more robust features by detecting noise from the corrupted data. Moreover, coefficient vectors \(z_i\) were maintained through the networks so that each output sample would be reconstructed by the most similar data samples in the dictionary with different weights. Furthermore, graph regularizers were developed to couple each layer’s encoding to preserve more geometric structure. In experiments, we achieved more effective features for classification and results on several benchmarks demonstrated our method’s superiority over other methods.