Keywords

1 Introduction

Unsupervised classification (clustering) is an indispensable technique in several modern data analysis tasks such as image segmentation, pattern recognition and data mining. Indeed, clustering plays a significant role in processing of hyperspectral remote sensing imagery, where labelled training samples are laborious to produce or often inadequate for application with supervised classification.

A hyperspectral image (HSI) consists of hundreds of contiguous spectral bands that provide detailed information to distinguish spectrally similar materials. However, the high dimensionality of data introduces complexities, generally referred to the curse of dimensionality [1, 2]. Thus, dimensionality reduction (DR) is of great importance in processing of HSI data for the unsupervised classification task.

Several DR techniques have been utilized in HSI analysis tasks. In this direction, the common strategies include linear DR approaches such as Principal Component Analysis (PCA), Multidimensional Scaling (MDS) and Independent Component Analysis (ICA). These linear techniques have been shown efficient in many scenarios, but as they rely on the linearity assumption and do not consider the potential non-linear dependencies they may be problematic in remote sensing HSI data usually coming with intrinsic non-linearities. The non-linearity in HSI data may be caused by non-linear scattering patterns, variations in local geometry of the sun-canopy-sensor triangle, nonuniform pixel composition, and the characteristics of transmission medium [3].

Non-linear dimensionality reduction (NLDR) approaches, mainly manifold learning algorithms, have been used to resolve both issues of high-dimensionality and the problems associated with non-linearity of HSI remote sensing data. The conventional non-linear manifold learning algorithms such as Isomap [4], kernel Principal Component Analysis (KPCA) [5], Locally Linear Embedding (LLE) [6], Local Tangent Space Alignment (LTSA) [7] and Laplacian Eigenmaps (LE) [8] are built on the single smooth manifold assumption. These approaches attempts to retrieve the structure of data while they assume that the data lie close to a low dimensional sub-manifold embedded in a uniformly distributed single topological manifold.

In many real world applications, data may lie on multiple manifolds. Manifold learning algorithms relying on a single global manifold will fail to achieve satisfactory results if the data is drawn from multiple manifolds with intersections. To this end, multi-manifold learning is an alternative line of research in NLDR that attempts to recover the structure of data with multi-manifold assumption. The multi-manifold assumption is about distribution of data where each data cluster is approximated by a specific sub-manifold that features distinct geometric structure being attributed to that data cluster.

There are a number of methods proposed in this context. A group of multi-manifold learning algorithms is about to complement the manifold learning algorithms based on the single manifold assumption to data with multiple manifolds. [9] proposes a multi-manifold clustering algorithm based on Multi Dimensional scaling (MDS). [10] extends ISOMAP to data of multiple manifolds. [11] proposes a multi-manifold learning algorithm that performs locality preserving projection (LPP) [12]. [13, 14] are extensions of the well-know locally linear embedding (LLE) for data with multi-manifold structure.

Another group of the methods in multi-manifold learning incorporates the manifold structure into clustering problem. [15,16,17] incorporate dimension or density information of data into the clustering algorithm. [18] performs multi-manifold clustering via energy minimization. [19] proposes a sparse-coding based spectral clustering. [20], High Order Spectral Clustering (HOSC), performs spectral clustering based via higher order graph affinities. [21,22,23,24,25,26] propose spectral clustering utilizing local tangent space affinities.

In this paper, we present a new spectral multi-manifold clustering algorithm with application to HSI data. The proposed algorithm follows the same outline as the general local tangent space based spectral methods, but it adopts a variant of autoencoders, specifically Contractive Autoencoder (CAE) [27], for local tangent space estimation.

Autoencoder (AE) is a simple feed forward neural network that basically aims to reconstruct input data. The main purpose of AE is to perform dimensionality reduction or Sparse Coding (SC) through unsupervised learning. An autoencoder with the number of hidden units less than the number of input units (the original dimension of data) is seen as a dimensionality reduction technique. An AE with the number of hidden units larger than number of input units imposed by some constraint renders Sparse Coding. Indeed, PCA in this way may be pointed out as a special case of linear AE that includes rank deficient connection weight matrix [27, 28].

Contractive Autoencoder is an improved variant of the basic autoencoder that directly attempts to produce robust latent representations that are less sensitive to infinitesimal variations in input data. CAE achieves this by balancing the reconstruction loss through an analytical regularizer term that penalizes the Frobenius norm of the Jacobian of the encoder. Through the manifold realization of data, Riafai et al. in [27,28,29] show that as the reconstruction objective in CAE attempts to produce accurate reconstructions, the contraction regularizer aims for the hidden representations that are also indifferent in all directions of input space. That is, the hidden representations in CAE prefer to change only along the directions of nearby training data points and accordingly tend to resemble a data manifold that is locally flat.

This observation leads to inspect the derivatives of the hidden representations with respect to the input data for estimation of the coordinates systems of the local tangent spaces as successfully exploited in [29]. Following this direction, we utilize CAE for local tangent space estimation required for a general spectral multi-manifold clustering applied for typical HSI unsupervised classification task.

In the proposed algorithm, we follow three sequential steps. First, we adopt and train an overcomplete single-layer CAE with unlabelled data. We compute the local tangent spaces at every point using the trained CAE. Second, we combine the obtained local tangent spaces with the proximity model of data and then we compute the affinity matrix. Third, we plug the affinity matrix into a standard spectral clustering algorithm and we classify data. We evaluate the performance of algorithm with two benchmark HSI data, namely Salinas and Pavia Center Scene.

The remainder of the paper is organized as follows. Section 2 reviews Contractive Autoencoder (CAE) and shows how it can be used for tangent space estimation. Section 3 presents the overall procedure of spectral multi-manifold clustering, presenting the main steps involved. Section 4 presents experimental results and evaluation of effectiveness of the proposed algorithm. Finally, Sect. 5 concludes the paper.

2 Tangent Space Estimation with Contractive Autoencoder

A basic autoencoder involves an encoder and a decoder. The encoder transforms data from the input space to the latent representations and the decoder transforms the latent representations back to the input space. Formally, a data vector from the original input space \(\mathbf {x} \in \mathbb {R}^D\) is mapped to the latent representation \(\mathbf {h} \in \mathbb {R}^{d_h}\) with encoder function \(f(\mathbf {x})\) and the reconstruction of that input data \(\hat{\mathbf {x}}\) is obtained from the latent representation with decoder function \(g(\mathbf {h})\).

A basic autoencoder learns the encoder and decoder parameters \(\varTheta \) through minimizing the average reconstruction loss given by the following objective function:

$$\begin{aligned} \mathcal {J}_{AE} = \sum _{_{\mathbf {x} \in X}} L(\mathbf {x},\hat{\mathbf {x}}) = \sum _{_{\mathbf {x} \in X}} L(\mathbf {x},g(f(\mathbf {x}))), \end{aligned}$$
(1)

where L() is the reconstruction loss function that can be defined by the squared error for a linear decoder:

$$\begin{aligned} L(\mathbf {x},\hat{\mathbf {x}}) = \parallel \mathbf {x} - \hat{\mathbf {x}} \parallel ^2 \end{aligned}$$
(2)

or can be defined by the Bernoulli cross entropy loss function for a logistic sigmoid decoder:

$$\begin{aligned} L(\mathbf {x},\hat{\mathbf {x}}) = -\sum _{i=1}^{D} x_{i} \log {(\hat{x}_{i})} + (1-x_i)\log {(1-\hat{x}_i)}. \end{aligned}$$
(3)

The encoder function \(f(\mathbf {x})\) is of the form:

$$\begin{aligned} \mathbf {h} = s_h(\mathbf {W}_h\mathbf {x} + \mathbf {b}_h), \end{aligned}$$
(4)

where \(\mathbf {W}_h \in \mathbb {R}^{d_h \times D}\) and \(\mathbf {b}_h \in \mathbb {R}^{d_h}\) are encoder network parameters, and \(s_h(.)\) is a nonlinear element-wise function, commonly defined by logistic sigmoid \(sigmoid(z) = (1+\exp (-z))^{-1}\).

The decoder \(g(\mathbf {h})\) has the following form:

$$\begin{aligned} \hat{\mathbf {x}} = s_r(\mathbf {W}_r\mathbf {h} + \mathbf {b}_r), \end{aligned}$$
(5)

where \(\mathbf {W}_r \in \mathbb {R}^{D \times d_h}\) and \(\mathbf {b}_r \in \mathbb {R}^{D}\) are decoder network parameters. The typical choices for \(s_r\) are sigmoid and identity functions. To reduce the number of parameter to train and assist the training of autoencoder, the decoder may be tied to the encoder where the connection weight parameter are (\( \mathbf {W}_h = \mathbf {W}\), \(\mathbf {W}_r =\mathbf {W}^T\)). In this way, the autoencoder network parameters set is defined by \(\varTheta = \lbrace \mathbf {W}, \mathbf {b}_h, \mathbf {b}_r\rbrace \).

Contractive Autoencoder is constructed by adding up an additional penalty term to the reconstruction objective cost function in Eq. 1. The role of added penalty term is to contract the reconstructions at training samples and make the reconstruction process robust to small variations in input data. CAE achieves this through the regularizer formed by the squared Frobenius norm of Jacobian of the encoder function, \(\parallel J_{f}(x)\parallel _F^2\). The objective function in CAE is as follows:

$$\begin{aligned} \mathcal {J}_{CAE} = \sum _{_{\mathbf {x} \in {X}}} L(\mathbf {x},g(f(\mathbf {x}))) + \lambda \parallel J_{f}(\mathbf {x}) \parallel _F^2, \end{aligned}$$
(6)

where \(\lambda \) is a non-negative parameter controlling the strength of contraction regularizer. For an autoencoder with sigmoid encoder activation function the, jth row of Jacobian of the encoder, \(J_{f}(\mathbf {x})_{j.}\), can be obtained as follows:

$$\begin{aligned} J_{f}(\mathbf {x})_{j.} = \frac{\partial f_j(\mathbf {x})}{\partial \mathbf {x}} = f_j(\mathbf {x}) (1-f_j(\mathbf {x})) \mathbf {W}_{j.}, \end{aligned}$$
(7)

Recall the embedded submanifold assumption, the tangent space can be given by the image of Jacobian of submanifold embedding map (here in CAE the encoder function) which also equals to the column space of the Jacobian matrix. This motivates CAE to obtain the local tangent spaces of the points through the principal vectors of the transpose of the encoder Jacobian matrix. That is, CAE estimates the local tangent spaces by extracting the principal direction captured by the Jacobian matrix. In practice, at a given point \(\mathbf {x}\), CAE estimates the coordinate axes of a local tangent space by the Singular Value Decomposition of the transpose of the Jacobian matrix:

$$\begin{aligned} J_{f}^T(\mathbf {x}) = \mathbf {U}\mathbf {S}\mathbf {V}, \end{aligned}$$
(8)

where \(\mathbf {U}\) and \(\mathbf {V}\) are left- and right-singular vectors, and \(\mathbf {S}\) is the diagonal matrix containing singular values.

Spectral analysis of the Jacobian matrix shows that the majority of singular values of the Jacobian are zero (or close to zero) and only a few of them are reasonably large. Accordingly, the set of the orthonormal bases \(\mathcal {B}_{\mathbf {x}}\) is given by the set of columns vectors of \(\mathbf {U}\) corresponding to the dominant singular values

$$\begin{aligned} \mathcal {B}_{\mathbf {x}} = \lbrace \mathbf {U}_{.k} | \mathbf {S}_{kk}>\epsilon , \rbrace \end{aligned}$$
(9)

which spans the tangent space at point \(\mathbf {x}\), \({T}_{\mathbf {x}}\):

$$\begin{aligned} {T}_{\mathbf {x}} = \lbrace \mathbf {x} + \mathbf {v} | \mathbf {v}\in \overline{Span} (\mathcal {B}_{\mathbf {x}}) \rbrace . \end{aligned}$$
(10)

3 Multi-manifold Spectral Clustering

This section presents the main steps involved with the proposed multi-manifold spectral clustering (MMSC) algorithm. We briefly cover the methodologies used for estimating of local tangent spaces, computing of the affinities and spectral clustering.

Given the set of data points of the original input D-dimensional space \(X=\lbrace \mathbf {x}_i\rbrace _{i=1}^N\), the MMSC assumes the data reside on m different embedded d-dimensional submanifolds. The proposed algorithm aims to cluster the data points to m different groups through a three-step procedure.

3.1 Estimation of Tangent Spaces

A submanifold embedded in Euclidean space or some abstract manifold can implicitly be approximated by its local tangent spaces. A MMSC based on local tangent space affinities models data manifolds via their local tangent spaces. Indeed, the quality of the estimated local tangent spaces has direct impacts on manifold sampling and the accuracy of data affinities.

The proposed MMSC algorithm exploits a single-layer contractive autoencoder for tangent space estimation, even though a multi-layer CAE can also be utilized. To this end, first a single layer contractive autoencoder is trained with unlabelled data. The network is set up with over-complete structure where \(d_h > D\).

The trained CAE is then used to compute the encoder Jacobian and its SVD for each data point as formulated in Eqs. 7 and 8. The local tangent spaces of the point is extracted and stored as the set of column vectors of the left-singular vector of the Jacobian that corresponds to the dominant singular values, Eq. 9.

3.2 Estimation of Affinity Metrics

Following the smooth multi-manifold assumption, MMSC determines the affinity between a pair of points by the similarity between their local tangent spaces, captured by their principal angles.

For a given pair of points \(\mathbf {x}_i\) and \(\mathbf {x}_j\), let \(T_{\mathbf {x}_i}\) and \(T_{\mathbf {x}_j}\) be their tangent spaces respectively, and assume \(dim(T_{\mathbf {x}_i})=d_i\), \(dim(T_{\mathbf {x}_j})=d_j\) and \( d \ge d_i\ge d_j \ge 1\). The principal angles \(\theta _l\) between \(T_{\mathbf {x}_i}\) and \(T_{\mathbf {x}_j}\),

$$\begin{aligned} 0 \le \theta _1 \le \ldots \theta _l \le \ldots \le \theta _b \le \frac{\pi }{2} \end{aligned}$$
(11)

are recursively given as follows:

$$\begin{aligned} cos (\theta _l) = \max _{\mathbf {p_k}\in T_{\mathbf {x}_i}, \mathbf {q_k}\in T_{\mathbf {x}_j}} \mathbf {p}_k^T \mathbf {q}_k \ , \ \ \ \ k=1,\ldots ,l-1 \end{aligned}$$
(12)

subject to

$$\begin{aligned} \parallel \mathbf {p}_k \parallel = \parallel \mathbf {q}_k \parallel =1, \ \mathbf {p}^T \mathbf {p}_k = \mathbf {q}^T \mathbf {q}_k=0, \end{aligned}$$
(13)

where \(\mathbf {p}_k \) and \(\mathbf {q}_k\) are the set of the principal vectors of the pair tangent spaces.

Let \(\mathcal {N}_{k}(\mathbf {x}_i)\) and \(\mathcal {N}_{k}(\mathbf {x}_j)\) be the k-nearest neighbours sets that approximate the local proximity models at points \(\mathbf {x}_i\) and \(\mathbf {x}_j\). The MMSC computes the affinity as follows:

$$\begin{aligned} (A)_{ij} = \left\{ \begin{aligned}&( \prod _{l=1}^{d_i} cos(\theta _l) )^{\alpha }&\text {if } \mathbf {x}_i \in \mathcal {N}_k(\mathbf {x}_j), \mathbf {x}_j \in \mathcal {N}_k(\mathbf {x}_i) ,\\&0&\text {otherwise} \end{aligned} \right. \end{aligned}$$
(14)

Notice that here we use a similar affinity metric to the one defined in [22], but we employ CAE to compute the local tangent spaces.

3.3 Spectral Clustering

Given the affinity metrics, the spectral clustering algorithm is applied to the affinity matrix by which the data clusters are extracted.

4 Experiments

This section presents the qualitative and the quantitative results that evaluate the performance of the proposed algorithm. The proposed algorithm was applied to real world hyperspectral data and was evaluated by comparing with the clustering approaches based on the single manifold assumption and the MMSC algorithm based on local PCA. By this means, we would like to empirically analyse if the Contractive Autoencoder based multi-manifold clustering is an appropriate approach for a typical multi-manifold clustering of remote sensing hyperspectral imagery data.

4.1 Data

We evaluated our proposed algorithm with two different test hypercubes built on two benchmark hyperspectral remote sensing datasets: Salinas and Pavia Center Scene.

The Salinas dataset was collected by AVIRIS over Salinas valley in Southern California, USA. The hypercube data is characterized by 3.7 m pixel size and 224 bands (including the noisy bands). The data set consists of 16 different classes of land covers including: broccoli 1, broccoli 2, fallow, rough ploughed fallow, smooth fallow, stubble, celery, untrained grapes, soil/vineyard, senesced corn, Romaine lettuce 4wk, Romaine lettuce 5wk, Romaine lettuce 6wk, Romaine lettuce 7wk, untrained vineyard, vineyard with vertical trellis and shadow. Due to the close similarity between spectral features of the vineyard with vertical trellis and the untrained vineyard, the vineyard with vertical trellis class was excluded from the experiments. 900 samples from each land cover class were randomly chosen and for better visualization arranged in form of a rectangular cuboid. The RGB rendering and the ground truth of the test Salinas hypercube are shown in Fig. 1(a) and (b).

The Pavia Center Scene dataset was acquired by the reflective optics system imaging spectrometer (ROSIS) sensor over the city of Pavia, Italy. The hypercube data is characterized by 1.3 m pixel size and 102 bands with no noisy bands. The data set consists of 9 different classes of land cover including: water, trees, asphalt, self-blocking bricks, bitumen, tiles, shadows, meadows and bare soil. The shadow class was excluded from the experiments. 400 samples from each land cover class were randomly chosen and for better visualization arranged in form of a rectangular cuboid. The RGB rendering and the ground truth of the test Pavia Center Scene hypercube are shown in Fig. 2(a) and (b) respectively.

4.2 Experimental Set-Up

As described in Sect. 2, we computed the local tangent spaces using an overcomplete single-layer CAE. The architectures for CAEs used for Pavia Center Scene and Salinas test hypercubes are 102-150-102 and 204-250-204 respectively. The logistic sigmoid function was set for the encoder and the linear activation was set for the decoder. Since, the contraction coefficient, \(\lambda \) in Eq. 6, significantly affects the quality of CAE for local tangent estimation, we report CAE over different values of contraction coefficient \(\lambda \).

We trained all the networks with all available samples over the squared error function with Stochastic Gradient Descent (SGD) driven by the momentum acceleration. The standard Spectral Clustering (SC) algorithm with the affinity metric defined in Sect. 3.2 is considered as the common baseline in all the experiments. Aiming for fair evaluation, all the SC experiments were performed with pre-known number of cluster and over 5 trials with random initialization. The performance of the clustering results were reported by the overall average accuracy (ACC) and the macro averaged Positive Predictive Value (\(PPV_m\)). We should notice that the architecture and the parameters utilized in the experiments are not optimal, the results may even be improved with some other architecture.

Table 1. Highest overall clustering ACC and macro averaged PPV (in percentage) obtained by the considered algorithms with k nearest neighbours

4.3 Results

The clustering results obtained by the proposed CAE based MMSC (MMSC+CAE) and the competing methods, Spectral Clustering with PCA (SC+PCA), MMSC with local PCA (MMSC+LPCA), MMSC with basic Autoencoder (MMSC+AE) are presented in Table 1. The results on the overall Average Accuracy (ACC) and the macro average Positive Predictive Value \((PPV_m)\) show that MMSC+CAE outperforms all the other methods. In particular, in case of Pavia Center Scene test cube, CAE with the three reported contraction coefficients resulted higher clustering results compared to other methods.

Fig. 1.
figure 1

Clustering performance comparison on Salinas test hypercube; RGB color rendering in (a), followed by the corresponding ground truth in (b). Clustering results obtained by (c) SC+PCA, (d) MMSC+LPCA, (e) MMSC+AE, (f) MMSC+CAE \((\lambda =0.001)\), (g) MMSC+CAE \((\lambda =0.003)\), (h) MMSC+CAE \((\lambda =0.01)\).

Fig. 2.
figure 2

Clustering performance comparison on Pavia Center Scene test hypercube; RGB color rendering in (a), followed by the corresponding ground truth in (b). Clustering results obtained by (c) SC+PCA, (d) MMSC+LPCA, (e) MMSC+AE, (f) MMSC+CAE \((\lambda =0.001)\), (g) MMSC+CAE \((\lambda =0.003)\), (h) MMSC+CAE \((\lambda =0.010)\).

Figures 1 and 2 show the clustering maps obtained by the competing methods with Salinas and Pavia center test hypercubes respectively. The proposed MMSC+CAE with three different values of contraction coefficient are shown in Figs. f, g and h. The clustering maps show MMSC+CAE with reported coefficient values led to better clustering performance compared to the other methods. In particular, we observed that MMSC+CAE achieves better separation power on data classes of not very similar spectral classes, but it cannot fully resolve the data clusters with quite similar spectral features, e.g. in Salinas the spectral data from the untrained grape and the untrained vineyard classes, the first and the seventh blocks from above.

We suppose that this issue relates to the Jacobian of the encoder function in CAE used for tangent space estimation. To obtain tangent spaces the Jacobian can not differentiate the structures of data submanifolds within intersection regions and inevitably it results inaccurate tangent spaces that impairs the clustering performance. Overall, these observations indicate that to some extent CAE can effectively be used in MMSC for estimating tangent spaces but it will suffer from data with high spectral similarity, i.e. data with potential highly intersecting submanifolds.

5 Conclusions

Motivated by the ability of autoencoders in building up the local structure of data, this paper proposed a multi-manifold clustering algorithm for hyperspectral remote sensing imagery where a robust variant of autoencoder is employed for tangent space estimation. The proposed algorithm flows similar to the general multi-manifold spectral clustering (MMSC) based on local PCA, however, it employs CAE for local tangent space estimation.

We analysed the quality of the proposed algorithm with two test hypercubes extracted from two benchmark hyperspectral datasets: Pavia Center Scene and Salinas. The experiments show the outperformance of our proposed algorithm over the standard Spectral Clustering (SC) and the MMSC with local PCA, local weighted PCA and the basic Autoencoder. This confirms the efficiency of the MMSC based on CAE for unsupervised classification of hyperspectral imaging data. However, along with this leading performance, the experiments also reveal that CAE may face difficulties in case of data with intersecting manifold.

We believe that certain measures may be required to successfully extend CAE based MMSC to data consisting of multiple manifolds with possible intersections. In future work, we will consider the study and the extension of CAE for a data with multi-manifold structure with intersections. For the case of labelled data, this may be to develop a strategy to supervise the training process at intersection areas by utilizing the data labels. The current work only covers a single layer CAE and another future work will be to extend CAE to a multi-layer architecture. Furthermore, we are interested in extending our experiments to other remote sensing hyperspectral imaging data.