Keywords

1 Introduction

Different from the conventional image which only covers the human visual spectral range with RGB bands, hyperspectral image (HSI) can cover a much larger spectral range with hundreds of narrow spectral bands. It has been widely used in urban mapping, forest monitoring, environmental management and precision agriculture [1]. For most of these applications, HSI classification is a fundamental task, which predicts the class label of each pixel in the image.

Compared with traditional classification problems [12,13,14, 16, 17], HSI classification is more challenging due to the curse of dimensionality [7], which is also known as the Hughes phenomenon [9]. In order to alleviate this problem, dimensionality reduction methods are proposed, which can be divided into feature selection [3] and feature extraction (FE) [2] methods. The main purpose of feature selection is to preserve the most representative and crucial bands from the original dataset and discard those making no contribution to the classification. By designing suitable criteria, feature selection methods can eliminate redundancies among adjacent bands and improve the discriminability of different targets. FE, on the other hand, is used to find an appropriate feature mapping to transform the original high-dimensional feature space into a low-dimensional one, where different objects tend to be more separable.

Witnessing the achievements by deep learning methods in the fields of computer vision and artificial intelligence [10, 11], a promising way to extract deep features for hyperspectral data has become a feasible option. In [5], Chen et al. introduce the concept of deep learning into HSI classification for the first time, using a multilayer stacked autoencoders (SAEs) to extract deep features. After the pre-training stage, the deep networks are then fine-tuned with the reference data through a logistic regression classifier. In a like manner, a deep belief networks (DBNs) based spectral-spatial classification method for HSI is proposed in [6], where both the single layer restricted Boltzmann machine and multilayer DBN framework are analyzed in detail.

Despite the fact that these deep learning based methods possess better generalization ability compared with shallow methods, they mainly focus on the spectral integrality and directly classify the image in the whole feature space. However, the pixel vectors in the HSI are also sequential data, where the contextual information among adjacent bands is discriminative for the recognition of different objects.

To overcome the aforementioned drawbacks, we propose to use the long short-term memory (LSTM) model to extract spectral features, which is an updated version of recurrent neural networks (RNNs). Considering the high dimensionality of hyperspectral data, two novel grouping strategies are proposed to better learn the contextual features among adjacent bands. The major contributions of this study are summarized as follow.

  1. 1.

    As far as we know, it is the first time that an end-to-end architecture for the HSI classification with the LSTM is proposed, which takes the contextual information among adjacent bands into consideration.

  2. 2.

    Two novel grouping strategies are proposed to better learn the contextual features among adjacent bands for the LSTM. Compared with the traditional band-by-band strategy, proposed methods prevent a very deep network for the HSI.

The rest of this paper is organized as follows. Section 2 describes the spectral classification with the LSTM in detail. The information of data sets used in this study and the experimental results are given in Sect. 3. Conclusions and other discussions are summarized in Sect. 4.

2 Spectral Classification with LSTM

In this Section, we will first make a brief introduction to RNNs and the LSTM. Then, strategies for processing the spectral information with the LSTM are presented.

2.1 RNNs

RNNs [15] are important systems for processing sequential data, which allow cyclical connections between neural activations at different time steps.

Fig. 1.
figure 1

The architecture of a recurrent neural network.

The architecture of a recurrent neural network is shown as Fig. 1 Given a sequence of values \(x^{\left( 1\right) },x^{\left( 2\right) },\ldots ,x^{\left( \tau \right) }\), apply the following update equations for each time step from \(t=1\) to \(t=\tau \).

$$\begin{aligned} h^{\left( t\right) } = g\left( b_a+Wh^{\left( t-1\right) }+Ux^{\left( t\right) }\right) \end{aligned}$$
(1)
$$\begin{aligned} o^{\left( t\right) } = b_o+Vh^{\left( t\right) } \end{aligned}$$
(2)
$$\begin{aligned} g\left( x\right) = \mathrm{tanh}\left( x\right) =\frac{e^x-e^{-x}}{e^x+e^{-x}} \end{aligned}$$
(3)

where \(b_a\) and \(b_o\) denote bias vectors. U, V and W are the weight matrices for input-to-hidden, hidden-to-output and hidden-to-hidden connections, respectively. \(x^{\left( t\right) }\), \(h^{\left( t\right) }\) and \(o^{\left( t\right) }\) are the input value, hidden value and output value at time t, respectively. The initialization of \(h^{\left( 0\right) }\) in (1) is specified with Gaussian values.

From (1) we can see that the hidden value of RNNs is determined by both the input signal at the current time step and the hidden value at the previous time step. In this manner, both the contextual information and the underlying pattern of the sequential data can be discovered. For the classification task, the softmax function can be added at the last time step to calculate the probability that the input data belongs to the ith category.

$$\begin{aligned} P\left( y=i|\theta ,b\right) = s\left( o^{\left( \tau \right) }\right) = \frac{e^{\theta _io^{\left( \tau \right) }}+b_i}{\sum _{j=1}^{k}e^{\theta _jo^{\left( \tau \right) }}+b_j} \end{aligned}$$
(4)

where \(\theta \) and b are the weight matrix and bias vector, respectively. k is the number of classes. The loss function of the whole network can be defined as

$$\begin{aligned} \mathcal {L} = -\frac{1}{m}\sum _{i=1}^m\left[ y_i\mathrm{log}\left( \hat{y}_i\right) +\left( 1-y_i\right) \mathrm{log}\left( 1-\hat{y}_i\right) \right] \end{aligned}$$
(5)

where \(y_i\) and \(\hat{y}_i\) denote the label and predicted label of the ith data, respectively. m is number of training samples. The optimization of a RNN can be accomplished by the mini-batch stochastic gradient descent with the back-propagation through time (BPTT) algorithm [19].

2.2 LSTM

Fig. 2.
figure 2

Illustration of an LSTM model.

The main challenge when training the RNN is the long-term dependencies that gradients tend to either vanish or explode during the back-propagation phase. To mitigate this problem, a gated RNN called LSTM is proposed in [8]. The core component of the LSTM is the memory cell which replaces the hidden unit in traditional RNNs. As shown in Fig. 2, there are four main elements in the memory cell, including an input gate, a forget gate, an output gate and a self-recurrent connection. The forward propagation of the LSTM for time step t is defined as follows.

Input gate:

$$\begin{aligned} i^{\left( t\right) } = \sigma \left( W_ix^{\left( t\right) }+U_ih^{\left( t-1\right) }+b_i\right) \end{aligned}$$
(6)

Forget gate:

$$\begin{aligned} f^{\left( t\right) } = \sigma \left( W_fx^{\left( t\right) }+U_fh^{\left( t-1\right) }+b_f\right) \end{aligned}$$
(7)

Output gate:

$$\begin{aligned} o^{\left( t\right) } = \sigma \left( W_ox^{\left( t\right) }+U_oh^{\left( t-1\right) }+b_o\right) \end{aligned}$$
(8)

Cell state:

$$\begin{aligned} c^{\left( t\right) } = i^{\left( t\right) }\odot g\left( W_cx^{\left( t\right) }+U_ch^{\left( t-1\right) }+b_c\right) + f^{\left( t\right) }\odot c^{\left( t-1\right) } \end{aligned}$$
(9)

LSTM output:

$$\begin{aligned} h^{\left( t\right) } = o^{\left( t\right) }\odot g\left( c^{\left( t\right) }\right) \end{aligned}$$
(10)

where \(W_i\), \(W_f\), \(W_o\), \(W_c\), \(U_i\), \(U_f\), \(U_o\) and \(U_c\) are weight matrices. \(b_i\), \(b_f\), \(b_o\) and \(b_c\) are bias vectors. \(\sigma \left( x\right) =1/\left( 1+\mathrm{exp}\left( -x\right) \right) \) is the sigmoid function and \(\odot \) denotes the dot product.

Similar to traditional RNNs, the LSTM network can be trained by the mini-batch stochastic gradient descent with the BPTT algorithm. Refer to [8] for more detailed descriptions.

2.3 The Proposed Band Grouping Based LSTM Algorithm

Previous literatures have shown that the deep architecture possesses better generalization ability when dealing with the complicated spectral structure [5, 6]. While existing methods focus on the integrality of spectra, the LSTM network pays more attention to the contextual information among adjacent sequential data. Therefore, how to divide the hyperspectral vector into different sequences in a proper way is crucial to the performance of the network. A natural idea is to consider each band as a time step and input one band at a time. However, hyperspectral data usually has hundreds of bands, making the LSTM network too deep to train in such a circumstance. Thus, a suitable grouping strategy is needed.

Let n be the number of bands and \(\tau \) be the number of time steps in the LSTM. Then the sequence length of each time step is defined as \(m=floor\left( n/\tau \right) \), where \(floor\left( x\right) \) denotes rounding down x. For each pixel in the hyperspectral image, let \(z=\left[ z_1,z_2,\ldots ,z_i,\ldots ,z_n\right] \) be the spectral vector, where \(z_i\) is the reflectance of the ith band. The transformed sequences are then denoted by \(x=\left[ x_1,x_2,\ldots ,x_i,\ldots ,x_\tau \right] \) , where \(x_i\) is the sequence at the ith time step. In what follows, we introduce two grouping strategies proposed in this paper.

Fig. 3.
figure 3

(a) Grouping strategy 1. Adjacent bands are divided into the same sequence according to the spectral orders. (b) Grouping strategy 2. Every group in this case will cover a large spectral range. The bands marked with the same color will be fed into the LSTM network at each step time.

Grouping Strategy 1: Divide the spectral vector into different sequences according to the spectral order:

$$\begin{aligned} {\begin{matrix} x^{\left( 1\right) }&{}=\left[ z_1,z_2,\ldots ,z_m\right] \\ &{}\ldots \\ x^{\left( i\right) }&{}=\left[ z_{\left( i-1\right) m+1}, z_{\left( i-1\right) m+2},\ldots ,z_{im}\right] \\ &{}\ldots \\ x^{\left( \tau \right) }&{}=\left[ z_{\left( \tau -1\right) m+1}, z_{\left( \tau -1\right) m+2},\ldots ,z_{\tau m}\right] \end{matrix}} \end{aligned}$$
(11)

where \(x^{\left( i\right) }\) is the sequence at time i. As shown in Fig. 3(a), strategy 1 makes the signals inside a group continuous without any intervals and each group concentrates on a narrow spectral range. The spectral distance between different time steps will be relatively longer under such circumstances.

Grouping Strategy 2: Divide the spectral vector with a short interval:

$$\begin{aligned} {\begin{matrix} x^{\left( 1\right) }&{}=\left[ z_1,z_{1+\tau },\ldots ,z_{1+\tau \left( m-1\right) }\right] \\ &{}\ldots \\ x^{\left( i\right) }&{}=\left[ z_i,z_{i+\tau },\ldots ,z_{i+\tau \left( m-1\right) }\right] \\ &{}\ldots \\ x^{\left( \tau \right) }&{}=\left[ z_{\tau },z_{2\tau },\ldots ,z_{\tau m}\right] \end{matrix}} \end{aligned}$$
(12)

Compared to strategy 1, each group in this case will cover a larger spectral range and the spectral distance between different time steps will be much shorter, as shown in Fig. 3(b).

After grouping the spectral vector z into different sequences \(x^{\left( 1\right) },\ldots ,x^{\left( \tau \right) }\), the LSTM network can be utilized to extract the contextual features among adjacent spectra. A fully connected (FC) layer and a softmax layer are added following the LSTM to accomplish the image classification. The complete spectral classification framework is illustrated as Fig. 4.

Fig. 4.
figure 4

Spectral classification with the proposed LSTM network. FC means the fully connected layer. The spectral vector of each pixel is divided into several groups. At each time step, a group of spectra are fed into the LSTM network.

3 Experiment Results and Analysis

3.1 Data Description

In our experiments, two benchmark hyperspectral data sets, including the Pavia University and Indian Pines, are utilized to evaluate the performance of the proposed method.

The first data set is acquired by the Reflective Optics Systems Imaging Spectrometer (ROSIS) sensor over the Pavia University, northern Italy. This image consists of 103 spectral bands with 610 \(\times \) 340 pixels and it has a spectral coverage from 0.43 \(\upmu \)m to 0.86 \(\upmu \)m and a spatial resolution of 1.3 m. The training and test set are listed in Table 1.

Table 1. Number of training and test samples used in the Pavia University data set.

The second data set is gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in Northwestern Indiana. After the removal of the water absorption bands, the image consists of 200 spectral bands with 145 \(\times \) 145 pixels. It has a spectral coverage from 0.4 \(\upmu \)m to 2.5 \(\upmu \)m and a spatial resolution of 20 m. The training and test set are listed in Table 2.

Table 2. Number of training and test samples used in the Indian Pines data set.
Fig. 5.
figure 5

The performance of the LSTM with different number of time steps.

All the experiments in this paper are randomly repeated 30 times with different random training data. The overall accuracy (OA) and Kappa coefficient [1] are utilized to quantitatively estimate different methods. Both the average value and the standard deviation are reported. The experiments in this paper are implemented with an Intel i7-5820K 3.30-GHz processor with 32 GB of RAM and a NVIDIA GTX1080 graphic card.

3.2 Analysis About the LSTM

In this subsection, we first evaluate two grouping strategies proposed in this paper with different number of time steps. As shown in Fig. 5, strategy 2 outperforms strategy 1 on both data sets. The reason behind this phenomenon lies on two aspects. First, the sequence divided by strategy 2 covers a wider spectral range compared with strategy 1, which means more abundant spectral information is fed into the LSTM cell at each time step in the case of strategy 2. Second, the spectral distance between different time steps is much shorter in strategy 2. Under such a circumstance, it is easier for the LSTM to learn the contextual features among adjacent spectral bands. Besides, as the number of time steps increases, the OA has a trend of rising first then getting steady or decreasing. This result shows that a too deep architecture may not be suitable for the LSTM to extract spectral features. For all the data sets in this paper, we set the number of time steps as 3. The number of neurons in the FC layer is set as 128.

Fig. 6.
figure 6

Classification maps for the Pavia University data set. (a) The false color image. (b) Ground-truth map. (c) Raw. (d) PCA. (e) SAE. (f) LSTM-band-by-band. (g) LSTM-strategy 1. (h) LSTM-strategy 2.

3.3 Classification Results

In this subsection, we will report the classification results of the proposed methods along with other approaches, including raw (classification with original spectral features using RBF-SVM), PCA (classification with first 20 PCs using RBF-SVM) and SAE [5] (spectral classification with SAE). We use the LibSVM [4, 18] for the SVM classification in our experiments. The range of the regularization parameters for the five-fold cross-validation is from \(2^{8}\) to \(2^{10}\). The LSTM is implemented under Theano 0.8.2 and other experiments in this paper are carried out under MATLAB 2012b. The classification maps of different methods are shown in Figs. 6 and 7 and the quantitative assessment is shown in Tables 3 and 4.

Fig. 7.
figure 7

Classification maps for the Indian Pines data set. (a) The false color image. (b) Ground-truth map. (c) Raw. (d) PCA. (e) SAE. (f) LSTM-band-by-band. (g) LSTM-strategy 1. (h) LSTM-strategy 2.

Table 3. Classification results of the Pavia University data set.
Table 4. Classification results of the Indian Pines data set.

As shown in Tables 3 and 4, the LSTM with traditional band-by-band input fails to get a high accuracy and performs even worse than shallow methods in both data sets. By contrast, the LSTM with proposed grouping strategies yields better results and the overall accuracy can be improved about 5% to 13% on different data sets. The main reason here is that the band-by-band strategy would generate a too deep network and may result in information loss during the recurrent connection. Take the Pavia University data set for example. Since the number of time steps in this case will reach 103, after unfolding the LSTM, the depth of the network will also be 103, making it very hard for the network’s training. In general, the LSTM with proposed strategy 2 achieves the best results compared with other spectral FE methods.

4 Conclusions

In this paper, we have proposed a band grouping based LSTM algorithm for the HSI classification. The proposed method has the following characteristics. (1) Our method takes the contextual information among adjacent bands into consideration, which is ignored by existing methods. (2) Two novel grouping strategies are proposed to better train the LSTM. Compared with the traditional band-by-band strategy, proposed methods prevent a very deep network for the HSI and yield better results.

Since the proposed method mainly focuses on the deep spectral FE, our future work will try to take the deep spatial features into consideration.