Keywords

1 Introduction

Compared with ordinary panchromatic and RGB images, the hyperspectral image (HSI) of the natural scene can effectively describe the spectral distribution and provide intrinsic and discriminative spectral information of the scene. It has been proven beneficial to numerous applications, including segmentation [43], classification [49], anomaly detection [50], face recognition [39], document analysis [28], food inspection [48], surveillance [37], earth observation [6], to name a few.

Hyperspectral cameras are widely used for HSI acquisition, which needs to densely sample the spectral signature across consecutive wavelength bands for every scene point. Such devices often come with a high cost and tend to suffer from the degradation of spatial/temporal resolution.

Recently, some methods [3, 22, 38, 42] have been presented to directly recover the HSI from a single RGB image. Since the mapping from RGB to spectrum is three-to-many, some prior knowledge has been introduced. Examples include radial basis function network mapping [38], K-SVD based sparse coding [3], constrained sparse coding [42], and manifold-based mapping [22]. In particular, Jia et al. [22] disclose the nonlinear characteristics of natural spectra, and show that properly designing nonlinear mapping can significantly boost the recovery accuracy. Inspired by this observation, we propose a spectral convolutional neural network (CNN) to better approximate the underlying nonlinear mapping. In contrast to the pixel-wise operation in [3, 22, 38], we further propose to better utilize the spatial similarity in HSI via a properly designed spatial CNN. In addition, the input RGB image is employed to guide the HSI reconstruction and residual learning is used to further preserve the spatial structure in our network. Experimental results show that our recovery network outperforms state-of-the-art methods in terms of quantitative metrics and perceptive quality, and both the spectral CNN module and the spatial CNN module have contributed to this performance gain.

Existing methods [3, 22, 38, 42] mainly focus on the HSI recovery under a given camera spectral sensitivity (CSS) function, while [4] shows that the quality of spectral recovery is sensitive to the CSS used. For example, given a CSS dataset, the optimal CSS selection may improve the accuracy by 33%, as shown in [4]. Rather than using exhaustive search, an evolutionary optimization methodology is used in [4] to choose the optimal CSS, which still needs to train the recovery method for multiple times and results in high time complexity.

Through experiments, we have found that the performance of our CNN-based recovery method is dependent on the CSS as well. This motivates us to develop a CNN-based CSS selection method with a single training process, which can jointly work with our HSI recovery method and have low time complexity. In this work, we propose a novel CSS selection layer, which can automatically determine the optimal CSS from the network weights under nonnegative sparse constraint. As illustrated in Fig. 1, this filter selection layer is appended to the recovery network, which jointly selects the proper CSS and learns the mapping for HSI recovery from a single RGB image captured with the algorithmically selected CSS. Experiment results show that the selection layer always gives a CSS that is consistent with the best one determined by exhaustive search. The spectral recovery accuracy can be further boosted by this optimal CSS, compared with using a casually selected CSS. To the best of our knowledge, this work is the first to integrate optimal CSS selection with HSI recovery via a unified CNN-based framework, which boosts HSI recovery fidelity and has much lower complexity.

Our main contributions are that we

  • Design a CNN-based HSI recovery network to account for spectral nonlinear mapping and utilize spatial similarity in the image plane domain;

  • Develop a CSS selection layer to retrieve the optimal CSS on the basis of the nonnegative sparse constraint onto the weight factors;

  • Jointly determine the optimal CSS and learn an accurate HSI recovery mapping in a single training process.

Fig. 1.
figure 1

Overview of the proposed method, which combines optimal CSS selection and HSI recovery into a unified CNN-based framework. The parameters for these two aspects are jointly learned first. Then, the RGB images are captured under the selected optimal CSS as the inputs and the underlying HSIs are reconstructed by using the HSI recovery network. The black arrow shows the training process and the red arrow denotes the testing process. (Color figure online)

2 Related Work

Hyperspectral imaging can effectively provide discriminative spectral information of the scene. To obtain HSIs, whiskbroom and pushbroom style scanning systems [5, 41] are widely used to capture the scene pointwisely or linewisely. RGB or monochromatic cameras with variant filters [9, 10, 51] or specific illuminations [19, 40] were also used to capture the HSI. But all these methods scanning along the spatial or spectral dimension result in low temporal resolution. To capture dynamic scenes, snapshot hyperspectral cameras [8, 14,15,16] were developed to capture full 3D HSIs, but they sacrificed spatial resolution.

To obtain high-resolution HSIs in real time, some coding-based hyperspectral imaging approaches have been presented, relying on the compressive sensing (CS) theory. CASSI [17, 44] employed a coding aperture with the disperser to uniformly encode the spectral signals into 2D space. DCCHI [45, 46] incorporated a co-located panchromatic camera to collect more information simultaneously with the CASSI measurement. SSCSI [33] jointly encoded the spatial and spectral dimensions in a single gray image. Besides, fusion-based approaches [1, 11, 12, 25, 31, 32, 35] were presented. These approaches were based on a hybrid camera system, in which a low spatial resolution hyperspectral camera and a high spatial resolution RGB camera were mounted in a coaxial system. The captured two images could be fused into a high resolution HSI, which has the same spatial resolution as the RGB image and same spectral resolution as the input HSI. All these coding-based and fusion-based hyperspectral imaging systems demand either high precision optical design or expensive hyperspectral camera.

To avoid using the above mentioned specialized devices, i.e. multiple illuminations, filters, coding aperture and hyperspectral cameras, HSI recovery from a single RGB image has attracted more attention. The spectral recovery from three values provided by the RGB camera arouses a three-to-many mapping, which is severely underdetermined in general.

To unambiguously determine the spectrum, some prior knowledge on the mapping is introduced. Nguyen et al. [38] learned the mapping between white balanced RGB values and illumination-free spectral signals based on a radial basis function network. Arad and Ben-Shahar [3] built a large hyperspectral dataset for natural scenes, and derived the mapping between hyperspectral signatures and their RGB values under a dictionary learned by K-SVD. Robles-Kelly [42] reconstructed the illumination-free HSI based on a constrained sparse coding approach by using a set of prototypes extracted from the training set. Jia et al. [22] proposed a two-step manifold-based mapping method, which highlighted the role of nonlinear mapping in spectral recovery. In this work, we present a spectral CNN module to better account for spectral nonlinear mapping, and a spatial CNN module to further incorporate the spatial similarity.

Arad and Ben-Shahar [4] first recognized that the quality of HSI recovery from a single RGB image was sensitive to the CSS selection. To avoid the heavy computational cost of exhaustive search, they proposed an evolutionary optimization based selection strategy. However, the training has still to be conducted multiple times under different CSS instances. In this work, we propose a CSS selection layer under the nonnegative sparse constraint, and jointly select the optimal CSS and learn the mapping for HSI recovery via a unified CNN-based framework. This can be achieved in only one training process, in contrast to repeated training operations in [4].

3 Joint Optimal CSS Selection and HSI Recovery

In this section, we present a CNN-based method for simultaneous optimal CSS selection and HSI recovery from a single RGB image. The overall framework of the proposed method is shown in Fig. 1. In the training stage, given a large set of CSS functions and HSIs, we first synthesize multiple RGB images for each HSI under variant CSS functions, which are the input of the network. The designed optimal CSS selection network is utilized to select the best CSS and the corresponding RGB channels. In the HSI recovery network, we design a spectral CNN to approximate the complex nonlinear mapping between the RGB space and the spectra space, and a spatial CNN for the spatial similarity. The CSS selection network and the spectral recovery network are combined to recover the HSI, which should be close enough to its corresponding HSI in the training dataset. In the testing stage, the input RGB image is obtained under the selected CSS. A HSI will be obtained by feeding this input RGB image into the recovery network, which has been learned in the training stage.

In the following, we first describe the motivation of our network structure by digesting common approaches for HSI recovery from a single RGB image. Then, we introduce our CNN-based method for both HSI recovery and optimal CSS selection. Finally, the learning detail is provided.

3.1 Preliminaries and Motivation

Let \(\mathbf {Y}\in \mathbb {R}^{3\times M}\) and \(\mathbf {X}\in \mathbb {R}^{B \times M}\) denote the input RGB image and the recovered HSI, where M and B are the number of pixels and bands in the HSI. The relationship between \(\mathbf {Y}\) and \(\mathbf {X}\) can be described as

$$\begin{aligned} \mathbf {Y} = \mathbf {C}\mathbf {X}, \end{aligned}$$
(1)

where \(\mathbf {C} \in \mathbb {R}^{3\times B}\) denotes the RGB CSS function.

Most state-of-the-art methods assume that the CSS function is known and model HSI recovery from a single RGB image as

$$\begin{aligned} E(\mathbf {X}) = E_{d}(\mathbf {X},\mathbf {Y})+\lambda E_{s}(\mathbf {X}), \end{aligned}$$
(2)

where the first term \(E_{d}(\mathbf {X},\mathbf {Y})\) is the data term, and it guarantees that the recovered \(\mathbf {X}\) should be projected to \(\mathbf {Y}\) under the CSS function \(\mathbf {C}\). The second term \(E_{s}(\mathbf {X})\) is the prior regularization for \(\mathbf {X}\).

The models for the first term in the previous works [3, 22, 38, 42] can be generally described as

$$\begin{aligned} E_{d}(\mathbf {X},\mathbf {Y}) = \Vert f_{d}(\mathbf {X})-\mathbf {Y}\Vert _{F}^{2}. \end{aligned}$$
(3)

where the function \(f_{d}\) is linear mapping for [3, 42] and spectral nonlinear mapping for [22, 38]. [22] shows that the nonlinear mapping can effectively assist HSI recovery, compared with the linear constraint in [3].

In addition, [3, 42] assume that the spectra can be sparsely described by several bases, which means

$$\begin{aligned} E_{s}(\mathbf {X}) = \Vert \mathbf {D}\varvec{\alpha }-\mathbf {X}\Vert _{F}^{2}+\Vert \varvec{\alpha }\Vert _{1}, \end{aligned}$$
(4)

where \(\mathbf {D}\) is the learned spectral dictionary and \(\varvec{\alpha }\) is the corresponding spectral sparse coefficient. [38] and [22] implicitly assume that the spectral information lies in a low dimensional space in the model.

Furthermore, since the neighboring pixels in the recovered HSI \(\mathbf {X}\) should be similar, [42] also has the spatial constraint as

$$\begin{aligned} E_{s}(\mathbf {X}) = \Vert f_{s}(\mathbf {X})\Vert _{F}^{2}, \end{aligned}$$
(5)

where the function \(f_{s}\) denotes the local spatial operation.

According to these analyzes, we present a CNN-based HSI recovery method from a single RGB image, which can effectively learn the nonlinear spectral mapping and the spatial structure information to improve the recovered HSI.

Besides, from Eq. (1), we can see that the quality of recovered HSI \(\mathbf {X}\) is influenced by both the input RGB image \(\mathbf {Y}\) and the CSS function \(\mathbf {C}\). Meanwhile, [4] shows that the selection of CSS significantly affects the quality of HSI recovery. To boost the accuracy of recovered HSI, it is essential to select the optimal CSS as well. Therefore, our method models the optimal CSS selection and the HSI recovery via a unified CNN-based framework.

Fig. 2.
figure 2

The architecture of CNN-based HSI recovery from a single RGB image.

3.2 HSI Recovery

Previous works have shown that effectively exploiting the underlying characteristics of the HSI—spectral nonlinear mapping [22, 38] and spatial similarity [42]—can reconstruct high quality HSI from a single RGB image. Compared with these approaches, our method utilizes multiple CNN layers in spectral CNN to deeply learn the nonlinear mapping between spectra and RGB space, and employs DenseNet blocks [21] and ResNet modules [20] in the spatial CNN to enlarge the receptive field and obtain more spatial similarity in the space domain. Besides, our method uses the input RGB image to further guide the HSI recovery and residual learning to preserve the spatial structure. Figure 2 shows the HSI recovery network.

Spectral CNN. Previous works for the spectral recovery from a single RGB image [3, 22, 38, 42] mainly consider the spectral mapping between the input RGB image and the recovered HSI. It is well-known that CNN can effectively learn the nonlinear mapping. Thus, we design a spectral CNN to learn the spectral nonlinear mapping between the RGB values and the corresponding spectrum, which consists of L layers. The output of the l-th layer is expressed as

$$\begin{aligned} \begin{aligned} \mathbf {F}_{l}&= \text {ReLU}(\mathbf {W}_{l}*\mathbf {F}_{l-1}+\varvec{b}_{l}), \quad \text {where} \quad \mathbf {F}_{0} = \mathbf {Y}, \end{aligned} \end{aligned}$$
(6)

where \(\text {ReLU}(x) = \max \{x,0\}\), denoting a rectified linear unit [36]. \(\mathbf {W}_{l}\) and \(\varvec{b}_{l}\) represent the filters and biases for the l-th layer, respectively. For the first layer, we compute \(a_{0}\) feature maps using an \(s_{1}\times s_{1}\) receptive field, where \(a_{0} = 64\) in our network. Filters are of size \(3 \times s_{1}\times s_{1}\times a_{0}\), when the input is an RGB image. In the 2-nd to \((L-1)\)-th layers, we also compute \(a_{0}\) feature maps using an \(s_{1}\times s_{1}\) receptive field and a rectified linear unit. The filters are of size \(a_{0} \times s_{1}\times s_{1}\times a_{0}\). Finally, the last layer uses the same receptive field and the filters are of size \(a_{0} \times s_{1}\times s_{1}\times B\). In the experiments, we set \(L=5\).

To perform the spectral nonlinear mapping, \(s_{1} = 1\) and the receptive field is \(1\times 1\) in the spatial domain. It implies that only the spectral nonlinear mapping is learned without any spatial structure.

RGB Guidance. Many researches for pan-sharping [2, 52] employ the panchromatic image to preserve the structural information, as the two input images should have similar spatial structure for the same scene. Inspired by this, we use the input RGB image to guide the spatial information reconstruction, which is modeled by stacking the input RGB image and the initialized HSI \(\mathbf {F}_{L}\) from the spectral CNN mentioned above. Thus, the output of the \((L+1)\)-th layer can be expressed as

$$\begin{aligned} \begin{aligned} \mathbf {F}_{L+1} = \mathcal {C}(\mathbf {W}_{L+1}*\text {stack}(\mathbf {Y},\mathbf {F}_{L})+\varvec{b}_{L+1}), \end{aligned} \end{aligned}$$
(7)

where \(\mathcal {C}\) denotes the activation function. It is batch normalization (BN) [34] followed by a ReLU [36].

Spatial CNN. Due to abundant self-repeating patterns in natural images [7, 13], the spatial information is usually similar in the neighboring area. To effectively exploit the spatial similarity, we need to obtain abundant spatial correlation in much larger area, which can be carried out by using several DenseNet blocks [21]. We employ N DenseNet blocks into the designed network, and the output of the n-th DenseNet block can be expressed as

$$\begin{aligned} \begin{aligned} \mathbf {S}_{n}&= \mathcal {D}_{n}(\text {stack}(\mathbf {S}_{0},\cdots ,\mathbf {S}_{n-1})), \quad \text {where} \quad \mathbf {S}_{0} = \mathbf {F}_{L+1}, \end{aligned} \end{aligned}$$
(8)

where \(\mathcal {D}_{n}\) denotes the n-th DenseNet block function.

In each DenseNet block, there are K ResNet modules [20]. For the n-th DenseNet block, the input is \(\mathbf {S}_{n-1}\) and the k-th ResNet module can be expressed as

$$\begin{aligned} \begin{aligned} \mathbf {H}_{n}^{k}&= \mathcal {R}_{n}(\mathbf {H}_{n}^{k-1}) + \mathbf {H}_{n}^{k-1}, \quad \text {where} \quad \mathbf {H}_{n}^{0} = \mathbf {S}_{n-1}. \end{aligned} \end{aligned}$$
(9)

In our spatial CNN, we set \(N=4\) and \(K=4\).

In addition, since the spatial structural information mainly exists in the high-pass components, we employ the residual learning to efficiently reconstruct detail information like [27]. Thus, the final output can be described as

$$\begin{aligned} \begin{aligned} \hat{\mathbf {X}}&= \mathbf {S}_{N} + \mathbf {F}_{L}. \end{aligned} \end{aligned}$$
(10)

3.3 Optimal CSS Selection

Previous work [4] shows that the quality of HSI recovery is sensitive to the CSS used to generate the input RGB image. As will be shown in Sect. 4.3, we perform our HSI recovery method by using the synthetic RGB images under different CSS functions in a brute force way. From Fig. 5, we can see that the accuracy of HSI recovery has about 10%–25% difference. It means that properly choosing a CSS will contribute to the HSI recovery accuracy as well. The exhaustive search method and the evolutionary optimization method in [4] are not appropriate for our CNN-based HSI recovery method, since they need to train the HSI recovery network multiple times, which is extremely slow. Thus, we propose to design a selection convolution layer to retrieve the optimal CSS, which only needs to train once and has much lower time complexity.

Fig. 3.
figure 3

The illustration of the optimal CSS selection.

To select the optimal CSS function, RGB images for each HSI are first synthesized with all CSS functions in a candidate dataset. Let \(\mathbf {C}_{j}\) (\(j = 1,\cdots , J\)) denote the j-th CSS function. Each synthetic RGB image via the j-th CSS function and the t-th HSI in the training dataset can be described as

$$\begin{aligned} \mathbf {Y}_{j,t} = \mathbf {C}_{j}\mathbf {X}_{t}. \end{aligned}$$
(11)

Thus, for each scene, the input RGB images are obtained by stacking all RGB images under different CSS functions, i.e.,

$$\begin{aligned} \mathcal {Y}_{t}= \text {stack}(\mathbf {Y}_{1,t},\cdots ,\mathbf {Y}_{j,t},\cdots ,\mathbf {Y}_{J,t}). \end{aligned}$$
(12)

Since our work mainly focuses on the HSI recovery from a single RGB image, we will select a CSS from an existing camera, without considering the combination of channels from different cameras. Note that the simulated RGB image cannot be negative, so the weight to select CSS should be nonnegative. To correctly localize the promising CSS, we further add a sparsity constraint. Thus, we design a convolution layer for the optimal CSS selection, which acts as the largest weight learned in the network under the nonnegative sparse constraint. To enforce the nonnegative constraint, the weights in this convolution layer for CSS selection are set to be positive. As for the sparse constraint, the weights are learned under the sparsity promoting \(l_{1}\)-norm.

The optimal CSS selection is equivalent to an RGB image selection in \( \mathcal {Y}_{t}\), which is synthesized with the selected optimal CSS. As shown in Fig. 3, we first separate RGB bands into three branches, which share the same convolution filter \(\varvec{V}\). The size of this filter is \(J \times 1\times 1\times a_{1}\), where \(a_{1}=1\) is the number of the output band for each branch. Thus, the output of this optimal CSS selection network can be expressed as

$$\begin{aligned} \hat{\mathbf {Y}}_{t} = \text {stack}(\varvec{V}*\mathcal {Y}_{t}(R),\varvec{V}*\mathcal {Y}_{t}(G),\varvec{V}*\mathcal {Y}_{t}(B)), \end{aligned}$$
(13)

where \(\mathcal {Y}_{t}(R)\), \(\mathcal {Y}_{t}(G)\), and \(\mathcal {Y}_{t}(B)\) denote all the red, green, and blue channels in \(\mathcal {Y}_{t}\), respectively.

The values in \(\varvec{V}\) can be determined by minimizing the mean squared error (MSE) under the nonnegative sparse constraint between the selected RGB image \( \hat{\mathbf {Y}}\) and the corresponding ground truth image

$$\begin{aligned} \mathcal {L}_{c}(\varvec{V})=\frac{1}{T}\sum _{t=1}^{T}\Vert \hat{\mathbf {Y}}_{t}(\varvec{V})-\mathbf {Y}_t\Vert ^2 +\Vert \mathbf{V }\Vert _{1}, \quad s.t. \quad \mathbf {V} \ge 0, \end{aligned}$$
(14)

where \(\hat{\mathbf {Y}}_{t}\) is the t-th output, \(\mathbf {Y}_{t}\) is the t-th corresponding selected optimal CSS, and T is the number of training samples. A larger value in \(\varvec{V}\) represents that its corresponding CSS is better for HSI recovery. Consequently, the CSS corresponding to the largest value in \(\varvec{V}\) is selected as the optimal CSS.

3.4 Learning Details

The parameters for the HSI recovery network are denoted as \(\mathbf {\Theta }\), and can be achieved by minimizing the MSE between the reconstructed HSI \( \hat{\mathbf {X}}\) and the corresponding ground truth image,

$$\begin{aligned} \mathcal {L}_{s}(\mathbf {\Theta })=\frac{1}{T}\sum _{t=1}^{T}\Vert \hat{\mathbf {X}}_{t}(\hat{\mathbf {Y}}_t,\mathbf {\Theta })-\mathbf {X}_t\Vert ^2 +\Vert \mathbf {\Theta }\Vert _{2}^{2}, \end{aligned}$$
(15)

where \(\hat{\mathbf {X}}_{t}\) is the t-th output, \(\mathbf {X}_{t}\) is the corresponding ground truth, and \(\hat{\mathbf {Y}}_t\) is the corresponding selected RGB image by the CSS selection network.

In our method, since the output of CSS selection network in Eq. (14) is the input of HSI recovery network, it depends on the HSI recovery network training. Thus, we first append the optimal CSS selection network onto the HSI recovery network to select the optimal CSS and learn the mapping for spectral recovery together. In this joint training phase, we train the entire network by minimizing the loss

$$\begin{aligned} \mathcal {L}= \mathcal {L}_{c}(\varvec{V})+\tau \mathcal {L}_{s}(\mathbf {\Theta }), \end{aligned}$$
(16)

where \(\tau \) is a predefined parameter. Please note that \(\mathbf {Y}_{t}\) needs not to be explicitly labeled in this joint training process, and thus \(\Vert \hat{\mathbf {Y}}_{t}(\varvec{V})-\mathbf {Y}_t\Vert ^2 \) can be ignored in Eq. (14). Then, the CSS corresponding to the largest value in \(\varvec{V}\) is selected as the optimal CSS, which is used to synthesize RGB images. These synthetic RGB images act as the input of the recovery network to recover HSIs.

The loss is minimized with the adaptive moment estimation method [29]. For all designed network modules, we set the mini-batch size to 16, momentum parameter to 0.9, and weight decay to \(10^{-4}\). To fit the nonnegative constraint for the optimal CSS selection, its convolution layer’s weights are initialized as random positive numbers and all negative weights are set to zero during the forward and backward propagation. In the HSI recovery network, all convolution layer’s weights are initialized by the method in [18]. The network has been trained with the deep learning tool Caffe [23] on a NVIDIA Titan X GPU.

4 Experimental Results

In the following, we will first introduce the datasets used for training and testing of all methods, and the metrics for quantitative evaluation. Then, we compare our method with several state-of-the-art HSI recovery methods under a typical CSS. In addition, the effectiveness of our optimal CSS selection method is evaluated on two CSS datasets.

4.1 Datasets and Metrics

We evaluate our joint CSS selection and CNN-based HSI recovery from a single RGB image on three public hyperspectral datasets, including the ICVL dataset [3], the NUS dataset [38], and the Harvard dataset [9]. The ICVL dataset consists of 201 images, which is by far the most comprehensive natural hyperspectral dataset. We randomly select 101 images in this dataset for training and use the rest for testing. The NUS dataset contains 41 HSIs in the training set and 25 HSIs in the testing set. The Harvard dataset consists of 50 outdoor images captured under daylight illumination. We remove those 6 images with strong highlights, and randomly use 35 images for training and 9 images for testing. All HSIs in these datasets have 31 bands. Two CSS datasets are used to evaluate the optimal CSS selection. The first dataset [24] contains 28 CSS curves and the second dataset [26] contains 12 CSS curves. Both datasets cover different camera types and brands.

We uniformly extract the patch pairs from each HSI and its corresponding RGB images under variant CSS functions with the size of \(64\times 64\) and the stride of 61. We randomly select 90% pairs for training and 10% pairs for validation. The RGB patches are regarded as the network’s input and the HSI patches are regarded as the ground truth.

Three image quality metrics are utilized to evaluate the performance of all methods, including root-mean-square error (RMSE), structural similarity (SSIM) [47], and spectral angle mapping (SAM) [30]. RMSE and SSIM are calculated on each 2D spatial image, which measure the spatial fidelity between the recovered HSI and the ground truth. SAM is calculated on the 1D spectral vector, which shows the spectral fidelity. Smaller values of RMSE and SAM suggest better performance, while a larger value of SSIM implies better performance.

Table 1. RMSE, SSIM, and SAM results for different HSI recovery methods on there HSI datasets.
Table 2. Time complexity for different HSI recovery methods. (Unit: second)
Fig. 4.
figure 4

Visual quality comparison on six typical scenes in HSI datasets. The ground truth, recovered HSI by our method, the error map for RBF/SR/MM/our results, and RMSE results along spectra for all methods are shown from top to bottom.

Fig. 5.
figure 5

The RMSE results of our HSI recovery network on three HSI and two CSS datasets. The red and green bars indicate the best and worst CSS functions for the HSI recovery in a brute force way, respectively. (Color figure online)

Fig. 6.
figure 6

The selected optimal CSS by our method on three HSI datasets.

4.2 Evaluation on HSI Recovery

Here, we first compare our CNN-based HSI recovery method with three state-of-the-art HSI recovery methods from a single RGB image under the known CSS, including radial basis function network based method (RBF) [38], the sparse representation based method (SR) [3], and manifold-based mapping (MM) [22]. The original HSIs in datasets serve as ground truth. To fairly compare with these methods, we use the CSS function of Canon 5D Mark II to synthesize RGB values, which is the same as [22].

Table 1 provides the average results over all HSIs in the testing sets from three HSI datasets, for quantitative comparison of RBF, SR, MM and our method. The best results are highlighted in bold. We observe that MM [22] outperforms the other two methods. The reason is that MM effectively approximates the spectral nonlinear mapping between the RGB values and spectral signatures, compared with RBF and SR. This demonstrates that the spectral nonlinear mapping is much relevant for HSI recovery from a single RGB image. Our method provides substantial improvements over all these methods, in terms of RMSE, SSIM and SAM. This reveals the advantages of deeply exploiting the intrinsic properties of HSIs and verifies the effectiveness of our HSI recovery network.

To visualize the experimental results for all methods, several representative recovered HSIs and the corresponding recovered spectral errors on three datasets are shown in Fig. 4. The ground truth, our results, error images for RBF/SR/MM/our methods, and RMSE results along spectra for all methods are shown from top to bottom. The ground truth and our results are the 16-th band for all scenes. The error images are the average absolution errors between the ground truth and the recovered results across spectra. We can observe that the recovered images from our method are consistently more accurate for all scenes, which verifies that our method can provide higher spatial accuracy. The RMSE results along spectra for all methods show that the results of our method are much closer to the ground truth than other compared methods along spectra, which demonstrates that our approach obtains higher spectral fidelity.

The average testing time among 10 independent trials for HSIs with different sizes of all compared methods is included in Table 2. All results performed on an Intel Core i7-6800K CPU are provided. We can see that our method have higher time complexity on CPU, yet the running time on GPU is much shorter. Compared with the other methods, our method on GPU can reconstruct HSIs more than 10 frames per second for size of \(256\times 256 \times 31\).

4.3 Evaluation on CSS Selection

To evaluate the effect of CSS functions in our spectral recovery network, we have conducted experiments to evaluate the performance of HSI recovery on both CSS datasets [24, 26]. First, we perform all methods on the synthetic RGB images by different CSS functions in a brute force way. As shown in Fig. 5, we can indeed observe that our method is also dependent on the CSS selection. In spite of that, our method is still superior even with an improper CSS. To obtain an optimal CSS for improved HSI recovery by training once, our method uses a convolutional layer to select the optimal CSS. In Fig. 6, we can see that our method can effectively select the optimal CSS, which is consistent with the best one determined by exhaustive search. In addition, the selected CSS keeps the same for all three HSI dataset, which seems to indicate that this CSS properly encodes the intrinsic spectral information of the physical world.

Figure 7 shows the HSI recovery results of a typical scene under different CSS functions. The recovered HSIs under the worst/middle/best CSS functions in [24] and [26] in terms of Fig. 5 are shown in (a) and (b), respectively. We can see that the results under the selected optimal CSS are much close to the ground truth, which further demonstrates the effectiveness of joint optimal CSS selection and the accuracy of the learned CNN nonlinear mapping in HSI recovery.

Fig. 7.
figure 7

Visual quality comparison under different CSS functions on a typical scene. The recovered HSIs under the worst/middle/best CSS functions in [24] and [26] are shown in (a) and (b), respectively.

5 Conclusion

In this paper, we have presented an effective CNN-based method to jointly select the optimal camera spectral sensitivity function and learn an accurate mapping to reconstruct hyperspectral image from an RGB image. We first propose a spectral recovery network with properly designed modules to account for the underlying characteristics of the HSI, including spectral nonlinear mapping and spatial similarity. Meanwhile, a camera spectral sensitivity selection layer is developed to append onto the recovery network, which can automatically retrieve the optimal sensitivity functions by using the nonnegative sparse constraint. Experimental results show that our method can provide substantial improvements over the current state-of-the-art methods in terms of both objective metric and subjective visual quality.

Our current network selects the optimal sensitivity from off-the-shelf cameras. Therefore, there is no need to produce a new filter array, which is known to be extremely expensive. However, filter array makers are indeed able to produce novel filter arrays, which might be better for this spectral recovery task than all existing RGB cameras. It is thus worth investigating the limitations of current filter manufacturing techniques and exploring how to incorporate these limitations in a more relaxed filter response designing (rather than selection) process.