1 Introduction

With the development of 3D technology and AR/VR glasses, stereoscopic vision is playing an important role in people’s daily life. In the near future, it can be imagined that the scene in the film “Ready Player One” will come true. The soaring of stereoscopic visual data and applications foster research based on such data. Many research topics, targeting at processing the left and right views of stereo images, simultaneously and consistently, have emerged as required, such as stereoscopic image editing [2, 15], stereoscopic style transfer [4, 10], and stereoscopic image segmentation [17,18,19].

Fig. 1.
figure 1

Results obtained by Context Encoder (CE in the figure) [24] and the proposed method. Groundtruth (GT in the figure) and inputs are presented in the first and second rows, respectively. (a) Results on KITTI [9] images; (b) Close-up views of the results; (c) Results on Driving [20] images.

Before deep learning, inpainting [29] was achieved mainly by searching and copying appropriate local patches from the remaining parts of a given image to be repaired. In order to deal with stereo images, consistency between the left and right views was formulated into the inpainting process [21, 22, 27]. Despite great progress, it is hard for these patch-based methods to complete large missing areas correctly. This is because they can hardly perceive the global context of a given image to guide structure prediction. Moreover, source patches used for filling the hole are limited to coming from the remaining parts of the image.

In recent years, CNN (Convolutional Neural Network), acknowledged for its strong ability in feature representation after being well trained on large dataset, has been widely used in many tasks, including image inpainting. For example, Pathak et al. [24] presented a CNN-based encoder-decoder network, called Context Encoder, which captures image context into a compact latent feature representation. It was demonstrated to have better performances than traditional patch-based methods in semantic hole-filling [24]. However, up to now, there is no related work trying to solve the problem of stereoscopic image inpainting in the framework of deep learning.

Of course, stereo images could be inpainted one by one, by using networks for single images, e.g. Context Encoder [24]. However, as shown in Fig. 1, stereoscopic repairing without considering consistency between left and right views produces inconsistent contents. The paper therefore proposes a stereoscopic image inpainting network, which takes both views into consideration simultaneously with the information aggregation in a feature level. Meanwhile, the consistency in pixel level is also be improved by minimizing stereo matching cost according to the disparity map.

On the other hand, CNN based image inpainting is a data-hungry task [24]. Previous monocular networks [24] are generally pre-trained on ImageNet [25] for better performance and generalization. Unfortunately, there is no such huge stereoscopic dataset like ImageNet. In order to overcome this obstacle, our network is designed to be pre-trainable on monocular images.

The main contributions of this paper are summarized as follows:

  1. 1.

    A CNN-based stereoscopic image inpainting network, called Stereo Inpainting Net, is proposed. To the best of our knowledge, we are the first to solve the stereoscopic image inpainting problem in the framework of deep learning;

  2. 2.

    Our network encodes stereo consistency in both structure and detail levels, via a specially designed feature fusion module and a local correspondence loss, respectively. We validate the effectiveness of the feature fusion module and the consistency loss via ablation study;

  3. 3.

    Under the circumstance of stereoscopic data scarcity for training the network, we find a way to pre-train the network with single images, specifically being the huge ImageNet dataset. We validate the pre-training strategy via comparison experiments. This strategy could be generalized to other stereoscopic image processing tasks.

2 Related Work

Our work is directly related to traditional patch-based stereoscopic inpainting approaches in topics and CNN-based image inpainting in frameworks. Besides, the proposed method is also relevant to those involving stereoscopic consistency constrains. Therefore, we introduce related works from the above three aspects.

Patch-Based Stereoscopic Image Inpainting. Given a pair of stereo images, traditional methods [21, 22, 27], specifically being those without using deep learning, find usable patches only from the remaining parts and fill holes with stereo consistency constraints. For example, Wang et al. [27] completed RGB images and disparity maps jointly via greedy patch searching. The disparity maps were then used for consistency check to refine RGB completion. Morse et al. [21] firstly completed the disparity maps while maintaining mutual consistency by using a coupled Partial Differential Equation (PDE). Then, they used the completed disparity maps to guide cross-image patch search to fill RGB images. The above methods, limited by available data in the remaining parts and the quality of the completed disparity maps, are not good at repairing large missing areas and foreground with abrupt depth changes. Luo et al. [16] proposed a method which could edit both foreground and background. However, it needs users to manually provide a target depth map, thereby not being used for performance comparison in our experiments. The two automatic patch-based methods [21, 27] are tested and demonstrate poorer performances than our CNN-based method in filling foreground and large holes.

CNN-Based Image Inpainting. Inspired by the ability of CNN in feature representation, global structure extraction and image generation [7], Pathak et al. [24] proposed Context Encoder for single image inpainting. It is an encoder-decoder network with an adversarial loss. After sufficient training on extensive image dataset, Context Encoder is able to extract semantic features from input images with holes and generate completed images. Compared to traditional patch-based methods, Context Encoder is good at perceiving and recovering image structures. Besides, this network is able to generate novel contents, which might not appear in input images. Based on Context Encoder, several other approaches were proposed. For example, Yang et al. [30] and Wu et al. [28] improved its performance in dealing with high-resolution images. Demir et al. [5] polished the residual after initial restoration. Iizuka et al. [14] combined global and local consistency to make images more natural and plausible. These CNN-based methods perform well in monocular image inpainting. Based on these works, we develop a new network for stereo image inpainting, which successfully models consistency in both structure and detail levels. Moreover, we solve the stereoscopic data scarcity problem by a pre-training strategy.

Stereoscopic Consistency Constraints. Keeping consistency among different views is essential when processing multi-view images [1, 26]. Baek et al. [1] used structure propagation for consistent space topology across images. For stereoscopic image inpainting, disparities computed by left and right correspondences are widely used to form consistency constraints in the patch-based framework. For example, Wang et al. [27] inpainted images and their disparities together and then refined the results via disparity-determined consistency, iteratively. Morse et al. [21] and Luo et al. [16] filled stereo images based on their disparity maps which were completed in advance. Although there are no CNN-based stereoscopic inpainting methods according to our investigation, other tasks targeting at processing stereo images under the framework of CNN have appeared and certainly used consistency constraints. For instance, Chen et al. [4] and Gong et al. [10] adopted disparity maps to guide feature aggregation in a middle level for stereoscopic style transfer. Our methods are much different from the above ones in frameworks or tasks. However, the above works inspire us much in modeling stereo consistency in the proposed Stereo Inpainting Net.

Fig. 2.
figure 2

Architecture overview.

3 Stereoscopic Inpainting Network

3.1 Overview

The overall architecture of the proposed network, called Stereo Inpainting Net, is given in Fig. 2. It has two encoders sharing parameters with each other. Each encoder captures the context of the left/right view of an input image pair and expresses the context in a semantic feature representation. The feature representations of the two views are concatenated and fused via a fusion layer. Two decoders (\(Decoder_l\) and \(Decoder_r\)) take the fused features and generate a pair of completed images (\(\hat{x_l}\) and \(\hat{x_r}\)), simultaneously. Decoding based on the fused features achieves coherent structure repairing.

As illustrated in Fig. 2, the total loss, denoted as \(L_{total}\), is composed of three parts:

$$\begin{aligned} L_{total}=\sum _{d\in \{l,r\}}(\alpha L_G^d + \beta L_D^d) + \gamma L_{disp}. \end{aligned}$$
(1)

Here, \(d\in \{l,r\}\) denotes a current view (left or right). \(L_G\), \(L_D\) are an L2 reconstruction loss and an adversarial loss, respectively. \(L_{disp}\) is a local consistency loss. \(\alpha \), \(\beta \) and \(\gamma \) are weights balancing the three. The L2 reconstruction loss captures the overall structures of the missing regions in relation to the context, while the adversarial loss, making prediction of the discriminators (\(Dis_l\) and \(Dis_r\)) look real, drives the Stereo Inpainting Net to produce sharper predictions [24]. The local consistency loss, constraining the left and right views to look similar in details, is a complementary of the feature fusion layer in enforcing stereo consistency. We train the Stereo Inpainting Net with parts of or all the losses in multiple stages. Since there is no enough stereoscopic data available, a training strategy based on transfer learning is proposed.

The encoders, decoders, and the discriminators have the same architectures with those in Context Encoder [24], thereby being skipped in this paper. In the following, we will introduce the fusion layer for structure consistency, the local consistency loss, and the transfer-learning based training strategy, in detail.

3.2 Feature Fusion for Structure Consistency

For stereo coherent inpainting, the network is supposed to take both views into consideration, comprehensively. An intuitive way to achieve this is to pile the two RGB views into six channels and send them into a single network. As demonstrated in [3, 23], composition in a semantic feature level is more robust than that in the original image level. Thus, we feed the two views into encoders separately and fuse their features for coherent inpainting. Moreover, this design enables us to transfer parameters of the pre-trained Context Encoder [24], as we will explain in Sect. 3.4. Inspired by the Siamese network [32] used to calculate stereo matching costs, the two encoders share weights, in order to reduce parameters and produce unified feature representations for feature fusion.

Through the encoders, the two views are turned into feature vectors of 4000 channels. To aggregate the information of both views, we concatenate the two feature vectors into one of 8000 channels. After that, we apply a channel-wise convolution to the combined vector to blend features from different channels and produce a fused feature vector of 4000. Based on the fused feature vector, two independent decoders are employed to generate unique contents for the left and right views, respectively.

Both the reconstruction loss and the adversarial loss will be used for training the feature fusion layer. The former measures the L2 distance between generated images by Stereo Inpainting Net and ground truth. The reconstruction loss for view d is given by:

$$\begin{aligned} L_{G}^{d}=\Vert {(G(M \odot x_d)-x_d) \odot (1 - M)}\Vert _2. \end{aligned}$$
(2)

Here, \(x_d\) is a ground-truth view. \(M \odot x_d\) denotes the input for training. M is a mask, in which a missing area is filled with 0 and the other pixels are set to be 1. \(\odot \) is the element-wise product operation. \(G(input_d)\) is the generation function represented by Decoder\(_d\). The adversarial loss [11] is computed after the discriminators in Fig. 2. The discriminators receive generated images or ground truth as input and make a judgement whether the input is real. Two discriminators sharing weights have been tried in our experiments. However, it results in poor convergence because the derivatives provided by left and right branches are not symmetric. More generally, when dealing with stereoscopic images with GANs [11], it is a good practice to use two branches of decoders and discriminators, in order to make the training process easier and more stable. The adversarial loss for view d is computed by binary cross entropy over the judgements D() and ground-truth labels:

$$\begin{aligned} L_{D}^{d}=\max _D \mathbb {E}_{x \in X}[\log {D((1-M) \odot x_d)}+\log (1-D(G(M \odot x_d)))]. \end{aligned}$$
(3)

3.3 Local Consistency Loss

The local consistency loss, \(L_{disp}\), measures the differences between corresponding patches in the generated left and right views of contents. The loss computation is illustrated in Fig. 3. For each pixel i in the filled areas (indicated by the dashed boxes) of the left view, i.e. \(i \in (1-M) \odot \hat{x_l}\), we take a \(3 \times 3\) patch \(P_l(i)\) around i and warp the patch to the right image by the disparity of i. The disparity map of the left view is computed by DispNet [20]. Through warping, we find the corresponding patch of \(P_l(i)\) in the right view.

Fig. 3.
figure 3

Patch matching for local consistency loss.

We denote the corresponding patch of \(P_l(i)\) as \(\overleftarrow{W}(P_l(i),x_{disp}(i))\). \(L_{disp}\) is defined as:

$$\begin{aligned} L_{disp}=\frac{1}{\vert {(1-M) \odot x_l}\vert }\sum _{i \in (1-M) \odot x_l}{cost(P_l(i),\overleftarrow{W}(P_l(i),x_{disp}(i)))} \end{aligned}$$
(4)

Here, cost() measures the distance between the two patches. According to [12], many stereo matching costs, e.g. Sum of Squared Differences (SSD), Sum of Absolute Differences (SAD), mutual information [8], rank and census transforms [31], could be used. Unfortunately, some of them, such as rank transform and hierarchical mutual information, are unable to propagate derivatives backward. In this paper, we choose Normalized Cross-Correlation (NCC) [12] as the stereo matching cost.

3.4 Transfer-Learning Based Training

Due to the stereoscopic data limit, we present a transfer-learning based training strategy, which is summarized in Algorithm 1.

figure a

Here, \(T_{train}=2500\), \(T_{fusion}=100\), \(T_{D}=300\), and \(T_{G}=1000\). \(T_{fusion}\), \(T_{D}-T_{fusion}\), \(T_{G}-T_{D}\) denote individually trained epochs for the fusion layer, the discriminator and the Stereo Inpainting Net plus discriminator, respectively. The whole network in Fig. 2 is trained for \(T_{train}-T_{G}\) epochs.

4 Experiments

4.1 Implementation Details

Experimental Settings. We use Adam for optimization with a learning rate of 0.0002 and without weight decay. The batch size is 128. The weights balancing the three losses in Eq. 1 are set to be \(\alpha =1, \beta =0.001, \gamma =0.1\).

Dataset. In Algorithm 1, the baseline net, Context Encoder, is trained on ImageNet [25] which has 1, 260, 000 monocular images. The remaining training steps are performed on KITTI [9]. The original KITTI dataset contains 42, 382 rectified stereo pairs from 61 scenes. Images in this dataset are captured at 10 Hz and their resolution is \(1242 \times 375\) pixels. In our experiments, we resample the dataset with 1/5 of the original frequency to avoid high correlation among image pairs and generate 8476 stereo pairs, in which 8176 pairs for training our network and the left 300 pairs for the validation in Sect. 4.2 and the qualitative evaluation in Sect. 4.3. The images are resized to be \(640 \times 384\) for computation convenience of DispNet [20]. At each training iteration or testing step, we randomly crop a pair of \(128 \times 128\) pixels in the resized image pairs as input. Quantitative evaluation in Sect. 4.3 is carried out on Driving dataset [20], which has 4392 frames with accurate disparity. The dataset is resampled (with lower frequency and removal of too dark images) to be 600 pairs, 400 for fine tuning the network on this dataset and 200 for quantitative test.

4.2 Evaluation on Transfer-Learning Based Training

Fig. 4.
figure 4

Loss curves with and without transfer learning on both training and validation datasets.

To verify the performance of the transfer learning strategy, comparison is made between the network trained from scratch by ignoring step 2 in Algorithm 1 and our network trained based on transfer learning. Figure 4 presents the reconstruction-plus-adversarial loss curves of the two networks, on the KITTI training and validation datasets, respectively. From the figure, it can be seen that our network with transfer learning achieves much lower loss on both training and validation data than the one trained from scratch. In addition, with transfer learning, the performance of the network on the validation data is closer to that on the training data, which means the network is better in generalization [6].

4.3 Evaluation on the Proposed Network with Ablation Study

Quantitative Evaluation. In order to demonstrate the effectiveness of our method sufficiently, we present quantitative evaluation and comparison with the other methods in aspects of stereo consistency and image quality. Results are given in Table 1. The stereo consistency is quantified with SSD and SAD of corresponding patches (\(1\times 1\) pixels in this evaluation). In order to avoid influences from disparity noises, we carry out the evaluation on the Driving dataset [20] with ground truth disparity maps. Note that Context Encoder and our models are fine-tuned on the dataset in advance with parts of the Driving dataset (refer to Sect. 4.1). PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) [13] are employed to measure the image quality of the generated contents in pixels and local structures, respectively.

Table 1. Quantitative results in stereoscopic inconsistency and image quality.

As it can be seen from Table 1, Context Encoder performs poorly in keeping consistency. To be noticed, lower SAD/SSD means higher consistency while higher PSNR/SSIM means better quality. In contrast, patch-based methods are better at keeping consistency, but cannot produce images with high quality. Our network, even the one with only feature fusion, generates contents with high quality and consistency, simultaneously.

Qualitative Evaluation. We compare the results of our method with Context Encoder [24] and two traditional patch-based stereoscopic inpainting methods [21, 27]. Context Encoder is used to repair stereo images view by view, independently. Parts of the results on the KITTI and Driving validation dataset are given in Figs. 1 and 5. From Fig. 5, it can be seen that patch-based stereoscopic methods can restore missing areas and keep the consistency of these areas to some extent. However, their performances are limited by existing patches in the source images and they are prone to destroy the structure integrity. For example, in Fig. 5(a), patch-based methods break the window structure seriously because the window is partly missing in both views.

In contrast, Context Encoder is better at predicting content structures in the missing parts than patch-based methods, as it can be seen from Fig. 5. However, view-by-view repairing often leads to stereo inconsistency. For example, in the results of Context Encoder in Fig. 5(b), the yellow car appearing in the right view is absent in the left view.

Fig. 5.
figure 5

Results obtained by Wang et al. [27], Morse et al. [21], Context Encoder [24] and our method without and with disparity loss. The first and second rows are ground truth (GT in the figure) and inputs, respectively. (Color figure online)

From these figures, it can be seen that our network improves the consistency of structures by employing feature fusion. The disparity-determined local consistency loss helps coherent inpainting in detail level. In all, compared with the patch-based methods and Context Encoder, our network obtains better inpainting results in stereo consistency. In the meanwhile, the quality of each image generated by our network also looks better than those obtained by existing methods. Due to page limit, only few examples are listed. More results are provided in the supplementary material.

5 Conclusion

In this paper, a stereoscopic image inpainting network was proposed. The network was endowed with a specially assigned feature fusion layer and a local correspondence loss. The two played essential roles in coherent stereoscopic inpainting in structures and details, respectively. Besides, a transfer-learning based training strategy was presented, which conquered the problem of stereoscopic data scarcity. Contents predicted by the proposed network demonstrated higher stereo consistency and image quality than state-of-the-art methods.