Edge supervision and multi-scale cost volume for stereo matching

https://doi.org/10.1016/j.imavis.2021.104336Get rights and content

Highlights

  • Innovatively generate a dataset for edge detection and stereo matching.

  • Incorporate edge detection clues into the disparity estimation.

  • Enlarge receptive field and capture global information by a multi-scale cost volume.

  • Optimize the initial disparity for the sake of robustness to challenging regions.

Abstract

Recently, methods based on Convolutional Neural Network have achieved huge progress in stereo matching. However, it is still difficult to find accurate matching points in inherently ill-posed regions (e.g., weak texture areas and around object edges), in which the accuracy of disparity estimate can be improved by the corresponding geometric constraints. To tackle this problem, we innovatively generate the depth ground-truth boundary dataset by mining the instance segmentation and semantic segmentation datasets and propose RDNet, which incorporates edge cues into stereo matching. The network learns geometric information through a separate processing branch edge stream, which can process feature information in parallel with the stereo stream. The edge stream removes noise and only focuses on processing the relevant boundary information. Besides, we introduce a multi-scale cost volume in hierarchical cost aggregation to enlarge the receptive fields and capture structural and global representations that can significantly improve the ability of scene understanding and disparity estimation accuracy. Moreover, a disparity refinement network with several dilated convolutions is applied to further improve the accuracy of the final disparity estimation. The proposed method is evaluated on Sceneflow, KITTI 2015 and KITTI 2012 benchmark datasets, and the qualitative and quantitative results demonstrate that the proposed RDNet significantly achieves the state-of-the-art stereo matching performance.

Introduction

Stereo matching is one of the most widely studied problems in computer vision with applications in 3D reconstruction [1], [2], autonomous driving [3], robot automatic guidance [4] and augmented realities [5]. Binocular stereo vision system is based on the disparity principle and adopts imaging equipment to capture the left and right images of the measured object from different positions. Subsequently, it calculates the position deviation of the spatial point in the two-dimensional image following the triangulation principle, and finally exploits the position deviation to perform three-dimensional reconstruction and obtain 3D geometric information of the measured object.

In recent years, Convolutional Neural Networks (CNNs) [6], [7], [8], [9] compared with conventional algorithms have yielded significant gains in terms of both speed and accuracy. However, in the ill-posed regions (e.g., repeated patterns, reflective surfaces, around object edges and texture-less regions), accurate corresponding points remain difficult to find. To solve the mentioned problems, some studies [10], [11], [12] attempt to enlarge the receptive field for global context information extraction and increase the accuracy of stereo matching to a certain extent. But due to the lack of the geometric constraints and fine-grained representation in the deep stereo network, it is consider that the disparity is difficult to predict accurately around object edges.

Despite the successes of CNNs on stereo matching [13], [14], [15], [16] and edge detection [17], [18], [19], [20], the mentioned methods rarely consider the interactions between them. This study is inspired by the recent work of EdgeStereo [21], integrating edge information into disparity estimation via joint learning. According to Fig. 1, edge detection can capture boundaries of different objects in images and show geometry and spatial correlation with the corresponding disparity map. Moreover, accurate edge detection can help rectify the disparity estimation values along the object boundaries, which are constantly prone to error in stereo matching. However, the classical edge covers a lot of noise information. We innovatively generate the depth ground-truth boundary dataset by mining the instance and semantic segmentation datasets simultaneously, and propose a novel two-stream CNN architecture called RDNet for stereo matching that explicitly utilizes edge information through a separate processing branch. Our main goal is to enforce a structured representation that exploits the duality between the boundary prediction and stereo matching tasks. The proposed RDNet simultaneously learns disparity estimation and boundary detection and exploits boundaries as an intermediate representation to aid stereo matching.

Moreover, we consider that the same pixel on different scales have the identical cost volume. Then, we propose a mutlti-scale cost volume and a hierarchical cost aggregation for disparity estimation. Besides, the low-resolution cost volumes at different scales can cover multi-scale receptive fields, which can be complementary facilitating the networks observation in different scale image regions. In addition, we fuse multiple low-resolution cost volumes to enlarge the receptive field and extract robust structural representations for initial disparity estimation. The left image, right image and initial disparity are combined into the disparity refinement network that covers several convolutions and basic blocks with different dilated rates to further improve the final disparity estimation.

In summary, our main contributions are listed below:

  • (1)

    We innovatively generate the pixel-level depth boundary dataset of Sceneflow and propose a new two stream stereo network RDNet that can learn depth estimation and boundary detection simultaneously. The network utilizes the geometric constraint between disparity estimation and edge detection and estimates the disparity more accurately along the object boundaries.

  • (2)

    We propose a multi-scale cost volume for cost aggregation and integrate cost volumes with multi stacked hourglass architectures to enlarge the receptive filed and capture global information for initial disparity estimation. Besides, we develop an efficient disparity refinement network to further improve the final disparity estimation.

  • (3)

    Our method achieves the state-of-the-art performance of disparity estimation on the benchmark datasets of Sceneflow, KITTI 2015 and 2012.

Section snippets

Stereo matching

Numerous related works have been proposed for stereo matching. This section reviews disparity estimation with emphasis placed on CNN-based methods. Zbontar and LeCun [6] utilize a convolutional neural network to learn a similarity measure by small image patches. Nikolaus Mayer et al. propose the first end-to-end stereo matching network DispNet [8], which exploits a 1-D correlation layer for the cost calculation. To merge multiscale features, GC-Net [9] is efficiently capable of learning context

Our robust disparity network

In this paper, we propose a novel multi-task learning network RDNet, incorporating edge feature into the disparity estimation. We unite the stereo stream information and edge stream information in feature extraction and utilize them to construct the multi-scale cost volume. The network consists of five parts: feature extraction (stereo stream and edge stream), multi-scale cost volumes construction, hierarchical cost aggregation, disparity computation and disparity refinement. The overall

Datasets and evaluation metrics

We evaluate our method on three stereo datasets:

  • (1)

    Sceneflow [8]: a large synthetic stereo dataset containing 35,454 training images and 4370 testing images with H = 540 and W = 960. The dataset provides elaborate and dense disparity maps as ground truth. The disparity edge maps are obtained as edge ground truth by binarizing the images of object segmentation. We evaluate our model with the metric end-point error (EPE) that is the mean disparity error in pixels.

  • (2)

    KITTI 2015 [37] and KITTI 2012 [38]:

Conclusion

In this study, we propose a novel and effective network architecture RDNet that utilizes edge detection and multi-scale cost volume for robust stereo matching. We first incorporate edge cues into stereo stream for improving disparity estimation in feature extraction. Then we construct multi-scale cost volumes and fuse them together to extract more global context information and structural representations by multi hourglass modules. Furthermore, the disparity refinement network is used to

Funding

This study is supported in part by the Science and Technology Major Project of Guizhou Province (Qiankehe Major Projects No. ZNWLQC[2019]3012), the Science and Technology Project of Guizhou Province Department of Transportation (2021-322-021), the Natural Science Foundation of Guangdong Province Grant No. 2020A1515110501 and the Science and Technology Planning Project of Shenzhen No. JCYJ20180503182133411.

Declaration of Competing Interest

The authors declared no conflict of interest.

References (41)

  • Shuhan Shen

    Accurate multiple view 3D reconstruction using patch-based stereo for large-scale scenes

    IEEE Trans. Image Process.

    (2013)
  • Yao Yao et al.

    MVSNet: depth inference for unstructured multi-view stereo

  • Chenyi Chen et al.

    Deepdriving: learning affordance for direct perception in autonomous driving

  • Korbinian Schmid et al.

    Stereo vision based indoor/outdoor navigation for flying robots

  • Qiuyu Chen et al.

    Virtual blood vessels in complex background using stereo x-ray images

  • Jure Zbontar et al.

    Stereo matching by training a convolutional neural network to compare image patches

    J. Mach. Learn. Res.

    (2016)
  • Wenjie Luo et al.

    Efficient deep learning for stereo matching

  • Nikolaus Mayer et al.

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

  • Alex Kendall et al.

    End-to-end learning of geometry and context for deep stereo regression

  • Jia-Ren Chang et al.

    Pyramid stereo matching network

  • Guang-Yu Nie et al.

    Multi-level context ultra-aggregation for stereo matching

  • Zhidong Zhu et al.

    Multi-scale cross-form pyramid network for stereo matching

  • Alessio Tonioni et al.

    Real-time self-adaptive deep stereo

  • Jiahao Pang et al.

    Cascade residual learning: a two-stage convolutional neural network for stereo matching

  • Sameh Khamis et al.

    StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction

  • Zequn Jie et al.

    Left-right comparative recurrent model for stereo matching

  • Saining Xie et al.

    Holistically-nested edge detection

  • Yun Liu et al.

    Richer convolutional features for edge detection

  • D. Xu et al.

    Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction[C]

    Advances in Neural Information Processing Systems

    (2017)
  • Jianzhong He et al.

    Bi-directional cascade network for perceptual edge detection

  • Cited by (14)

    • Feature enhancement network for stereo matching

      2023, Image and Vision Computing
    • Lightweight multi-scale convolutional neural network for real time stereo matching

      2022, Image and Vision Computing
      Citation Excerpt :

      In recent years, deep learning based algrithms have demonstrated promising results in multi-view 2D image processing [7–11]. To exploit global context information, multi-scale fusion network is widely used in these methods [12]. Encoder-decoder structure is used to incorporate multi-scale features to achieve cost-volume regularization in GC-Net [7].

    • Multi-OCDTNet: A Novel Multi-Scale Object Context Dilated Transformer Network for Retinal Blood Vessel Segmentation

      2023, International Journal of Pattern Recognition and Artificial Intelligence
    View all citing articles on Scopus
    View full text