Edge supervision and multi-scale cost volume for stereo matching
Introduction
Stereo matching is one of the most widely studied problems in computer vision with applications in 3D reconstruction [1], [2], autonomous driving [3], robot automatic guidance [4] and augmented realities [5]. Binocular stereo vision system is based on the disparity principle and adopts imaging equipment to capture the left and right images of the measured object from different positions. Subsequently, it calculates the position deviation of the spatial point in the two-dimensional image following the triangulation principle, and finally exploits the position deviation to perform three-dimensional reconstruction and obtain 3D geometric information of the measured object.
In recent years, Convolutional Neural Networks (CNNs) [6], [7], [8], [9] compared with conventional algorithms have yielded significant gains in terms of both speed and accuracy. However, in the ill-posed regions (e.g., repeated patterns, reflective surfaces, around object edges and texture-less regions), accurate corresponding points remain difficult to find. To solve the mentioned problems, some studies [10], [11], [12] attempt to enlarge the receptive field for global context information extraction and increase the accuracy of stereo matching to a certain extent. But due to the lack of the geometric constraints and fine-grained representation in the deep stereo network, it is consider that the disparity is difficult to predict accurately around object edges.
Despite the successes of CNNs on stereo matching [13], [14], [15], [16] and edge detection [17], [18], [19], [20], the mentioned methods rarely consider the interactions between them. This study is inspired by the recent work of EdgeStereo [21], integrating edge information into disparity estimation via joint learning. According to Fig. 1, edge detection can capture boundaries of different objects in images and show geometry and spatial correlation with the corresponding disparity map. Moreover, accurate edge detection can help rectify the disparity estimation values along the object boundaries, which are constantly prone to error in stereo matching. However, the classical edge covers a lot of noise information. We innovatively generate the depth ground-truth boundary dataset by mining the instance and semantic segmentation datasets simultaneously, and propose a novel two-stream CNN architecture called RDNet for stereo matching that explicitly utilizes edge information through a separate processing branch. Our main goal is to enforce a structured representation that exploits the duality between the boundary prediction and stereo matching tasks. The proposed RDNet simultaneously learns disparity estimation and boundary detection and exploits boundaries as an intermediate representation to aid stereo matching.
Moreover, we consider that the same pixel on different scales have the identical cost volume. Then, we propose a mutlti-scale cost volume and a hierarchical cost aggregation for disparity estimation. Besides, the low-resolution cost volumes at different scales can cover multi-scale receptive fields, which can be complementary facilitating the networks observation in different scale image regions. In addition, we fuse multiple low-resolution cost volumes to enlarge the receptive field and extract robust structural representations for initial disparity estimation. The left image, right image and initial disparity are combined into the disparity refinement network that covers several convolutions and basic blocks with different dilated rates to further improve the final disparity estimation.
In summary, our main contributions are listed below:
- (1)
We innovatively generate the pixel-level depth boundary dataset of Sceneflow and propose a new two stream stereo network RDNet that can learn depth estimation and boundary detection simultaneously. The network utilizes the geometric constraint between disparity estimation and edge detection and estimates the disparity more accurately along the object boundaries.
- (2)
We propose a multi-scale cost volume for cost aggregation and integrate cost volumes with multi stacked hourglass architectures to enlarge the receptive filed and capture global information for initial disparity estimation. Besides, we develop an efficient disparity refinement network to further improve the final disparity estimation.
- (3)
Our method achieves the state-of-the-art performance of disparity estimation on the benchmark datasets of Sceneflow, KITTI 2015 and 2012.
Section snippets
Stereo matching
Numerous related works have been proposed for stereo matching. This section reviews disparity estimation with emphasis placed on CNN-based methods. Zbontar and LeCun [6] utilize a convolutional neural network to learn a similarity measure by small image patches. Nikolaus Mayer et al. propose the first end-to-end stereo matching network DispNet [8], which exploits a 1-D correlation layer for the cost calculation. To merge multiscale features, GC-Net [9] is efficiently capable of learning context
Our robust disparity network
In this paper, we propose a novel multi-task learning network RDNet, incorporating edge feature into the disparity estimation. We unite the stereo stream information and edge stream information in feature extraction and utilize them to construct the multi-scale cost volume. The network consists of five parts: feature extraction (stereo stream and edge stream), multi-scale cost volumes construction, hierarchical cost aggregation, disparity computation and disparity refinement. The overall
Datasets and evaluation metrics
We evaluate our method on three stereo datasets:
- (1)
Sceneflow [8]: a large synthetic stereo dataset containing 35,454 training images and 4370 testing images with H = 540 and W = 960. The dataset provides elaborate and dense disparity maps as ground truth. The disparity edge maps are obtained as edge ground truth by binarizing the images of object segmentation. We evaluate our model with the metric end-point error (EPE) that is the mean disparity error in pixels.
- (2)
KITTI 2015 [37] and KITTI 2012 [38]:
Conclusion
In this study, we propose a novel and effective network architecture RDNet that utilizes edge detection and multi-scale cost volume for robust stereo matching. We first incorporate edge cues into stereo stream for improving disparity estimation in feature extraction. Then we construct multi-scale cost volumes and fuse them together to extract more global context information and structural representations by multi hourglass modules. Furthermore, the disparity refinement network is used to
Funding
This study is supported in part by the Science and Technology Major Project of Guizhou Province (Qiankehe Major Projects No. ZNWLQC[2019]3012), the Science and Technology Project of Guizhou Province Department of Transportation (2021-322-021), the Natural Science Foundation of Guangdong Province Grant No. 2020A1515110501 and the Science and Technology Planning Project of Shenzhen No. JCYJ20180503182133411.
Declaration of Competing Interest
The authors declared no conflict of interest.
References (41)
Accurate multiple view 3D reconstruction using patch-based stereo for large-scale scenes
IEEE Trans. Image Process.
(2013)- et al.
MVSNet: depth inference for unstructured multi-view stereo
- et al.
Deepdriving: learning affordance for direct perception in autonomous driving
- et al.
Stereo vision based indoor/outdoor navigation for flying robots
- et al.
Virtual blood vessels in complex background using stereo x-ray images
- et al.
Stereo matching by training a convolutional neural network to compare image patches
J. Mach. Learn. Res.
(2016) - et al.
Efficient deep learning for stereo matching
- et al.
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation
- et al.
End-to-end learning of geometry and context for deep stereo regression
- et al.
Pyramid stereo matching network
Multi-level context ultra-aggregation for stereo matching
Multi-scale cross-form pyramid network for stereo matching
Real-time self-adaptive deep stereo
Cascade residual learning: a two-stage convolutional neural network for stereo matching
StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction
Left-right comparative recurrent model for stereo matching
Holistically-nested edge detection
Richer convolutional features for edge detection
Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction[C]
Advances in Neural Information Processing Systems
Bi-directional cascade network for perceptual edge detection
Cited by (14)
Feature enhancement network for stereo matching
2023, Image and Vision ComputingLightweight multi-scale convolutional neural network for real time stereo matching
2022, Image and Vision ComputingCitation Excerpt :In recent years, deep learning based algrithms have demonstrated promising results in multi-view 2D image processing [7–11]. To exploit global context information, multi-scale fusion network is widely used in these methods [12]. Encoder-decoder structure is used to incorporate multi-scale features to achieve cost-volume regularization in GC-Net [7].
Expanded photo-model-based stereo vision pose estimation using a shooting distance unknown photo
2024, International Journal of Advanced Robotic SystemsReal-time stereo matching with high accuracy via Spatial Attention-Guided Upsampling
2023, Applied IntelligenceMulti-OCDTNet: A Novel Multi-Scale Object Context Dilated Transformer Network for Retinal Blood Vessel Segmentation
2023, International Journal of Pattern Recognition and Artificial Intelligence