Elsevier

Neurocomputing

Volume 492, 1 July 2022, Pages 601-611
Neurocomputing

Multi-hierarchy feature extraction and multi-step cost aggregation for stereo matching

https://doi.org/10.1016/j.neucom.2021.12.052Get rights and content

Abstract

Compared with the traditional hand-crafted feature based methods, learning-based stereo matching methods have made great progress in matching accuracy. However, current CNN-based stereo matching methods usually require a lot of time and memory consumption. It is very difficult to achieve the good balance between disparity estimation accuracy and inference speed that is significant to the application in real scenarios. To this end, we propose a accurate and fast stereo matching network (named MMNet), which contains two key modules of Multi-hierarchy feature extraction and Multi-step cost aggregation. In order to achieve a good trade-off between better disparity estimation and faster inference speed, a lightweight multi-hierarchy feature extractor is first proposed. This module obtains reliable feature information of different scales through three stable scale hierarchy branches, and outputs multi-step feature flows containing multi-scale fusion information at each step of the highest scale hierarchy branch. Moreover, we also propose a multi-step cost aggregation scheme, which uses shallow features to guide cost aggregation for ensuring a better aggregation effect with a small number of 3D convolutions. The experimental results on SceneFlow, KITTI 2012 and KITTI 2015 datasets show that our proposed network achieves extremely competitive disparity estimation accuracy with fast inference speed.

Introduction

As a classic computer vision task, stereo matching has made great progress based on deep learning. The main task of binocular stereo matching is to recover 3D structure of real world from rectified image pairs. Accuracy and rapidity are the important basis for stereo matching algorithm to be applied to real life. However, few existing methods can achieve a good trade-off between better disparity estimation and faster inference speed.

Traditional stereo matching approaches [1], [2] are usually decomposed into four key steps: matching cost computation, cost aggregation, disparity computation and refinement. The accuracy and inference speed of these traditional methods [3], [4], [5], [6], [7] are greatly affected by the design of feature descriptor and the constraint range of cost aggregation function. According to the range of cost aggregation, it can be divided into global stereo approaches [4], [5] and local stereo approaches [6], [7]. Global stereo approaches achieve better accuracy than local stereo approaches, but the processing speed is slower than local approaches. The semi-global method [1] proposed by Hirschmuller is a compromise between global methods and local methods, which has both the accuracy of global methods and the speed of local methods. However, the accuracy of traditional stereo matching methods is much worse than that of learning-based methods.

Recently, learning-based methods are constantly exploring different feature extractors and matching cost aggregation algorithms to obtain better stereo matching results. These methods can be divided into 3D convolution-based stereo matching methods [8], [9], [10], [11], [12], [13], [14] and 2D convolution-based stereo matching methods [15], [16], [17] according to whether the model contains 3D convolution. DispNetC [15], FADNet [17] and AANet [16] are all based on 2D convolution, which have the advantages of fewer parameters and faster inference speed. DispNetC [15] constructed an end-to-end disparity estimation network based on 2D convolution, and firstly adopted the encoder-decoder framework, where a correlation layer was used to measure the similarity of left and right image features. FADNet [17] adopted the architecture of DispNetC [15] as a backbone, and still utilized similarity measures based on point-wise correlation to construct cost volume. AANet [16] used deformable convolution to implement a new sparse point-based intra-scale cost aggregation representation, and further approximated traditional cross-scale cost aggregation algorithm [18] with neural network layers. However, these 2D convolution-based methods ignored the aggregation clue on parallax. In other words, these methods only performed cost aggregation on a single-channel correlation map for each disparity level and ignored the relationship between correlation maps of different disparity level.

In recent years, the state-of-the-art performances have been achieved by the methods based on stacked 3D CNN, such as PSMNet [8], GANet [9] and AcfNet [12]. GC-Net [10] directly concatenated the left and right image features to form the cost volume and incorporated a 3D CNN to aggregate contextual features. PSMNet [8] further improved GC-Net [10] by introducing spatial pyramid pooling to enlarge the receptive fields and utilizing more 3D convolutions for cost aggregation. Based on PSMNet [8], GANet [9] and GwcNet [11] improved the cost aggregation part to obtain better results. Although 3D CNN based stereo matching methods achieved state-of-the-art performance, the high computational cost and memory consumption make it difficult to be applied to practice.

In a word, it is a big challenge to make a good trade-off between better accuracy and faster inference speed. Specifically, we find that these current advanced methods ignored the importance of the quality of unary features for accurate disparity estimation and the entirety correlation between the feature extractor and the cost aggregation module. Therefore, we first design a feature extractor with multiple hierarchy branches and multiple feature output flows. The former enables the extractor to extract more robust multi-scale features through multiple hierarchy branches of different scales, and the latter can gradually guide cost aggregation by bridging them up to the cost aggregation modules. These two improvements do enhance the quality of unary features, thereby avoiding the redundancy or lack of features and raising the accuracy of disparity prediction. In addition, in order to enhance the entirety correlation between the feature extractor and the cost aggregation module, we gradually integrate the shallow unary feature flows of the left and right views into the cost aggregation process, which also further strengthens the regularization of 3D convolution. As illustrated in Fig. 1, we evaluate the existing advanced stereo matching methods from two aspects of inference time and the amount of parameters. It can be seen that our method achieves a better balance among disparity estimation accuracy, inference speed and the amount of parameters of the model.

Our main contributions are summarized as follows:

  • (1) To solve the problem of how to balance the accuracy and speed of stereo matching, we propose a accurate and fast stereo matching network, which contains two key modules of multi-hierarchy feature extraction and multi-step cost aggregation. The whole network achieves high accuracy and alleviates these issues of the high computational cost and memory consumption caused by 3D convolutions, thus achieving a good trade-off between better accuracy and faster inference speed.

  • (2) A light-weight multi-hierarchy feature extractor is proposed, which utilizes fewer parameters to obtain more robust multi-scale feature for stereo matching. Compared with the feature extractor of PSMNet [8], the number of parameters of the multi-hierarchy feature extractor is reduced by 39%, and the accuracy of the 3-pixel threshold error is increased by 0.05%.

  • (3) A multi-step cost aggregation scheme is proposed, including a novel cost volume construction strategy and a multi-step integration strategy. In this scheme, shallow unary features can guide the cost aggregation step by step, thus alleviating the problem of a large number of mismatches in complex illumination and repetitive texture areas.

The remainder of the paper is structured as follows: In Section 2, the existing researches most relevant to this paper are introduced. Section 3 describes the overall network framework and the details of its key modules. Section 4 introduces the model implementation scheme and experimental results. Finally, the conclusion and future work are given in Section 5.

Section snippets

Related work

The two key points to achieve the good balance between accuracy and speed in the stereo matching task are: (1) how to design a more efficient feature extractor to obtain recognizable and reliable matching features from stereo image pairs; (2) how to realize the cost aggregation of matching features more efficiently. At present, designing high-efficiency matching feature extractor and making full use of strong regularization of 3D CNNs to achieve better cost aggregation are the explorative

Methods

This section first introduces the overall architecture of MMNet. Then, we describe the details of multi-hierarchy feature extractor (MHFE) and multi-step cost aggregation (MSCA). These two modules overcome the problems of wrong matches caused by challenging areas such as complex illumination and repetitive textures, and accelerate inference with light-weighting. Finally, we briefly introduce the disparity regression strategy and the loss function.

Experiments

Experiment settings and results are presented in this section. We first introduce the experimental settings and network implementation, and then verify the effectiveness of our network through related experiments.

Conclusion

In this paper, we propose a fast and accurate deep stereo matching network based on 3D CNNs. The peculiar multi-hierarchy feature extractor design strategy can obtain more robust multi-scale information with fewer parameters. Moreover, the multi-step cost aggregation strategy proposed in this paper enables the model to make full use of the great potential of shallow features and improve the prediction accuracy in the condition of reducing the size of cost volume and the amount of parameters.

CRediT authorship contribution statement

Aixin Chong: Methodology, Software, Writing - original draft. Hui Yin: Supervision, Conceptualization, Funding acquisition. Yanting Liu: Writing - original draft, Writing - review & editing, Validation. Jin Wan: Data curation, Validation. Zhihao Liu: Visualization, Validation. Ming Han: Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by R&D Program of Beijing Municipal Education Commission (KJZD20191000402) and National Nature Science Foundation of China (51827813, 61472029).

Ai-Xin Chong received the B.E. degree from Shandong Agricultural University in 2017. Now, he is currently pursuing the Ph.D. degree with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. His research interests are in machine learning, pattern recognition, computer stereo vision.

References (39)

  • Kuk-Jin, Yoon, So, Kweon, Adaptive support-weight approach for correspondence search, IEEE Trans. Pattern Anal. Mach....
  • J. Chang et al.

    Pyramid stereo matching network

  • F. Zhang, V. Prisacariu, R. Yang, P. Torr, Ga-net: Guided aggregation net for end-to-end stereo matching, The IEEE...
  • A. Kendall et al.

    End-to-end learning of geometry and context for deep stereo regression

    (2017)
  • X. Guo et al.

    Group-wise correlation stereo network

  • Y. Zhang, Y. Chen, X. Bai, S. Yu, K. Yu, Z. Li, K. Yang, Adaptive unimodal cost volume filtering for deep stereo...
  • N. Mayer et al.

    A large dataset to train convolutional networks for disparity, optical flow and scene flow estimation

  • H. Xu, J. Zhang, Aanet: Adaptive aggregation network for efficient stereo matching, The IEEE Conference on Computer...
  • Q. Wang et al.

    Fadnet: A fast and accurate network for disparity estimation

  • Cited by (3)

    Ai-Xin Chong received the B.E. degree from Shandong Agricultural University in 2017. Now, he is currently pursuing the Ph.D. degree with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. His research interests are in machine learning, pattern recognition, computer stereo vision.

    Hui Yin received the Ph.D. degree in computer application technology from Beijing Jiaotong University, Beijing, China. She is currently a Full Professor of the School of Computer and Information Technology, Beijing Jiaotong University. Her current research interests include the machine vision, intelligent information processing and their application in the railway industry.

    Yanting Liu received the B.E. degree from Hebei University, Baoding, China, in 2018. She is currently pursuing the Ph.D. degree with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. Her current research interests include computer vision and pattern recognition.

    Jin Wan received the B.E. degree from Changchun University of Science and Technology in 2017. He is currently pursuing the Ph.D. degree with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. His research interests are in machine learning, pattern recognition, image processing and algorithms.

    Zhihao Liu received the B.E. degree from Beijing Institute of Graphic Communication, Beijing, China, in 2016. He is currently pursuing the Ph.D. degree with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. His current research interests include computer vision and pattern recognition.

    Ming Han received master degree from College of Photonic and Electronic Engineering, Fujian Normal University, Fuzhou, China, in 2017. He is currently pursuing the Ph.D. degree with the school of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. His current main research interests are the computer vision including 3D point cloud reconstruction.

    View full text