Elsevier

Neurocomputing

Volume 514, 1 December 2022, Pages 403-413
Neurocomputing

PCNet: Paired channel feature volume network for accurate and efficient depth estimation

https://doi.org/10.1016/j.neucom.2022.09.024Get rights and content

Abstract

Most state-of-the-art deep learning based depth estimation methods follow the pipeline of firstly forming a 4D cost volume (feature dimension, max disparity, height, and width) and then regressing disparity from the cost volume by several 3D convolutional layers. Applying 3D operations on the 4D tensor leads to unacceptable computational complexity and memory cost. To solve the problem, we aim at replacing the 4D cost volume with 3D cost volume so that the disparity can be regressed by 2D convolutions to achieve a good balance between efficiency and effectiveness. To this end, a light-weighted network, called PCNet, is proposed to generate 3D cost volume. The main novelty lies in the proposed Paired Channel Feature Volume (PCFV) which is capable of combining the features of stereo pairs with specially designed 3D filters to preliminarily encode the relationship between each pair of the channels. Moreover, a densely connected aggregation on the outputs of PCFV is performed to exploit much richer contextual information. Experimental results on the SceneFlow, KITTI 2012, and KITTI 2015 datasets demonstrate that the proposed PCNet achieves comparable accuracy with state-of-the-art methods and keeps high efficiency as well.

Introduction

Depth estimation is a fundamental and important task in field of computer vision. Accurate depth sensing plays an important role in applications such as odometry, robot navigation, and autonomous driving. Meanwhile, the depth information is also helpful in high-level object detection and semantic segmentation tasks [12], [13], [14], [15], [16], [17].

The depth estimation methods can be divided into two categories: active depth estimation and passive depth estimation. In active depth estimation, depth information is generated by active depth sensor, such as MMW radar (millimeter-wave radar) and LiDAR (Light Detection and Ranging) which are widely used in autonomous driving. Depth map can be extracted by the point clouds generated by active depth sensor. As for passive depth estimation, depth information is generated by passive depth sensor, such as stereo camera. Unlikely active depth estimation methods, stereo-based depth estimation methods gain advantage on low cost and rich contextual information which makes it a common solution to deal with depth estimation problem.

With the development of deep learning, great progress has been achieved in computer vision[1], [2], [3], [18], [19], [20], [21], [7], [6], [8], [9], [10], [11]. CNN (Convolutional Neural Network), as the most representative method of deep learning, has been introduced to stereo matching task to achieve good results on both accuracy and efficiency. Existing CNN based stereo methods can be classified into feature based methods and cost volume based methods. In feature based methods, disparity information is directly predicted by the deep CNN without special design. However, feature based methods fail in complex scene because structural information of the stereo inputs is not taken into account. In the recent few years, the cost volume based methods follow the pipeline: feature extractor, cost volume construction, cost aggregation, and refinement. The cost volume based methods achieve promising results even in challenging situations by taking the geometric constraints in stereo matching into account. However, the cost volume based methods suffer high computational cost compared with feature based methods for the construction of cost volume.

Two typical cost volumes are summarized in Fig. 1. Fig. 1(a) shows the concat based cost volume [22], [23], [24]. Left and right features are aligned in h dimension and slide in w dimension from 0 to D (D is the max disparity to predict). Then the overlap parts of the stereo features are concatenated to obtain a 4D cost volume with the size of [D,2C,H,W]. However, the cost volume based methods suffer from high computational cost because 3D convolutional operations are involved in the aggregation module. The correlation based cost volume [25], [26] is shown in Fig. 1(b). Left and right features are aligned in h dimension and slide in w dimension from 0 to D. Then the overlap parts of left and right features are fused by inner product to obtain D tensors. Finally, the averages of all the tensors in channel dimension are concatenated together to obtain a 3D cost volume with the size of [D,H,W]. The correlation based cost volume gains advantage on efficiency but lacks of contextual information which results in low accuracy.

To balance the accuracy and efficiency, a Paired Channel Feature Volume Network (PCNet) is proposed. The key of the proposed method is a novel Paired Channel Feature Volume (PCFV) module (Fig. 1(c)). The left and right feature maps are interweaved in channel dimension. Then small 3D convolutional filters, called Paired Channel 3D Convolution, are introduced to learn relations between each channel-paired feature maps. Finally, the averages of the feature volume in channel dimension are computed to generate the Paired Channel Feature Volume. Details of all the steps of the proposed PCFV are summarized in Alg. 1. Because the Paired Channel Feature Volume is a 3D tensor, we use 2D convolutions in the aggregation module and the refinement module instead of 3D convolutions. The PCFV module can avoid high computational cost caused by 3D convolutional operations.

The contributions of the paper can be summarized as follows:

  • A novel feature volume module called PCFV module is proposed to describe the relations between stereo feature maps. Compared with 4D cost volume, the 3D tensor generated by PCFV module can be handled by light-weighted aggregation module and refinement module.

  • A densely connected aggregation module is proposed to exploit abundant contextual information and enhance the feature representation.

  • Experiments are conducted on one synthetic dataset (SceneFlow Dataset [28]) and two real-world driving datasets (KITTI 2012 Dataset [29] and KITTI 2015 Dataset [30]). The proposed PCNet achieves state-of-the-art accuracy with high efficiency.

Section snippets

Related Work

In this section, a review of depth estimation, concat based cost volume, correlation based cost volume multi-scale cost volume aggregation, and deformable convolution is given.

Our Proposed Method

It is well known that cost volume is crucial for depth estimation in the sense of both accuracy and efficiency. In this section, we propose a novel module, called Paired Channel Feature Volume (PCFV), for efficiently generating high-quality cost volume. The description of PCFV and its novelty are given in Section 3.2. We begin with presenting the network architecture of the proposed method.

Experiments

In this section, datasets and experimental details are introduced in Section 4.1. In Section 4.2, the ablation study is designed on SceneFlow Dataset to prove the validity of the proposed PCFV and densely connected aggregation. In Section 4.3, experiments are conducted to prove the generality. Visualization results and objective evaluation index are indicated in Section 4.4, Section 4.5, and Section 4.6.

Conclusion

We have proposed a Paired Channel Feature Volume Network (PCNet) which incorporates PCFV and densely connected aggregation. A share-weighted feature extractor has been used to generate multi-scale feature maps from stereo images. Then a novel PCFV has been proposed to describe relations between stereo feature maps. Finally, a densely connected aggregation has been used to involve more contextual information. Experimental results indicate that the proposed PCNet has advantages on both

CRediT authorship contribution statement

Dayu Jia: Conceptualization, Methodology, Software, Writing - original draft. Yanwei Pang: Supervision, Writing - review & editing. Jiale Cao: Writing - review & editing, Project administration. Jing Pan: Visualization, Data curation, Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported in part by National Key R&D program of China (Grant No. 2018AAA0102802) and Tianjin Research Program of Science and Technology (Grant No. 19ZXZNG00050).

Dayu Jia Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Dayu Jia received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2013 and the M.S. in information and communication engineering from the Shandong University, Shandong, China, in 2017. He is currently a Ph.D candidate in the Tianjin University. His research

References (63)

  • W. Gan et al.

    Light-weight network for real-time adaptive stereo depth estimation

    Neurocomputing

    (2021)
  • S. Chen et al.

    Pgnet: Panoptic parsing guided deep stereo matching

    Neurocomputing

    (2021)
  • Y. Liu et al.

    ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery

    IEEE Transactions on Geoscience and Remote Sensing

    (2022)
  • Y. Liu et al.

    Part-object relational visual saliency

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2021)
  • G. Wu et al.

    Unsupervised deep video hashing via balanced code for large-scale video retrieval

    IEEE Transactions on Image Processing

    (2019)
  • Y. Liu et al.

    Integrating part-object relationship and contrast for camouflaged object detection

    IEEE Transactions on Information Forensics and Security

    (2021)
  • Q. Wang et al.

    Hybrid feature aligned network for salient object detection in optical remote sensing imagery

    IEEE Transactions on Geoscience and Remote Sensing

    (2022)
  • J. Xie et al.

    PSC-Net: learning part spatial co-occurrence for occluded pedestrian detection

    Science China Information Sciences

    (2021)
  • Z. Zhang et al.

    CGNet: cross-guidance network for semantic segmentation

    Science China Information Sciences

    (2020)
  • S. Ma et al.

    Preserving details in semantics-aware context for scene parsing

    Science China Information Sciences

    (2020)
  • J. Cao et al.

    SipMaskv2: enhanced fast image and video instance segmentation

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2022)
  • Y. Wang et al.

    Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

  • A.D. Pon, J. Ku, C. Li, S.L. Waslander, Object-centric stereo matching for 3d object detection, arXiv preprint...
  • P. Li et al.

    Stereo r-cnn based 3d object detection for autonomous driving

  • T. Vlad et al.

    Performance evaluation of deep learning networks for semantic segmentation of traffic stereo-pair images

  • J. Zhang et al.

    Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery

    IEEE Robotics and Automation Letters

    (2019)
  • G. Yang et al.

    Segstereo: Exploiting semantic information for disparity estimation

  • Q. Wang et al.

    Gradient matters: Designing binarized neural networks via enhanced information-flow

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2021)
  • Z. Xiong et al.

    ASK: Adaptively selecting key local features for RGB-D scene recognition

    IEEE Transactions on Image Processing

    (2021)
  • X. Li et al.

    A multiview-based parameter free framework for group detection

  • X. Li et al.

    Locality adaptive discriminant analysis

  • X. Guo et al.

    Group-wise correlation stereo network

  • J.-R. Chang et al.

    Pyramid stereo matching network

  • F. Zhang et al.

    GA-net: Guided aggregation net for end-to-end stereo matching

  • N. Mayer et al.

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

  • H. Xu et al.

    AANet: Adaptive aggregation network for efficient stereo matching

  • G. Xu et al.

    Attention concatenation volume for accurate and efficient stereo matching

  • N. Mayer et al.

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

  • A. Geiger et al.

    Are we ready for autonomous driving? the kitti vision benchmark suite

  • M. Menze et al.

    Object scene flow for autonomous vehicles

  • H.R. Affendi et al.

    Literature survey on stereo vision disparity map algorithms

    Journal of Sensors

    (2016)
  • Cited by (0)

    Dayu Jia Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Dayu Jia received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2013 and the M.S. in information and communication engineering from the Shandong University, Shandong, China, in 2017. He is currently a Ph.D candidate in the Tianjin University. His research interests include semantic segmentation and stereo matching.

    Yanwei Pang Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Yanwei Pang (Senior Member, IEEE) received the Ph.D. degree in electronic engineering from the University of Science and Technology of China in 2004. Currently, he is a Professor at the Tianjin University, China, and the Founding Director of the Tianjin Key Laboratory of Brain-Inspired Intelligence Technology (BIIT lab), China. His research interests include object detection and image recognition, in which he has published 150 scientific papers, including 40 articles in IEEE TRANSACTIONS and 30 papers in top conferences (e.g., CVPR, ICCV, and ECCV). He is an Associate Editor of both IEEE Transactions on Neural Networks and Learning Systems and Neural Networks (Elsevier) and a Guest Editor of Pattern Recognition Letters.

    Jiale Cao Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Jiale Cao received the Ph.D. in information and communication engineering from the Tianjin University, Tianjin, China, in 2018. He is currently an associate professor at the Tianjin University. His research interests include object detection and deep learning, in which he has published 19 papers in top conferences and journals, including IEEE CVPR, IEEE ICCV, ECCV, IEEE T-PAMI, IEEE TIP, IEEE TCSVT and IEEE TIFS.

    Jing Pan School of Eletronic Enigeering, Tianjin University of Technology and Education, Tianjin 300222, China. JingPan received the B.S. degree in mechanical engineering from North China Institute of Technology (now North University of China), Taiyuan, China, in 2002, and the M.S. degree in precision instrument and mechanism from the University of Science and Technology of China, Hefei, China, in 2007. She is currently an Associate Professor with the School of Electronic Engineering, Tianjin University of Technology and Education, Tianjin, China. Her research interests include computer vision and pattern recognition.

    View full text