PCNet: Paired channel feature volume network for accurate and efficient depth estimation
Introduction
Depth estimation is a fundamental and important task in field of computer vision. Accurate depth sensing plays an important role in applications such as odometry, robot navigation, and autonomous driving. Meanwhile, the depth information is also helpful in high-level object detection and semantic segmentation tasks [12], [13], [14], [15], [16], [17].
The depth estimation methods can be divided into two categories: active depth estimation and passive depth estimation. In active depth estimation, depth information is generated by active depth sensor, such as MMW radar (millimeter-wave radar) and LiDAR (Light Detection and Ranging) which are widely used in autonomous driving. Depth map can be extracted by the point clouds generated by active depth sensor. As for passive depth estimation, depth information is generated by passive depth sensor, such as stereo camera. Unlikely active depth estimation methods, stereo-based depth estimation methods gain advantage on low cost and rich contextual information which makes it a common solution to deal with depth estimation problem.
With the development of deep learning, great progress has been achieved in computer vision[1], [2], [3], [18], [19], [20], [21], [7], [6], [8], [9], [10], [11]. CNN (Convolutional Neural Network), as the most representative method of deep learning, has been introduced to stereo matching task to achieve good results on both accuracy and efficiency. Existing CNN based stereo methods can be classified into feature based methods and cost volume based methods. In feature based methods, disparity information is directly predicted by the deep CNN without special design. However, feature based methods fail in complex scene because structural information of the stereo inputs is not taken into account. In the recent few years, the cost volume based methods follow the pipeline: feature extractor, cost volume construction, cost aggregation, and refinement. The cost volume based methods achieve promising results even in challenging situations by taking the geometric constraints in stereo matching into account. However, the cost volume based methods suffer high computational cost compared with feature based methods for the construction of cost volume.
Two typical cost volumes are summarized in Fig. 1. Fig. 1(a) shows the concat based cost volume [22], [23], [24]. Left and right features are aligned in dimension and slide in dimension from 0 to ( is the max disparity to predict). Then the overlap parts of the stereo features are concatenated to obtain a 4D cost volume with the size of . However, the cost volume based methods suffer from high computational cost because 3D convolutional operations are involved in the aggregation module. The correlation based cost volume [25], [26] is shown in Fig. 1(b). Left and right features are aligned in dimension and slide in dimension from 0 to . Then the overlap parts of left and right features are fused by inner product to obtain tensors. Finally, the averages of all the tensors in channel dimension are concatenated together to obtain a 3D cost volume with the size of . The correlation based cost volume gains advantage on efficiency but lacks of contextual information which results in low accuracy.
To balance the accuracy and efficiency, a Paired Channel Feature Volume Network (PCNet) is proposed. The key of the proposed method is a novel Paired Channel Feature Volume (PCFV) module (Fig. 1(c)). The left and right feature maps are interweaved in channel dimension. Then small 3D convolutional filters, called Paired Channel 3D Convolution, are introduced to learn relations between each channel-paired feature maps. Finally, the averages of the feature volume in channel dimension are computed to generate the Paired Channel Feature Volume. Details of all the steps of the proposed PCFV are summarized in Alg. 1. Because the Paired Channel Feature Volume is a 3D tensor, we use 2D convolutions in the aggregation module and the refinement module instead of 3D convolutions. The PCFV module can avoid high computational cost caused by 3D convolutional operations.
The contributions of the paper can be summarized as follows:
- •
A novel feature volume module called PCFV module is proposed to describe the relations between stereo feature maps. Compared with 4D cost volume, the 3D tensor generated by PCFV module can be handled by light-weighted aggregation module and refinement module.
- •
A densely connected aggregation module is proposed to exploit abundant contextual information and enhance the feature representation.
- •
Experiments are conducted on one synthetic dataset (SceneFlow Dataset [28]) and two real-world driving datasets (KITTI 2012 Dataset [29] and KITTI 2015 Dataset [30]). The proposed PCNet achieves state-of-the-art accuracy with high efficiency.
Section snippets
Related Work
In this section, a review of depth estimation, concat based cost volume, correlation based cost volume multi-scale cost volume aggregation, and deformable convolution is given.
Our Proposed Method
It is well known that cost volume is crucial for depth estimation in the sense of both accuracy and efficiency. In this section, we propose a novel module, called Paired Channel Feature Volume (PCFV), for efficiently generating high-quality cost volume. The description of PCFV and its novelty are given in Section 3.2. We begin with presenting the network architecture of the proposed method.
Experiments
In this section, datasets and experimental details are introduced in Section 4.1. In Section 4.2, the ablation study is designed on SceneFlow Dataset to prove the validity of the proposed PCFV and densely connected aggregation. In Section 4.3, experiments are conducted to prove the generality. Visualization results and objective evaluation index are indicated in Section 4.4, Section 4.5, and Section 4.6.
Conclusion
We have proposed a Paired Channel Feature Volume Network (PCNet) which incorporates PCFV and densely connected aggregation. A share-weighted feature extractor has been used to generate multi-scale feature maps from stereo images. Then a novel PCFV has been proposed to describe relations between stereo feature maps. Finally, a densely connected aggregation has been used to involve more contextual information. Experimental results indicate that the proposed PCNet has advantages on both
CRediT authorship contribution statement
Dayu Jia: Conceptualization, Methodology, Software, Writing - original draft. Yanwei Pang: Supervision, Writing - review & editing. Jiale Cao: Writing - review & editing, Project administration. Jing Pan: Visualization, Data curation, Formal analysis.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported in part by National Key R&D program of China (Grant No. 2018AAA0102802) and Tianjin Research Program of Science and Technology (Grant No. 19ZXZNG00050).
Dayu Jia Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Dayu Jia received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2013 and the M.S. in information and communication engineering from the Shandong University, Shandong, China, in 2017. He is currently a Ph.D candidate in the Tianjin University. His research
References (63)
- et al.
Light-weight network for real-time adaptive stereo depth estimation
Neurocomputing
(2021) - et al.
Pgnet: Panoptic parsing guided deep stereo matching
Neurocomputing
(2021) - et al.
ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery
IEEE Transactions on Geoscience and Remote Sensing
(2022) - et al.
Part-object relational visual saliency
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2021) - et al.
Unsupervised deep video hashing via balanced code for large-scale video retrieval
IEEE Transactions on Image Processing
(2019) - et al.
Integrating part-object relationship and contrast for camouflaged object detection
IEEE Transactions on Information Forensics and Security
(2021) - et al.
Hybrid feature aligned network for salient object detection in optical remote sensing imagery
IEEE Transactions on Geoscience and Remote Sensing
(2022) - et al.
PSC-Net: learning part spatial co-occurrence for occluded pedestrian detection
Science China Information Sciences
(2021) - et al.
CGNet: cross-guidance network for semantic segmentation
Science China Information Sciences
(2020) - et al.
Preserving details in semantics-aware context for scene parsing
Science China Information Sciences
(2020)
SipMaskv2: enhanced fast image and video instance segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving
Stereo r-cnn based 3d object detection for autonomous driving
Performance evaluation of deep learning networks for semantic segmentation of traffic stereo-pair images
Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery
IEEE Robotics and Automation Letters
Segstereo: Exploiting semantic information for disparity estimation
Gradient matters: Designing binarized neural networks via enhanced information-flow
IEEE Transactions on Pattern Analysis and Machine Intelligence
ASK: Adaptively selecting key local features for RGB-D scene recognition
IEEE Transactions on Image Processing
A multiview-based parameter free framework for group detection
Locality adaptive discriminant analysis
Group-wise correlation stereo network
Pyramid stereo matching network
GA-net: Guided aggregation net for end-to-end stereo matching
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation
AANet: Adaptive aggregation network for efficient stereo matching
Attention concatenation volume for accurate and efficient stereo matching
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation
Are we ready for autonomous driving? the kitti vision benchmark suite
Object scene flow for autonomous vehicles
Literature survey on stereo vision disparity map algorithms
Journal of Sensors
Cited by (0)
Dayu Jia Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Dayu Jia received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2013 and the M.S. in information and communication engineering from the Shandong University, Shandong, China, in 2017. He is currently a Ph.D candidate in the Tianjin University. His research interests include semantic segmentation and stereo matching.
Yanwei Pang Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Yanwei Pang (Senior Member, IEEE) received the Ph.D. degree in electronic engineering from the University of Science and Technology of China in 2004. Currently, he is a Professor at the Tianjin University, China, and the Founding Director of the Tianjin Key Laboratory of Brain-Inspired Intelligence Technology (BIIT lab), China. His research interests include object detection and image recognition, in which he has published 150 scientific papers, including 40 articles in IEEE TRANSACTIONS and 30 papers in top conferences (e.g., CVPR, ICCV, and ECCV). He is an Associate Editor of both IEEE Transactions on Neural Networks and Learning Systems and Neural Networks (Elsevier) and a Guest Editor of Pattern Recognition Letters.
Jiale Cao Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Jiale Cao received the Ph.D. in information and communication engineering from the Tianjin University, Tianjin, China, in 2018. He is currently an associate professor at the Tianjin University. His research interests include object detection and deep learning, in which he has published 19 papers in top conferences and journals, including IEEE CVPR, IEEE ICCV, ECCV, IEEE T-PAMI, IEEE TIP, IEEE TCSVT and IEEE TIFS.
Jing Pan School of Eletronic Enigeering, Tianjin University of Technology and Education, Tianjin 300222, China. JingPan received the B.S. degree in mechanical engineering from North China Institute of Technology (now North University of China), Taiyuan, China, in 2002, and the M.S. degree in precision instrument and mechanism from the University of Science and Technology of China, Hefei, China, in 2007. She is currently an Associate Professor with the School of Electronic Engineering, Tianjin University of Technology and Education, Tianjin, China. Her research interests include computer vision and pattern recognition.