Abstract:
Recent studies have shown that joint depth and pose estimation using convolutional neural networks (CNNs) can learn unlabelled monocular frames. However, three problems r...Show MoreMetadata
Abstract:
Recent studies have shown that joint depth and pose estimation using convolutional neural networks (CNNs) can learn unlabelled monocular frames. However, three problems remain: 1) CNNs can only extract local features due to the limited receptive field, 2) scale ambiguity is inherent in the monocular task, and 3) illness regions violate the photometric consistency assumption and produce large errors. We propose a novel framework, ADPDepth, with corresponding effective strategies to ameliorate the above problems. First, a PCAtt module is designed to capture the correlation between channels and efficiently extract multiscale spatial information using a multibranch parallel strategy. Second, depth-pose consistency loss is proposed based on the geometric consistency in depth and pose to constrain the scale between samples, eliminate scale ambiguity and obtain a globally consistent scale. To further improve performance, a cover mask is derived from depth-pose consistency for filtering dynamic objects and outliers to reduce the adverse effects of these illness regions. Extensive experiments are conducted on the KITTI, NYU-Depth and Make3D datasets. Based on public benchmarks, the experimental results confirm that the proposed ADPDepth framework achieves state-of-the-art performance. The effectiveness of each strategy is also verified in subsequent ablation experiments.
Published in: IEEE Transactions on Multimedia ( Volume: 26)