Abstract
3D perception of depth and ego-motion is of vital importance in intelligent agent and Human Computer Interaction (HCI) tasks, such as robotics and autonomous driving. There are different kinds of sensors that can directly obtain 3D depth information. However, the commonly used Lidar sensor is expensive, and the effective range of RGB-D cameras is limited. In the field of computer vision, researchers have done a lot of work on 3D perception. While traditional geometric algorithms require a lot of manual features for depth estimation, Deep Learning methods have achieved great success in this field. In this work, we proposed a novel self-supervised method based on Vision Transformer (ViT) with Convolutional Neural Network (CNN) architecture, which is referred to as ViT-Depth. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low-cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. Our method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, we propose a cross-frame loss that could constrain photometric error and scale consistency among multi-frames, which lead the training process to be more stable and improve the performance. Extensive experimental results on autonomous driving dataset demonstrate the proposed approach is competitive with the state-of-the-art depth and motion estimation methods.
- [1] , and . 2021. Energy-quality scalable monocular depth estimation on low-power CPUs. IEEE Internet of Things Journal 99 (2021), 1–1.Google Scholar
- [2] . 2019. Joint 3-D shape estimation and landmark localization from monocular cameras of intelligent vehicles. IEEE Internet of Things Journal (2019).Google ScholarCross Ref
- [3] . 2017. Attention is all you need. NIPS (2017).Google Scholar
- [4] . 2018. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2018), 2011–2023.Google ScholarDigital Library
- [5] . 2013. Vision meets Robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013).Google ScholarDigital Library
- [6] . 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN 0-521-62304-9, 2000.Google Scholar
- [7] . 2014. Depth map prediction from a single image using a multi-scale deep network. NIPS (2014).Google Scholar
- [8] . 2017. A two-streamed network for estimating fine-scaled depth maps from single RGB images. CVPR (2017), 3372–3380.Google Scholar
- [9] . 2017. Unsupervised learning of depth and ego-motion from video. CVPR (2017).Google Scholar
- [10] . 2018. Learning depth from monocular videos using direct methods. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022–2030.Google Scholar
- [11] . 2018. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1983–1992.Google Scholar
- [12] . 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA. 8001–8008.Google Scholar
- [13] . 2020. Unsupervised monocular depth learning in dynamic scenes, arXiv preprint arXiv: 2010.16404.Google Scholar
- [14] . 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019), 35–45.Google Scholar
- [15] . 2019. Digging into self-supervised monocular depth estimation. ICCV (2019), 3827–3837.Google Scholar
- [16] 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.Google Scholar
- [17] . 2021. Efficient monocular depth estimation for edge devices in internet of things. IEEE Transactions on Industrial Informatics 17, 4 (2021), 2821–2832.Google ScholarCross Ref
- [18] . 2018. Deep ordinal regression network for monocular depth estimation. CVPR (2018), 2002–2011.Google Scholar
- [19] . 2018. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. ECCV.Google Scholar
- [20] . 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2024–2039.Google ScholarDigital Library
- [21] . 2018. AdaDepth: Unsupervised content congruent adaptation for depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2656–2665.Google Scholar
- [22] , 2017. Semi-supervised deep learning for monocular depth map prediction. IEEE Conference on Computer Vision & Pattern Recognition.Google Scholar
- [23] . 2018. Structured adversarial training for unsupervised monocular depth estimation. International Conference on 3D Vision (3DV). 314–323.Google Scholar
- [24] . 2018. Unsupervised learning of geometry with edge-aware depth-normal consistency. AAAI (2018).Google ScholarCross Ref
- [25] . 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5667–5675.Google Scholar
- [26] . 2019. Every Pixel Counts ++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2019), 2624–2641.Google ScholarDigital Library
- [27] . 2019. SIGNet: Semantic instance aided unsupervised 3D geometry perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9802–9812.Google Scholar
- [28] . 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. IEEE/CVF International Conference on Computer Vision (ICCV). 8976–8985.Google Scholar
- [29] . 2019. Beyond photometric loss for self-supervised ego-motion estimation. International Conference on Robotics and Automation (ICRA). 6359–6365.Google Scholar
- [30] . 2018. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. 15th European Conference, Munich, Germany.Google Scholar
- [31] . 2019. Learning single camera depth estimation using dual-pixels. ICCV (2019), 7627–7636.Google Scholar
- [32] . 2021. Unsupervised scale-consistent depth and ego-motion learning from monocular video. IJCV (2021).Google ScholarDigital Library
- [33] . 2016. Deeper depth prediction with fully convolutional residual networks. 2016. Fourth International Conference on 3D Vision (3DV). 239–248.Google Scholar
- [34] . 2020. Guiding monocular depth estimation using depth-attention volume. European Conference on Computer Vision (ECCV). 581–597.Google Scholar
- [35] . 2018. What makes good synthetic training data for learning disparity and optical flow estimation?. IJCV 126 (2018), 942–960.Google ScholarDigital Library
- [36] . 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. ECCV (2016).Google Scholar
- [37] . 2017. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV). 2980–2988.Google Scholar
- [38] . 2005. A non-local algorithm for image denoising. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2 (2005), 60–65.Google Scholar
- [39] . 2018. Non-local neural networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7794–7803.Google Scholar
- [40] . 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877.Google Scholar
- [41] . 2020. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503.Google Scholar
- [42] . 2020. Hand-transformer: Non-autoregressive structured modeling for 3D hand pose estimation. European Conference on Computer Vision, 17–33.Google Scholar
- [43] . 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- [44] . 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (April 2004), 600–612.Google ScholarDigital Library
- [45] . 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- [46] . 2020. D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. CVPR (2020), 1281–1292.Google Scholar
- [47] . 2021. Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics 17, 6 (2021), 3920–3928.Google ScholarCross Ref
- [48] . 2019. VisioMap: Lightweight 3-D scene reconstruction toward natural indoor localization. IEEE Internet of Things Journal (2019).Google Scholar
- [49] . 2019. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CVPR (2019), 12232–12241.Google Scholar
Index Terms
- Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction
Recommendations
Self-Supervised Visual Odometry with Ego-Motion Sampling
VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image ProcessingIn recent years, deep learning-based methods for monocular visual odometry have made good progress and now demonstrate state-of-the-art results on the well-known KITTI benchmark. However, collecting ground truth camera poses for training deep visual ...
Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation
Image and GraphicsAbstractThe self-supervised depth and camera pose estimation methods are proposed to address the difficulty of acquiring the densely labeled ground-truth data and have achieved a great advance. As the stereo vision could constrain the predicted depth to a ...
Self-supervised learning of monocular depth using quantized networks
AbstractLearning monocular depth in a self-supervised manner is desirable for numerous applications ranging from autonomous driving, robotics to augmented reality. However, the current challenges lie in the problems of scale ambiguity, dynamic ...
Comments