research-article

Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

Authors:

Xiaoyan JiangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 2

Article No.: 45, Pages 1 - 21

https://doi.org/10.1145/3588571

Published: 25 September 2023 Publication History

Abstract

3D perception of depth and ego-motion is of vital importance in intelligent agent and Human Computer Interaction (HCI) tasks, such as robotics and autonomous driving. There are different kinds of sensors that can directly obtain 3D depth information. However, the commonly used Lidar sensor is expensive, and the effective range of RGB-D cameras is limited. In the field of computer vision, researchers have done a lot of work on 3D perception. While traditional geometric algorithms require a lot of manual features for depth estimation, Deep Learning methods have achieved great success in this field. In this work, we proposed a novel self-supervised method based on Vision Transformer (ViT) with Convolutional Neural Network (CNN) architecture, which is referred to as ViT-Depth. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low-cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. Our method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, we propose a cross-frame loss that could constrain photometric error and scale consistency among multi-frames, which lead the training process to be more stable and improve the performance. Extensive experimental results on autonomous driving dataset demonstrate the proposed approach is competitive with the state-of-the-art depth and motion estimation methods.

References

[1]

Antonio Cipolletta, Valentino Peluso, Andrea Calimera, Matteo Poggi, and Stefano Mattoccia. 2021. Energy-quality scalable monocular depth estimation on low-power CPUs. IEEE Internet of Things Journal 99 (2021), 1–1.

[2]

Yanan Miao, Xiaoming Tao, Xiaoli Xu, and Jianhua Lu. 2019. Joint 3-D shape estimation and landmark localization from monocular cameras of intelligent vehicles. IEEE Internet of Things Journal (2019).

[3]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS (2017).

[4]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2018), 2011–2023.

Digital Library

[5]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013).

Digital Library

[6]

Richard Hartley and Andrew Zisserman. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN 0-521-62304-9, 2000.

[7]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. NIPS (2014).

[8]

Jun Li, Reinhard Klein, and Angela Yao. 2017. A two-streamed network for estimating fine-scaled depth maps from single RGB images. CVPR (2017), 3372–3380.

[9]

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. 2017. Unsupervised learning of depth and ego-motion from video. CVPR (2017).

[10]

Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning depth from monocular videos using direct methods. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022–2030.

[11]

Zhichao Yin and Jianping Shi. 2018. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1983–1992.

[12]

Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA. 8001–8008.

[13]

Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. 2020. Unsupervised monocular depth learning in dynamic scenes, arXiv preprint arXiv: 2010.16404.

[14]

Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019), 35–45.

[15]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. 2019. Digging into self-supervised monocular depth estimation. ICCV (2019), 3827–3837.

[16]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[17]

Xiaohan Tu, Cheng Xu, Siping Liu, Renfa Li, Guoqi Xie, Jing Huang, and Laurence Tianruo Yang. 2021. Efficient monocular depth estimation for edge devices in internet of things. IEEE Transactions on Industrial Informatics 17, 4 (2021), 2821–2832.

[18]

Fu Huan, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. CVPR (2018), 2002–2011.

[19]

Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. 2018. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. ECCV.

[20]

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2024–2039.

Digital Library

[21]

Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R. Venkatesh Babu. 2018. AdaDepth: Unsupervised content congruent adaptation for depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2656–2665.

[22]

Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe, 2017. Semi-supervised deep learning for monocular depth map prediction. IEEE Conference on Computer Vision & Pattern Recognition.

[23]

Ishit Mehta, Parikshit Sakurikar, and P. J. Narayanan. 2018. Structured adversarial training for unsupervised monocular depth estimation. International Conference on 3D Vision (3DV). 314–323.

[24]

Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. 2018. Unsupervised learning of geometry with edge-aware depth-normal consistency. AAAI (2018).

[25]

Reza Mahjourian, Martin Wicke, and Anelia Angelova. 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5667–5675.

[26]

Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. 2019. Every Pixel Counts ++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2019), 2624–2641.

Digital Library

[27]

Yue Meng, Yongxi Lu, Aman Raj, Samuel Sunarjo, Rui Guo, Tara Javidi, Gaurav Bansal, and Dinesh Bharadia. 2019. SIGNet: Semantic instance aided unsupervised 3D geometry perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9802–9812.

[28]

Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. IEEE/CVF International Conference on Computer Vision (ICCV). 8976–8985.

[29]

Tianwei Shen, Zixin Luo, Lei Zhou, Hanyu Deng, Runze Zhang, Tian Fang, and Long Quan. 2019. Beyond photometric loss for self-supervised ego-motion estimation. International Conference on Robotics and Automation (ICRA). 6359–6365.

[30]

Yuliang Zou, Zelun Luo, and Jia-Bin Huang. 2018. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. 15th European Conference, Munich, Germany.

[31]

Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T. Barron. 2019. Learning single camera depth estimation using dual-pixels. ICCV (2019), 7627–7636.

[32]

Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. 2021. Unsupervised scale-consistent depth and ego-motion learning from monocular video. IJCV (2021).

Digital Library

[33]

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. 2016. Fourth International Conference on 3D Vision (3DV). 239–248.

[34]

Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. 2020. Guiding monocular depth estimation using depth-attention volume. European Conference on Computer Vision (ECCV). 581–597.

[35]

Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. 2018. What makes good synthetic training data for learning disparity and optical flow estimation?. IJCV 126 (2018), 942–960.

Digital Library

[36]

Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. ECCV (2016).

[37]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV). 2980–2988.

[38]

Antoni Buades, Bartomeu Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2 (2005), 60–65.

[39]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7794–7803.

[40]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877.

[41]

Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503.

[42]

Lin Huang, Jianchao Tan, Ji Liu, and Junsong Yuan. 2020. Hand-transformer: Non-autoregressive structured modeling for 3D hand pose estimation. European Conference on Computer Vision, 17–33.

[43]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (April 2004), 600–612.

Digital Library

[45]

Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]

Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. CVPR (2020), 1281–1292.

[47]

Ling Li, Xiaojian Li, Shanlin Yang, Shuai Ding, Alireza Jolfaei, and Xi Zheng. 2021. Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics 17, 6 (2021), 3920–3928.

[48]

Feng Li, Jie Hao, Jin Wang, Jun Luo, Ying He, Dongxiao Yu, and Xiuzhen Cheng. 2019. VisioMap: Lightweight 3-D scene reconstruction toward natural indoor localization. IEEE Internet of Things Journal (2019).

[49]

Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J. Black. 2019. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CVPR (2019), 12232–12241.

Cited By

Verma ABadal TBansal A(2025)A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2Image and Vision Computing10.1016/j.imavis.2025.105422154(105422)Online publication date: Feb-2025
https://doi.org/10.1016/j.imavis.2025.105422
Lu JGao YChen JHwang JFujita HFang Z(2024)Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal ConstraintsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367497720:10(1-26)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674977
Peng BSun LLei JLiu BShen HLi WHuang Q(2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
https://doi.org/10.1145/3663570
Show More Cited By

Index Terms

Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Self-Supervised Visual Odometry with Ego-Motion Sampling
VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image Processing

In recent years, deep learning-based methods for monocular visual odometry have made good progress and now demonstrate state-of-the-art results on the well-known KITTI benchmark. However, collecting ground truth camera poses for training deep visual ...
Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation
Image and Graphics
Abstract
The self-supervised depth and camera pose estimation methods are proposed to address the difficulty of acquiring the densely labeled ground-truth data and have achieved a great advance. As the stereo vision could constrain the predicted depth to a ...
Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation Learning
Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 2

February 2024

548 pages

EISSN:1551-6865

DOI:10.1145/3613570

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2023

Online AM: 23 March 2023

Accepted: 15 March 2023

Revised: 07 December 2022

Received: 10 November 2021

Published in TOMM Volume 20, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Shanghai Local Capacity Enhancement
Science and Technology Innovation Action Plan
Shanghai Science and Technology Commission
Chenguang talented program of Shanghai

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
491
Total Downloads

Downloads (Last 12 months)171
Downloads (Last 6 weeks)20

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Verma ABadal TBansal A(2025)A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2Image and Vision Computing10.1016/j.imavis.2025.105422154(105422)Online publication date: Feb-2025
https://doi.org/10.1016/j.imavis.2025.105422
Lu JGao YChen JHwang JFujita HFang Z(2024)Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal ConstraintsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367497720:10(1-26)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674977
Peng BSun LLei JLiu BShen HLi WHuang Q(2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
https://doi.org/10.1145/3663570
Han CLv CKou QJiang HCheng D(2024)DCL-depth: monocular depth estimation network based on iam and depth consistency lossMultimedia Tools and Applications10.1007/s11042-024-18877-7Online publication date: 25-Mar-2024
https://doi.org/10.1007/s11042-024-18877-7
Zhou YZhang CDeng LFu JLi HXu ZZhang J(2024)Resolution-sensitive self-supervised monocular absolute depth estimationApplied Intelligence10.1007/s10489-024-05414-054:6(4781-4793)Online publication date: 5-Apr-2024
https://doi.org/10.1007/s10489-024-05414-0

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents