skip to main content
research-article

Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

Published:25 September 2023Publication History
Skip Abstract Section

Abstract

3D perception of depth and ego-motion is of vital importance in intelligent agent and Human Computer Interaction (HCI) tasks, such as robotics and autonomous driving. There are different kinds of sensors that can directly obtain 3D depth information. However, the commonly used Lidar sensor is expensive, and the effective range of RGB-D cameras is limited. In the field of computer vision, researchers have done a lot of work on 3D perception. While traditional geometric algorithms require a lot of manual features for depth estimation, Deep Learning methods have achieved great success in this field. In this work, we proposed a novel self-supervised method based on Vision Transformer (ViT) with Convolutional Neural Network (CNN) architecture, which is referred to as ViT-Depth. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low-cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. Our method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, we propose a cross-frame loss that could constrain photometric error and scale consistency among multi-frames, which lead the training process to be more stable and improve the performance. Extensive experimental results on autonomous driving dataset demonstrate the proposed approach is competitive with the state-of-the-art depth and motion estimation methods.

REFERENCES

  1. [1] Cipolletta Antonio, Peluso Valentino, Calimera Andrea, Poggi Matteo, and Mattoccia Stefano. 2021. Energy-quality scalable monocular depth estimation on low-power CPUs. IEEE Internet of Things Journal 99 (2021), 11.Google ScholarGoogle Scholar
  2. [2] Miao Yanan, Tao Xiaoming, Xu Xiaoli, and Lu Jianhua. 2019. Joint 3-D shape estimation and landmark localization from monocular cameras of intelligent vehicles. IEEE Internet of Things Journal (2019).Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. NIPS (2017).Google ScholarGoogle Scholar
  4. [4] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2018), 20112023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Geiger Andreas, Lenz Philip, Stiller Christoph, and Urtasun Raquel. 2013. Vision meets Robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Richard Hartley and Andrew Zisserman. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN 0-521-62304-9, 2000.Google ScholarGoogle Scholar
  7. [7] Eigen David, Puhrsch Christian, and Fergus Rob. 2014. Depth map prediction from a single image using a multi-scale deep network. NIPS (2014).Google ScholarGoogle Scholar
  8. [8] Li Jun, Klein Reinhard, and Yao Angela. 2017. A two-streamed network for estimating fine-scaled depth maps from single RGB images. CVPR (2017), 33723380.Google ScholarGoogle Scholar
  9. [9] Zhou Tinghui, Brown Matthew, Snavely Noah, and Lowe David G.. 2017. Unsupervised learning of depth and ego-motion from video. CVPR (2017).Google ScholarGoogle Scholar
  10. [10] Wang Chaoyang, Buenaposada José Miguel, Zhu Rui, and Lucey Simon. 2018. Learning depth from monocular videos using direct methods. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20222030.Google ScholarGoogle Scholar
  11. [11] Yin Zhichao and Shi Jianping. 2018. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19831992.Google ScholarGoogle Scholar
  12. [12] Casser Vincent, Pirk Soeren, Mahjourian Reza, and Angelova Anelia. 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA. 80018008.Google ScholarGoogle Scholar
  13. [13] Hanhan Li, Gordon Ariel, Zhao Hang, Casser Vincent, and Angelova Anelia. 2020. Unsupervised monocular depth learning in dynamic scenes, arXiv preprint arXiv: 2010.16404.Google ScholarGoogle Scholar
  14. [14] Bian Jiawang, Li Zhichao, Wang Naiyan, Zhan Huangying, Shen Chunhua, Cheng Ming-Ming, and Reid Ian. 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019), 3545.Google ScholarGoogle Scholar
  15. [15] Godard Clément, Mac Aodha Oisin, Firman Michael, and Brostow Gabriel J.. 2019. Digging into self-supervised monocular depth estimation. ICCV (2019), 38273837.Google ScholarGoogle Scholar
  16. [16] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.Google ScholarGoogle Scholar
  17. [17] Xiaohan Tu, Xu Cheng, Liu Siping, Li Renfa, Xie Guoqi, Huang Jing, and Yang Laurence Tianruo. 2021. Efficient monocular depth estimation for edge devices in internet of things. IEEE Transactions on Industrial Informatics 17, 4 (2021), 28212832.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huan Fu, Gong Mingming, Wang Chaohui, Batmanghelich Kayhan, and Tao Dacheng. 2018. Deep ordinal regression network for monocular depth estimation. CVPR (2018), 20022011.Google ScholarGoogle Scholar
  19. [19] Yang Nan, Wang Rui, Stuckler Jorg, and Cremers Daniel. 2018. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. ECCV.Google ScholarGoogle Scholar
  20. [20] Liu Fayao, Shen Chunhua, Lin Guosheng, and Reid Ian. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 20242039.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Jogendra Nath Kundu, Uppala Phani Krishna, Pahuja Anuj, and Venkatesh Babu R.. 2018. AdaDepth: Unsupervised content congruent adaptation for depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26562665.Google ScholarGoogle Scholar
  22. [22] Kuznietsov Yevhen, Stuckler Jorg, and Leibe Bastian, 2017. Semi-supervised deep learning for monocular depth map prediction. IEEE Conference on Computer Vision & Pattern Recognition.Google ScholarGoogle Scholar
  23. [23] Ishit Mehta, Sakurikar Parikshit, and Narayanan P. J.. 2018. Structured adversarial training for unsupervised monocular depth estimation. International Conference on 3D Vision (3DV). 314323.Google ScholarGoogle Scholar
  24. [24] Yang Zhenheng, Wang Peng, Xu Wei, Zhao Liang, and Nevatia Ramakant. 2018. Unsupervised learning of geometry with edge-aware depth-normal consistency. AAAI (2018).Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Mahjourian Reza, Wicke Martin, and Angelova Anelia. 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 56675675.Google ScholarGoogle Scholar
  26. [26] Luo Chenxu, Yang Zhenheng, Wang Peng, Wang Yang, Xu Wei, Nevatia Ram, and Yuille Alan. 2019. Every Pixel Counts ++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2019), 26242641.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Meng Yue, Lu Yongxi, Raj Aman, Sunarjo Samuel, Guo Rui, Javidi Tara, Bansal Gaurav, and Bharadia Dinesh. 2019. SIGNet: Semantic instance aided unsupervised 3D geometry perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 98029812.Google ScholarGoogle Scholar
  28. [28] Gordon Ariel, Li Hanhan, Jonschkowski Rico, and Angelova Anelia. 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. IEEE/CVF International Conference on Computer Vision (ICCV). 89768985.Google ScholarGoogle Scholar
  29. [29] Shen Tianwei, Luo Zixin, Zhou Lei, Deng Hanyu, Zhang Runze, Fang Tian, and Quan Long. 2019. Beyond photometric loss for self-supervised ego-motion estimation. International Conference on Robotics and Automation (ICRA). 63596365.Google ScholarGoogle Scholar
  30. [30] Zou Yuliang, Luo Zelun, and Huang Jia-Bin. 2018. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. 15th European Conference, Munich, Germany.Google ScholarGoogle Scholar
  31. [31] Garg Rahul, Wadhwa Neal, Ansari Sameer, and Barron Jonathan T.. 2019. Learning single camera depth estimation using dual-pixels. ICCV (2019), 76277636.Google ScholarGoogle Scholar
  32. [32] Bian Jiawang, Li Zhichao, Wang Naiyan, Zhan Huangying, Shen Chunhua, Cheng Ming-Ming, and Reid Ian. 2021. Unsupervised scale-consistent depth and ego-motion learning from monocular video. IJCV (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Laina Iro, Rupprecht Christian, Belagiannis Vasileios, Tombari Federico, and Navab Nassir. 2016. Deeper depth prediction with fully convolutional residual networks. 2016. Fourth International Conference on 3D Vision (3DV). 239248.Google ScholarGoogle Scholar
  34. [34] Huynh Lam, Nguyen-Ha Phong, Matas Jiri, Rahtu Esa, and Heikkilä Janne. 2020. Guiding monocular depth estimation using depth-attention volume. European Conference on Computer Vision (ECCV). 581597.Google ScholarGoogle Scholar
  35. [35] Mayer Nikolaus, Ilg Eddy, Fischer Philipp, Hazirbas Caner, Cremers Daniel, Dosovitskiy Alexey, and Brox Thomas. 2018. What makes good synthetic training data for learning disparity and optical flow estimation?. IJCV 126 (2018), 942960.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Garg Ravi, BG Vijay Kumar, Carneiro Gustavo, and Reid Ian. 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. ECCV (2016).Google ScholarGoogle Scholar
  37. [37] He Kaiming, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV). 29802988.Google ScholarGoogle Scholar
  38. [38] Buades Antoni, Coll Bartomeu, and Morel J.-M.. 2005. A non-local algorithm for image denoising. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2 (2005), 6065.Google ScholarGoogle Scholar
  39. [39] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 77947803.Google ScholarGoogle Scholar
  40. [40] Touvron Hugo, Cord Matthieu, Douze Matthijs, Massa Francisco, Sablayrolles Alexandre, and Jégou Hervé. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877.Google ScholarGoogle Scholar
  41. [41] Wang Yuqing, Xu Zhaoliang, Wang Xinlong, Shen Chunhua, Cheng Baoshan, Shen Hao, and Xia Huaxia. 2020. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503.Google ScholarGoogle Scholar
  42. [42] Huang Lin, Tan Jianchao, Liu Ji, and Yuan Junsong. 2020. Hand-transformer: Non-autoregressive structured modeling for 3D hand pose estimation. European Conference on Computer Vision, 1733.Google ScholarGoogle Scholar
  43. [43] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  44. [44] Wang Zhou, Bovik Alan C., Sheikh Hamid R., and Simoncelli Eero P.. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (April 2004), 600612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Zhan Huangying, Garg Ravi, Weerasekera Chamara Saroj, Li Kejie, Agarwal Harsh, and Reid Ian. 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  46. [46] Yang Nan, Stumberg Lukas von, Wang Rui, and Cremers Daniel. 2020. D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. CVPR (2020), 12811292.Google ScholarGoogle Scholar
  47. [47] Li Ling, Li Xiaojian, Yang Shanlin, Ding Shuai, Jolfaei Alireza, and Zheng Xi. 2021. Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics 17, 6 (2021), 39203928.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Li Feng, Hao Jie, Wang Jin, Luo Jun, He Ying, Yu Dongxiao, and Cheng Xiuzhen. 2019. VisioMap: Lightweight 3-D scene reconstruction toward natural indoor localization. IEEE Internet of Things Journal (2019).Google ScholarGoogle Scholar
  49. [49] Anurag Ranjan , Jampani Varun, Balles Lukas, Kim Kihwan, Sun Deqing, Wulff Jonas, and Black Michael J.. 2019. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CVPR (2019), 1223212241.Google ScholarGoogle Scholar

Index Terms

  1. Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
      February 2024
      548 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613570
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 September 2023
      • Online AM: 23 March 2023
      • Accepted: 15 March 2023
      • Revised: 7 December 2022
      • Received: 10 November 2021
      Published in tomm Volume 20, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)342
      • Downloads (Last 6 weeks)26

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text