skip to main content
10.1145/3591106.3592295acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments

Published:12 June 2023Publication History

ABSTRACT

Simultaneous Localization and Mapping (SLAM) has developed as a fundamental method for intelligent robot perception over the past decades. Most of the existing feature-based SLAM systems relied on traditional hand-crafted visual features and a strong static world assumption, which makes these systems vulnerable in complex dynamic environments. In this paper, we propose a robust monocular SLAM system by combining geometry-based methods with two convolutional neural networks. Specifically, a lightweight deep local feature detection network is proposed as the system front-end, which can efficiently generate keypoints and binary descriptors robust against variations in illumination and viewpoint. Besides, we propose a motion segmentation and depth estimation network for simultaneously predicting pixel-wise motion object segmentation and depth map, so that our system can easily discard dynamic features and reconstruct 3D maps without dynamic objects. The comparison against state-of-the-art methods on publicly available datasets shows the effectiveness of our system in highly dynamic environments.

References

  1. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495.Google ScholarGoogle Scholar
  2. Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. 2017. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5173–5182.Google ScholarGoogle ScholarCross RefCross Ref
  3. Berta Bescos, José M Fácil, Javier Civera, and José Neira. 2018. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3, 4 (2018), 4076–4083.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, and Ian Reid. 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems 32 (2019).Google ScholarGoogle Scholar
  5. Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. 2018. CodeSLAM—learning a compact, optimisable representation for dense visual SLAM. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2560–2568.Google ScholarGoogle ScholarCross RefCross Ref
  6. Matthieu Courbariaux, Itay Hubara, Daniel Soudry, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).Google ScholarGoogle Scholar
  7. Jan Czarnowski, Tristan Laidlow, Ronald Clark, and Andrew J Davison. 2020. Deepfactors: Real-time probabilistic dense monocular slam. IEEE Robotics and Automation Letters 5, 2 (2020), 721–728.Google ScholarGoogle ScholarCross RefCross Ref
  8. Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 224–236.Google ScholarGoogle ScholarCross RefCross Ref
  9. Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 6 (1981), 381–395.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 11 (2013), 1231–1237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Haifan Gong, Guanqi Chen, Sishuo Liu, Yizhou Yu, and Guanbin Li. 2021. Cross-modal self-attention with multi-task pre-training for medical visual question answering. In Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR). 456–460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Richard I Hartley. 1997. In defense of the eight-point algorithm. IEEE Transactions on pattern analysis and machine intelligence 19, 6 (1997), 580–593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Richard I Hartley and Peter Sturm. 1997. Triangulation. Computer vision and image understanding 68, 2 (1997), 146–157.Google ScholarGoogle Scholar
  14. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarGoogle ScholarCross RefCross Ref
  16. Dongjiang Li, Xuesong Shi, Qiwei Long, Shenghui Liu, Wei Yang, Fangshi Wang, Qi Wei, and Fei Qiao. 2020. DXSLAM: A robust and efficient visual SLAM system with deep features. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4958–4965.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21–37.Google ScholarGoogle ScholarCross RefCross Ref
  18. David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Taiyuan Ma, Yafei Wang, Zili Wang, Xulei Liu, and Huimin Zhang. 2020. ASD-SLAM: A Novel Adaptive-Scale Descriptor Learning for Visual SLAM. In 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 809–816.Google ScholarGoogle Scholar
  20. Tomoyuki Mukasa, Jiu Xu, and Bjorn Stenger. 2017. 3d scene mesh from cnn depth predictions and sparse monocular slam. In Proceedings of the IEEE international conference on computer vision workshops. 921–928.Google ScholarGoogle ScholarCross RefCross Ref
  21. Raul Mur-Artal and Juan D Tardós. 2017. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics 33, 5 (2017), 1255–1262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. 2019. R2d2: Reliable and repeatable detector and descriptor. Advances in neural information processing systems 32 (2019).Google ScholarGoogle Scholar
  23. Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In 2011 International conference on computer vision. Ieee, 2564–2571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580.Google ScholarGoogle ScholarCross RefCross Ref
  25. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8934–8943.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. 2021. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8922–8931.Google ScholarGoogle ScholarCross RefCross Ref
  27. Qiang Sun and Yanwei Fu. 2019. Stacked self-attention networks for visual question answering. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR). 207–211.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6243–6252.Google ScholarGoogle ScholarCross RefCross Ref
  29. Zachary Teed and Jia Deng. 2021. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems 34 (2021), 16558–16569.Google ScholarGoogle Scholar
  30. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  31. Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3–19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. 2016. Lift: Learned invariant feature transform. In European conference on computer vision. Springer, 467–483.Google ScholarGoogle ScholarCross RefCross Ref
  33. Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. 2018. DS-SLAM: A semantic visual SLAM towards dynamic environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1168–1174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tianwei Zhang, Huayan Zhang, Yang Li, Yoshihiko Nakamura, and Lei Zhang. 2020. Flowfusion: Dynamic dense rgb-d slam based on optical flow. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 7322–7328.Google ScholarGoogle ScholarCross RefCross Ref
  35. Wang Zhao, Shaohui Liu, Yezhi Shu, and Yong-Jin Liu. 2020. Towards better generalization: Joint depth-pose learning without posenet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9151–9161.Google ScholarGoogle ScholarCross RefCross Ref
  36. Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. 2018. Detect-SLAM: Making object detection and SLAM mutually beneficial. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1001–1010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
        June 2023
        694 pages
        ISBN:9798400701788
        DOI:10.1145/3591106

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 June 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate254of830submissions,31%

        Upcoming Conference

        ICMR '24
        International Conference on Multimedia Retrieval
        June 10 - 14, 2024
        Phuket , Thailand
      • Article Metrics

        • Downloads (Last 12 months)164
        • Downloads (Last 6 weeks)12

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format