Skip to main content
Log in

Keypoint Heatmap Guided Self-Supervised Monocular Visual Odometry

  • Short Paper
  • Published:
Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Abstract

Visual odometry is an important part of visual simultaneous localization and mapping (SLAM) system. In recent years, with the development of deep learning technique, the combination of visual odometry with deep learning has attracted more and more researchers’ attentions. Existing deep learning-based monocular visual odometry methods include a large number of calculations of redundant pixels, and they only consider the pose transformation between two adjacent frames, resulting in error accumulations. To solve the above problems, an end-to-end self-supervised monocular visual odometry method based on keypoint heatmap guidance is proposed in this paper. In the process of network training, the keypoint heatmap is used to guide network learning to reduce the influence of redundant pixels. The photometric error based on the pose consistency constraint of image sequence is calculated to reduce the accumulated error in the pose estimation of video sequence. Extensive experimental results on the KITTI visual odometry dataset have fully validated the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The code and data of the proposed method can be downloaded from the following address: https://github.com/kaixjl/kphm-vo

References

  1. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31, 1147–1163 (2015). https://doi.org/10.1109/TRO.2015.2463671

    Article  Google Scholar 

  2. Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 33, 1255–1262 (2017). https://doi.org/10.1109/TRO.2017.2705103

    Article  Google Scholar 

  3. Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap SLAM. IEEE Trans. Robot. 37, 1874–1890 (2021). https://doi.org/10.1109/TRO.2021.3075644

    Article  Google Scholar 

  4. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017). https://doi.org/10.1109/ICCV.2017.322

    Chapter  Google Scholar 

  5. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(1137–1149), (2017). https://doi.org/10.1109/TPAMI.2016.2577031

  6. DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 337–345. IEEE (2018). https://doi.org/10.1109/CVPRW.2018.00060

    Chapter  Google Scholar 

  7. Sarlin, P.-E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4937–4946. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00499

    Chapter  Google Scholar 

  8. Liu, Y., Shen, Z., Lin, Z., Peng, S., Bao, H., Zhou, X.: GIFT: learning transformation-invariant dense visual descriptors via group CNNs. In: Advances in Neural Information Processing Systems, pp. 6992–7003. MIT Press (2019) https://dl.acm.org/doi/10.5555/3454287.3454915

    Google Scholar 

  9. Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-net: a trainable CNN for joint description and detection of local features. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8084–8093. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00828

    Chapter  Google Scholar 

  10. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and Ego-motion from video. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6612–6619. IEEE (2017). https://doi.org/10.1109/CVPR.2017.700

    Chapter  Google Scholar 

  11. Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 1–11. MIT Press, Vancouver (2019)

    Google Scholar 

  12. Almalioglu, Y., Saputra, M.R.U., De Gusmão, P.P.B., Markham, A., Trigoni, N.: GANVO: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 5474–5480. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793512

    Chapter  Google Scholar 

  13. Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1983–1992. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00212

    Chapter  Google Scholar 

  14. Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: dense tracking and mapping in real-time. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2320–2327. IEEE (2011). https://doi.org/10.1109/ICCV.2011.6126513

    Chapter  Google Scholar 

  15. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1150–1157. IEEE (1999). https://doi.org/10.1109/ICCV.1999.790410

    Chapter  Google Scholar 

  16. Newcombe, R.A., Fitzgibbon, A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S.: KinectFusion: real-time dense surface mapping and tracking. In: Proceedings - 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE (2011). https://doi.org/10.1109/ISMAR.2011.6092378

    Chapter  Google Scholar 

  17. Wang, R., Schworer, M., Cremers, D.: Stereo DSO: large-scale direct sparse visual odometry with stereo cameras. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3923–3931. IEEE (2017). https://doi.org/10.1109/ICCV.2017.421

    Chapter  Google Scholar 

  18. Von Stumberg, L., Usenko, V., Cremers, D.: Direct sparse visual-inertial odometry using dynamic marginalization. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 2510–2517. IEEE (2018). https://doi.org/10.1109/ICRA.2018.8462905

    Chapter  Google Scholar 

  19. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2018). https://doi.org/10.1109/TPAMI.2017.2658577

    Article  Google Scholar 

  20. Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale direct monocular SLAM. In: European Conference on Computer Vision (ECCV), pp. 834–849. Springer (2014). https://doi.org/10.1007/978-3-319-10605-2_54

    Chapter  Google Scholar 

  21. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Machine Intell. 29, 1052–1067 (2007)

    Article  Google Scholar 

  22. Holmes, S., Klein, G., Murray, D.W.: A square root unscented kalman filter for visual monoSLAM. In: Proceedings - IEEE International Conference on Robotics and Automation, pp. 3710–3716. IEEE (2008). https://doi.org/10.1109/ROBOT.2008.4543780

    Chapter  Google Scholar 

  23. Gamage, D., Drummond, T.: Reduced dimensionality extended Kalman filter for SLAM in a relative formulation. In: Proceedings - IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1365–1372. IEEE (2015). https://doi.org/10.1109/IROS.2015.7353545

    Chapter  Google Scholar 

  24. Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: Proceedings - IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 1–10. IEEE (2007). https://doi.org/10.1109/ISMAR.2007.4538852

    Chapter  Google Scholar 

  25. Engel, J., Sturm, J., Cremers, D.: Semi-dense visual odometry for a monocular camera. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1449–1456. IEEE (2013). https://doi.org/10.1109/ICCV.2013.183

    Chapter  Google Scholar 

  26. Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. IEEE (2017). https://doi.org/10.1109/ICRA.2017.7989236

    Chapter  Google Scholar 

  27. Xue, F., Wang, X., Li, S., Wang, Q., Wang, J., Zha, H.: Beyond tracking: selecting memory and refining poses for deep visual odometry. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8567–8575. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00877

    Chapter  Google Scholar 

  28. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: DeMoN: depth and motion network for learning monocular stereo. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5622–5631. IEEE (2017). https://doi.org/10.1109/CVPR.2017.596

    Chapter  Google Scholar 

  29. Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Computer Vision – ECCV 2020, pp. 710–727. Springer (2020). https://doi.org/10.1007/978-3-030-58568-6_42

    Chapter  Google Scholar 

  30. Zhou, H., Ummenhofer, B., Brox, T.: DeepTAM: deep tracking and mapping. In: European Conference on Computer Vision (ECCV), pp. 1–18. Springer (2018). https://doi.org/10.1007/978-3-030-01270-0_50

    Chapter  Google Scholar 

  31. Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: monocular visual odometry through unsupervised deep learning. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 7286–7291. IEEE (2018). https://doi.org/10.1109/ICRA.2018.8461251

    Chapter  Google Scholar 

  32. Wang, R., Pizer, S.M., Frahm, J.-M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5555–5564. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00570

    Chapter  Google Scholar 

  33. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810. MIT Press (2015)

    Google Scholar 

  34. Jau, Y.-Y., Zhu, R., Su, H., Chandraker, M.: Deep keypoint-based camera pose estimation with geometric constraints. In: Proceedings - IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4950–4957. IEEE (2020). https://doi.org/10.1109/IROS45743.2020.9341229

    Chapter  Google Scholar 

  35. Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85. IEEE (2017). https://doi.org/10.1109/CVPR.2017.16

    Chapter  Google Scholar 

  36. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(600–612), (2004). https://doi.org/10.1109/TIP.2003.819861

  37. Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6602–6611. IEEE (2017). https://doi.org/10.1109/CVPR.2017.699

    Chapter  Google Scholar 

  38. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016). https://doi.org/10.1109/CVPR.2016.90

    Chapter  Google Scholar 

  39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  40. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048. IEEE (2016). https://doi.org/10.1109/CVPR.2016.438

    Chapter  Google Scholar 

  41. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE (2012). https://doi.org/10.1109/CVPR.2012.6248074

    Chapter  Google Scholar 

  42. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR), pp. 1–15. ICLR, San Diego (2015). https://doi.org/10.48550/ARXIV.1412.6980

    Chapter  Google Scholar 

  43. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.: Digging into self-supervised monocular depth estimation. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3827–3837. IEEE (South) (2019). https://doi.org/10.1109/ICCV.2019.00393

    Chapter  Google Scholar 

  44. Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 340–349. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00043

    Chapter  Google Scholar 

  45. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using geometric constraints. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00594

    Chapter  Google Scholar 

  46. Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12232–12241. IEEE (2019). https://doi.org/10.1109/CVPR.2019.01252

    Chapter  Google Scholar 

  47. Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., Quan, L.: Beyond photometric loss for self-supervised ego-motion estimation. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 6359–6365. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793479

    Chapter  Google Scholar 

  48. Li, Y., Ushiku, Y., Harada, T.: Pose graph optimization for unsupervised monocular visual Odometry. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 5439–5445. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793706

    Chapter  Google Scholar 

  49. Li, S., Wang, X., Cao, Y., Xue, F., Yan, Z., Zha, H.: Self-supervised deep visual Odometry with online adaptation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6338–6347. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00637

    Chapter  Google Scholar 

Download references

Funding

This work was supported by the National Key R&D Program of China (2020YFB1313002) and National Natural Science Foundation of China (Grant No. 61973029), and Scientific and Technological Innovation Foundation of Foshan (BK21BF004).

Author information

Authors and Affiliations

Authors

Contributions

Haixin Xiu: Conceptualization, Methodology, Software, Formal Analysis, Writing Part Original Draft. Yiyou Liang: Investigation, Formal Analysis, Writing Part Original Draft. Hui Zeng: Supervision, Writing Review and Editing.

Corresponding author

Correspondence to Hui Zeng.

Ethics declarations

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Consent to Publish

Not applicable.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiu, H., Liang, Y. & Zeng, H. Keypoint Heatmap Guided Self-Supervised Monocular Visual Odometry. J Intell Robot Syst 105, 78 (2022). https://doi.org/10.1007/s10846-022-01685-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10846-022-01685-2

Keywords

Navigation