Abstract
Visual odometry is an important part of visual simultaneous localization and mapping (SLAM) system. In recent years, with the development of deep learning technique, the combination of visual odometry with deep learning has attracted more and more researchers’ attentions. Existing deep learning-based monocular visual odometry methods include a large number of calculations of redundant pixels, and they only consider the pose transformation between two adjacent frames, resulting in error accumulations. To solve the above problems, an end-to-end self-supervised monocular visual odometry method based on keypoint heatmap guidance is proposed in this paper. In the process of network training, the keypoint heatmap is used to guide network learning to reduce the influence of redundant pixels. The photometric error based on the pose consistency constraint of image sequence is calculated to reduce the accumulated error in the pose estimation of video sequence. Extensive experimental results on the KITTI visual odometry dataset have fully validated the effectiveness of the proposed method.
Similar content being viewed by others
Data Availability
The code and data of the proposed method can be downloaded from the following address: https://github.com/kaixjl/kphm-vo
References
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31, 1147–1163 (2015). https://doi.org/10.1109/TRO.2015.2463671
Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 33, 1255–1262 (2017). https://doi.org/10.1109/TRO.2017.2705103
Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap SLAM. IEEE Trans. Robot. 37, 1874–1890 (2021). https://doi.org/10.1109/TRO.2021.3075644
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017). https://doi.org/10.1109/ICCV.2017.322
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(1137–1149), (2017). https://doi.org/10.1109/TPAMI.2016.2577031
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 337–345. IEEE (2018). https://doi.org/10.1109/CVPRW.2018.00060
Sarlin, P.-E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4937–4946. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00499
Liu, Y., Shen, Z., Lin, Z., Peng, S., Bao, H., Zhou, X.: GIFT: learning transformation-invariant dense visual descriptors via group CNNs. In: Advances in Neural Information Processing Systems, pp. 6992–7003. MIT Press (2019) https://dl.acm.org/doi/10.5555/3454287.3454915
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-net: a trainable CNN for joint description and detection of local features. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8084–8093. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00828
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and Ego-motion from video. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6612–6619. IEEE (2017). https://doi.org/10.1109/CVPR.2017.700
Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 1–11. MIT Press, Vancouver (2019)
Almalioglu, Y., Saputra, M.R.U., De Gusmão, P.P.B., Markham, A., Trigoni, N.: GANVO: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 5474–5480. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793512
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1983–1992. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00212
Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: dense tracking and mapping in real-time. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2320–2327. IEEE (2011). https://doi.org/10.1109/ICCV.2011.6126513
Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1150–1157. IEEE (1999). https://doi.org/10.1109/ICCV.1999.790410
Newcombe, R.A., Fitzgibbon, A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S.: KinectFusion: real-time dense surface mapping and tracking. In: Proceedings - 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE (2011). https://doi.org/10.1109/ISMAR.2011.6092378
Wang, R., Schworer, M., Cremers, D.: Stereo DSO: large-scale direct sparse visual odometry with stereo cameras. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3923–3931. IEEE (2017). https://doi.org/10.1109/ICCV.2017.421
Von Stumberg, L., Usenko, V., Cremers, D.: Direct sparse visual-inertial odometry using dynamic marginalization. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 2510–2517. IEEE (2018). https://doi.org/10.1109/ICRA.2018.8462905
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2018). https://doi.org/10.1109/TPAMI.2017.2658577
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale direct monocular SLAM. In: European Conference on Computer Vision (ECCV), pp. 834–849. Springer (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Machine Intell. 29, 1052–1067 (2007)
Holmes, S., Klein, G., Murray, D.W.: A square root unscented kalman filter for visual monoSLAM. In: Proceedings - IEEE International Conference on Robotics and Automation, pp. 3710–3716. IEEE (2008). https://doi.org/10.1109/ROBOT.2008.4543780
Gamage, D., Drummond, T.: Reduced dimensionality extended Kalman filter for SLAM in a relative formulation. In: Proceedings - IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1365–1372. IEEE (2015). https://doi.org/10.1109/IROS.2015.7353545
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: Proceedings - IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 1–10. IEEE (2007). https://doi.org/10.1109/ISMAR.2007.4538852
Engel, J., Sturm, J., Cremers, D.: Semi-dense visual odometry for a monocular camera. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1449–1456. IEEE (2013). https://doi.org/10.1109/ICCV.2013.183
Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. IEEE (2017). https://doi.org/10.1109/ICRA.2017.7989236
Xue, F., Wang, X., Li, S., Wang, Q., Wang, J., Zha, H.: Beyond tracking: selecting memory and refining poses for deep visual odometry. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8567–8575. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00877
Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: DeMoN: depth and motion network for learning monocular stereo. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5622–5631. IEEE (2017). https://doi.org/10.1109/CVPR.2017.596
Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Computer Vision – ECCV 2020, pp. 710–727. Springer (2020). https://doi.org/10.1007/978-3-030-58568-6_42
Zhou, H., Ummenhofer, B., Brox, T.: DeepTAM: deep tracking and mapping. In: European Conference on Computer Vision (ECCV), pp. 1–18. Springer (2018). https://doi.org/10.1007/978-3-030-01270-0_50
Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: monocular visual odometry through unsupervised deep learning. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 7286–7291. IEEE (2018). https://doi.org/10.1109/ICRA.2018.8461251
Wang, R., Pizer, S.M., Frahm, J.-M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5555–5564. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00570
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810. MIT Press (2015)
Jau, Y.-Y., Zhu, R., Su, H., Chandraker, M.: Deep keypoint-based camera pose estimation with geometric constraints. In: Proceedings - IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4950–4957. IEEE (2020). https://doi.org/10.1109/IROS45743.2020.9341229
Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85. IEEE (2017). https://doi.org/10.1109/CVPR.2017.16
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(600–612), (2004). https://doi.org/10.1109/TIP.2003.819861
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6602–6611. IEEE (2017). https://doi.org/10.1109/CVPR.2017.699
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016). https://doi.org/10.1109/CVPR.2016.90
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048. IEEE (2016). https://doi.org/10.1109/CVPR.2016.438
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE (2012). https://doi.org/10.1109/CVPR.2012.6248074
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR), pp. 1–15. ICLR, San Diego (2015). https://doi.org/10.48550/ARXIV.1412.6980
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.: Digging into self-supervised monocular depth estimation. In: Proceedings - IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3827–3837. IEEE (South) (2019). https://doi.org/10.1109/ICCV.2019.00393
Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 340–349. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00043
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using geometric constraints. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00594
Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12232–12241. IEEE (2019). https://doi.org/10.1109/CVPR.2019.01252
Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., Quan, L.: Beyond photometric loss for self-supervised ego-motion estimation. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 6359–6365. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793479
Li, Y., Ushiku, Y., Harada, T.: Pose graph optimization for unsupervised monocular visual Odometry. In: Proceedings - IEEE International Conference on Robotics and Automation (ICRA), pp. 5439–5445. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793706
Li, S., Wang, X., Cao, Y., Xue, F., Yan, Z., Zha, H.: Self-supervised deep visual Odometry with online adaptation. In: Proceedings - IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6338–6347. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00637
Funding
This work was supported by the National Key R&D Program of China (2020YFB1313002) and National Natural Science Foundation of China (Grant No. 61973029), and Scientific and Technological Innovation Foundation of Foshan (BK21BF004).
Author information
Authors and Affiliations
Contributions
Haixin Xiu: Conceptualization, Methodology, Software, Formal Analysis, Writing Part Original Draft. Yiyou Liang: Investigation, Formal Analysis, Writing Part Original Draft. Hui Zeng: Supervision, Writing Review and Editing.
Corresponding author
Ethics declarations
Ethical Approval
Not applicable.
Consent to Participate
Not applicable.
Consent to Publish
Not applicable.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiu, H., Liang, Y. & Zeng, H. Keypoint Heatmap Guided Self-Supervised Monocular Visual Odometry. J Intell Robot Syst 105, 78 (2022). https://doi.org/10.1007/s10846-022-01685-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10846-022-01685-2