Abstract
Visual odometry and visual simultaneous localization and mapping aid in tracking the position of a camera and mapping the surroundings using images. It is an important part of robotic perception. Tracking and mapping using a monocular camera is cost-effective, requires less calibration effort, and is easy to deploy across a wide range of applications. This paper provides an extensive review of the developments for the first two decades of the twenty-first century. Astounding results from early methods based on filtering have intrigued the community to extend these algorithms using other forms of techniques like bundle adjustment and deep learning. This article starts by introducing the basic sensor systems and analyzing the evolution of monocular tracking and mapping algorithms through bibliometric data. Then, it covers the overview of filtering and bundle adjustment methods, followed by recent advancements in methods using deep learning with the mathematical constraints applied on the networks. Finally, the popular benchmarks available for developing and evaluating these algorithms are presented along with a comparative study on a different class of algorithms. It is anticipated that this article will serve as the latest introductory tool and further ignite the interest of the community to solve current and future impediments.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Nourani-Vatani, N., Roberts, J., Srinivasan, M.V.: Practical visual odometry for car-like vehicles. In: 2009 IEEE International Conference on Robotics and Automation, pp. 3551–3557 (2009). IEEE
Helmick, D.M., Cheng, Y., Clouse, D.S., Matthies, L.H., Roumeliotis, S.I.: Path following using visual odometry for a mars rover in high-slip environments. In: 2004 IEEE Aerospace Conference Proceedings (IEEE Cat. No. 04TH8720), Vol. 2, pp. 772–789 (2004). IEEE
Woodman, O.J.: An introduction to inertial navigation. Technical report, University of Cambridge, Computer Laboratory (2007)
Jiang, W., Yin, Z.: Combining passive visual cameras and active imu sensors for persistent pedestrian tracking. J. Vis. Commun. Image Represent. 48, 419–431 (2017). https://doi.org/10.1016/j.jvcir.2017.03.015
Aqel, M.O., Marhaban, M.H., Saripan, M.I., Ismail, N.B.: Review of visual odometry: types, approaches, challenges, and applications. Springerplus 5(1), 1–26 (2016). https://doi.org/10.1186/s40064-016-3573-7
Debeunne, C., Vivet, D.: A review of visual-lidar fusion based simultaneous localization and mapping. Sensors 20(7), 2068 (2020). https://doi.org/10.3390/s20072068
Zaffar, M., Ehsan, S., Stolkin, R., Maier, K.M.: Sensors, slam and long-term autonomy: a review. In: 2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pp. 285–290 (2018). IEEE
Yousif, K., Bab-Hadiashar, A., Hoseinnezhad, R.: An overview to visual odometry and visual slam: applications to mobile robotics. Intell. Ind. Syst. 1, 289–311 (2015)
Younes, G., Asmar, D.C., Shammas, E.: A survey on non-filter-based monocular visual slam systems. arXiv:1607.00470 (2016)
Younes, G., Asmar, D., Shammas, E., Zelek, J.: Keyframe-based monocular slam: design, survey, and future directions. Robot. Auton. Syst. 98, 67–88 (2017). https://doi.org/10.1016/j.robot.2017.09.010
Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: Monoslam: real-time single camera slam. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007). https://doi.org/10.1109/TPAMI.2007.1049
Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment-a modern synthesis. In: International Workshop on Vision Algorithms, pp. 298–372 (1999). Springer
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2012)
Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part I. IEEE Robot. Autom. Mag. 13(2), 99–110 (2006). https://doi.org/10.1109/MRA.2006.1638022
Nistér, D., Naroditsky, O., Bergen, J.R.: Visual odometry. In; Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. Vol. 1, (2004)
Wei, Y., Kang, L., Yang, B., Wu, L.: Applications of structure from motion: a survey. J. Zhejiang Univ. Sci. C 14, 486–494 (2013). https://doi.org/10.1631/jzus.CIDE1302
Song, S., Chandraker, M.: Robust scale estimation in real-time monocular sfm for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1566–1573 (2014)
Zhou, D., Dai, Y., Li, H.: Ground-plane-based absolute scale estimation for monocular visual odometry. IEEE Trans. Intell. Transp. Syst. 21(2), 791–802 (2019). https://doi.org/10.1109/TITS.2019.2900330
He, M., Zhu, C., Huang, Q., Ren, B., Liu, J.: A review of monocular visual odometry. Vis. Comput. 36(5), 1053–1065 (2020). https://doi.org/10.1007/s00371-019-01714-6
Milz, S., Arbeiter, G., Witt, C., Abdallah, B., Yogamani, S.: Visual slam for automated driving: Exploring the applications of deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 247–257 (2018)
Mahmoud, N., Grasa, Ó.G., Nicolau, S.A., Doignon, C., Soler, L., Marescaux, J., Montiel, J.: On-patient see-through augmented reality based on visual slam. Int. J. Comput. Assist. Radiol. Surg. 12(1), 1–11 (2017)
Yu, K., Ahn, J., Lee, J., Kim, M., Han, J.: Collaborative slam and ar-guided navigation for floor layout inspection. Vis. Comput. 36(10), 2051–2063 (2020)
Marchand, É., Courty, N.: Controlling a camera in a virtual environment. Vis. Comput. 18(1), 1–19 (2002)
Grasa, O.G., Bernal, E., Casado, S., Gil, I., Montiel, J.: Visual slam for handheld monocular endoscope. IEEE Trans. Med. Imaging 33(1), 135–146 (2013). https://doi.org/10.1109/TMI.2013.2282997
Liu, X., Sinha, A., Ishii, M., Hager, G.D., Reiter, A., Taylor, R.H., Unberath, M.: Dense depth estimation in monocular endoscopy with self-supervised learning methods. IEEE Trans. Med. Imaging 39(5), 1438–1447 (2019). https://doi.org/10.1109/TMI.2019.2950936
Hartley, R.I.: In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 19(6), 580–593 (1997). https://doi.org/10.1109/34.601246
Zhang, Z.: Determining the epipolar geometry and its uncertainty: a review. Int. J. Comput. Vision 27(2), 161–195 (1998). https://doi.org/10.1023/A:1007941100561
Zhu, R., Yang, M., Liu, W., Song, R., Yan, B., Xiao, Z.: Deepavo: Efficient pose refining with feature distilling for deep visual odometry. Neurocomputing 467, 22–35 (2022). https://doi.org/10.1016/j.neucom.2021.09.029
Bailey, T., Durrant-Whyte, H.: Simultaneous localization and mapping (slam): part II. IEEE Robot. Autom. Mag. 13(3), 108–117 (2006). https://doi.org/10.1109/MRA.2006.1678144
Scaramuzza, D., Fraundorfer, F.: Visual odometry [tutorial]. IEEE Robot. Autom. Mag. 18(4), 80–92 (2011). https://doi.org/10.1109/MRA.2011.943233
Fraundorfer, F., Scaramuzza, D.: Visual odometry: part II: matching, robustness, optimization, and applications. IEEE Robot. Autom. Mag. 19(2), 78–90 (2012). https://doi.org/10.1109/MRA.2012.2182810
Taketomi, T., Uchiyama, H., Ikeda, S.: Visual slam algorithms: a survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 9, 1–11 (2017). https://doi.org/10.1186/s41074-017-0027-2
Li, R., Wang, S., Gu, D.: Ongoing evolution of visual slam from geometry to deep learning: challenges and opportunities. Cogn. Comput. 10, 875–889 (2018). https://doi.org/10.1007/s12559-018-9591-8
Taheri, H., Xia, Z.C.: Slam; definition and evolution. Eng. Appl. Artif. Intell. 97, 104032 (2021). https://doi.org/10.1016/j.engappai.2020.104032
Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 32, 1309–1332 (2016). https://doi.org/10.1109/TRO.2016.2624754
Saputra, M.R.U., Markham, A., Trigoni, A.: Visual slam and structure from motion in dynamic environments. ACM Comput. Surv. 51, 1–36 (2018). https://doi.org/10.1145/3177853
Pan, J., Li, L., Yamaguchi, H., Hasegawa, K., Thufail, F.I., Tanaka, S., et al.: 3d reconstruction of borobudur reliefs from 2d monocular photographs based on soft-edge enhanced deep learning. ISPRS J. Photogramm. Remote. Sens. 183, 439–450 (2022). https://doi.org/10.1016/j.isprsjprs.2021.11.007
Davison, A.J.: Real-time simultaneous localisation and mapping with a single camera. In: IEEE International Conference on Computer Vision, vol. 3, pp. 1403–1403 (2003). IEEE Computer Society
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015). https://doi.org/10.1109/TRO.2015.2463671
Liu, Y., Chen, X., Gu, T., Zhang, Y., Xing, G.: Real-time camera pose estimation via line tracking. Vis. Comput. 34(6), 899–909 (2018)
Maity, S., Saha, A., Bhowmick, B.: Edge slam: Edge points based monocular visual slam. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2408–2417 (2017)
Dong, Y., Wang, S., Yue, J., Chen, C., He, S., Wang, H., He, B.: A novel texture-less object oriented visual slam system. IEEE Trans. Intell. Transp. Syst. (2019)
Yang, S., Scherer, S.: Cubeslam: monocular 3-d object slam. IEEE Trans. Robot. 35(4), 925–938 (2019). https://doi.org/10.1109/TRO.2019.2909168
Tuytelaars, T., Mikolajczyk, K.: Local Invariant Feature Detectors: a Survey. Now Publishers Inc, (2008)
Li, Y., Wang, S., Tian, Q., Ding, X.: A survey of recent advances in visual feature detection. Neurocomputing 149, 736–751 (2015). https://doi.org/10.1016/j.neucom.2014.08.003
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 6, 679–698 (1986). https://doi.org/10.1109/TPAMI.1986.4767851
Harris, C.G., Stephens, M., et al.: A combined corner and edge detector. In: Alvey Vision Conference, vol. 15, pp. 10–5244 (1988). Citeseer
Shi, J., et al.: Good features to track. In: 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600 (1994). IEEE
Piniés, P., Tardós, J.D.: Large-scale slam building conditionally independent local maps: application to monocular vision. IEEE Trans. Rob. 24(5), 1094–1106 (2008). https://doi.org/10.1109/TRO.2008.2004637
Kwon, J., Lee, K.M.: Monocular slam with locally planar landmarks via geometric rao-blackwellized particle filtering on lie groups. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1522–1529 (2010). IEEE
Clemente, L.A., Davison, A.J., Reid, I.D., Neira, J., Tardós, J.D.: Mapping large loops with a single hand-held camera. In: Robotics: Science and Systems, vol. 2 (2007)
Holmes, S.A., Klein, G., Murray, D.W.: An o (n\(^2\)) square root unscented kalman filter for visual simultaneous localization and mapping. IEEE Trans. Pattern Anal. Mach. Intell. 31(7), 1251–1263 (2008). https://doi.org/10.1109/TPAMI.2008.189
Celik, K., Chung, S.-J., Clausman, M., Somani, A.K.: Monocular vision slam for indoor aerial vehicles. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1566–1573 (2009). IEEE
Liu, J., Liu, D., Cheng, J., Tang, Y.: Conditional simultaneous localization and mapping: a robust visual slam system. Neurocomputing 145, 269–284 (2014). https://doi.org/10.1016/j.neucom.2014.05.034
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: European Conference on Computer Vision, pp. 430–443 (2006). Springer
Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 225–234 (2007). IEEE
Herrera, D.C., Kim, K., Kannala, J., Pulli, K., Heikkilä, J.: Dt-slam: Deferred triangulation for robust slam. In: 2014 2nd International Conference on 3D Vision, vol. 1, pp. 609–616 (2014). IEEE
Pumarola, A., Vakhitov, A., Agudo, A., Sanfeliu, A., Moreno-Noguer, F.: Pl-slam: Real-time monocular visual slam with points and lines. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4503–4508 (2017). IEEE
Ma, J., Jiang, X., Fan, A., Jiang, J., Yan, J.: Image matching from handcrafted to deep features: a survey. Int. J. Comput. Vis. 129(1), 23–79 (2021). https://doi.org/10.1007/s11263-020-01359-2
Chen, L., Rottensteiner, F., Heipke, C.: Feature detection and description for image matching: from hand-crafted design to deep learning. Geo-Spatial Inf. Sci. 24(1), 58–74 (2021). https://doi.org/10.1080/10095020.2020.1843376
Martins, P.F., Costelha, H., Bento, L.C., Neves, C.: Monocular camera calibration for autonomous driving-a comparative study. In: 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 306–311 (2020). IEEE
Nistér, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 756–770 (2004). https://doi.org/10.1109/TPAMI.2004.17
Armangué, X., Salvi, J.: Overall view regarding fundamental matrix estimation. Image Vis. Comput. 21(2), 205–220 (2003). https://doi.org/10.1016/S0262-8856(02)00154-3
Lui, V., Drummond, T.: An iterative 5-pt algorithm for fast and robust essential matrix estimation. IJCV 74(2), 117–136 (2007)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981). https://doi.org/10.1145/358669.358692
Torr, P.H., Zisserman, A.: Mlesac: a new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000). https://doi.org/10.1006/cviu.1999.0832
Yan, K., Zhao, R., Liu, E., Ma, Y.: A robust fundamental matrix estimation method based on epipolar geometric error criterion. IEEE Access 7, 147523–147533 (2019). https://doi.org/10.1109/ACCESS.2019.2946387
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, New York (2003)
Forster, C., Pizzoli, M., Scaramuzza, D.: Svo: Fast semi-direct monocular visual odometry. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 15–22 (2014). IEEE
Huang, J., Liu, R., Zhang, J., Chen, S.: Fast initialization method for monocular slam based on indoor model. In: 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 2360–2365 (2017). IEEE
Yang, Y., Xiong, J., She, X., Liu, C., Yang, C., Li, J.: Passive initialization method based on motion characteristics for monocular slam. Complexity 2019, 8176489–1817648911 (2019). https://doi.org/10.1155/2019/8176489
Strasdat, H., Montiel, J., Davison, A.J.: Real-time monocular slam: Why filter? In: 2010 IEEE International Conference on Robotics and Automation, pp. 2657–2664 (2010). IEEE
Ho, T.S., Fai, Y.C., Ming, E.S.L.: Simultaneous localization and mapping survey based on filtering techniques. In: 2015 10th Asian Control Conference (ASCC), pp. 1–6 (2015). IEEE
Huang, S., Dissanayake, G.: Convergence and consistency analysis for extended Kalman filter based slam. IEEE Trans. Robot. 23(5), 1036–1049 (2007). https://doi.org/10.1109/TRO.2007.903811
Guivant, J.E., Nebot, E.M.: Optimization of the simultaneous localization and map-building algorithm for real-time implementation. IEEE Trans. Robot. Autom. 17(3), 242–257 (2001). https://doi.org/10.1109/70.938382
Dissanayake, G., Williams, S.B., Durrant-Whyte, H., Bailey, T.: Map management for efficient simultaneous localization and mapping (slam). Auton. Robot. 12(3), 267–286 (2002). https://doi.org/10.1023/A:1015217631658
Paz, L.M., Piniés, P., Tardós, J.D., Neira, J.: Large-scale 6-dof slam with stereo-in-hand. IEEE Trans. Robot. 24(5), 946–957 (2008)
Mahon, I., Williams, S.B., Pizarro, O., Johnson-Roberson, M.: Efficient view-based slam using visual loop closures. IEEE Trans. Robot. 24(5), 1002–1014 (2008). https://doi.org/10.1109/TRO.2008.2004888
Cadena, C., Neira, J.: Slam in o (logn) with the combined Kalman-information filter. Robot. Auton. Syst. 58(11), 1207–1219 (2010). https://doi.org/10.1016/j.robot.2010.08.003
He, B., Liu, Y., Dong, D., Shen, Y., Yan, T., Nian, R.: Simultaneous localization and mapping with iterative sparse extended information filter for autonomous vehicles. Sensors 15(8), 19852–19879 (2015). https://doi.org/10.3390/s150819852
Wan, E.A., Van Der Merwe, R.: The unscented Kalman filter for nonlinear estimation. In: Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pp. 153–158 (2000). IEEE
Civera, J., Davison, A.J., Montiel, J.M.: Inverse depth parametrization for monocular slam. IEEE Trans. Robot. 24(5), 932–945 (2008). https://doi.org/10.1109/TRO.2008.2003276
Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al.: Fastslam: A factored solution to the simultaneous localization and mapping problem. Aaai/iaai Vol. 593598 (2002)
Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al.: Fastslam 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In: IJCAI, vol. 3, pp. 1151–1156 (2003)
Pupilli, M., Calway, A.: Real-time camera tracking using a particle filter. In: BMVC (2005)
Hoseini, S.A., Kabiri, P.: A novel feature-based approach for indoor monocular slam. Electronics 7(11), 305 (2018). https://doi.org/10.3390/electronics7110305
Angeli, A., Doncieux, S., Meyer, J.-A., Filliat, D.: Real-time visual loop-closure detection. In: 2008 IEEE International Conference on Robotics and Automation, pp. 1842–1847 (2008). IEEE
Lee, S.-H.: Real-time camera tracking using a particle filter combined with unscented kalman filters. J. Electron. Imaging 23(1), 013029 (2014). https://doi.org/10.1117/1.JEI.23.1.013029
Zhou, H., Zou, D., Pei, L., Ying, R., Liu, P., Yu, W.: Structslam: Visual slam with building structure lines. IEEE Trans. Veh. Technol. 64(4), 1364–1375 (2015). https://doi.org/10.1109/TVT.2015.2388780
Tseng, K.-K., Li, J., Chang, Y., Yung, K., Chan, C., Hsu, C.-Y.: A new architecture for simultaneous localization and mapping: an application of a planetary rover. Enterprise Inf. Syst. 15(8), 1162–1178 (2021). https://doi.org/10.1080/17517575.2019.1698772
Gao, X.-S., Hou, X.-R., Tang, J., Cheng, H.-F.: Complete solution classification for the perspective-three-point problem. IEEE Trans. Pattern Anal. Mach. Intell. 25(8), 930–943 (2003). https://doi.org/10.1109/TPAMI.2003.1217599
Lepetit, V., Moreno-Noguer, F., Fua, P.: Epnp: an accurate o (n) solution to the pnp problem. Int. J. Comput. Vis. 81(2), 155 (2009). https://doi.org/10.1007/s11263-008-0152-6
Persson, M., Nordberg, K.: Lambda twist: An accurate fast robust perspective three point (p3p) solver. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–332 (2018)
Blochliger, F., Fehr, M., Dymczyk, M., Schneider, T., Siegwart, R.: Topomap: Topological mapping and navigation based on visual slam maps. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3818–3825 (2018). IEEE
Yang, A., Luo, Y., Chen, L., Xu, Y.: Survey of 3d map in slam: localization and navigation. In: Advanced Computational Methods in Life System Modeling and Simulation, pp. 410–420. Springer (2017)
Cai, Q., Zhang, L., Wu, Y., Yu, W., Hu, D.: A pose-only solution to visual reconstruction and navigation. arXiv preprint arXiv:2103.01530 (2021)
Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: Dtam: Dense tracking and mapping in real-time. In: 2011 International Conference on Computer Vision, pp. 2320–2327 (2011). IEEE
Engel, J., Schöps, T., Cremers, D.: Lsd-slam: Large-scale direct monocular slam. In: European Conference on Computer Vision, pp. 834–849 (2014). Springer
Forster, C., Zhang, Z., Gassner, M., Werlberger, M., Scaramuzza, D.: Svo: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Rob. 33(2), 249–265 (2016). https://doi.org/10.1109/TRO.2016.2623335
Concha, A., Civera, J.: Dpptam: Dense piecewise planar tracking and mapping from a monocular sequence. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5686–5693 (2015). IEEE
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2017). https://doi.org/10.1109/TPAMI.2017.2658577
Zubizarreta, J., Aguinaga, I., Montiel, J.M.M.: Direct sparse mapping. IEEE Trans. Robot. 36(4), 1363–1370 (2020). https://doi.org/10.1109/TRO.2020.2991614
Roberts, R., Nguyen, H., Krishnamurthi, N., Balch, T.: Memory-based learning for visual odometry. In: 2008 IEEE International Conference on Robotics and Automation, pp. 47–52 (2008). IEEE
Guizilini, V., Ramos, F.: Semi-parametric learning for visual odometry. Tnt. J. Robot. Res. 32(5), 526–546 (2013). https://doi.org/10.1177/2F0278364912472245
Konda, K.R., Memisevic, R.: Learning visual odometry with a convolutional network. In: VISAPP (1), pp. 486–490 (2015)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016)
Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE Robot. Autom. Lett. 1(1), 18–25 (2015). https://doi.org/10.1109/TITS.2019.2952159
Muller, P., Savakis, A.: Flowdometry: An optical flow and deep learning based approach to visual odometry. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 624–631 (2017). IEEE
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5038–5047 (2017)
Costante, G., Ciarfuglia, T.A.: Ls-vo: Learning dense optical subspace for robust visual odometry estimation. IEEE Robot. Autom. Lett. 3(3), 1735–1742 (2018). https://doi.org/10.1109/LRA.2018.2803211
Pandey, T., Pena, D., Byrne, J., Moloney, D.: Leveraging deep learning for visual odometry using optical flow. Sensors 21(4), 1313 (2021). https://doi.org/10.3390/s21041313
Wang, H., Ban, X., Ding, F., Xiao, Y., Zhou, J.: Monocular vo based on deep siamese convolutional neural network. Complexity (2020). https://doi.org/10.1155/2020/6367273
Saputra, M.R.U., de Gusmao, P.P., Wang, S., Markham, A., Trigoni, N.: Learning monocular visual odometry through geometry-aware curriculum learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3549–3555 (2019). IEEE
Wang, S., Clark, R., Wen, H., Trigoni, N.: End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. Int. J. Robot. Res. 37(4–5), 513–542 (2018). https://doi.org/10.1177/2F0278364917734298
Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5974–5983 (2017)
Gadipudi, N., Elamvazuthi, I., Lu, C.-K., Paramasivam, S., Su, S.: Wpo-net: Windowed pose optimization network for monocular visual odometry estimation. Sensors 21(23), 8155 (2021). https://doi.org/10.3390/s21238155
Wang, X., Zhang, H.: Deep monocular visual odometry for ground vehicle. IEEE Access 8, 175220–175229 (2020). https://doi.org/10.1109/ACCESS.2020.3025557
Saputra, M.R.U., de Gusmao, P.P., Almalioglu, Y., Markham, A., Trigoni, N.: Distilling knowledge from a deep pose regressor network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 263–272 (2019)
Koumis, A.S., Preiss, J.A., Sukhatme, G.S.: Estimating metric scale visual odometry from videos using 3d convolutional networks. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 265–272 (2019). IEEE
Zhai, G., Liu, L., Zhang, L., Liu, Y., Jiang, Y.: Poseconvgru: a monocular approach for visual ego-motion estimation by learning. Pattern Recogn. 102, 107187 (2020). https://doi.org/10.1016/j.patcog.2019.107187
Kuo, X.-Y., Liu, C., Lin, K.-C., Lee, C.-Y.: Dynamic attention-based visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 36–37 (2020)
Gadipudi, N., Elamvazuthi, I., Lu, C.-K., Paramasivam, S., Su, S.: Lightweight spatial attentive network for vehicular visual odometry estimation in urban environments. Neural Computing and Applications, 1–14 (2022). https://doi.org/10.1007/s00521-022-07484-y
Xue, F., Wang, X., Li, S., Wang, Q., Wang, J., Zha, H.: Beyond tracking: Selecting memory and refining poses for deep visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8575–8583 (2019)
Xu, S., Xiong, H., Wu, Q., Wang, Z.: Attention-based long-term modeling for deep visual odometry. In: 2021 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Garg, R., Bg, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: European Conference on Computer Vision, pp. 740–756 (2016). Springer
Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. arXiv preprint arXiv:1506.02025 (2015)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Prasad, V., Bhowmick, B.: Sfmlearner++: Learning monocular depth and ego-motion using meaningful geometric constraints. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2087–2096 (2019). IEEE
Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992 (2018)
Zou, Y., Luo, Z., Huang, J.-B.: Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 36–53 (2018)
Sun, Q., Tang, Y., Zhao, C.: Cycle-sfm: Joint self-supervised learning of depth and camera motion from monocular image sequences. Chaos: Interdiscip. J. Nonlinear Sci. 29(12), 123102 (2019). https://doi.org/10.1063/1.5120605
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
Almalioglu, Y., Saputra, M.R.U., de Gusmao, P.P., Markham, A., Trigoni, N.: Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 5474–5480 (2019). IEEE
Li, S., Xue, F., Wang, X., Yan, Z., Zha, H.: Sequential adversarial learning for self-supervised deep visual odometry. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2851–2860 (2019)
Zhao, C., Yen, G.G., Sun, Q., Zhang, C., Tang, Y.: Masked gan for unsupervised depth and pose prediction with scale consistency. IEEE Trans. Neural Netw. Learn. Syst. (2020). https://doi.org/10.1109/TNNLS.2020.3044181
Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. arXiv preprint arXiv:1908.10553 (2019)
Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Proceedings of 16th European Conference Computer Vision–ECCV 2020, Glasgow, UK, August 23–28, 2020, Part XIV 16, pp. 710–727 (2020). Springer
Lu, Y., Xu, X., Ding, M., Lu, Z., Xiang, T.: A global occlusion-aware approach to self-supervised monocular visual odometry. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2260–2268 (2021)
Liu, Y., Wang, H., Wang, J., Wang, X.: Unsupervised monocular visual odometry based on confidence evaluation. IEEE Trans. Intell. Transp. Syst. (2021). https://doi.org/10.1109/TITS.2021.3053412
Sarlin, P.-E., Unagar, A., Larsson, M., Germain, H., Toft, C., Larsson, V., Pollefeys, M., Lepetit, V., Hammarstrand, L., Kahl, F., et al.: Back to the feature: Learning robust camera localization from pixels to pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3247–3257 (2021)
Zhang, J., Su, Q., Liu, P., Xu, C., Chen, Y.: Unsupervised learning of monocular depth and ego-motion with space-temporal-centroid loss. Int. J. Mach. Learn. Cybern. 11(3), 615–627 (2020). https://doi.org/10.1007/s13042-019-01020-6
Liu, Q., Li, R., Hu, H., Gu, D.: Using unsupervised deep learning technique for monocular visual odometry. Ieee Access 7, 18076–18088 (2019). https://doi.org/10.1109/ACCESS.2019.2896988
Wang, A., Fang, Z., Gao, Y., Tan, S., Wang, S., Ma, S., Hwang, J.-N.: Adversarial learning for joint optimization of depth and ego-motion. IEEE Trans. Image Process. 29, 4130–4142 (2020). https://doi.org/10.1109/TIP.2020.2968751
Ding, Y., Barath, D., Yang, J., Kukelova, Z.: Relative pose from a calibrated and an uncalibrated smartphone image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12766–12775 (2022)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zhang, X.: Tensorflow: A system for large-scale machine learning. In: OSDI (2016)
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274 (2015)
Tian, C., Fei, L., Zheng, W., Xu, Y., Zuo, W., Lin, C.-W.: Deep learning on image denoising: an overview. Neural Netw. (2020). https://doi.org/10.1016/j.neunet.2020.07.025
Tao, X., Gao, H., Wang, Y., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep image deblurring. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8174–8182 (2018)
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 295–307 (2016). https://doi.org/10.1109/TPAMI.2015.2439281
Yi, K., Trulls, E., Lepetit, V., Fua, P.: Lift: Learned invariant feature transform. arXiv:1603.09114 (2016)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 337–33712 (2018)
Ono, Y., Trulls, E., Fua, P., Yi, K.: Lf-net: Learning local features from images. In: NeurIPS (2018)
Altwaijry, H., Veit, A., Belongie, S.J.: Learning to detect and match keypoints with deep architectures. In: BMVC (2016)
Nguyen, T., Chen, S.W., Shivakumar, S.S., Taylor, C.J., Kumar, V.: Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robot. Autom. Lett. 3(3), 2346–2353 (2018). https://doi.org/10.1109/LRA.2018.2809549
Ranftl, R., Koltun, V.: Deep fundamental matrix estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 284–299 (2018)
Balntas, V., Li, S., Prisacariu, V.: Relocnet: Continuous metric learning relocalisation using neural nets. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–767 (2018)
Valada, A., Radwan, N., Burgard, W.: Deep auxiliary learning for visual localization and odometry. 2018 IEEE International Conference on Robotics and Automation (ICRA), 6939–6946 (2018)
Radwan, N., Valada, A., Burgard, W.: Vlocnet++: deep multitask learning for semantic visual localization and odometry. IEEE Robot. Autom. Lett. 3, 4407–4414 (2018). https://doi.org/10.1109/LRA.2018.2869640
Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C.: Dsac - differentiable ransac for camera localization. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2492–2500 (2017)
Brachmann, E., Rother, C.: Learning less is more - 6d camera localization via 3d surface regression. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4654–4662 (2018)
Brachmann, E., Rother, C.: Expert sample consensus applied to camera re-localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 7524–7533 (2019)
Barath, D., Cavalli, L., Pollefeys, M.: Learning to find good models in ransac. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15744–15753 (2022)
Yin, X., Wang, X., Du, X., Chen, Q.: Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. 2017 IEEE International Conference on Computer Vision (ICCV), 5871–5879 (2017)
Sünderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., Milford, M.: Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In: Robotics: Science and Systems (2015)
Merrill, N., Huang, G.: Lightweight unsupervised deep loop closure. arXiv:1805.07703 (2018)
Memon, A.R., Wang, H., Hussain, A.: Loop closure detection using supervised and unsupervised deep neural networks for monocular slam systems. Robot. Auton. Syst. 126, 103470 (2020). https://doi.org/10.1016/j.robot.2020.103470
Clark, R., Bloesch, M., Czarnowski, J., Leutenegger, S., Davison, A.: Ls-net: Learning to solve nonlinear least squares for monocular stereo. arXiv:1809.02966 (2018)
Tang, C., Tan, P.: Ba-net: Dense bundle adjustment network. arXiv:1806.04807 (2018)
Zhou, H., Ummenhofer, B., Brox, T.: Deeptam: deep tracking and mapping with convolutional neural networks. Int. J. Comput. Vis. 128(3), 756–769 (2020). https://doi.org/10.1007/s11263-019-01221-0
Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., Chandraker, M.: Pseudo rgb-d for self-improving monocular slam and depth prediction. In: European Conference on Computer Vision, pp. 437–455 (2020). Springer
Loo, S.Y., Amiri, A.J., Mashohor, S., Tang, S.H., Zhang, H.: Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 5218–5223 (2019). IEEE
Cheng, J., Wang, Z., Zhou, H., Li, L., Yao, J.: Dm-slam: a feature-based slam system for rigid dynamic scenes. ISPRS Int. J. Geo Inf. 9(4), 202 (2020). https://doi.org/10.3390/ijgi9040202
Yang, N., Stumberg, L.v., Wang, R., Cremers, D.: D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1281–1292 (2020)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013). https://doi.org/10.1177/2F0278364913491297
Blanco-Claraco, J.-L., Moreno, F.A., González, J.: The málaga urban dataset: High-rate stereo and lidar in a realistic urban scenario. Int. J. Robot. Res. 33, 207–214 (2014). https://doi.org/10.1177/2F0278364913507326
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 573–580 (2012)
Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 year, 1000 km: The oxford robotcar dataset. Int. J. Robot. Res. 36, 15–30 (2017). https://doi.org/10.1177/2F0278364916679498
Carlevaris-Bianco, N., Ushani, A.K., Eustice, R.: University of michigan north campus long-term vision and lidar dataset. Int. J. Robot. Res. 35, 1023–1035 (2016). https://doi.org/10.1177/2F0278364915614638
Burri, M., Nikolic, J., Gohl, P., Schneider, T., Rehder, J., Omari, S., Achtelik, M., Siegwart, R.: The Euroc micro aerial vehicle datasets. Int. J. Robot. Res. 35, 1157–1163 (2016). https://doi.org/10.1177/2F0278364915620033
Majdik, A., Till, C., Scaramuzza, D.: The zurich urban micro aerial vehicle dataset. Int. J. Robot. Res. 36, 269–273 (2017). https://doi.org/10.1177/2F0278364917702237
Smith, M., Baldwin, I., Churchill, W., Paul, R., Newman, P.: The new college vision and laser data set. Int. J. Robot. Res. 28, 595–599 (2009). https://doi.org/10.1177/2F0278364909103911
Huang, A.S., Antone, M.E., Olson, E., Fletcher, L., Moore, D., Teller, S., Leonard, J.: A high-rate, heterogeneous data set from the darpa urban challenge. Int. J. Robot. Res. 29, 1595–1601 (2010). https://doi.org/10.1177/2F0278364910384295
Pandey, G., McBride, J., Eustice, R.: Ford campus vision and lidar data set. Int. J. Robot. Res. 30, 1543–1552 (2011)
Engel, J., Usenko, V., Cremers, D.: A photometrically calibrated benchmark for monocular visual odometry. arXiv:1607.02555 (2016). https://doi.org/10.1177/2F0278364911400640
Dosovitskiy, A., Ros, G., Codevilla, F., López, A., Koltun, V.: Carla: An open urban driving simulator. arXiv:1711.03938 (2017)
Li, W., Saeedi, S., McCormac, J., Clark, R., Tzoumanikas, D., Ye, Q., Huang, Y., Tang, R., Leutenegger, S.: Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv:1809.00716 (2018)
Kirsanov, P., Gaskarov, A., Konokhov, F., Sofiiuk, K., Vorontsova, A., Slinko, I., Zhukov, D., Bykov, S., Barinova, O., Konushin, A.: Discoman: Dataset of indoor scenes for odometry, mapping and navigation. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2470–2477 (2019)
Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4909–4916 (2020)
Shah, S., Dey, D., Lovett, C., Kapoor, A.: Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In: FSR (2017)
Richter, S.R., AlHaija, H.A., Koltun, V.: Enhancing photorealism enhancement. arXiv preprint arXiv:2105.04619 (2021)
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307 (2019)
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A.S., Hauswald, L., Pham, V.H., Mühlegg, M., Dorn, S., et al.: A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320 (2020)
Acknowledgements
The authors are grateful to the sponsors who provided YUTP Grant (015LC0-243) for this project.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gadipudi, N., Elamvazuthi, I., Izhar, L.I. et al. A review on monocular tracking and mapping: from model-based to data-driven methods. Vis Comput 39, 5897–5924 (2023). https://doi.org/10.1007/s00371-022-02702-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02702-z