DeepTAM: Deep Tracking and Mapping with Convolutional Neural Networks

Zhou, Huizhong; Ummenhofer, Benjamin; Brox, Thomas

doi:10.1007/s11263-019-01221-0

DeepTAM: Deep Tracking and Mapping with Convolutional Neural Networks

Published: 03 September 2019

Volume 128, pages 756–769, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1455 Accesses
21 Citations
2 Altmetric
Explore all metrics

Abstract

We present a system for dense keyframe-based camera tracking and depth map estimation that is entirely learned. For tracking, we estimate small pose increments between the current camera image and a synthetic viewpoint. This formulation significantly simplifies the learning problem and alleviates the dataset bias for camera motions. Further, we show that generating a large number of pose hypotheses leads to more accurate predictions. For mapping, we accumulate information in a cost volume centered at the current depth estimate. The mapping network then combines the cost volume and the keyframe image to update the depth prediction, thereby effectively making use of depth measurements and image-based priors. Our approach yields state-of-the-art results with few images and is robust with respect to noisy camera poses. We demonstrate that the performance of our 6 DOF tracking competes with RGB-D tracking algorithms.We compare favorably against strong classic and deep learning powered dense depth algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Fig. 10

Fig. 11

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

https://github.com/magican/OpenDTAM.git SHA: 1f92a54334c233f9c4ce7d8cbaf9a81dee5e69a6

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In 2015 IEEE international conference on computer vision (ICCV) (pp. 37–45), https://doi.org/10.1109/ICCV.2015.13.
Collins, R.T. (1996). A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE computer society conference on computer vision and pattern recognition, IEEE (pp. 358–363), https://doi.org/10.1109/CVPR.1996.517097.
Dhiman, V., Tran, Q.H., Corso, J.J., & Chandraker, M. (2016). A continuous occlusion model for road scene understanding. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4331–4339), https://doi.org/10.1109/CVPR.2016.469.
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.) Advances in neural information processing systems (Vol. 27, pp. 2366–2374), Curran Associates, Inc.
Engel, J., Schöps, T., & Cremers, D. (2014). LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision (pp. 834–849). Springer.
Engel, J., Koltun, V., & Cremers, D. (2018). Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 611–625. https://doi.org/10.1109/TPAMI.2017.2658577.
Article Google Scholar
Fattal, R. (2008). Single image dehazing. In ACM SIGGRAPH 2008 Papers, ACM, New York, NY, USA, SIGGRAPH ’08 (pp. 72:1–72:9), https://doi.org/10.1145/1399504.1360671.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. International Journal of Computer Vision, 70(1), 41–54. https://doi.org/10.1007/s11263-006-7899-4.
Article Google Scholar
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE (pp. 3354–3361).
Gupta, S., Arbelàez, P., & Malik, J. (2013). Perceptual organization and recognition of indoor scenes from rgb-d images. In 2013 IEEE conference on computer vision and pattern recognition (pp. 564–571), https://doi.org/10.1109/CVPR.2013.79.
Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision, 112(2), 133–149. https://doi.org/10.1007/s11263-014-0777-6.
Article MathSciNet Google Scholar
Hirschmüller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 2, pp. 807–814), https://doi.org/10.1109/CVPR.2005.56.
Hosni, A., Rhemann, C., Bleyer, M., Rother, C., & Gelautz, M. (2013). Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2), 504–511. https://doi.org/10.1109/TPAMI.2012.156.
Article Google Scholar
Kendall, A., & Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kendall, A., Martirosyan, H., Dasgupta, S., & Henry, P. (2017). End-to-end learning of geometry and context for deep stereo regression. In 2017 IEEE international conference on computer vision (ICCV) (pp. 66–75), https://doi.org/10.1109/ICCV.2017.17.
Kerl, C., Sturm, J., & Cremers, D. (2013a). Dense visual SLAM for RGB-D cameras. In 2013 IEEE/RSJ international conference on intelligent robots and systems (pp. 2100–2106), https://doi.org/10.1109/IROS.2013.6696650.
Kerl, C., Sturm, J., & Cremers, D. (2013b). Robust odometry estimation for RGB-D cameras. In 2013 IEEE international conference on robotics and automation (pp. 3748–3754), https://doi.org/10.1109/ICRA.2013.6631104.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio , Y. LeCun (Eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.
Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small \(\{\text{AR}\}\) workspaces. In Proceedings of sixth IEEE and ACM international symposium on mixed and augmented reality (ISMAR’07).
Li, R., Wang, S., Long, Z., & Gu, D. (2018). UnDeepVO: monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 7286–7291), https://doi.org/10.1109/ICRA.2018.8461251.
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic gradient descent with warm restarts. In 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings, OpenReview.net.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94.
Article Google Scholar
Newcombe, R. A., Lovegrove, S., & Davison, A. (2011). DTAM: Dense tracking and mapping in real-time. In: 2011 IEEE international conference on computer vision (ICCV) (pp. 2320–2327), https://doi.org/10.1109/ICCV.2011.6126513.
Schönberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revisited. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4104–4113), https://doi.org/10.1109/CVPR.2016.445.
Schönberger, J. L., Zheng, E., Frahm, J. M., & Pollefeys, M. (2016). Pixelwise view selection for unstructured multi-view stereo. In Computer Vision – ECCV 2016 (pp. 501–518). Springer, https://doi.org/10.1007/978-3-319-46487-9_31.
Chapter Google Scholar
Song, S., & Chandraker, M. (2015). Joint SFM and detection cues for monocular 3D localization in road scenes. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3734–3742), https://doi.org/10.1109/CVPR.2015.7298997.
Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 190–198), https://doi.org/10.1109/CVPR.2017.28.
Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 573–580), https://doi.org/10.1109/IROS.2012.6385773.
Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6565–6574), https://doi.org/10.1109/CVPR.2017.695.
Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., & Brox, T. (2017). DeMoN: Depth and motion network for learning monocular stereo. In IEEE conference on computer vision and pattern recognition (CVPR).
Valada, A., Radwan, N., & Burgard, W. (2018). Deep auxiliary learning for visual localization and odometry. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 6939–6946), https://doi.org/10.1109/ICRA.2018.8462979.
Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). SfM-Net: Learning of structure and motion from video. arXiv:170407804 [cs].
Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2043–2050), https://doi.org/10.1109/ICRA.2017.7989236.
Weerasekera, C. S., Latif, Y., Garg, R., & Reid, I. (2017). Dense monocular reconstruction using surface normals. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2524–2531), https://doi.org/10.1109/ICRA.2017.7989293.
Weerasekera, C. S., Garg, R., Latif, Y., Reid, I. (2018). Learning deeply supervised good features to match for dense monocular reconstruction. In Computer vision—ACCV 2018 (pp. 609–624). Cham: Springer, https://doi.org/10.1007/978-3-030-20873-8_39.
Chapter Google Scholar
Xiao, J., Owens, A., & Torralba, A. (2013). SUN3D: A database of big spaces reconstructed using SfM and object labels. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1625–1632), https://doi.org/10.1109/ICCV.2013.458.
Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, H., & Patel, V. M. (2018a) Densely connected pyramid dehazing network. In CVPR.
Zhang, H., & Patel, V. M. (2018b) Density-aware single image de-raining using a multi-stream dense network. In CVPR.
Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deeptam: Deep tracking and mapping. In European conference on computer vision (ECCV).
Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6612–6619), https://doi.org/10.1109/CVPR.2017.700.

Download references

Author information

Huizhong Zhou and Benjamin Ummenhofer have contributed equally to this work.

Authors and Affiliations

University of Freiburg, Freiburg, Germany
Huizhong Zhou & Thomas Brox
Intel Labs, Munich, Germany
Benjamin Ummenhofer

Authors

Huizhong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Ummenhofer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Brox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huizhong Zhou.

Additional information

Communicated by Cristian Sminchisescu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This project was in large parts funded by the EU Horizon 2020 project Trimbot2020. We also thank the bwHPC initiative for computing resources, Facebook for their P100 server donation and gift funding.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 16091 KB)

Supplementary material 2 (pdf 6299 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, H., Ummenhofer, B. & Brox, T. DeepTAM: Deep Tracking and Mapping with Convolutional Neural Networks. Int J Comput Vis 128, 756–769 (2020). https://doi.org/10.1007/s11263-019-01221-0

Download citation

Received: 31 January 2019
Accepted: 26 August 2019
Published: 03 September 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11263-019-01221-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DeepTAM: Deep Tracking and Mapping with Convolutional Neural Networks

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 2 (pdf 6299 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 2 (pdf 6299 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation