MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features

Meng, Xuyang; Fan, Chunxiao; Ming, Yue; Zhang, Runqing; Zhao, Panzi

doi:10.1007/978-3-030-60633-6_29

Xuyang Meng¹⁶,
Chunxiao Fan¹⁶,
Yue Ming¹⁶,
Runqing Zhang¹⁶ &
…
Panzi Zhao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12305))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2544 Accesses

Abstract

Monocular depth estimation methods based on deep learning have shown very promising results recently, most of which exploit deep convolutional neural networks (CNNs) with scene geometric constraints. However, the depth maps estimated by most existing methods still have problems such as unclear object contours and unsmooth depth gradients. In this paper, we propose a novel encoder-decoder network, named Monocular Depth estimation with Spatio-Temporal features (MD-ST), based on recurrent convolutional neural networks for monocular video depth estimation with spatio-temporal correlation features. Specifically, we put forward a novel encoder with convolutional long short-term memory (Conv-LSTM) structure for monocular depth estimation, which not only captures the spatial features of the scene but also focuses on collecting the temporal features from video sequences. In decoder, we learn four scales depth maps for multi-scale estimation to fine-tune the outputs. Additionally, in order to enhance and maintain the spatio-temporal consistency, we constraint our network with a flow consistency loss to penalize the errors between the estimated and ground-truth maps by learning residual flow vectors. Experiments conducted on the KITTI dataset demonstrate that the proposed MD-ST can effectively estimate scene depth maps, especially in dynamic scenes, which is superior to existing monocular depth estimation methods.

The work presented in this paper is supported by Beijing Natural Science Foundation of China (Grant No. L182033), Fund for Beijing University of Posts and Telecommunications (2019PTB-001).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atapour-Abarghouei, A., Breckon, T.P.: Veritatem dies aperit-temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3373–3384 (2019)
Google Scholar
CS Kumar, A., Bhandarkar, S.M., Prasad, M.: Depthnet: a recurrent neural network architecture for monocular depth prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 283–291 (2018)
Google Scholar
Doan, A.D., Latif, Y., Chin, T.J., Liu, Y., Do, T.T., Reid, I.: Scalable place recognition under appearance change for autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9319–9328 (2019)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kim, Y., Jung, H., Min, D., Sohn, K.: Deep monocular depth estimation via integration of global and local predictions. IEEE Trans. Image Process. 27(8), 4131–4144 (2018)
Article MathSciNet Google Scholar
Li, J., Klein, R., Yao, A.: A two-streamed network for estimating fine-scaled depth maps from single rgb images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3372–3380 (2017)
Google Scholar
Li, Y., Zhu, J., Hoi, S.C., Song, W., Wang, Z., Liu, H.: Robust estimation of similarity transformation for visual object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8666–8673 (2019)
Google Scholar
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015)
Article Google Scholar
Mancini, M., Costante, G., Valigi, P., Ciarfuglia, T.A., Delmerico, J., Scaramuzza, D.: Toward domain independence for learning-based monocular depth estimation. IEEE Rob. Autom. Lett. 2(3), 1778–1785 (2017)
Article Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Google Scholar
Meng, X., Fan, C., Ming, Y., Shen, Y., Yu, H.: Un-vdnet: unsupervised network for visual odometry and depth estimation. J. Electron. Imaging 28(6), 063015 (2019)
Article Google Scholar
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Google Scholar
Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5555–5564 (2019)
Google Scholar
Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., Xu, W.: Unos: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8071–8081 (2019)
Google Scholar
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Google Scholar
Yu, Z., Zheng, J., Lian, D., Zhou, Z., Gao, S.: Single-image piece-wise planar 3D reconstruction via associative embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1037 (2019)
Google Scholar
Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349 (2018)
Google Scholar
Zhang, K., Chen, J., Li, Y., Zhang, X.: Visual tracking and depth estimation of mobile robots without desired velocity information. IEEE Trans. Cybern. 50(1), 361–373 (2018)
Article Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, 100876, China
Xuyang Meng, Chunxiao Fan, Yue Ming, Runqing Zhang & Panzi Zhao

Authors

Xuyang Meng
View author publications
You can also search for this author in PubMed Google Scholar
Chunxiao Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yue Ming
View author publications
You can also search for this author in PubMed Google Scholar
Runqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Panzi Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuyang Meng .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Yuxin Peng
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Dalian University of Technology, Dalian, China
Huchuan Lu
Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Chinese Academy of Sciences, Beijing, China
Chenglin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Peking University, Beijing, China
Hongbin Zha
Nanjing University of Science and Technology, Nanjing, China
Jian Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Meng, X., Fan, C., Ming, Y., Zhang, R., Zhao, P. (2020). MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12305. Springer, Cham. https://doi.org/10.1007/978-3-030-60633-6_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-60633-6_29
Published: 11 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60632-9
Online ISBN: 978-3-030-60633-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics