Abstract
Monocular depth estimation methods based on deep learning have shown very promising results recently, most of which exploit deep convolutional neural networks (CNNs) with scene geometric constraints. However, the depth maps estimated by most existing methods still have problems such as unclear object contours and unsmooth depth gradients. In this paper, we propose a novel encoder-decoder network, named Monocular Depth estimation with Spatio-Temporal features (MD-ST), based on recurrent convolutional neural networks for monocular video depth estimation with spatio-temporal correlation features. Specifically, we put forward a novel encoder with convolutional long short-term memory (Conv-LSTM) structure for monocular depth estimation, which not only captures the spatial features of the scene but also focuses on collecting the temporal features from video sequences. In decoder, we learn four scales depth maps for multi-scale estimation to fine-tune the outputs. Additionally, in order to enhance and maintain the spatio-temporal consistency, we constraint our network with a flow consistency loss to penalize the errors between the estimated and ground-truth maps by learning residual flow vectors. Experiments conducted on the KITTI dataset demonstrate that the proposed MD-ST can effectively estimate scene depth maps, especially in dynamic scenes, which is superior to existing monocular depth estimation methods.
The work presented in this paper is supported by Beijing Natural Science Foundation of China (Grant No. L182033), Fund for Beijing University of Posts and Telecommunications (2019PTB-001).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atapour-Abarghouei, A., Breckon, T.P.: Veritatem dies aperit-temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3373–3384 (2019)
CS Kumar, A., Bhandarkar, S.M., Prasad, M.: Depthnet: a recurrent neural network architecture for monocular depth prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 283–291 (2018)
Doan, A.D., Latif, Y., Chin, T.J., Liu, Y., Do, T.T., Reid, I.: Scalable place recognition under appearance change for autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9319–9328 (2019)
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Kim, Y., Jung, H., Min, D., Sohn, K.: Deep monocular depth estimation via integration of global and local predictions. IEEE Trans. Image Process. 27(8), 4131–4144 (2018)
Li, J., Klein, R., Yao, A.: A two-streamed network for estimating fine-scaled depth maps from single rgb images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3372–3380 (2017)
Li, Y., Zhu, J., Hoi, S.C., Song, W., Wang, Z., Liu, H.: Robust estimation of similarity transformation for visual object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8666–8673 (2019)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015)
Mancini, M., Costante, G., Valigi, P., Ciarfuglia, T.A., Delmerico, J., Scaramuzza, D.: Toward domain independence for learning-based monocular depth estimation. IEEE Rob. Autom. Lett. 2(3), 1778–1785 (2017)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Meng, X., Fan, C., Ming, Y., Shen, Y., Yu, H.: Un-vdnet: unsupervised network for visual odometry and depth estimation. J. Electron. Imaging 28(6), 063015 (2019)
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5555–5564 (2019)
Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., Xu, W.: Unos: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8071–8081 (2019)
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Yu, Z., Zheng, J., Lian, D., Zhou, Z., Gao, S.: Single-image piece-wise planar 3D reconstruction via associative embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1037 (2019)
Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349 (2018)
Zhang, K., Chen, J., Li, Y., Zhang, X.: Visual tracking and depth estimation of mobile robots without desired velocity information. IEEE Trans. Cybern. 50(1), 361–373 (2018)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Meng, X., Fan, C., Ming, Y., Zhang, R., Zhao, P. (2020). MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12305. Springer, Cham. https://doi.org/10.1007/978-3-030-60633-6_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-60633-6_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60632-9
Online ISBN: 978-3-030-60633-6
eBook Packages: Computer ScienceComputer Science (R0)