Abstract
Depth estimation from 2D images is a fundamental task for many applications, for example, robotics and 3D reconstruction. Because of the weak ability to perspective transformation, the existing CNN methods have limited generalization performance and large number of parameters. To solve these problems, we propose CNNapsule network for monocular depth estimation. Firstly, we extract CNN and Matrix Capsule features. Next, we propose a Fusion Block to combine the CNN with Matrix Capsule features. Then the skip connections are used to transmit the extracted and fused features. Moreover, we design the loss function with the consideration of long-tailed distribution, gradient and structural similarity. At last, we compare our method with the existing methods on NYU Depth V2 dataset. The experiment shows that our method has higher accuracy than the traditional methods and similar networks without pre-trained. Compared with the state-of-the-art, the trainable parameters of our method decrease by 65%. In the test experiment of images collected in the Internet and real images collected by mobile phone, the generalization performance of our method is further verified.
Supported by the National Natural Science Foundation of China under grant No. 61672084 and the Fundamental Research Funds for the Central Universities under grant No. XK1802-4.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gao, W., Wang, K., Ding, W., Gao, F., Qin, T., Shen, S.: Autonomous aerial robot using dual-fisheye cameras. J. Robot. Syst. 37(4), 497–514 (2020)
Saleem, N.H., Chien, H.J., Rezaei, M., Klette, R.: Effects of ground manifold modeling on the accuracy of Stixel calculations. IEEE Trans. Intell. Transp. Syst. 20(10), 3675–3687 (2020)
Civera, J., Davison, A.J., Montiel, J.M.M.: Inverse Depth parameterization for monocular SLAM. IEEE Trans. Robot. 24(5), 932–945 (2008)
Ping, J., Thomas, B.J., Baumeister, J., Guo, J., Weng, D., Liu, Y.: Effects of shading model and opacity on depth perception in optical see-through augmented reality. J. Soc. Inform. Display 28, 892–904 (2020)
Yang, X., Zhou, L., Jiang, H., Tang, Z.: Mobile3DRecon: real-time monocular 3D reconstruction on a mobile phone. IEEE Trans. Visual Comput. Graphics 26, 3446–3456 (2020)
Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54181-5_14
Atapour-Abarghouei, A.: Real-time monocular depth estimation using synthetic data with domain adaptation. In: IEEE/CVF Conference on Computer Vision & Pattern Recognition (2018)
Ji, R.R., et al.: Semi-Supervised adversarial monocular depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2410–2422 (2020)
Zhang, M.L., Ye, X.C., Xin, F.: Unsupervised detail-preserving network for high quality monocular depth estimation. Neurocomputing 404, 1–13 (2020)
Huang, K., Qu, X., Chen, S., Chen, Z.: Superb monocular depth estimation based on transfer learning and surface normal guidance. Sensors 20(17), 4856 (2020)
Konrad, J., Wang, M., Ishwar, P.: 2D-to-3D image conversion by learning depth from examples. In: Proceedings of Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–22 (2012)
Li, N.B., Shen, N.C., Dai, N.Y., Hengel, A.V.D., He, N.M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1119–1127 (2015)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision (2015)
Ye, X. C., Chen, S. D., Xu, R.: DPNet: detail-preserving network for high quality monocular depth estimation. Pattern Recogn. 109 (2021)
Fu, H., Gong, M., Wang, C., Batmanghelich, N., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2002–2011 (2018)
Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. arXiv: 1812.11941 (2018)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248 (2016)
Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5354–5362 (2017)
Hao, Z., Li, Y., You, S., Lu, F.: Detail preserving depth estimation from a single image using attention guided networks. In: 2018 International Conference on 3D Vision (3DV), pp. 304–313 (2018)
Yeh, C.H., Huang, Y.P., Lin, C.Y., Chang, C.Y.: Transfer2Depth: dual attention network with transfer learning for monocular depth estimation. IEEE Access 99, 1–1 (2020)
Hinton, G. E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: International Conference on Learning Representations (2018)
Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 55–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_4
Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5622–5631 (2017)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Zheng, C., Cham, T.J., Cai, J.: T2Net: synthetic-to-realistic translation for solving single-image depth estimation tasks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 716–723 (2014)
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809 (2015)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under grant No. 61672084 and the Fundamental Research Funds for the Central Universities under grant No. XK1802-4.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Y., Zhu, H., Liu, M. (2021). CNNapsule: A Lightweight Network with Fusion Features for Monocular Depth Estimation. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-86362-3_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86361-6
Online ISBN: 978-3-030-86362-3
eBook Packages: Computer ScienceComputer Science (R0)