Abstract
Monocular depth estimation is a traditional computer vision task, which plays a crucial role in 3D reconstruction and scene understanding. Recent works based on deep convolutional neural networks (DCNNs) have achieved great success. However, these works did not make full use of structural information, resulting in discontinuity of depth and ambiguity of boundaries. This paper proposes Dual Attention Feature Fusion Network (DAFFNet), which utilizes the structural relationship between the RGB image and the predicted depth to solve the above problems. It contains two critical modules, Dual Attention Fusion Module (DAFM) and Iterative Dual Attention Fusion Module (IDAFM). Specifically, DAFM includes two blocks (spatial attention block and channel attention block), which fuse global context and local information respectively. To aggregate information of different encoder and decoder levels, we design IDAFM, an iterative version of DAFM. Extensive experimental results on the KITTI dataset show that our model achieves the competitive performance with recent state-of-the-art models .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Saxena, A., Chung, S.H., Ng, A.: Learning depth from single monocular images. In: NIPS (2005)
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611. IEEE Computer Society (2017)
Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: CVPR, pp. 4058–4066. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.440
Yang, J., Mao, W., Alvarez, J.M., Liu, M.: Cost volume pyramid based depth inference for multi-view stereo. In: CVPR, pp. 4876–4885. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00493
Lee, J.H., Han, M., Ko, D.W., Suh, I.H.: From big to small: multi-scale local planar guidance for monocular depth estimation. CoRR abs/1907.10326 (2019)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS, pp. 2366–2374 (2014)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR, pp. 2002–2011. IEEE Computer Society (2018)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the Kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)
Fácil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: CAM-Convs: camera-aware multi-scale convolutions for single-view depth. In: CVPR, pp. 11826–11835. Computer Vision Foundation. IEEE (2019)
Lee, J., Kim, C.: Monocular depth estimation using relative depth maps. In: CVPR, pp. 9729–9738. Computer Vision Foundation/IEEE (2019)
Zhou, T., Brown, M., Snavely, N., Lowe, D.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)
Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 5987–5995. IEEE Computer Society (2017)
Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. CoRR abs/1812.11941 (2018)
Aich, S., Vianney, J.M.U., Islam, M.A., Kaur, M., Liu, B.: Bidirectional attention network for monocular depth estimation. CoRR abs/2009.00743 (2020)
Chen, T., et al.: Improving monocular depth estimation by leveraging structural awareness and complementary datasets (2020)
Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR, pp. 5622–5631. IEEE Computer Society (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009). https://doi.org/10.1109/TPAMI.2008.132
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3DV, pp. 239–248. IEEE Computer Society (2016)
Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based classification network for robust depth prediction. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 663–678. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_41
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658. IEEE Computer Society (2015)
Ramamonjisoa, M., Du, Y., Lepetit, V.: Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In: CVPR, pp. 14636–14645. IEEE (2020)
Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFS as sequential deep networks for monocular depth estimation. In: CVPR, pp. 161–169. IEEE Computer Society (2017)
Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: CVPR, pp. 5667–5675. IEEE Computer Society (2018)
Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: LEGO: learning edge with geometry all at once by watching videos. In: CVPR, pp. 225–234. IEEE Computer Society (2018)
Zheng, C., Cham, T., Cai, J.: T2Net: synthetic-to-realistic translation for solving single-image depth estimation tasks. In: ECCV (2018)
Wang, Q., Guo, G.: LS-CNN: characterizing local patches at multiple scales for face recognition. IEEE Trans. Inf. Forensics Secur. 15, 1640–1653 (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00745
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Kong, S., Fowlkes, C.C.: Pixel-wise attentional gating for parsimonious pixel labeling. CoRR abs/1805.01556 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, Y., Li, M., Peng, C., Li, Y., Du, S. (2021). Dual Attention Feature Fusion Network for Monocular Depth Estimation. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13069. Springer, Cham. https://doi.org/10.1007/978-3-030-93046-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-93046-2_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93045-5
Online ISBN: 978-3-030-93046-2
eBook Packages: Computer ScienceComputer Science (R0)