Skip to main content

Dual Attention Feature Fusion Network for Monocular Depth Estimation

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13069))

Included in the following conference series:

Abstract

Monocular depth estimation is a traditional computer vision task, which plays a crucial role in 3D reconstruction and scene understanding. Recent works based on deep convolutional neural networks (DCNNs) have achieved great success. However, these works did not make full use of structural information, resulting in discontinuity of depth and ambiguity of boundaries. This paper proposes Dual Attention Feature Fusion Network (DAFFNet), which utilizes the structural relationship between the RGB image and the predicted depth to solve the above problems. It contains two critical modules, Dual Attention Fusion Module (DAFM) and Iterative Dual Attention Fusion Module (IDAFM). Specifically, DAFM includes two blocks (spatial attention block and channel attention block), which fuse global context and local information respectively. To aggregate information of different encoder and decoder levels, we design IDAFM, an iterative version of DAFM. Extensive experimental results on the KITTI dataset show that our model achieves the competitive performance with recent state-of-the-art models .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Saxena, A., Chung, S.H., Ng, A.: Learning depth from single monocular images. In: NIPS (2005)

    Google Scholar 

  2. Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611. IEEE Computer Society (2017)

    Google Scholar 

  3. Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: CVPR, pp. 4058–4066. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.440

  4. Yang, J., Mao, W., Alvarez, J.M., Liu, M.: Cost volume pyramid based depth inference for multi-view stereo. In: CVPR, pp. 4876–4885. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00493

  5. Lee, J.H., Han, M., Ko, D.W., Suh, I.H.: From big to small: multi-scale local planar guidance for monocular depth estimation. CoRR abs/1907.10326 (2019)

    Google Scholar 

  6. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS, pp. 2366–2374 (2014)

    Google Scholar 

  7. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR, pp. 2002–2011. IEEE Computer Society (2018)

    Google Scholar 

  8. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the Kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)

    Article  Google Scholar 

  9. Fácil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: CAM-Convs: camera-aware multi-scale convolutions for single-view depth. In: CVPR, pp. 11826–11835. Computer Vision Foundation. IEEE (2019)

    Google Scholar 

  10. Lee, J., Kim, C.: Monocular depth estimation using relative depth maps. In: CVPR, pp. 9729–9738. Computer Vision Foundation/IEEE (2019)

    Google Scholar 

  11. Zhou, T., Brown, M., Snavely, N., Lowe, D.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  13. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)

    Google Scholar 

  14. Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 5987–5995. IEEE Computer Society (2017)

    Google Scholar 

  15. Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. CoRR abs/1812.11941 (2018)

    Google Scholar 

  16. Aich, S., Vianney, J.M.U., Islam, M.A., Kaur, M., Liu, B.: Bidirectional attention network for monocular depth estimation. CoRR abs/2009.00743 (2020)

    Google Scholar 

  17. Chen, T., et al.: Improving monocular depth estimation by leveraging structural awareness and complementary datasets (2020)

    Google Scholar 

  18. Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR, pp. 5622–5631. IEEE Computer Society (2017)

    Google Scholar 

  19. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1

    Chapter  Google Scholar 

  20. Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009). https://doi.org/10.1109/TPAMI.2008.132

  21. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3DV, pp. 239–248. IEEE Computer Society (2016)

    Google Scholar 

  22. Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based classification network for robust depth prediction. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 663–678. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_41

    Chapter  Google Scholar 

  23. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658. IEEE Computer Society (2015)

    Google Scholar 

  24. Ramamonjisoa, M., Du, Y., Lepetit, V.: Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In: CVPR, pp. 14636–14645. IEEE (2020)

    Google Scholar 

  25. Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFS as sequential deep networks for monocular depth estimation. In: CVPR, pp. 161–169. IEEE Computer Society (2017)

    Google Scholar 

  26. Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45

    Chapter  Google Scholar 

  27. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: CVPR, pp. 5667–5675. IEEE Computer Society (2018)

    Google Scholar 

  28. Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: LEGO: learning edge with geometry all at once by watching videos. In: CVPR, pp. 225–234. IEEE Computer Society (2018)

    Google Scholar 

  29. Zheng, C., Cham, T., Cai, J.: T2Net: synthetic-to-realistic translation for solving single-image depth estimation tasks. In: ECCV (2018)

    Google Scholar 

  30. Wang, Q., Guo, G.: LS-CNN: characterizing local patches at multiple scales for face recognition. IEEE Trans. Inf. Forensics Secur. 15, 1640–1653 (2020)

    Article  Google Scholar 

  31. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00745

  32. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  33. Kong, S., Fowlkes, C.C.: Pixel-wise attentional gating for parsimonious pixel labeling. CoRR abs/1805.01556 (2018)

    Google Scholar 

  34. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chenglei Peng or Sidan Du .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, Y., Li, M., Peng, C., Li, Y., Du, S. (2021). Dual Attention Feature Fusion Network for Monocular Depth Estimation. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13069. Springer, Cham. https://doi.org/10.1007/978-3-030-93046-2_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93046-2_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93045-5

  • Online ISBN: 978-3-030-93046-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics