Skip to main content
Log in

Self-supervised Monocular Trained Depth Estimation Using Triplet Attention and Funnel Activation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Dense depth estimation based on a single image is a basic problem in computer vision and has exciting applications in many robotic tasks. Modelling fully supervised methods requires the acquisition of accurate and large ground truth data sets, which is often complex and expensive. On the other hand, self-supervised learning has emerged as a promising alternative to monocular depth estimation as it does not require ground truth depth data. In this paper, we propose a novel self-supervised joint learning framework for depth estimation using consecutive frames from monocular and stereo videos. Our architecture leverages two new ideas for improvement: (1) triplet attention and (2) funnel activation (FReLU). By adding triplet attention to the deep and pose networks, this module captures the importance of features across dimensions in a tensor without any information bottlenecks, making the optimisation learning framework more reliable. FReLU is used at the non-linear activation layer to grasp the local context adaptively in images, rather than using more complex convolutions at the convolution layer. FReLU extracts the spatial structure of objects by the pixel-wise modeling capacity provided by the spatial condition, making the details of the complex image richer. The experimental results show that the proposed method is comparable with the state-of-the-art self-supervised monocular depth estimation method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Desouza G N, Kak A C (2002) Vision for mobile robot navigation: a survey. IEEE Trans Pattern Anal Mach Intell 237–267

  2. Chen C, Seff A, Kornhauser A et al (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In: Proceedings of the IEEE international conference on computer vision, pp 2722–2730

  3. Karsch K, Liu C, Kang S B (2014) Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell 2144–2158

  4. Saxena A, Chung S H, Ng A Y (2005) Learning depth from single monocular images. In: Conference and workshop on neural information processing system, pp 1–8

  5. Saxena A, Sun M, Ng A Y (2008) Make3d: learning 3d scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 824–840

  6. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv: 1406.2283

  7. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp 2650–2658

  8. Fu H, Gong M, Wang C, et al (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2002–2011

  9. Yang N, Wang R, Stuckler J, et al (2018) Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In: Proceedings of the European conference on computer vision, pp 817–833

  10. Mayer N, Ilg E, Fischer P et al (2018) What makes good synthetic training data for learning disparity and optical flow estimation. Int J Comput Vis 942–960

  11. Gupta S, Girshick R, Arbeláez P, et al (2014) Learning rich features from RGB-D images for object detection and segmentation. In: European conference on computer vision, pp 345–360

  12. Garg R, Bg V K, Carneiro G, et al (2016) Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Proceedings of the European conference on computer vision, pp 740–756

  13. Godard C, Mac Aodha O, Brostow G J (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279

  14. Zhou T, Brown M, Snavely N, et al (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1851–1858

  15. Guizilini V, Ambrus R, Pillai S et al (2019) Packnet-SFM: 3d packing for self-supervised monocular depth estimation. arXiv preprint arXiv: 1905.02693

  16. Bian J W, Li Z, Wang N et al (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. arXiv preprint arXiv: 1908.10553

  17. Vijayanarasimhan S, Ricco S, Schmid C et al (2017) Sfm-Net: learning of structure and motion from video. arXiv preprint arXiv: 1704.07804

  18. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. arXiv prepri arXiv: 1706.03762

  19. Godard C, Mac Aodha O, Firman M, et al (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE international conference on computer vision, pp 3828–3838

  20. Misra D,Nalamada T, Arasanipalai A U, et al (2021) Rotate to attend: convolutional triplet attention module. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 3139–3148

  21. Ma N, Zhang X, Sun J (2020) Funnel activation for vision recognition. arXiv preprint arXiv: 2007.11842

  22. Nair V, Hinton G E (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the international conference on machine learning, pp 807–814

  23. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference and computer vision, pp 1026–1034

  24. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  25. Menze M, Geiger A (2015) Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3061–3070

  26. Dudek G, Jenkin M (2010) Computational principles of mobile robotics. Cambridge University Press, pp 1827–1834

  27. Achtelik M, Bachrach A, He R, et al (2009) Stereo vision and laser odometry for autonomous helicopters in GPS-denied indoor environments. In: Proceedings of the SPIE unmanned systems technology XI, Orlando, FL, pp 733219-1–733219-10

  28. Ranjan A, Jampani V, Balles L, et al (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12240–12249

  29. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283

  30. Li B, Shen C, Dai Y, et al (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1119–1127

  31. Laina I, Rupprecht C, Belagiannis V, et al (2016) Deeper depth prediction with fully convolutional residual networks. In: International conference on 3D vision (3DV), pp 239–248

  32. Zoran D, Isola P, Krishnan D, et al (2015) Learning ordinal relationships for mid-level vision. In: Proceedings of the IEEE international conference on computer vision, pp 388–396

  33. Chen W, Fu Z, Yang D et al (2016) Single-image depth perception in the wild. arXiv preprint arXiv: 1604.03901

  34. Wu Y, Ying S, Zheng L (2018) Size-to-depth: a new perspective for single image depth estimation. arXiv preprint arXiv: 1801.04461

  35. Zbontar J, LeCun Y (2016) Stereo matching by training a convolutional neural network to compare image patches. J Mach Learn Res 2287–2318

  36. Zhan H, Garg R, Weerasekera C S, et al (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 340–349

  37. Kundu J N, Uppala P K, Pahuja A, et al (2018) Adadepth: Unsupervised content congruent adaptation for depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2656–2665

  38. Atapour-Abarghouei A, Breckon T P (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2800–2810

  39. Zou Y, Luo Z, Huang J B (2018) Df-net: unsupervised joint learning of depth and flow using cross-task consistency. In: Proceedings of the European conference on computer vision, pp 36–53

  40. Xie J, Girshick R, Farhadi A (2016) Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: Proceedings of the European conference on computer vision, pp 842–857

  41. Guizilini V, Ambrus R, Pillai S, et al (2020) 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2485–2494

  42. Kuznietsov Y, Stuckler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6647–6655

  43. Luo Y, Ren J, Lin M, et al (2018) Single view stereo matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 155–163

  44. Aleotti F, Tosi F, Poggi M, et al (2018) Generative adversarial networks for unsupervised monocular depth prediction. In: Proceedings of the European conference on computer vision, pp 298–313

  45. Pilzer A, Xu D, Puscas M, et al (2018) Unsupervised adversarial depth estimation using cycled generative networks. In: International conference on 3D vision, pp 587–595

  46. Poggi M, Tosi F, Mattoccia S (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In: International conference on 3d vision, pp 324–333

  47. Li R, Wang S, Long Z, et al (2018) Undeepvo: monocular visual odometry through unsupervised deep learning. In: IEEE International conference on robotics and automation, pp 7286–7291

  48. Babu V M, Das K, Majumdar A, et al (2018) Undemon: unsupervised deep network for depth and ego-motion estimation. In: IEEE International conference on intelligent robots and systems, pp 1082–1088

  49. Poggi M, Aleotti F, Tosi F, et al (2018) Towards real-time unsupervised monocular depth estimation on cpu. In: 2018 IEEE international conference on intelligent robots and systems, pp 5848–5854

  50. Byravan A, Fox D (2017) Se3-nets: Learning rigid body motion using deep neural networks. In: 2017 IEEE international conference on robotics and automation, pp 173–180

  51. Yin Z, Shi J (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1983–1992

  52. Johnston A, Carneiro G (2020) Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4756–4765

  53. Dai J, Qi H, Xiong Y, et al (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773

  54. Holschneider M, Kronland-Martinet R, Morlet J et al (1990) A real-time algorithm for signal analysis with the help of the wavelet transform. Wavelets. Inverse Problems and Theoretical Imaging 286–297

  55. Qiu S, Xu X, Cai B (2018) FReLU: flexible rectified linear units for improving convolutional neural networks. In: International conference on pattern recognition, pp 1223–1228

  56. Clevert D A, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289

  57. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer assisted intervention society, pp 234–241

  58. Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 600–612

  59. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The kitti vision benchmark suite. In: IEEE Conference on computer vision and pattern recognition, pp 3354–3361

  60. Liu F, Shen C, Lin G et al (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 2024–2039

  61. Mindspore. https://www.mindspore.cn/,2020

  62. Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  63. Klodt M, Vedaldi A (2018) Supervising the new with the old: learning SFM from SFM. In: Proceedings of the European conference on computer vision, pp 698–713

  64. Guo X, Li H, Yi S, et al (2018) Learning monocular depth by distilling cross-domain stereo networks. In: Proceedings of the European conference on computer vision, pp 484–500

  65. Yang Z, Wang P, Xu W et al (2017) Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv preprint arXiv:1711.03665

  66. Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5667–5675

  67. Wang C, Buenaposada J M, Zhu R, et al (2018) Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2022–2030

  68. Yang Z, Wang P, Wang Y, et al (2018) Lego: learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 225–234

  69. Luo C, Yang Z, Wang P et al (2019) Every pixel counts++: joint learning of geometry and motion with 3d holistic understanding. IEEE Trans Pattern Anal Mach Intell 2624–2641

  70. Casser V, Pirk S, Mahjourian R, et al (2019) Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In: Proceedings of the AAAI conference on artificial intelligence, pp 8001–8008

  71. Mehta I, Sakurikar P, Narayanan P J (2018) Structured adversarial training for unsupervised monocular depth estimation. In: International conference on 3D vision, pp 314–323

  72. Pillai S, Ambruş R, Gaidon A (2019) Superdepth: self-supervised, super-resolved monocular depth estimation. In: International conference on robotics and automation, pp 9250–9256

  73. Wang J, Zhang G, Wu Z et al (2020) Self-supervised joint learning framework of depth estimation via implicit cues. arXiv preprint arXiv:2006.09876

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuezhi Xiang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported in part by CAAI-Huawei MindSpore Open Fund,in part by the National Natural Science Foundation of China under Grant 61401113, in part by the Natural Science Foundation of Heilongjiang Province of China under Grant LH2021F011, in part by the Fundamental Research Funds for the Central Universities of China under Grant 3072021CF0811, in part by the Key Laboratory of Advanced Marine Communication and Information Technology Open Fund under Grant AMCIT2103-03.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiang, X., Kong, X., Qiu, Y. et al. Self-supervised Monocular Trained Depth Estimation Using Triplet Attention and Funnel Activation. Neural Process Lett 53, 4489–4506 (2021). https://doi.org/10.1007/s11063-021-10608-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10608-5

Keywords

Navigation