Skip to main content
Log in

Depth estimation using an improved stereo network

基于改进立体网络的深度估计

  • Research Article
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

An Erratum to this article was published on 04 April 2023

This article has been updated

Abstract

Self-supervised depth estimation approaches present excellent results that are comparable to those of the fully supervised approaches, by employing view synthesis between the target and reference images in the training data. ResNet, which serves as a backbone network, has some structural deficiencies when applied to downstream fields, because its original purpose was to cope with classification problems. The low-texture area also deteriorates the performance. To address these problems, we propose a set of improvements that lead to superior predictions. First, we boost the information flow in the network and improve the ability to learn spatial structures by improving the network structures. Second, we use a binary mask to remove the pixels in low-texture areas between the target and reference images to more accurately reconstruct the image. Finally, we input the target and reference images randomly to expand the dataset and pre-train it on ImageNet, so that the model obtains a favorable general feature representation. We demonstrate state-of-the-art performance on an Eigen split of the KITTI driving dataset using stereo pairs.

摘要

自监督深度估计方法通过在训练数据中利用目标图像和参考图像之间的视角合成, 呈现了可以与全监督方法相媲美的结果。然而, 作为主干网络的ResNet最初是为了应对分类问题而设计的, 在应用于下游领域时存在一些结构上的缺陷。图像中的低纹理区域也使深度估计的效果受到很大影响。为了解决这些问题, 本文提出一系列改进, 以实现更加有效的深度预测。首先, 我们通过改进网络结构来促进网络中的信息流通, 并提高学习空间结构的能力。其次, 使用二值蒙版去除目标图像和参考图像之间低纹理区域中的像素, 以更准确地重建图像。最后, 我们随机输入目标图像和参考图像对数据集进行扩充, 并在ImageNet上进行预训练, 从而使模型获得良好的通用特征表示。我们使用立体图像对作为输入, 在KITTI自动驾驶数据集的特征分割上验证了最先进的性能。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Change history

References

  • Aleotti F, Tosi F, Poggi M, et al., 2018. Generative adversarial networks for unsupervised monocular depth prediction. Proc European Conf on Computer Vision Workshops, p.337–354.

  • Atapour-Abarghouei A, Breckon TP, 2018. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2800–2810. https://doi.org/10.1109/CVPR.2018.00296

  • Casser V, Pirk S, Mahjourian R, et al., 2019. Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. Proc 33rd AAAI Conf on Artificial Intelligence, p.8001–8008. https://doi.org/10.1609/aaai.v33i01.33018001

  • Chen WF, Fu Z, Yang DW, et al., 2016. Single-image depth perception in the wild. https://arxiv.org/abs/1604.03901

  • Chen YH, Schmid C, Sminchisescu C, 2019. Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. Proc IEEE/CVF Int Conf on Computer Vision, p.7062–7071. https://doi.org/10.1109/ICCV.2019.00716

  • Cordts M, Omran M, Ramos S, et al., 2016. The CityScapes dataset for semantic urban scene understanding. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3213–3223. https://doi.org/10.1109/CVPR.2016.350

  • Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248–255. https://doi.org/10.1109/CVPR.2009.5206848

  • Desouza GN, Kak AC, 2002. Vision for mobile robot navigation: a survey. IEEE Trans Patt Anal Mach Intell, 24(2):237–267. https://doi.org/10.1109/34.982903

    Article  Google Scholar 

  • Duta IC, Liu L, Zhu F, et al., 2020. Improved residual networks for image and video recognition. Proc 25th Int Conf on Pattern Recognition, p.9415–9422. https://doi.org/10.1109/ICPR48806.2021.9412193

  • Eigen D, Fergus R, 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proc IEEE Int Conf on Computer Vision, p.2650–2658. https://doi.org/10.1109/ICCV.2015.304

  • Eigen D, Puhrsch C, Fergus R, 2014. Depth map prediction from a single image using a multi-scale deep network. Proc 27th Int Conf on Neural Information Processing Systems, p.2366–2374.

  • Garg R, Bg VK, Carneiro G, et al., 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue. Proc European Conf on Computer Vision, p.740–756. https://doi.org/10.1007/978-3-319-46484-8_45

  • Gehrke S, Morin K, Downey M, et al., 2010. Semi-global matching: an alternative to LIDAR for DSM generation? Proc Canadian Geomatics Conf and Symp of Commission I, p.15–18.

  • Godard C, Aodha OM, Brostow GJ, 2017. Unsupervised monocular depth estimation with left-right consistency. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6602–6611. https://doi.org/10.1109/CVPR.2017.699

  • Godard C, Aodha OM, Firman M, et al., 2019. Digging into self-supervised monocular depth estimation. Proc IEEE/CVF Int Conf on Computer Vision, p.3827–3837. https://doi.org/10.1109/ICCV.2019.00393

  • Gordon A, Li HH, Jonschkowski R, et al., 2019. Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. Proc IEEE/CVF Int Conf on Computer Vision, p.8976–8985. https://doi.org/10.1109/ICCV.2019.00907

  • Guizilini V, Ambruş R, Pillai S, et al., 2020. 3D packing for self-supervised monocular depth estimation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2482–2491. https://doi.org/10.1109/CVPR42600.2020.00256

  • He KM, Zhang XY, Ren SQ, et al., 2016a. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90

  • He KM, Zhang XY, Ren SQ, et al., 2016b. Identity mappings in deep residual networks. Proc European Conf on Computer Vision, p.630–645. https://doi.org/10.1007/978-3-319-46493-0_38

  • Jaderberg M, Simonyan K, Zisserman A, 2015. Spatial transformer networks. Proc 28th Int Conf on Neural Information Processing Systems, p.2017–2025.

  • Karsch K, Liu C, Kang SB, 2012. Depth extraction from video using non-parametric sampling. Proc European Conf on Computer Vision, p.775–788. https://doi.org/10.1007/978-3-642-33715-4_56

  • Kendall A, Martirosyan H, Dasgupta S, et al., 2017. End-to-end learning of geometry and context for deep stereo regression. Proc IEEE Int Conf on Computer Vision, p.66–75. https://doi.org/10.1109/ICCV.2017.17

  • Kuznietsov Y, Stückler J, Leibe B, 2017. Semi-supervised deep learning for monocular depth map prediction. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2215–2223. https://doi.org/10.1109/CVPR.2017.238

  • Laina I, Rupprecht C, Belagiannis V, et al., 2016. Deeper depth prediction with fully convolutional residual networks. Proc 4th Int Conf on 3D Vision, p.239–248. https://doi.org/10.1109/3DV.2016.32

  • Luo CX, Yang ZH, Wang P, et al., 2020. Every pixel counts ++: joint learning of geometry and motion with 3D holistic understanding. IEEE Trans Patt Anal Mach Intell, 42(10):2624–2641. https://doi.org/10.1109/TPAMI.2019.2930258

    Article  Google Scholar 

  • Luo WJ, Schwing AG, Urtasun R, 2016. Efficient deep learning for stereo matching. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5695–5703. https://doi.org/10.1109/CVPR.2016.614

  • Mahjourian R, Wicke M, Angelova A, 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5667–5675. https://doi.org/10.1109/CVPR.2018.00594

  • Mayer N, Ilg E, Fischer P, et al., 2018. What makes good synthetic training data for learning disparity and optical flow estimation? Int J Comput Vis, 126(9):942–960. https://doi.org/10.1007/s11263-018-1082-6

    Article  Google Scholar 

  • Menze M, Geiger A, 2015. Object scene flow for autonomous vehicles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3061–3070. https://doi.org/10.1109/CVPR.2015.7298925

  • Newcombe RA, Lovegrove SJ, Davison AJ, 2011. DTAM: dense tracking and mapping in real-time. Proc Int Conf on Computer Vision, p.2320–2327. https://doi.org/10.1109/ICCV.2011.6126513

  • Poggi M, Aleotti F, Tosi F, et al., 2018. Towards realtime unsupervised monocular depth estimation on CPU. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.5848–5854. https://doi.org/10.1109/IROS.2018.8593814

  • Saxena A, Sun M, Ng AY, 2009. Make3D: learning 3D scene structure from a single still image. IEEE Trans Patt Anal Mach Intell, 31(5):824–840. https://doi.org/10.1109/TPAMI.2008.132

    Article  Google Scholar 

  • Uhrig J, Schneider N, Schneider L, et al., 2017. Sparsity invariant CNNs. Proc Int Conf on 3D Vision, p.11–20. https://doi.org/10.1109/3DV.2017.00012

  • Ummenhofer B, Zhou HZ, Uhrig J, et al., 2017. De-MoN: depth and motion network for learning monocular stereo. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5622–5631. https://doi.org/10.1109/CVPR.2017.596

  • Vijayanarasimhan S, Ricco S, Schmid C, et al., 2017. SfM-Net: learning of structure and motion from video. https://arxiv.org/abs/1704.07804

  • Wang Y, Yang Y, Yang ZH, et al., 2018. Occlusion aware unsupervised learning of optical flow. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4884–4893. https://doi.org/10.1109/CVPR.2018.00513

  • Wang Z, Bovik AC, Sheikh HR, et al., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process, 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861

    Article  Google Scholar 

  • Watson J, Firman M, Brostow G, et al., 2019. Self-supervised monocular depth hints. Proc IEEE/CVF Int Conf on Computer Vision, p.2162–2171. https://doi.org/10.1109/ICCV.2019.00225

  • Wu YR, Ying SH, Zheng LM, 2018. Size-to-depth: a new perspective for single image depth estimation. https://arxiv.org/abs/1801.04461

  • Xie JY, Girshick R, Farhadi A, 2016. Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. Proc European Conf on Computer Vision, p.842–857. https://doi.org/10.1007/978-3-319-46493-0_51

  • Žbontar J, LeCun Y, 2016. Stereo matching by training a convolutional neural network to compare image patches. J Mach Learn Res, 17:1–32.

    MATH  Google Scholar 

  • Zhan HY, Garg R, Weerasekera CS, et al., 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.340–349. https://doi.org/10.1109/CVPR.2018.00043

  • Zhou LP, Kaess M, 2020. Windowed bundle adjustment framework for unsupervised learning of monocular depth estimation with U-net extension and clip loss. IEEE Robot Autom Lett, 5(2):3283–3290. https://doi.org/10.1109/LRA.2020.2976301

    Article  Google Scholar 

  • Zhou TH, Brown M, Snavely N, et al., 2017. Unsupervised learning of depth and ego-motion from video. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6612–6619. https://doi.org/10.1109/CVPR.2017.700

  • Zoran D, Isola P, Krishnan D, et al., 2015. Learning ordinal relationships for mid-level vision. Proc IEEE Int Conf on Computer Vision, p.388–396. https://doi.org/10.1109/ICCV.2015.52

  • Zou YL, Luo ZL, Huang JB, 2018. DF-Net: unsupervised joint learning of depth and flow using cross-task consistency. Proc European Conf on Computer Vision, p.36–53. https://doi.org/10.1007/978-3-030-01228-1_3

Download references

Author information

Authors and Affiliations

Authors

Contributions

Wanpeng XU conceived and designed the method. Lingda WU and Yue QI guided the completion of the research. Wanpeng XU and Ling ZOU performed the experiments. Zhaoyang QIAN helped the experiments. Wanpeng XU drafted the paper. Lingda WU and Yue QI revised the paper. Wanpeng XU finalized the paper.

Corresponding author

Correspondence to Ling Zou  (邹玲).

Additional information

Compliance with ethics guidelines

Wanpeng XU, Ling ZOU, Lingda WU, Yue QI, and Zhaoyong QIAN declare that they have no conflict of interest.

Project supported by the Key-Area Research and Development Program of Guangdong Province (No. 2019B010150001) and the National Natural Science Foundation of China (No. 61902201)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, W., Zou, L., Wu, L. et al. Depth estimation using an improved stereo network. Front Inform Technol Electron Eng 23, 777–789 (2022). https://doi.org/10.1631/FITEE.2000676

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2000676

Key words

关键词

CLC number

Navigation