Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction

Rao, Zhibo; He, Mingyi; Dai, Yuchao; Shen, Zhelun

doi:10.1007/s00371-020-02001-5

Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction

Original article
Published: 12 November 2020

Volume 38, pages 77–93, (2022)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Zhibo Rao ORCID: orcid.org/0000-0001-7832-2913¹,
Mingyi He¹,
Yuchao Dai¹ &
…
Zhelun Shen²

514 Accesses
8 Citations
Explore all metrics

Abstract

In this paper, we address the challenging points of binocular disparity estimation: (1) unsatisfactory results in the occluded region when utilizing warping function in unsupervised learning; (2) inefficiency in running time and the number of parameters as adopting a lot of 3D convolutions in the feature matching module. To solve these drawbacks, we propose a patch attention network for semi-supervised stereo matching learning. First, we employ a channel-attention mechanism to aggregate the cost volume by selecting its different surfaces for reducing a large number of 3D convolution, called the patch attention network (PA-Net). Second, we use our proposed PA-Net as a generator and then combine it, traditional unsupervised learning loss, and the adversarial learning model to construct a semi-supervised learning framework for improving performance in the occluded areas. We have trained our PA-Net in supervised learning, semi-supervised learning, and unsupervised learning manners. Extensive experiments show that (1) our semi-supervised learning framework can overcome the drawbacks of unsupervised learning and significantly improve the performance in the ill-posed region by using only a few or inaccurate ground truths; (2) our PA-Net can outperform other state-of-the-art approaches in supervised learning and use fewer parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ActiveStereoNet: End-to-End Self-supervised Learning for Active Stereo Systems

SRC-Disp: Synthetic-Realistic Collaborative Disparity Learning for Stereo Matching

Stereo Matching Using Conditional Adversarial Networks

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: The Symposium on Operating Systems Design and Implementation, pp. 265–283 (2016)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: non-local networks meet squeeze-excitation networks and beyond. In: arXiv preprint (2019)
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In: The AAAI Conference on Artificial Intelligence, pp. 8001–8008 (2019)
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418 (2018)
Chen, S., Zhang, J., Jin, M.: A simplified ICA-based local similarity stereo matching. Vis. Comput. (2020). https://doi.org/10.1007/s00371-020-01811-x
Cheng, X., Zhong, Y., Dai, Y., Ji, P., Li, H.: Noise-aware unsupervised deep lidar-stereo fusion. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6339–6348 (2019)
Dai, Y., Zhu, Z., Rao, Z., Li, B.: Mvs2: Deep unsupervised multi-view stereo with multi-view symmetry. In: IEEE International Conference on 3D Vision (3DV), pp. 1–8 (2019)
Duggal, S., Wang, S., Ma, W.C., Hu, R., Urtasun, R.: Deeppruner: learning efficient stereo matching via differentiable patchmatch. In: IEEE International Conference on Computer Vision (ICCV), pp. 4384–4393 (2019)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The Kitti vision benchmark suite. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361 (2012)
Guney, F., Geiger, A.: Displets: resolving stereo ambiguities using object knowledge. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4165–4175 (2015)
Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3282 (2019)
Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 807–814 (2005)
Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1043–1051 (2019)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: criss-cross attention for semantic segmentation. In: IEEE International Conference on Computer Vision (ICCV), pp. 603–612 (2019)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1125–1134 (2017)
Ji, R., Li, K., Wang, Y., Sun, X., Guo, F., Guo, X., Wu, Y., Huang, F., Luo, J.: Semi-supervised adversarial monocular depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2410–2422 (2019)
Article Google Scholar
Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P.: End-to-end learning of geometry and context for deep stereo regression. In: IEEE International Conference on Computer Vision (ICCV), pp. 66–75 (2017)
Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6647–6655 (2017)
Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognit. 83, 328–339 (2018)
Article Google Scholar
Li, B., Shen, C., Dai, Y., Van Den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1119–1127 (2015)
Li, X., Huang, H., Zhao, H., Wang, Y., Hu, M.: Learning a convolutional neural network for propagation-based stereo image segmentation. Vis. Comput. 36(1), 39–52 (2020)
Article Google Scholar
Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., Wang, X.: Attention-guided unified network for panoptic segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7026–7035 (2019)
Li, Y., Zhang, J., Zhong, Y., Wang, M.: An efficient stereo matching based on fragment matching. Vis. Comput. 35(2), 257–269 (2019)
Article Google Scholar
Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048 (2016)
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070 (2015)
Ramirez, P.Z., Poggi, M., Tosi, F., Mattoccia, S., Di Stefano, L.: Geometry meets semantics for semi-supervised monocular depth estimation. In: Asian Conference on Computer Vision (ACCV), pp. 298–313 (2018)
Rao, Z., He, M., Dai, Y., Zhu, Z., Li, B., He, R.: Msdc-net: Multi-scale dense and contextual networks for stereo matching. In: Asia-Pacific Signal and Information Processing Association (APSIPA), pp. 578–583 (2019)
Rao, Z., He, M., Dai, Y., Zhu, Z., Li, B., He, R.: Nlca-net: a non-local context attention network for stereo matching. APSIPA Trans. Signal Inf. Process. 9, e18 (2020)
Article Google Scholar
Rao, Z., He, M., Zhu, Z., Dai, Y., He, R.: Sdbf-net: semantic and disparity bidirectional fusion network for 3d semantic detection on incidental satellite images. In: Asia-Pacific Signal and Information Processing Association (APSIPA), pp. 438–444 (2019)
Rasmuson, S., Sintorn, E., Assarsson, U.: A low-cost, practical acquisition and rendering pipeline for real-time free-viewpoint video communication. Vis. Comput. (2020). https://doi.org/10.1007/s00371-020-01823-7
Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3260–3269 (2017)
Seki, A., Pollefeys, M.: Patch based confidence prediction for dense disparity map. In: British Machine Vision Conference, pp. 23.1–23.13 (2016)
Seki, A., Pollefeys, M.: SGM-nets: semi-global matching with neural networks. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6640–6649 (2017)
Shaked, A., Wolf, L.: Improved stereo matching with constant highway networks and reflective confidence learning. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6901–6910 (2017)
Smolyanskiy, N., Kamenev, A., Birchfield, S.: On the importance of stereo for accurate depth estimation: an efficient semi-supervised deep neural network approach. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1007–1015 (2018)
Souly, N., Spampinato, C., Shah, M.: Semi supervised semantic segmentation using generative adversarial network. In: IEEE International Conference on Computer Vision (ICCV), pp. 5688–5696 (2017)
Tian, L., Liu, J., Ling, H., Guo, W.: Disparity estimation in stereo video sequence with adaptive spatiotemporally consistent constraints. Vis. Comput. 35(10), 1427–1446 (2019)
Article Google Scholar
Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Stefano, L.D.: Real-time self-adaptive deep stereo. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–204 (2019)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8445–8453 (2019)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wu, Z., Wu, X., Zhang, X., Wang, S., Ju, L.: Semantic stereo matching with pyramid cost volumes. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7484–7493 (2019)
Xie, L., Xu, Y., Zhang, X., Bao, W., Tong, C., Shi, B.: A self-calibrated photo-geometric depth camera. Vis. Comput. 35(1), 99–108 (2019)
Article Google Scholar
Yamaguchi, K., McAllester, D., Urtasun, R.: Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In: European Conference on Computer Vision (ECCV), pp. 756–771 (2014)
Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: Segstereo: exploiting semantic information for disparity estimation. In: The European Conference on Computer Vision (ECCV), pp. 636–651 (2018)
Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent mvsnet for high-resolution multi-view stereo depth inference. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5525–5534 (2019)
Yin, Z., Darrell, T., Yu, F.: Hierarchical discrete distribution decomposition for match density estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6044–6053 (2019)
Žbontar, J., Le Cun, Y.: Computing the stereo matching cost with a convolutional neural network. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1592–1599 (2015)
Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: Ga-net: guided aggregation net for end-to-end stereo matching. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 185–194 (2019)
Zhong, Y., Dai, Y., Li, H.: Self-supervised Learning for Stereo Matching with Self-Improving Ability. In: arXiv preprint (2017)
Zhong, Y., Ji, P., Wang, J., Dai, Y., Li, H.: Unsupervised deep epipolar flow for stationary or dynamic scenes. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12095–12104 (2019)
Zhu, Z., He, M., Dai, Y., Rao, Z., Li, B.: Multi-scale cross-form pyramid network for stereo matching. In: IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 1789–1794 (2019)

Download references

Acknowledgements

This work was supported in part by Natural Science Foundation of China (61671387, 61420106007, 61871325).

Author information

Authors and Affiliations

Northwestern Polytechnical University, Xian, 710129, China
Zhibo Rao, Mingyi He & Yuchao Dai
Peking University, Beijing, 100871, China
Zhelun Shen

Authors

Zhibo Rao
View author publications
You can also search for this author in PubMed Google Scholar
Mingyi He
View author publications
You can also search for this author in PubMed Google Scholar
Yuchao Dai
View author publications
You can also search for this author in PubMed Google Scholar
Zhelun Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhibo Rao or Mingyi He.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Detailed network structure. The core architecture of semi-supervised learning framework contains a disparity generator network and a disparity pair discriminator network. The detailed structure of our method is presented in Tables 7 and 8. Each 2D or 3D convolutional layer contains three steps: convolution, batch normalization (BN), and ReLU nonlinearity (unless otherwise specified).

Table 7 The summary of our disparity generator network, patch attention network (PA-Net)

Full size table

Table 8 The summary of our disparity pair discriminator network

Full size table

Appendix B

For the sake of completeness, we provide qualitative and quantitative results with various scenes on the ETH3D dataset [32]. We fine-tune our models on the ETH3D dataset. Because ETH3D dataset does not give us the error maps of testing data, we divide our training data into training data and validation data. First, we show our results on the validation data, as shown in Fig. 13. Then, we present our results on the testing data (without ground truth), as shown in Fig. 14.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, Z., He, M., Dai, Y. et al. Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction. Vis Comput 38, 77–93 (2022). https://doi.org/10.1007/s00371-020-02001-5

Download citation

Accepted: 16 October 2020
Published: 12 November 2020
Issue Date: January 2022
DOI: https://doi.org/10.1007/s00371-020-02001-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction

Abstract

Access this article

Similar content being viewed by others

ActiveStereoNet: End-to-End Self-supervised Learning for Active Stereo Systems

SRC-Disp: Synthetic-Realistic Collaborative Disparity Learning for Stereo Matching

Stereo Matching Using Conditional Adversarial Networks

References

Acknowledgements