Skip to main content
Log in

Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Stereo matching is a fundamental technique for many vision and robotics applications. State-of-the-art methods either employ convolutional neural networks with spatially-shared kernels or utilize content-dependent interactions (e.g., local or global attention) to augment convolution operations. Despite of great improvements being made, existing methods could either suffer from a high computational cost arising from global attention operations or a suboptimal performance at edge regions due to spatially-shared convolutions. In this paper, we propose a CoAtRS stereo matching method to exert the complementary advantages of convolution and attention to the full via region separation. Our method can adaptively adopt the most suitable feature extraction and aggregation patterns for smooth and edge regions with less computational cost. In addition, we propose D-global attention which performs global filtering on the disparity dimension to better fuse cost volumes of different regions and alleviate the locality defects of convolutions. Our CoAtRS stereo matching method can also be embedded conveniently in various existing 3D CNN stereo networks. The resulting networks can achieve significant improvements in terms of both accuracy and efficiency. Furthermore, we design an accurate network (named CoAtRSNet) which achieves the state-of-the-art results on five public datasets. At the time of writing, CoAtRSNet ranks 1st–3rd on all the metrics published on the ETH3D website, ranks 2nd on Scene Flow, and ranks 1st for the Root-Mean-Square metric, 2nd for the average error metric and 3rd for the bad 0.5 metric on the Middlebury benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Bello, I., Zoph, B., Vaswani, A., Shlens, J., & Le, Q. V. (2019). Attention augmented convolutional networks. In proceedings of the IEEE/CVF international conference on computer vision (pp. 3286-3295).

  • Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 0-0).

  • Chang, J. R., & Chen, Y. S. (2018). Pyramid stereo matching network. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5418).

  • Cheng, X., Zhong, Y., Harandi, M., Dai, Y., Chang, X., Drummond, T., & Ge, Z. (2020). Hierarchical neural architecture search for deep stereo matching. arXiv preprint arXiv:2010.13501.

  • Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE.

  • Guo, X., Yang, K., Yang, W., Wang, X., & Li, H. (2019). Group-wise correlation stereo network. In proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 3273-3282).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

  • Hu, H., Zhang, Z., Xie, Z., & Lin, S. (2019). Local relation networks for image recognition. In proceedings of the IEEE/CVF international conference on computer vision (pp. 3464-3473).

  • Johnston, A., & Carneiro, G. (2020). Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4756-4765).

  • Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE international conference on computer vision (pp. 66-75).

  • Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). End-to-end learning of geometry and context for deep stereo regression. In proceedings of the IEEE international conference on computer vision (pp. 66-75).

  • Khamis, S., Fanello, S., Rhemann, C., Kowdle, A., Valentin, J., & Izadi, S. (2018). Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In proceedings of the European conference on computer vision (ECCV) (pp. 573-590).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Klaus, A., Sormann, M., & Karner, K. (2006, August). Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In 18th international conference on pattern recognition (ICPR’06) (Vol. 3, pp. 15-18). IEEE.

  • Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F. X., Taylor, R. H., & Unberath, M. (2021). Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In proceedings of the IEEE/CVF international conference on computer vision (pp. 6197-6206).

  • Ma, J., Jiang, X., Fan, A., Jiang, J., & Yan, J. (2021). Image matching from handcrafted to deep features: A survey. International Journal of Computer Vision, 129(1), 23–79.

    Article  MathSciNet  Google Scholar 

  • Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040-4048).

  • Mei, X., Sun, X., Dong, W., Wang, H., & Zhang, X. (2013). Segment-tree based cost aggregation for stereo matching. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 313-320).

  • Menze, M., Heipke, C., & Geiger, A. (2015). Joint 3d estimation of vehicles and scene flow. ISPRS annals of the photogrammetry, remote sensing and spatial information sciences, 2, 427.

    Article  Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., & Chanan, G. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.

    Google Scholar 

  • Rao, Z., He, M., Dai, Y., Zhu, Z., Li, B., & He, R. (2020). Nlca-net: a non-local context attention network for stereo matching. APSIPA Transactions on Signal and Information Processing, 9, e18.

    Article  Google Scholar 

  • Sang, H., Wang, Q., & Zhao, Y. (2019). Multi-scale context attention network for stereo matching. IEEE Access, 7, 15152–15161.

    Article  Google Scholar 

  • Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nešić, N., Wang, X., & Westling, P. (2014, September). High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition (pp. 31-42). Springer, Cham.

  • Schops, T., Schonberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., & Geiger, A. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3260-3269).

  • Seki, A., & Pollefeys, M. (2017). Sgm-nets: Semi-global matching with neural networks. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 231–240).

  • Senousy, Z., Abdelsamea, M., Gaber, M. M., Abdar, M., Acharya, R. U., Khosravi, A., & Nahavandi, S. (2021). MCUa: Multi-level context and uncertainty aware dynamic deep ensemble for breast cancer histology image classification. IEEE Transactions on Biomedical Engineering.

  • Shen, Z., Dai, Y., & Rao, Z. (2021). CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13906–13915).

  • Song, X., Zhao, X., Hu, H., & Fang, L. (2018, December). Edgestereo: A context integrated residual pyramid network for stereo matching. In Asian conference on computer vision (pp. 20-35). Springer, Cham.

  • Song, X., Zhao, X., Fang, L., Hu, H., & Yu, Y. (2020). Edgestereo: An effective multi-task learning network for stereo matching and edge detection. International Journal of Computer Vision, 128(4), 910–930.

    Article  Google Scholar 

  • Sun, J., Zheng, N. N., & Shum, H. Y. (2003). Stereo matching using belief propagation. IEEE Transactions on pattern analysis and machine intelligence, 25(7), 787–800.

    Article  Google Scholar 

  • Tankovich, V., Hane, C., Zhang, Y., Kowdle, A., Fanello, S., & Bouaziz, S. (2021). Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14362-14372).

  • Tomasi, C., & Manduchi, R. (1998, January). Bilateral filtering for gray and color images. In sixth international conference on computer vision (IEEE Cat. No. 98CH36271) (pp. 839-846). IEEE.

  • Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12894-12904).

  • Wu, S., Kan, M., Shan, S., & Chen, X. (2019). Hierarchical attention for part-aware face detection. International Journal of Computer Vision, 127(6), 560–578.

    Article  Google Scholar 

  • Xu, H., & Zhang, J. (2020). Aanet: Adaptive aggregation network for efficient stereo matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1959-1968).

  • Yang, Q. (2012, June). A non-local cost aggregation method for stereo matching. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1402-1409). IEEE.

  • Yang, G., Zhao, H., Shi, J., Deng, Z., & Jia, J. (2018). Segstereo: Exploiting semantic information for disparity estimation. In proceedings of the European conference on computer vision (ECCV) (pp. 636-651).

  • Zhang, F., Prisacariu, V., Yang, R., & Torr, P. H. (2019). Ga-net: Guided aggregation net for end-to-end stereo matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 185-194).

  • Zhang, J., Skinner, K. A., Vasudevan, R., & Johnson-Roberson, M. (2019). Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery. IEEE Robotics and Automation Letters, 4(2), 1162–1169.

    Article  Google Scholar 

  • Zhang, G., Zhu, D., Shi, W., Ye, X., Li, J., & Zhang, X. (2019). Multi-dimensional residual dense attention network for stereo matching. IEEE Access, 7, 51681–51690.

    Article  Google Scholar 

  • Zhang, X., Hu, Y., Wang, H., Cao, X., & Zhang, B. (2021). Long-range Attention Network for Multi-View Stereo. In proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3782-3791).

  • Zhou, H., Ummenhofer, B., & Brox, T. (2020). DeepTAM: Deep tracking and mapping with convolutional neural networks. International Journal of Computer Vision,128(3), 756–769.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 62061160490, 62122029, and U20B2064.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Yang.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, J., Xu, G., Guo, P. et al. Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation. Int J Comput Vis 132, 56–73 (2024). https://doi.org/10.1007/s11263-023-01872-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01872-0

Keywords