Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation

Cheng, Junda; Xu, Gangwei; Guo, Peng; Yang, Xin

doi:10.1007/s11263-023-01872-0

Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation

Published: 22 August 2023

Volume 132, pages 56–73, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Junda Cheng¹^na1,
Gangwei Xu¹^na1,
Peng Guo ORCID: orcid.org/0000-0001-6252-1061¹ &
…
Xin Yang¹

998 Accesses
12 Citations
Explore all metrics

Abstract

Stereo matching is a fundamental technique for many vision and robotics applications. State-of-the-art methods either employ convolutional neural networks with spatially-shared kernels or utilize content-dependent interactions (e.g., local or global attention) to augment convolution operations. Despite of great improvements being made, existing methods could either suffer from a high computational cost arising from global attention operations or a suboptimal performance at edge regions due to spatially-shared convolutions. In this paper, we propose a CoAtRS stereo matching method to exert the complementary advantages of convolution and attention to the full via region separation. Our method can adaptively adopt the most suitable feature extraction and aggregation patterns for smooth and edge regions with less computational cost. In addition, we propose D-global attention which performs global filtering on the disparity dimension to better fuse cost volumes of different regions and alleviate the locality defects of convolutions. Our CoAtRS stereo matching method can also be embedded conveniently in various existing 3D CNN stereo networks. The resulting networks can achieve significant improvements in terms of both accuracy and efficiency. Furthermore, we design an accurate network (named CoAtRSNet) which achieves the state-of-the-art results on five public datasets. At the time of writing, CoAtRSNet ranks 1st–3rd on all the metrics published on the ETH3D website, ranks 2nd on Scene Flow, and ranks 1st for the Root-Mean-Square metric, 2nd for the average error metric and 3rd for the bad 0.5 metric on the Middlebury benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple attention networks for stereo matching

Article 07 June 2021

Seeking Attention: Using Full Context Transformers for Better Disparity Estimation

ModuleNet: A Convolutional Neural Network for Stereo Vision

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bello, I., Zoph, B., Vaswani, A., Shlens, J., & Le, Q. V. (2019). Attention augmented convolutional networks. In proceedings of the IEEE/CVF international conference on computer vision (pp. 3286-3295).
Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 0-0).
Chang, J. R., & Chen, Y. S. (2018). Pyramid stereo matching network. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5418).
Cheng, X., Zhong, Y., Harandi, M., Dai, Y., Chang, X., Drummond, T., & Ge, Z. (2020). Hierarchical neural architecture search for deep stereo matching. arXiv preprint arXiv:2010.13501.
Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE.
Guo, X., Yang, K., Yang, W., Wang, X., & Li, H. (2019). Group-wise correlation stereo network. In proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 3273-3282).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Hu, H., Zhang, Z., Xie, Z., & Lin, S. (2019). Local relation networks for image recognition. In proceedings of the IEEE/CVF international conference on computer vision (pp. 3464-3473).
Johnston, A., & Carneiro, G. (2020). Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4756-4765).
Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE international conference on computer vision (pp. 66-75).
Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). End-to-end learning of geometry and context for deep stereo regression. In proceedings of the IEEE international conference on computer vision (pp. 66-75).
Khamis, S., Fanello, S., Rhemann, C., Kowdle, A., Valentin, J., & Izadi, S. (2018). Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In proceedings of the European conference on computer vision (ECCV) (pp. 573-590).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Klaus, A., Sormann, M., & Karner, K. (2006, August). Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In 18th international conference on pattern recognition (ICPR’06) (Vol. 3, pp. 15-18). IEEE.
Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F. X., Taylor, R. H., & Unberath, M. (2021). Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In proceedings of the IEEE/CVF international conference on computer vision (pp. 6197-6206).
Ma, J., Jiang, X., Fan, A., Jiang, J., & Yan, J. (2021). Image matching from handcrafted to deep features: A survey. International Journal of Computer Vision, 129(1), 23–79.
Article MathSciNet Google Scholar
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040-4048).
Mei, X., Sun, X., Dong, W., Wang, H., & Zhang, X. (2013). Segment-tree based cost aggregation for stereo matching. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 313-320).
Menze, M., Heipke, C., & Geiger, A. (2015). Joint 3d estimation of vehicles and scene flow. ISPRS annals of the photogrammetry, remote sensing and spatial information sciences, 2, 427.
Article Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., & Chanan, G. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.
Google Scholar
Rao, Z., He, M., Dai, Y., Zhu, Z., Li, B., & He, R. (2020). Nlca-net: a non-local context attention network for stereo matching. APSIPA Transactions on Signal and Information Processing, 9, e18.
Article Google Scholar
Sang, H., Wang, Q., & Zhao, Y. (2019). Multi-scale context attention network for stereo matching. IEEE Access, 7, 15152–15161.
Article Google Scholar
Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nešić, N., Wang, X., & Westling, P. (2014, September). High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition (pp. 31-42). Springer, Cham.
Schops, T., Schonberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., & Geiger, A. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3260-3269).
Seki, A., & Pollefeys, M. (2017). Sgm-nets: Semi-global matching with neural networks. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 231–240).
Senousy, Z., Abdelsamea, M., Gaber, M. M., Abdar, M., Acharya, R. U., Khosravi, A., & Nahavandi, S. (2021). MCUa: Multi-level context and uncertainty aware dynamic deep ensemble for breast cancer histology image classification. IEEE Transactions on Biomedical Engineering.
Shen, Z., Dai, Y., & Rao, Z. (2021). CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13906–13915).
Song, X., Zhao, X., Hu, H., & Fang, L. (2018, December). Edgestereo: A context integrated residual pyramid network for stereo matching. In Asian conference on computer vision (pp. 20-35). Springer, Cham.
Song, X., Zhao, X., Fang, L., Hu, H., & Yu, Y. (2020). Edgestereo: An effective multi-task learning network for stereo matching and edge detection. International Journal of Computer Vision, 128(4), 910–930.
Article Google Scholar
Sun, J., Zheng, N. N., & Shum, H. Y. (2003). Stereo matching using belief propagation. IEEE Transactions on pattern analysis and machine intelligence, 25(7), 787–800.
Article Google Scholar
Tankovich, V., Hane, C., Zhang, Y., Kowdle, A., Fanello, S., & Bouaziz, S. (2021). Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14362-14372).
Tomasi, C., & Manduchi, R. (1998, January). Bilateral filtering for gray and color images. In sixth international conference on computer vision (IEEE Cat. No. 98CH36271) (pp. 839-846). IEEE.
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12894-12904).
Wu, S., Kan, M., Shan, S., & Chen, X. (2019). Hierarchical attention for part-aware face detection. International Journal of Computer Vision, 127(6), 560–578.
Article Google Scholar
Xu, H., & Zhang, J. (2020). Aanet: Adaptive aggregation network for efficient stereo matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1959-1968).
Yang, Q. (2012, June). A non-local cost aggregation method for stereo matching. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1402-1409). IEEE.
Yang, G., Zhao, H., Shi, J., Deng, Z., & Jia, J. (2018). Segstereo: Exploiting semantic information for disparity estimation. In proceedings of the European conference on computer vision (ECCV) (pp. 636-651).
Zhang, F., Prisacariu, V., Yang, R., & Torr, P. H. (2019). Ga-net: Guided aggregation net for end-to-end stereo matching. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 185-194).
Zhang, J., Skinner, K. A., Vasudevan, R., & Johnson-Roberson, M. (2019). Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery. IEEE Robotics and Automation Letters, 4(2), 1162–1169.
Article Google Scholar
Zhang, G., Zhu, D., Shi, W., Ye, X., Li, J., & Zhang, X. (2019). Multi-dimensional residual dense attention network for stereo matching. IEEE Access, 7, 51681–51690.
Article Google Scholar
Zhang, X., Hu, Y., Wang, H., Cao, X., & Zhang, B. (2021). Long-range Attention Network for Multi-View Stereo. In proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3782-3791).
Zhou, H., Ummenhofer, B., & Brox, T. (2020). DeepTAM: Deep tracking and mapping with convolutional neural networks. International Journal of Computer Vision,128(3), 756–769.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 62061160490, 62122029, and U20B2064.

Author information

Junda Cheng and Gangwei Xu contributed equally to this work.

Authors and Affiliations

School of Electronic Information and Communications, Huazhong University of Science and Technology, Luoyu Lu, Wuhan, 430074, Hubei, China
Junda Cheng, Gangwei Xu, Peng Guo & Xin Yang

Authors

Junda Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Gangwei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Yang.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cheng, J., Xu, G., Guo, P. et al. Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation. Int J Comput Vis 132, 56–73 (2024). https://doi.org/10.1007/s11263-023-01872-0

Download citation

Received: 15 December 2021
Accepted: 31 July 2023
Published: 22 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11263-023-01872-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiple attention networks for stereo matching

Seeking Attention: Using Full Context Transformers for Better Disparity Estimation

ModuleNet: A Convolutional Neural Network for Stereo Vision

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now