Skip to main content

Disparity Refinement Based on Cross-Modal Feature Fusion and Global Hourglass Aggregation for Robust Stereo Matching

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

  • 140 Accesses

Abstract

Stereo matching is a critical research area in computer vision. The advancement of deep learning has led to the gradual replacement of cost-filtering methods by iterative optimization techniques, characterized by outstanding generalization performance. However, cost volumes constructed solely through recurrent all-pairs field transforms in iterative optimization methods lack adequate image information, making it challenging to resolve blurring issues in pathological regions such as illumination changes or similar textures. In this paper, we propose SCA-Stereo, a disparity refinement network aimed at further optimizing the initial disparity map generated by iteration. First, we introduce a high- and low-frequency feature extractor to delve deeper into the structural and fine feature information inherent in the image. Furthermore, we propose a cross-modal feature fusion module to facilitate the exchange and integration of diverse features, expanding the receptive field to enhance information flow. Finally, we design a global hourglass aggregation network to efficiently capture non-local interactions between fusion features. Extensive experiments conducted across Scene Flow, KITTI, Middlebury, and ETH3D demonstrate the effectiveness of SCA-Stereo in achieving state-of-the-art stereo matching performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)

    Google Scholar 

  2. Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018)

    Google Scholar 

  3. Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3273–3282 (2019)

    Google Scholar 

  4. Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: Ga-net: guided aggregation net for end-to-end stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 185–194 (2019)

    Google Scholar 

  5. Liu, B., Yu, H., Long, Y.: Local similarity pattern and cost self-reassembling for deep stereo matching networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1647–1655 (2022)

    Google Scholar 

  6. Cheng, J., Xu, G., Guo, P., Yang, X.: Coatrsnet: fully exploiting convolution and attention for stereo matching by region separation. Int. J. Comput. Vision 132(1), 56–73 (2024)

    Article  Google Scholar 

  7. Song, X., Yang, G., Zhu, X., Zhou, H., Wang, Z., Shi, J.: Adastereo: a simple and efficient approach for adaptive stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10328–10337 (2021)

    Google Scholar 

  8. Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Proceedings of the European Conference on Computer Vision, pp. 402–419 (2020)

    Google Scholar 

  9. Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: Proceedings of the International Conference on 3D Vision, pp. 218–227 (2021)

    Google Scholar 

  10. Li, J., et al.: Practical stereo matching via cascaded recurrent network with adaptive correlation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16263–16272 (2022)

    Google Scholar 

  11. Liu, Z., Li, Y., Okutomi, M.: Global occlusion-aware transformer for robust stereo matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3535–3544 (2024)

    Google Scholar 

  12. Zhao, H., Zhou, H., Zhang, Y., Zhao, Y., Yang, Y., Ouyang, T.: Eai-stereo: Error aware iterative network for stereo matching. In: Proceedings of the Asian Conference on Computer Vision, pp. 315–332 (2022)

    Google Scholar 

  13. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation (2014). arXiv:1406.1078

  14. Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2007)

    Article  Google Scholar 

  15. Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21919–21928 (2023)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  17. Tu, D., Min, X., Duan, H., Guo, G., Zhai, G., Shen, W.: End-to-end human-gaze-target detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2192–2200 (2022)

    Google Scholar 

  18. Chen, X., Kang, B., Wang, D., Li, D., Lu, H.: Efficient visual tracking via hierarchical cross-attention transformer. In: Proceedings of the European Conference on Computer Vision, pp. 461–477 (2022)

    Google Scholar 

  19. Gu, J., et al.: Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103 (2022)

    Google Scholar 

  20. Pan, Z., Cai, J., Zhuang, B.: Fast vision transformers with hilo attention. Adv. Neural. Inf. Process. Syst. 35, 14541–14554 (2022)

    Google Scholar 

  21. Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3531–3539 (2021)

    Google Scholar 

  22. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361 (2012)

    Google Scholar 

  23. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3070 (2015)

    Google Scholar 

  24. Scharstein, D., et al.: High-resolution stereo datasets with subpixel-accurate ground truth. In: Proceedings of the German Conference on Pattern Recognition, pp. 31–42 (2014)

    Google Scholar 

  25. Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3260–3269 (2017)

    Google Scholar 

  26. Zhao, H., Zhou, H., Zhang, Y., Chen, J., Yang, Y., Zhao, Y.: High-frequency stereo matching network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1327–1336 (2023)

    Google Scholar 

  27. Shen, Z., Dai, Y., Rao, Z.: Cfnet: cascade and fused cost volume for robust stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13906–13915 (2021)

    Google Scholar 

  28. Song, X., Zhao, X., Fang, L., Hu, H., Yu, Y.: Edgestereo: an effective multi-task learning network for stereo matching and edge detection. Int. J. Comput. Vision 128, 910–930 (2020)

    Article  Google Scholar 

  29. Xu, G., Cheng, J., Guo, P., Yang, X.: Attention concatenation volume for accurate and efficient stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12981–12990 (2022)

    Google Scholar 

  30. Liang, Z., Li, C.: Any-stereo: arbitrary scale disparity estimation for iterative stereo matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3333–3341 (2024)

    Google Scholar 

  31. Zhang, Y., Chen, Y., Bai, X., Yu, S., Yu, K., Li, Z., Yang, K.: Adaptive unimodal cost volume filtering for deep stereo matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12926–12934 (2020)

    Google Scholar 

  32. Xu, H., et al.: Unifying flow, stereo and depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 13941–13958 (2023)

    Google Scholar 

  33. Zeng, J., Yao, C., Yu, L., Wu, Y., Jia, Y.: Parameterized cost volume for stereo matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18347–18357 (2023)

    Google Scholar 

  34. Shen, Z., Song, X., Dai, Y., Zhou, D., Rao, Z., Zhang, L.: Digging into uncertainty-based pseudo-label for robust stereo matching. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 14301–14320 (2023)

    Article  Google Scholar 

  35. Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo-stereo matching with slanted support windows. In: Proceedings of the British Machine Vision Conference, pp. 1–11 (2011)

    Google Scholar 

  36. Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 504–511 (2012)

    Article  Google Scholar 

  37. Li, Z., et al.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6197–6206 (2021)

    Google Scholar 

  38. Zhang, J., et al.: Revisiting domain generalized stereo matching networks from a feature consistency perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13001–13011 (2022)

    Google Scholar 

  39. Rao, Z., et al.: Masked representation learning for domain generalized stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5435–5444 (2023)

    Google Scholar 

Download references

Acknowledgments.

This work was supported by the Natural Science Foundation of Jiangsu Province (No. BK20181340), and the Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education (No. 1311013).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Gang Wang or Jinlong Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, G., Yang, J., Wang, Y. (2025). Disparity Refinement Based on Cross-Modal Feature Fusion and Global Hourglass Aggregation for Robust Stereo Matching. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8508-7_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8507-0

  • Online ISBN: 978-981-97-8508-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics