Skip to main content
Log in

Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Existing depth map-based multi-view stereo (MVS) methods typically assume that texture features remain consistent across different viewpoints. However, factors such as lighting changes, occlusions, and weakly textured regions can lead to inconsistent texture features, posing challenges for feature extraction. As a result, relying solely on texture consistency does not always yield high-quality reconstruction results in certain scenarios. In contrast, high-level semantic concepts corresponding to the same objects remain consistent across different viewpoints, which we define as semantic consistency. Since designing and training new MVS networks from scratch is both costly and labor-intensive, we propose fine-tuning existing depth map-based MVS networks during testing phase by incorporating semantic consistency constraints to improve the reconstruction quality in regions with poor results. Considering the robust open-set detection and zero-shot segmentation capabilities of Grounded-SAM, we first use Grounded-SAM to generate semantic segmentation masks for arbitrary objects in multi-view images based on text instructions. These masks are then used to fine-tune pre-trained MVS networks via aligning them from different viewpoints to the reference viewpoint and optimizing the depth maps based on the proposed semantic consistency loss function. Our method is designed as a test-time approach that is adaptable to a wide range of depth map-based MVS networks, requiring only adjustments to a small number of depth-related parameters. Comprehensive experimental evaluation across different MVS networks and large-scale scenarios demonstrates that our method effectively enhances reconstruction quality at a lower computational cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets analyzed during the current study are available at the link https://www.eth3d.net/datasets.

References

  1. Xu Q, Tao W (2019) Multi-scale geometric consistency guided multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5483–5492

  2. Xu Q, Kong W, Tao W, Pollefeys M (2022) Multi-scale geometric consistency guided and planar prior assisted multi-view stereo. IEEE Trans Pattern Anal Mach Intell

  3. Wang Y, Zeng Z, Guan T, Yang W, Chen Z, Liu W, Xu L, Luo Y (2023) Adaptive patch deformation for textureless-resilient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1621–1630

  4. Liu H, Zhang C, Deng Y, Liu T, Zhang Z, Li Y-F (2023) Orientation cues-aware facial relationship representation for head pose estimation via transformer. IEEE Trans Image Process 32:6289–6302

    Article  MATH  Google Scholar 

  5. Liu H, Zhang C, Deng Y, Xie B, Liu T, Li Y-F (2023) Transifc: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Trans Multimed

  6. Liu H, Zhou Q, Zhang C, Zhu J, Liu T, Zhang Z, Li Y-F (2024) Mmatrans: Muscle movement aware representation learning for facial expression recognition via transformers. IEEE Trans Ind Inform

  7. Yao Y, Luo Z, Li S, Fang T, Quan L (2018) Mvsnet: Depth inference for unstructured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV), pp 767–783

  8. Wang F, Galliani S, Vogel C, Speciale P, Pollefeys M (2021) Patchmatchnet: Learned multi-view patchmatch stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14194–14203

  9. Wang F, Galliani S, Vogel C, Pollefeys M (2022) Itermvs: Iterative probability estimation for efficient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8606–8615

  10. Mi Z, Di C, Xu D (2022) Generalized binary search network for highly-efficient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12991–13000

  11. Su W, Tao W (2023) Efficient edge-preserving multi-view stereo network for depth estimation. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 2348–2356

  12. Wu J, Li R, Xu H, Zhao W, Zhu Y, Sun J, Zhang Y (2024) Gomvs: Geometrically consistent cost aggregation for multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20207–20216

  13. Cao C, Ren X, Fu Y (2024) Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo. In: The Twelfth international conference on learning representations

  14. Chen J, Yu Z, Ma L, Zhang K (2023) Uncertainty awareness with adaptive propagation for multi-view stereo. Appl Intell 53(21):26230–26239

    Article  MATH  Google Scholar 

  15. Liu T, Ye X, Zhao W, Pan Z, Shi M, Cao Z (2023) When epipolar constraint meets non-local operators in multi-view stereo. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18088–18097

  16. Xu S, Xu Q, Su W, Tao W (2023) Edge-aware spatial propagation network for multi-view depth estimation. Neural Process Lett 55(8):10905–10923

    Article  MATH  Google Scholar 

  17. Zhang J, Wang X, Bai X, Wang C, Huang L, Chen Y, Gu L, Zhou J, Harada T, Hancock ER (2022) Revisiting domain generalized stereo matching networks from a feature consistency perspective. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13001–13011

  18. Xu H, Zhou Z, Qiao Y, Kang W, Wu Q (2021) Self-supervised multi-view stereo via effective co-segmentation and data-augmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 3030–3038

  19. Wang Y-X, Zhang Y-J (2012) Nonnegative matrix factorization: A comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353

    Article  MATH  Google Scholar 

  20. Yuan Z, Cao J, Li Z, Jiang H, Wang Z (2024) Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical refinement and em optimization. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 6871–6880

  21. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4015–4026

  22. Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, Chen J, Huang X, Chen Y, Yan F et al (2024) Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159

  23. Ulusoy AO, Black MJ, Geiger A (2017) Semantic multi-view stereo: Jointly estimating objects and voxels. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), IEEE, pp 4531–4540

  24. Jin Y, Jiang D, Cai M (2020) 3d reconstruction using deep learning: a survey. Commun Inf Syst 20(4):389–413

    Article  MathSciNet  MATH  Google Scholar 

  25. Schonberger JL, Frahm J-M (2016) Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4104–4113

  26. Xu Q, Tao W (2019) Multi-scale geometric consistency guided multi-view stereo. Comput Vis Pattern Recognit (CVPR)

  27. Romanoni A, Matteucci M (2019) Tapa-mvs: Textureless-aware patchmatch multi-view stereo. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10413–10422

  28. Yuan Z, Cao J, Wang Z, Li Z (2024) Tsar-mvs: Textureless-aware segmentation and correlative refinement guided multi-view stereo. Pattern Recogn 154:110565

    Article  Google Scholar 

  29. Liu T, Liu H, Yang B, Zhang Z (2023) Ldcnet: limb direction cues-aware network for flexible human pose estimation in industrial behavioral biometrics systems. IEEE Trans Ind Inform

  30. Liu H, Liu T, Chen Y, Zhang Z, Li Y-F (2022) Ehpe: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans Multimed

  31. Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li Y (2022) Arhpe: Asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction. IEEE Trans Industr Inf 18(10):7107–7117

    Article  MATH  Google Scholar 

  32. Li H, Guo Y, Zheng X, Xiong H (2024) Learning deformable hypothesis sampling for accurate patchmatch multi-view stereo. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 3082–3090

  33. Hu H, Su L, Mao S, Chen M, Pan G, Xu B, Zhu Q (2023) Adaptive region aggregation for multi-view stereo matching using deformable convolutional networks. Photogram Rec 38(183):430–449

    Article  Google Scholar 

  34. Wang X, Zhu Z, Huang G, Qin F, Ye Y, He Y, Chi X, Wang X (2022) Mvster: Epipolar transformer for efficient multi-view stereo. In: European conference on computer vision, Springer, pp 573–591

  35. Chen W, Xu H, Zhou Z, Liu Y, Sun B, Kang W, Xie X (2023) Costformer: cost transformer for cost aggregation in multi-view stereo. In: Proceedings of the thirty-second international joint conference on artificial intelligence, pp 599–608

  36. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30

  37. Ding Y, Yuan W, Zhu Q, Zhang H, Liu X, Wang Y, Liu X (2022) Transmvsnet: Global context-aware multi-view stereo network with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8585–8594

  38. Zhao X, Ding W, An Y, Du Y, Yu T, Li M, Tang M, Wang J (2023) Fast segment anything. arXiv preprint arXiv:2306.12156

  39. Zhang C, Han D, Qiao Y, Kim JU, Bae S-H, Lee S, Hong CS (2023) Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289

  40. Ke L, Ye M, Danelljan M, Tai Y-W, Tang C-K, Yu F et al (2024) Segment anything in high quality. Adv Neural Inform Process Syst 36

  41. Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Jiang Q, Li C, Yang J, Su H et al (2025) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision, Springer, pp 38–55

  42. Zhang Y, Huang K, Chen C, Chen Q, Heng P-A (2023) Satta: Semantic-aware test-time adaptation for cross-domain medical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 148–158

  43. Enomoto S, Hasegawa N, Adachi K, Sasaki T, Yamaguchi S, Suzuki S, Eda T (2024) Test-time adaptation meets image enhancement: Improving accuracy via uncertainty-aware logit switching. In: 2024 International joint conference on neural networks (IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN60899.2024.10650964

  44. Nam H, Jung DS, Oh Y, Lee KM (2023) Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 14829–14839

  45. Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, Minneapolis, Minnesota, vol 1, p 2

  46. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

  47. Aans, Henrik, Jensen, Rasmus, Vogiatzis, George, Tola, Engin, Dahl, Anders (2016) Large-scale data for multiple-view stereopsis. Int J ofuter Vis 120(2):153–168

Download references

Funding

This research was supported by the National Natural Science Foundation of China under Grant 62306310.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Yan Zhang, Hongping Yan and Kun Ding; methodology, Yan Zhang, Hongping Yan and Kun Ding; software, Yan Zhang; validation, Yan Zhang; resources, Yan Zhang; data curation, Yan Zhang; supervision, Hongping Yan and Kun Ding; writing original draft preparation, Yan Zhang; writing review and editing, Yan Zhang, Hongping Yan, Kun Ding, Tingting Cai, and Yueyue Zhou; funding acquisition, Kun Ding. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Hongping Yan.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Yan, H., Ding, K. et al. Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo. Appl Intell 55, 473 (2025). https://doi.org/10.1007/s10489-025-06382-9

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-025-06382-9

Keywords