Abstract
Existing depth map-based multi-view stereo (MVS) methods typically assume that texture features remain consistent across different viewpoints. However, factors such as lighting changes, occlusions, and weakly textured regions can lead to inconsistent texture features, posing challenges for feature extraction. As a result, relying solely on texture consistency does not always yield high-quality reconstruction results in certain scenarios. In contrast, high-level semantic concepts corresponding to the same objects remain consistent across different viewpoints, which we define as semantic consistency. Since designing and training new MVS networks from scratch is both costly and labor-intensive, we propose fine-tuning existing depth map-based MVS networks during testing phase by incorporating semantic consistency constraints to improve the reconstruction quality in regions with poor results. Considering the robust open-set detection and zero-shot segmentation capabilities of Grounded-SAM, we first use Grounded-SAM to generate semantic segmentation masks for arbitrary objects in multi-view images based on text instructions. These masks are then used to fine-tune pre-trained MVS networks via aligning them from different viewpoints to the reference viewpoint and optimizing the depth maps based on the proposed semantic consistency loss function. Our method is designed as a test-time approach that is adaptable to a wide range of depth map-based MVS networks, requiring only adjustments to a small number of depth-related parameters. Comprehensive experimental evaluation across different MVS networks and large-scale scenarios demonstrates that our method effectively enhances reconstruction quality at a lower computational cost.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets analyzed during the current study are available at the link https://www.eth3d.net/datasets.
References
Xu Q, Tao W (2019) Multi-scale geometric consistency guided multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5483–5492
Xu Q, Kong W, Tao W, Pollefeys M (2022) Multi-scale geometric consistency guided and planar prior assisted multi-view stereo. IEEE Trans Pattern Anal Mach Intell
Wang Y, Zeng Z, Guan T, Yang W, Chen Z, Liu W, Xu L, Luo Y (2023) Adaptive patch deformation for textureless-resilient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1621–1630
Liu H, Zhang C, Deng Y, Liu T, Zhang Z, Li Y-F (2023) Orientation cues-aware facial relationship representation for head pose estimation via transformer. IEEE Trans Image Process 32:6289–6302
Liu H, Zhang C, Deng Y, Xie B, Liu T, Li Y-F (2023) Transifc: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Trans Multimed
Liu H, Zhou Q, Zhang C, Zhu J, Liu T, Zhang Z, Li Y-F (2024) Mmatrans: Muscle movement aware representation learning for facial expression recognition via transformers. IEEE Trans Ind Inform
Yao Y, Luo Z, Li S, Fang T, Quan L (2018) Mvsnet: Depth inference for unstructured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV), pp 767–783
Wang F, Galliani S, Vogel C, Speciale P, Pollefeys M (2021) Patchmatchnet: Learned multi-view patchmatch stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14194–14203
Wang F, Galliani S, Vogel C, Pollefeys M (2022) Itermvs: Iterative probability estimation for efficient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8606–8615
Mi Z, Di C, Xu D (2022) Generalized binary search network for highly-efficient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12991–13000
Su W, Tao W (2023) Efficient edge-preserving multi-view stereo network for depth estimation. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 2348–2356
Wu J, Li R, Xu H, Zhao W, Zhu Y, Sun J, Zhang Y (2024) Gomvs: Geometrically consistent cost aggregation for multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20207–20216
Cao C, Ren X, Fu Y (2024) Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo. In: The Twelfth international conference on learning representations
Chen J, Yu Z, Ma L, Zhang K (2023) Uncertainty awareness with adaptive propagation for multi-view stereo. Appl Intell 53(21):26230–26239
Liu T, Ye X, Zhao W, Pan Z, Shi M, Cao Z (2023) When epipolar constraint meets non-local operators in multi-view stereo. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18088–18097
Xu S, Xu Q, Su W, Tao W (2023) Edge-aware spatial propagation network for multi-view depth estimation. Neural Process Lett 55(8):10905–10923
Zhang J, Wang X, Bai X, Wang C, Huang L, Chen Y, Gu L, Zhou J, Harada T, Hancock ER (2022) Revisiting domain generalized stereo matching networks from a feature consistency perspective. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13001–13011
Xu H, Zhou Z, Qiao Y, Kang W, Wu Q (2021) Self-supervised multi-view stereo via effective co-segmentation and data-augmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 3030–3038
Wang Y-X, Zhang Y-J (2012) Nonnegative matrix factorization: A comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353
Yuan Z, Cao J, Li Z, Jiang H, Wang Z (2024) Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical refinement and em optimization. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 6871–6880
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4015–4026
Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, Chen J, Huang X, Chen Y, Yan F et al (2024) Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159
Ulusoy AO, Black MJ, Geiger A (2017) Semantic multi-view stereo: Jointly estimating objects and voxels. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), IEEE, pp 4531–4540
Jin Y, Jiang D, Cai M (2020) 3d reconstruction using deep learning: a survey. Commun Inf Syst 20(4):389–413
Schonberger JL, Frahm J-M (2016) Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4104–4113
Xu Q, Tao W (2019) Multi-scale geometric consistency guided multi-view stereo. Comput Vis Pattern Recognit (CVPR)
Romanoni A, Matteucci M (2019) Tapa-mvs: Textureless-aware patchmatch multi-view stereo. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10413–10422
Yuan Z, Cao J, Wang Z, Li Z (2024) Tsar-mvs: Textureless-aware segmentation and correlative refinement guided multi-view stereo. Pattern Recogn 154:110565
Liu T, Liu H, Yang B, Zhang Z (2023) Ldcnet: limb direction cues-aware network for flexible human pose estimation in industrial behavioral biometrics systems. IEEE Trans Ind Inform
Liu H, Liu T, Chen Y, Zhang Z, Li Y-F (2022) Ehpe: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans Multimed
Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li Y (2022) Arhpe: Asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction. IEEE Trans Industr Inf 18(10):7107–7117
Li H, Guo Y, Zheng X, Xiong H (2024) Learning deformable hypothesis sampling for accurate patchmatch multi-view stereo. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 3082–3090
Hu H, Su L, Mao S, Chen M, Pan G, Xu B, Zhu Q (2023) Adaptive region aggregation for multi-view stereo matching using deformable convolutional networks. Photogram Rec 38(183):430–449
Wang X, Zhu Z, Huang G, Qin F, Ye Y, He Y, Chi X, Wang X (2022) Mvster: Epipolar transformer for efficient multi-view stereo. In: European conference on computer vision, Springer, pp 573–591
Chen W, Xu H, Zhou Z, Liu Y, Sun B, Kang W, Xie X (2023) Costformer: cost transformer for cost aggregation in multi-view stereo. In: Proceedings of the thirty-second international joint conference on artificial intelligence, pp 599–608
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Ding Y, Yuan W, Zhu Q, Zhang H, Liu X, Wang Y, Liu X (2022) Transmvsnet: Global context-aware multi-view stereo network with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8585–8594
Zhao X, Ding W, An Y, Du Y, Yu T, Li M, Tang M, Wang J (2023) Fast segment anything. arXiv preprint arXiv:2306.12156
Zhang C, Han D, Qiao Y, Kim JU, Bae S-H, Lee S, Hong CS (2023) Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289
Ke L, Ye M, Danelljan M, Tai Y-W, Tang C-K, Yu F et al (2024) Segment anything in high quality. Adv Neural Inform Process Syst 36
Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Jiang Q, Li C, Yang J, Su H et al (2025) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision, Springer, pp 38–55
Zhang Y, Huang K, Chen C, Chen Q, Heng P-A (2023) Satta: Semantic-aware test-time adaptation for cross-domain medical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 148–158
Enomoto S, Hasegawa N, Adachi K, Sasaki T, Yamaguchi S, Suzuki S, Eda T (2024) Test-time adaptation meets image enhancement: Improving accuracy via uncertainty-aware logit switching. In: 2024 International joint conference on neural networks (IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN60899.2024.10650964
Nam H, Jung DS, Oh Y, Lee KM (2023) Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 14829–14839
Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, Minneapolis, Minnesota, vol 1, p 2
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
Aans, Henrik, Jensen, Rasmus, Vogiatzis, George, Tola, Engin, Dahl, Anders (2016) Large-scale data for multiple-view stereopsis. Int J ofuter Vis 120(2):153–168
Funding
This research was supported by the National Natural Science Foundation of China under Grant 62306310.
Author information
Authors and Affiliations
Contributions
Conceptualization, Yan Zhang, Hongping Yan and Kun Ding; methodology, Yan Zhang, Hongping Yan and Kun Ding; software, Yan Zhang; validation, Yan Zhang; resources, Yan Zhang; data curation, Yan Zhang; supervision, Hongping Yan and Kun Ding; writing original draft preparation, Yan Zhang; writing review and editing, Yan Zhang, Hongping Yan, Kun Ding, Tingting Cai, and Yueyue Zhou; funding acquisition, Kun Ding. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no Conflict of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Yan, H., Ding, K. et al. Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo. Appl Intell 55, 473 (2025). https://doi.org/10.1007/s10489-025-06382-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-025-06382-9