Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo

Zhang, Yan; Yan, Hongping; Ding, Kun; Cai, Tingting; Zhou, Yueyue

doi:10.1007/s10489-025-06382-9

Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo

Published: 25 February 2025

Volume 55, article number 473, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yan Zhang^1,2,
Hongping Yan ORCID: orcid.org/0000-0002-4296-2289¹,
Kun Ding²,
Tingting Cai^1,2 &
…
Yueyue Zhou^1,2

121 Accesses
Explore all metrics

Abstract

Existing depth map-based multi-view stereo (MVS) methods typically assume that texture features remain consistent across different viewpoints. However, factors such as lighting changes, occlusions, and weakly textured regions can lead to inconsistent texture features, posing challenges for feature extraction. As a result, relying solely on texture consistency does not always yield high-quality reconstruction results in certain scenarios. In contrast, high-level semantic concepts corresponding to the same objects remain consistent across different viewpoints, which we define as semantic consistency. Since designing and training new MVS networks from scratch is both costly and labor-intensive, we propose fine-tuning existing depth map-based MVS networks during testing phase by incorporating semantic consistency constraints to improve the reconstruction quality in regions with poor results. Considering the robust open-set detection and zero-shot segmentation capabilities of Grounded-SAM, we first use Grounded-SAM to generate semantic segmentation masks for arbitrary objects in multi-view images based on text instructions. These masks are then used to fine-tune pre-trained MVS networks via aligning them from different viewpoints to the reference viewpoint and optimizing the depth maps based on the proposed semantic consistency loss function. Our method is designed as a test-time approach that is adaptable to a wide range of depth map-based MVS networks, requiring only adjustments to a small number of depth-related parameters. Comprehensive experimental evaluation across different MVS networks and large-scale scenarios demonstrates that our method effectively enhances reconstruction quality at a lower computational cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context-Guided Multi-view Stereo with Depth Back-Projection

LE-MVSNet: Lightweight Efficient Multi-view Stereo Network

PSP-MVSNet: Deep Patch-Based Similarity Perceptual for Multi-view Stereo Depth Inference

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets analyzed during the current study are available at the link https://www.eth3d.net/datasets.

References

Xu Q, Tao W (2019) Multi-scale geometric consistency guided multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5483–5492
Xu Q, Kong W, Tao W, Pollefeys M (2022) Multi-scale geometric consistency guided and planar prior assisted multi-view stereo. IEEE Trans Pattern Anal Mach Intell
Wang Y, Zeng Z, Guan T, Yang W, Chen Z, Liu W, Xu L, Luo Y (2023) Adaptive patch deformation for textureless-resilient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1621–1630
Liu H, Zhang C, Deng Y, Liu T, Zhang Z, Li Y-F (2023) Orientation cues-aware facial relationship representation for head pose estimation via transformer. IEEE Trans Image Process 32:6289–6302
Article MATH Google Scholar
Liu H, Zhang C, Deng Y, Xie B, Liu T, Li Y-F (2023) Transifc: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Trans Multimed
Liu H, Zhou Q, Zhang C, Zhu J, Liu T, Zhang Z, Li Y-F (2024) Mmatrans: Muscle movement aware representation learning for facial expression recognition via transformers. IEEE Trans Ind Inform
Yao Y, Luo Z, Li S, Fang T, Quan L (2018) Mvsnet: Depth inference for unstructured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV), pp 767–783
Wang F, Galliani S, Vogel C, Speciale P, Pollefeys M (2021) Patchmatchnet: Learned multi-view patchmatch stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14194–14203
Wang F, Galliani S, Vogel C, Pollefeys M (2022) Itermvs: Iterative probability estimation for efficient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8606–8615
Mi Z, Di C, Xu D (2022) Generalized binary search network for highly-efficient multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12991–13000
Su W, Tao W (2023) Efficient edge-preserving multi-view stereo network for depth estimation. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 2348–2356
Wu J, Li R, Xu H, Zhao W, Zhu Y, Sun J, Zhang Y (2024) Gomvs: Geometrically consistent cost aggregation for multi-view stereo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20207–20216
Cao C, Ren X, Fu Y (2024) Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo. In: The Twelfth international conference on learning representations
Chen J, Yu Z, Ma L, Zhang K (2023) Uncertainty awareness with adaptive propagation for multi-view stereo. Appl Intell 53(21):26230–26239
Article MATH Google Scholar
Liu T, Ye X, Zhao W, Pan Z, Shi M, Cao Z (2023) When epipolar constraint meets non-local operators in multi-view stereo. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18088–18097
Xu S, Xu Q, Su W, Tao W (2023) Edge-aware spatial propagation network for multi-view depth estimation. Neural Process Lett 55(8):10905–10923
Article MATH Google Scholar
Zhang J, Wang X, Bai X, Wang C, Huang L, Chen Y, Gu L, Zhou J, Harada T, Hancock ER (2022) Revisiting domain generalized stereo matching networks from a feature consistency perspective. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13001–13011
Xu H, Zhou Z, Qiao Y, Kang W, Wu Q (2021) Self-supervised multi-view stereo via effective co-segmentation and data-augmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 3030–3038
Wang Y-X, Zhang Y-J (2012) Nonnegative matrix factorization: A comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353
Article MATH Google Scholar
Yuan Z, Cao J, Li Z, Jiang H, Wang Z (2024) Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical refinement and em optimization. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 6871–6880
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4015–4026
Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, Chen J, Huang X, Chen Y, Yan F et al (2024) Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159
Ulusoy AO, Black MJ, Geiger A (2017) Semantic multi-view stereo: Jointly estimating objects and voxels. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), IEEE, pp 4531–4540
Jin Y, Jiang D, Cai M (2020) 3d reconstruction using deep learning: a survey. Commun Inf Syst 20(4):389–413
Article MathSciNet MATH Google Scholar
Schonberger JL, Frahm J-M (2016) Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4104–4113
Xu Q, Tao W (2019) Multi-scale geometric consistency guided multi-view stereo. Comput Vis Pattern Recognit (CVPR)
Romanoni A, Matteucci M (2019) Tapa-mvs: Textureless-aware patchmatch multi-view stereo. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10413–10422
Yuan Z, Cao J, Wang Z, Li Z (2024) Tsar-mvs: Textureless-aware segmentation and correlative refinement guided multi-view stereo. Pattern Recogn 154:110565
Article Google Scholar
Liu T, Liu H, Yang B, Zhang Z (2023) Ldcnet: limb direction cues-aware network for flexible human pose estimation in industrial behavioral biometrics systems. IEEE Trans Ind Inform
Liu H, Liu T, Chen Y, Zhang Z, Li Y-F (2022) Ehpe: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans Multimed
Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li Y (2022) Arhpe: Asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction. IEEE Trans Industr Inf 18(10):7107–7117
Article MATH Google Scholar
Li H, Guo Y, Zheng X, Xiong H (2024) Learning deformable hypothesis sampling for accurate patchmatch multi-view stereo. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 3082–3090
Hu H, Su L, Mao S, Chen M, Pan G, Xu B, Zhu Q (2023) Adaptive region aggregation for multi-view stereo matching using deformable convolutional networks. Photogram Rec 38(183):430–449
Article Google Scholar
Wang X, Zhu Z, Huang G, Qin F, Ye Y, He Y, Chi X, Wang X (2022) Mvster: Epipolar transformer for efficient multi-view stereo. In: European conference on computer vision, Springer, pp 573–591
Chen W, Xu H, Zhou Z, Liu Y, Sun B, Kang W, Xie X (2023) Costformer: cost transformer for cost aggregation in multi-view stereo. In: Proceedings of the thirty-second international joint conference on artificial intelligence, pp 599–608
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Ding Y, Yuan W, Zhu Q, Zhang H, Liu X, Wang Y, Liu X (2022) Transmvsnet: Global context-aware multi-view stereo network with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8585–8594
Zhao X, Ding W, An Y, Du Y, Yu T, Li M, Tang M, Wang J (2023) Fast segment anything. arXiv preprint arXiv:2306.12156
Zhang C, Han D, Qiao Y, Kim JU, Bae S-H, Lee S, Hong CS (2023) Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289
Ke L, Ye M, Danelljan M, Tai Y-W, Tang C-K, Yu F et al (2024) Segment anything in high quality. Adv Neural Inform Process Syst 36
Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Jiang Q, Li C, Yang J, Su H et al (2025) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision, Springer, pp 38–55
Zhang Y, Huang K, Chen C, Chen Q, Heng P-A (2023) Satta: Semantic-aware test-time adaptation for cross-domain medical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 148–158
Enomoto S, Hasegawa N, Adachi K, Sasaki T, Yamaguchi S, Suzuki S, Eda T (2024) Test-time adaptation meets image enhancement: Improving accuracy via uncertainty-aware logit switching. In: 2024 International joint conference on neural networks (IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN60899.2024.10650964
Nam H, Jung DS, Oh Y, Lee KM (2023) Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 14829–14839
Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, Minneapolis, Minnesota, vol 1, p 2
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
Aans, Henrik, Jensen, Rasmus, Vogiatzis, George, Tola, Engin, Dahl, Anders (2016) Large-scale data for multiple-view stereopsis. Int J ofuter Vis 120(2):153–168

Download references

Funding

This research was supported by the National Natural Science Foundation of China under Grant 62306310.

Author information

Authors and Affiliations

School of Information Engineering, China University of Geosciences, Beijing, 100083, China
Yan Zhang, Hongping Yan, Tingting Cai & Yueyue Zhou
State Key Laboratory of Multimodal Artiffcial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Yan Zhang, Kun Ding, Tingting Cai & Yueyue Zhou

Authors

Yan Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Hongping Yan
View author publications
You can also search for this author inPubMed Google Scholar
Kun Ding
View author publications
You can also search for this author inPubMed Google Scholar
Tingting Cai
View author publications
You can also search for this author inPubMed Google Scholar
Yueyue Zhou
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization, Yan Zhang, Hongping Yan and Kun Ding; methodology, Yan Zhang, Hongping Yan and Kun Ding; software, Yan Zhang; validation, Yan Zhang; resources, Yan Zhang; data curation, Yan Zhang; supervision, Hongping Yan and Kun Ding; writing original draft preparation, Yan Zhang; writing review and editing, Yan Zhang, Hongping Yan, Kun Ding, Tingting Cai, and Yueyue Zhou; funding acquisition, Kun Ding. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Hongping Yan.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Yan, H., Ding, K. et al. Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo. Appl Intell 55, 473 (2025). https://doi.org/10.1007/s10489-025-06382-9

Download citation

Accepted: 14 February 2025
Published: 25 February 2025
DOI: https://doi.org/10.1007/s10489-025-06382-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instructed fine-tuning based on semantic consistency constraint for deep multi-view stereo

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Context-Guided Multi-view Stereo with Depth Back-Projection

LE-MVSNet: Lightweight Efficient Multi-view Stereo Network

PSP-MVSNet: Deep Patch-Based Similarity Perceptual for Multi-view Stereo Depth Inference

Explore related subjects

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now