Abstract
A fundamental problem in the texturing of 3D meshes using pre-trained text-to-image models is to ensure multi-view consistency. State-of-the-art approaches typically use diffusion models to aggregate multi-view inputs, where common issues are the blurriness caused by the averaging operation in the aggregation step or inconsistencies in local features. This paper introduces an optimization framework that proceeds in four stages to achieve multi-view consistency. Specifically, the first stage generates an over-complete set of 2D textures from a predefined set of viewpoints using an MV-consistent diffusion process. The second stage selects a subset of views that are mutually consistent while covering the underlying 3D model. We show how to achieve this goal by solving semi-definite programs. The third stage performs non-rigid alignment to align the selected views across overlapping regions. The fourth stage solves an MRF problem to associate each mesh face with a selected view. In particular, the third and fourth stages are iterated, with the cuts obtained in the fourth stage encouraging non-rigid alignment in the third stage to focus on regions close to the cuts. Experimental results show that our approach significantly outperforms baseline approaches both qualitatively and quantitatively. Project page: https://aigc3d.github.io/ConsistenTex.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: synthesizing 3d textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4169–4181 (2023)
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Chen, Y., Chen, R., Lei, J., Zhang, Y., Jia, K.: Tango: text-driven photorealistic and robust 3D stylization via lighting decomposition (2022)
Christie, M., Olivier, P., Normand, J.: Camera control in computer graphics. Comput. Graph. Forum 27(8), 2197–2218 (2008). https://doi.org/10.1111/j.1467-8659.2008.01181.x
Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002). https://doi.org/10.1109/34.1000236
Deitke, M., et al.: Objaverse: a universe of annotated 3D objects (2022)
Deng, K., et al.: Flashtex: fast relightable mesh texturing with lightcontrolnet (2024)
Dong, Y., et al.: Gpld3d: latent diffusion of 3d shape generative models by enforcing geometric and physical priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 56–66 (2024)
Dutagaci, H., Cheung, C.P., Godil, A.: A benchmark for best view selection of 3d objects. In: Proceedings of the ACM Workshop on 3D Object Retrieval (3DOR 2010), pp. 45–50. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1877808.1877819
Efros, A.A., Freeman, W.T.: Image Quilting for Texture Synthesis and Transfer, 1st edn. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3596711.3596771
Guo, Y., et al.: Decorate3d: text-driven high-quality texture generation for mesh decoration in the wild. In: Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS) (2023)
Hamdi, A., Giancola, S., Ghanem, B.: MVTN: multi-view transformation network for 3d shape recognition. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, 10–17 October 2021, pp. 1–11. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00007
Hamdi, A., Giancola, S., Ghanem, B.: Voint cloud: multi-view point cloud representation for 3d understanding. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, 1–5 May 2023. OpenReview.net (2023). https://openreview.net/pdf?id=IpGgfpMucHj
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Kanezaki, A., Matsushita, Y., Nishida, Y.: Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, 18–22 June 2018, pp. 5010–5019. Computer Vision Foundation/IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00526
Kappes, J.H., et al.: A comparative study of modern inference techniques for structured discrete energy minimization problems. Int. J. Comput. Vision 1–30 (2015). https://doi.org/10.1007/s11263-015-0809-x
Kim, S., Tai, Y., Lee, J., Park, J., Kweon, I.S.: Category-specific salient view selection via deep convolutional neural networks. Comput. Graph. Forum 36(8), 313–328 (2017). https://doi.org/10.1111/cgf.13082
Knodt, J., Gao, X.: Consistent latent diffusion for mesh texturing (2023)
Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1568–1583 (2006). https://doi.org/10.1109/TPAMI.2006.200
Kundu, A., et al.: Virtual multi-view fusion for 3d semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) ECCV 2020, Part XXIV. LNCS, vol. 12369, pp. 518–535. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_31
Lee, C.H., Varshney, A., Jacobs, D.W.: Mesh saliency. In: ACM SIGGRAPH 2005 Papers (SIGGRAPH 2005), pp. 659–666. Association for Computing Machinery, New York (2005). https://doi.org/10.1145/1186822.1073244
Leifman, G., Shtrom, E., Tal, A.: Surface regions of interest for viewpoint selection. IEEE Trans. Pattern Anal. Mach. Intell. 38(12), 2544–2556 (2016). https://doi.org/10.1109/TPAMI.2016.2522437
Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2011). https://doi.org/10.1109/TPAMI.2010.147
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328 (2023)
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Liu, Y., Xie, M., Liu, H., Wong, T.T.: Text-guided texturing by synchronized multi-view diffusion. arXiv preprint arXiv:2311.12891 (2023)
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, 17–24 June 2023, pp. 12663–12673. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01218
Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: text-driven neural stylization for meshes. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https://doi.org/10.1109/cvpr52688.2022.01313
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers (2022). https://doi.org/10.1145/3550469.3555392
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, 1–5 May 2023. OpenReview.net (2023). https://openreview.net/pdf?id=FjNys5c7VyY
Qiu, L., et al.: Richdreamer: a generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv preprint arXiv:2311.16918 (2023)
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3d shapes. In: ACM SIGGRAPH 2023 Conference Proceedings (SIGGRAPH 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3588432.3591503
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https://doi.org/10.1109/cvpr52688.2022.01042
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022). http://papers.nips.cc/paper_files/paper/2022/hash/a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets_and_Benchmarks.html
Secord, A., Lu, J., Finkelstein, A., Singh, M., Nealen, A.: Perceptual models of viewpoint preference. ACM Trans. Graph. 30(5), 1–12 (2011). https://doi.org/10.1145/2019627.2019628
. Sederberg, T.W., Parry, S.R.: Free-form deformation of solid geometric models. SIGGRAPH Comput. Graph. 20(4), 151–160 (1986). https://doi.org/10.1145/15886.15903
Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
Soltani, A.A., Huang, H., Wu, J., Kulkarni, T.D., Tenenbaum, J.B.: Synthesizing 3D shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, 21–26 July 2017, pp. 2511–2519. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.269
Song, R., Zhang, W., Zhao, Y., Liu, Y.: Unsupervised multi-view CNN for salient view selection and 3d interest point detection. Int. J. Comput. Vis. 130(5), 1210–1227 (2022). https://doi.org/10.1007/s11263-022-01592-x
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.G.: Multi-view convolutional neural networks for 3d shape recognition. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, 7–13 December 2015, pp. 945–953. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.114
Sun, Y., Huang, Q., Hsiao, D., Guan, L., Hua, G.: Learning view selection for 3d scenes. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021, pp. 14464–14473. Computer Vision Foundation/IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01423
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_20
Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: Textmesh: generation of realistic 3D meshes from text prompts (2023)
Waechter, M., Moehrle, N., Goesele, M.: Let there be color! Large-scale texturing of 3D reconstructions. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 836–850. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_54
Wei, X., Yu, R., Sun, J.: View-GCN: view-based graph convolutional network for 3d shape analysis. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, 13–19 June 2020, pp. 1847–1856. Computer Vision Foundation/IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00192
Weng, H., et al.: Consistent123: improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092 (2023)
Xu, Y., et al.: DMV3D: denoising multi-view diffusion using 3d large reconstruction model (2023)
Ye, J., Wang, P., Li, K., Shi, Y., Wang, H.: Consistent-1-to-3: consistent image to 3D view synthesis via geometry-aware diffusion models (2023)
Youwang, K., Oh, T.H., Pons-Moll, G.: Paint-it: text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Yu, X., Dai, P., Li, W., Ma, L., Liu, Z., Qi, X.: Texture generation on 3d meshes with point-UV diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4206–4216 (2023)
Zeng, X., et al.: Paint3d: paint anything 3d with lighting-less texture diffusion models (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3836–3847 (2023)
Zuo, Q., et al.: Videomv: consistent multi-view generation based on large video generative model (2024)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, Z. et al. (2025). An Optimization Framework to Enforce Multi-view Consistency for Texturing 3D Meshes. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15094. Springer, Cham. https://doi.org/10.1007/978-3-031-72764-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72764-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72763-4
Online ISBN: 978-3-031-72764-1
eBook Packages: Computer ScienceComputer Science (R0)