MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

Liu, Yu; Wang, Ruowei; Li, Jiaqi; Xu, Zixiang; Zhao, Qijun

doi:10.1007/978-981-97-8508-7_12

Yu Liu¹⁵,
Ruowei Wang¹⁶,
Jiaqi Li¹⁵,
Zixiang Xu¹⁶ &
…
Qijun Zhao^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

146 Accesses
1 Altmetric

Abstract

Reconstructing 3D models from single-view images is a long-standing problem in computer vision. The latest advances for single-image 3D reconstruction extract a textual description from the input image and further utilize it to synthesize 3D models. However, existing methods focus on capturing a single key attribute of the image (e.g., object type, artistic style) and fail to consider the multi-perspective information required for accurate 3D reconstruction, such as object shape and material properties. Besides, the reliance on Neural Radiance Fields hinders their ability to reconstruct intricate surfaces and texture details. In this work, we propose MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction. Our approach consists of two stages. First, we adopt a novel multi-word textual inversion technique to extract a detailed text description capturing the image’s characteristics. Then, we use this description and the image to generate a 3D model with FlexiCubes. Additionally, MTFusion enhances FlexiCubes by employing a special decoder network for Signed Distance Functions, leading to faster training and finer surface representation. Extensive evaluations demonstrate that our MTFusion surpasses existing image-to-3D methods on a wide range of synthetic and real-world images. Furthermore, the ablation study proves the effectiveness of our network designs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficient Implicit SDF and Color Reconstruction via Shared Feature Field

Improving Neural Surface Reconstruction with Feature Priors from Multi-view Images

MVDiffusion++: A Dense High-Resolution Multi-view Diffusion Model for Single or Sparse-View 3D Object Reconstruction

References

Agarwal, A., Karanam, S., Shukla, T., Srinivasan, B.V.: An image is worth multiple words: multi-attribute inversion for constrained text-to-image synthesis (2023). arXiv:2311.11919
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. In: CVPR (2021)
Google Scholar
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In: ICCV (2023)
Google Scholar
Fei, Z., Fan, M., Huang, J.: Gradient-free textual inversion. In: ACMMM (2023)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)
Google Scholar
Gao, J., et al.: Get3d: a generative model of high quality 3d textured shapes learned from images. NeurIPS (2022)
Google Scholar
Gaoli, S., Shudi, X., Qijun, Z.: Soft threshold denoising and video data fusion-relevant low-quality 3d face recognition. J. Image Graph. 28(5), 1434–1444 (2023)
Article Google Scholar
Hansen, N., Müller, S.D., Koumoutsakos, P.: Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput. 11(1), 1–18 (2003)
Article Google Scholar
Van den Heuvel, F.A.: 3D reconstruction from a single image using geometric constraints. ISPRS (1998)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ju, T., Losasso, F., Schaefer, S., Warren, J.: Dual contouring of hermite data. In: TOG (2002)
Google Scholar
Khilar, R., Chitrakala, S., SelvamParvathy, S.: 3D image reconstruction: techniques, applications and challenges. In: ICOSS (2013)
Google Scholar
Koutsoudis, A., Vidmar, B., Ioannakis, G., Arnaoutoglou, F., Pavlidis, G., Chamzas, C.: Multi-image 3D reconstruction data evaluation. JCH (2014)
Google Scholar
Kwon, M., Jeong, J., Uh, Y.: Diffusion models already have a semantic latent space. In: ICLR (2023)
Google Scholar
Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., Aila, T.: Modular primitives for high-performance differentiable rendering. TOG (2020)
Google Scholar
Liao, Y., Donne, S., Geiger, A.: Deep marching cubes: Learning explicit surface representations. In: CVPR (2018)
Google Scholar
Lin, C.H., et al.: Magic3D: High-resolution text-to-3D content creation. In: CVPR (2023)
Google Scholar
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. In: SIGGRAPH (1987)
Google Scholar
Melas-Kyriazi, L., Laina, I., Rupprecht, C., Vedaldi, A.: Realfusion: 360deg reconstruction of any object from a single image. In: CVPR (2023)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)
Google Scholar
Munkberg, J., et al.: Extracting triangular 3d models, materials, and lighting from images. In: CVPR (2022)
Google Scholar
Nielson, G.M.: Dual marching cubes. In: IEEE Visualization 2004 (2004)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. In: ICLR (2023)
Google Scholar
Qian, G., et al.: Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. In: ICLR (2024)
Google Scholar
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recogn. 106, 107404 (2020)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rechenberg, I.: Evolution strategy: nature’s way of optimization. In: Optimization: Methods and Applications, Possibilities and Limitations (1989)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. NeurIPS (2022)
Google Scholar
Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3D shape synthesis. In: NeurIPS (2021)
Google Scholar
Shen, T., et al.: Flexible isosurface extraction for gradient-based mesh optimization. TOG (2023)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Su, P., Zhao, Q., Pan, F., Gao, F.: Cascaded network-based single-view bird 3d reconstruction. In: ICANN (2023)
Google Scholar
Sun, S., Zhu, Z., Dai, X., Zhao, Q., Li, J.: Weakly-supervised reconstruction of 3d objects with large shape variation from single in-the-wild images. In: ACCV (2020)
Google Scholar
Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3D content creation. In: ICLR (2024)
Google Scholar
Tang, J., et al.: Make-it-3D: High-fidelity 3D creation from a single image with diffusion prior. In: ICCV (2023)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Google Scholar
Wang, R., Liu, Y., Su, P., Zhang, J., Zhao, Q.: 3d semantic subspace traverser: empowering 3d generative model with shape editing capability. In: CVPR (2023)
Google Scholar
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. In: NeurIPS (2024)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhao, B.N., et al.: Dreamdistribution: prompt distribution learning for text-to-image diffusion models (2023). arXiv:2312.14216

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 62176170, 61773270), and the Key Science and Technology Plans of Lhasa (No. LSKJ202306).

Author information

Authors and Affiliations

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China
Yu Liu, Jiaqi Li & Qijun Zhao
College of Computer Science, Sichuan University, Chengdu, China
Ruowei Wang, Zixiang Xu & Qijun Zhao

Authors

Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ruowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zixiang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Qijun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qijun Zhao .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Wang, R., Li, J., Xu, Z., Zhao, Q. (2025). MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_12

Download citation

DOI: https://doi.org/10.1007/978-981-97-8508-7_12
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Implicit SDF and Color Reconstruction via Shared Feature Field

Improving Neural Surface Reconstruction with Feature Priors from Multi-view Images

MVDiffusion++: A Dense High-Resolution Multi-view Diffusion Model for Single or Sparse-View 3D Object Reconstruction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Implicit SDF and Color Reconstruction via Shared Feature Field

Improving Neural Surface Reconstruction with Feature Priors from Multi-view Images

MVDiffusion++: A Dense High-Resolution Multi-view Diffusion Model for Single or Sparse-View 3D Object Reconstruction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation