3D-C2FT: Coarse-to-Fine Transformer for Multi-view 3D Reconstruction

Tiong, Leslie Ching Ow; Sigmund, Dick; Teoh, Andrew Beng Jin

doi:10.1007/978-3-031-26319-4_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13841))

Included in the following conference series:

Asian Conference on Computer Vision

716 Accesses
12 Citations

Abstract

Recently, the transformer model has been successfully employed for the multi-view 3D reconstruction problem. However, challenges remain in designing an attention mechanism to explore the multi-view features and exploit their relations for reinforcing the encoding-decoding modules. This paper proposes a new model, namely 3D coarse-to-fine transformer (3D-C2FT), by introducing a novel coarse-to-fine (C2F) attention mechanism for encoding multi-view features and rectifying defective voxel-based 3D objects. C2F attention mechanism enables the model to learn multi-view information flow and synthesize 3D surface correction in a coarse to fine-grained manner. The proposed model is evaluated by ShapeNet and Multi-view Real-life voxel-based datasets. Experimental results show that 3D-C2FT achieves notable results and outperforms several competing models on these datasets.

L. C. O. Tiong and D. Sigmund—These authors have contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adaptive Interaction-Based Multi-view 3D Object Reconstruction

ARShape-Net: Single-View Image Oriented 3D Shape Reconstruction with an Adversarial Refiner

Improving Multi-view Stereo with Contextual 2D-3D Skip Connection

Notes

1.
Source Code URL: https://github.com/tiongleslie/3D-C2FT/.

References

Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. arXiv e-prints (2020). https://arxiv.org/abs/2005.00928
Burchfiel, B., Konidaris, G.: Bayesian eigenobjects: a unified framework for 3D robot perception. In: Robotics: Science and Systems, vol. 13 (2017)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3d object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Gao, Y., Luo, J., Qiu, H., Wu, B.: Survey of structure from motion. In: Proceedings of 2014 International Conference on Cloud Computing and Internet of Things, pp. 72–76 (2014)
Google Scholar
Groen, I.I.A., Baker, C.I.: Previews scenes in the human brain: comparing 2D versus 3D representations. Neuron 101(1), 8–10 (2019)
Article Google Scholar
Han, X.F., Laga, H., Bennamoun, M.: Image-based 3D object reconstruction: state-of-the-art and trends in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1578–1604 (2021)
Article Google Scholar
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708 (2017)
Google Scholar
Jabłoński, S., Martyn, T.: Real-time voxel rendering algorithm based on screen space billboard voxel buffer with sparse lookup textures. In: 24th Conference on Computer Graphics, Visualization and Computer Vision, pp. 27–36 (2016)
Google Scholar
Kanzler, M., Rautenhaus, M., Westermann, R.: A voxel-based rendering pipeline for large 3d line sets. IEEE Trans. Visual Comput. Graph. 25(7), 2378–2391 (2019)
Article Google Scholar
Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pp. 364–375. Curran Associates, Inc. (2017)
Google Scholar
Kargas, A., Loumos, G., Varoutas, D.: Using different ways of 3D reconstruction of historical cities for gaming purposes: the case study of Nafplio. Heritage 2(3), 1799–1811 (2019)
Article Google Scholar
Kniaz, V.V., Knyaz, V.A., Remondino, F., Bordodymov, A., Moshkantsev, P.: Image-to-voxel model translation for 3d scene reconstruction and segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 105–124. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_7
Chapter Google Scholar
Malik, J., et al.: HandVoxNet: deep voxel-based network for 3d hand shape and pose estimation from a single depth map. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7111–7120 (2020)
Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3d reconstruction in function space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Nabil, M., Saleh, F.: 3D reconstruction from images for museum artefacts: a comparative study. In: International Conference on Virtual Systems and Multimedia (VSMM), pp. 257–260. IEEE (2014)
Google Scholar
Nguyen, T.Q., Salazar, J.: Transformers without tears: improving the normalization of self-attention. In: Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong (2019)
Google Scholar
Park, N., Kim, S.: How do vision transformers work? In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS), pp. 8024–8035 (2019)
Google Scholar
Păvăloiu, I.B., Vasilăţeanu, A., Goga, N., Marin, I., Ilie, C., Ungar, A., Pătraşcu, I.: 3D dental reconstruction from CBCT data. In: International Symposium on Fundamentals of Electrical Engineering (ISFEE), pp. 4–9 (2014)
Google Scholar
Roointan, S., Tavakolian, P., Sivagurunathan, K.S., Floryan, M., Mandelis, A., Abrams, S.H.: 3D dental subsurface imaging using enhanced truncated correlation-photothermal coherence tomography. Sci. Rep. 9(1), 1–12 (2019)
Article Google Scholar
Shi, Q., Li, C., Wang, C., Luo, H., Huang, Q., Fukuda, T.: Design and implementation of an omnidirectional vision system for robot perception. Mechatronics 41, 58–66 (2017)
Article Google Scholar
Shi, Z., Meng, Z., Xing, Y., Ma, Y., Wattenhofer, R.: 3D-RETR: end-to-end single and multi-view 3D reconstruction with transformers. In: British Machine Vision Conference (BMVC), pp. 1–14 (2021)
Google Scholar
Silveira, G., Malis, E., Rives, P.: An efficient direct approach to visual SLAM. IEEE Trans. Rob. 24(5), 969–979 (2008)
Article Google Scholar
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In: IEEE International Conference on Computer Vision (ICCV), pp. 2088–2096 (2017)
Google Scholar
Tron, R., Vidal, R.: Distributed 3-D localization of camera sensor networks from 2-D image Measurements. IEEE Trans. Autom. Control 59(12), 3325–3340 (2014)
Article MathSciNet MATH Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), vol. 30, pp. 6000–6010 (2017)
Google Scholar
Wang, D., et al.: Multi-view 3D reconstruction with transformer. In: International Conference on Computer Vision (ICCV), pp. 5722–5731 (2021)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.-G.: Pixel2Mesh: generating 3d mesh models from single RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 55–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_4
Chapter Google Scholar
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wilson, K., Snavely, N.: Robust global translations with 1DSfM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 61–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_5
Chapter Google Scholar
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015)
Google Scholar
Xie, H., Yao, H., Sun, X., Zhou, S., Zhang, S.: Pix2Vox: context-aware 3D reconstruction from single and multi-view images. In: IEEE International Conference on Computer Vision (ICCV), pp. 2690–2698 (2019)
Google Scholar
Xie, H., Yao, H., Zhang, S., Zhou, S., Sun, W.: Pix2Vox++: multi-scale context-aware 3D object reconstruction from single and multiple images. Int. J. Comput. Vis. 128(12), 2919–2935 (2020)
Article Google Scholar
Yagubbayli, F., Tonioni, A., Tombari, F.: LegoFormer: transformers for block-by-block multi-view 3D reconstruction. arXiv e-prints (2021). http://arxiv.org/abs/2106.12102
Yang, B., Wang, S., Markham, A., Trigoni, N.: Robust attentional aggregation of deep feature sets for multi-view 3D reconstruction. Int. J. Comput. Vis. 128(1), 53–73 (2020)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computational Science Research Center, Korea Institute of Science and Technology, 5, Hwarang-ro 14-gil, Seongbuk-gu, Seoul, 02792, Republic of Korea
Leslie Ching Ow Tiong
AIDOT Inc., 128, Beobwon-ro, Songpa-gu, Seoul, 05854, Republic of Korea
Dick Sigmund
School of Electrical and Electronic Engineering, Yonsei University, Seoul, 120-749, Republic of Korea
Andrew Beng Jin Teoh

Authors

Leslie Ching Ow Tiong
View author publications
You can also search for this author in PubMed Google Scholar
Dick Sigmund
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Beng Jin Teoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Beng Jin Teoh .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 11840 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tiong, L.C.O., Sigmund, D., Teoh, A.B.J. (2023). 3D-C2FT: Coarse-to-Fine Transformer for Multi-view 3D Reconstruction. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13841. Springer, Cham. https://doi.org/10.1007/978-3-031-26319-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-26319-4_13
Published: 04 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26318-7
Online ISBN: 978-3-031-26319-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

3D-C2FT: Coarse-to-Fine Transformer for Multi-view 3D Reconstruction