Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Unal, Ozan; Sakaridis, Christos; Saha, Suman; Van Gool, Luc

doi:10.1007/978-3-031-73116-7_12

Ozan Unal ORCID: orcid.org/0000-0002-1121-3883^13,14,
Christos Sakaridis¹³,
Suman Saha^13,15 &
…
Luc Van Gool^13,16,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

European Conference on Computer Vision

279 Accesses

Abstract

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet ranks $1^{st}$ on the challenging ScanRefer online benchmark and has won the ICCV $3^{rd}$ Workshop on Language for 3D Scenes “3D Object Localization” challenge. Our code is available at ouenal.github.io/concretenet/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

OpenIns3D: Snap and Lookup for 3D Open-Vocabulary Instance Segmentation

YORO - Lightweight End to End Visual Grounding

Notes

1.
Classifying each point yields a more robust solution compared to the localization of 8 corner points in complete free 3D space.
2.
We believe that input camera positions are a reasonable assumption in indoor robotic applications and hope that this performance potential will motivate future research.

References

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25
Chapter Google Scholar
Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: CVPR (2022)
Google Scholar
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chapter Google Scholar
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3Net: a unified speaker-listener architecture for 3D dense captioning and visual grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXII, pp. 487–505. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_29
Chen, J., Luo, W., Wei, X., Ma, L., Zhang, W.: HAM: hierarchical attention model with high performance for 3D visual grounding. arXiv preprint arXiv:2210.12513 (2022)
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: context-aware dense captioning in RGB-d scans. In: CVPR (2021)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: ICCV (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Feng, M., et al.: Free-form description guided 3D visual graph network for object grounding in point cloud. In: ICCV (2021)
Google Scholar
Goyal, A., Yang, K., Yang, D., Deng, J.: Rel3D: a minimally contrastive benchmark for grounding spatial relations in 3D. In: NIPS (2020)
Google Scholar
Guo, Z., et al.: Viewrefer: grasp the multi-view knowledge for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15372–15383 (2023)
Google Scholar
He, T., Shen, C., Van Den Hengel, A.: Dyco3d: robust instance segmentation of 3d point clouds through dynamic convolution. In: CVPR (2021)
Google Scholar
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3D instance segmentation. In: AAAI (2021)
Google Scholar
Huang, S., Chen, Y., Jia, J., Wang, L.: Multi-view transformer for 3D visual grounding. In: CVPR (2022)
Google Scholar
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: dual-set point grouping for 3d instance segmentation. In: CVPR (2020)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: ICCV (2014)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Luo, J., et al.:: 3D-SPS: Single-stage 3D visual grounding via referred point progressive selection. In: CVPR (2022)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: ICCV (2016)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Prabhudesai, M., Tung, H.Y.F., Javed, S.A., Sieb, M., Harley, A.W., Fragkiadaki, K.: Embodied language grounding with 3D visual feature representations. In: CVPR (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Roh, J., Desingh, K., Farhadi, A., Fox, D.: LanguageRefer: spatial-language model for 3D visual grounding. In: CoRL (2022)
Google Scholar
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXIII, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_8
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: Mpnet: masked and permuted pre-training for language understanding. In: NIPS (2020)
Google Scholar
Vu, T., Kim, K., Luu, T.M., Nguyen, X.T., Yoo, C.D.: Softgroup for 3d instance segmentation on 3d point clouds. In: CVPR (2022)
Google Scholar
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: Eda: explicit text-decoupling and dense alignment for 3d visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19231–19242 (2023)
Google Scholar
Wu, Y., Shi, M., Du, S., Lu, H., Cao, Z., Zhong, W.: 3D instances as 1D kernels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXIX, pp. 235–252. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_14
Yang, Z., Zhang, S., Wang, L., Luo, J.: SAT: 2D semantics assisted training for 3D visual grounding. In: ICCV (2021)
Google Scholar
Yuan, Z., et al.: InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: ICCV (2021)
Google Scholar
Zhang, P., et al.: Multi-scale vision Longformer: a new vision transformer for high-resolution image encoding. In: ICCV (2021)
Google Scholar
Zhang, Y., Gong, Z., Chang, A.X.: Multi3drefer: grounding text description to multiple 3d objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15225–15236 (2023)
Google Scholar
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-transformer: relation modeling for visual grounding on point clouds. In: ICCV (2021)
Google Scholar

Download references

Acknowledgments

This work is funded by Toyota Motor Europe via the research project TRACE-Zürich.

Author information

Authors and Affiliations

ETH Zurich, Zurich, Switzerland
Ozan Unal, Christos Sakaridis, Suman Saha & Luc Van Gool
Huawei Technologies, Zurich, Switzerland
Ozan Unal
PSI, Zurich, Switzerland
Suman Saha
KU Leuven, Leuven, Belgium
Luc Van Gool
INSAIT, Sofia, Bulgaria
Luc Van Gool

Authors

Ozan Unal
View author publications
You can also search for this author in PubMed Google Scholar
Christos Sakaridis
View author publications
You can also search for this author in PubMed Google Scholar
Suman Saha
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ozan Unal .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9181 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Unal, O., Sakaridis, C., Saha, S., Van Gool, L. (2025). Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-73116-7_12
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73115-0
Online ISBN: 978-3-031-73116-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

OpenIns3D: Snap and Lookup for 3D Open-Vocabulary Instance Segmentation

YORO - Lightweight End to End Visual Grounding

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9181 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

OpenIns3D: Snap and Lookup for 3D Open-Vocabulary Instance Segmentation

YORO - Lightweight End to End Visual Grounding

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9181 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation