Abstract
3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional methods often use fixed attention patterns in visual encoders, limiting the utility of language-guided attention mechanisms. To address this, we introduce a novel language-guided visual context modeling (LGVC) strategy. Our approach enriches the visual encoding at multiple levels through language knowledge: (1) A Language-Object Embedding (LOE) Module directs attention toward language-relevant proposals in 3D visual scenes, and (2) a Language-Relation Embedding (LRE) Module explores the relationships among objects in the context of accompanying text. Extensive experiments show that LGVC efficiently filters out language-irrelevant proposals and aligns multimodal entities, outperforming state-of-the-art methods.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The ScanRefer Dataset is publicly accessible and can be obtained from the following URL: https://daveredrum.github.io/ScanRefer/. Likewise, the Nr3D/Sr3D Dataset is also publicly available and can be accessed at the following URL: https://referit3d.github.io/#dataset.
References
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int J Comput Vis 123(1):74–93
Kazemzadeh S, Ordonez V, Matten M, Berg T (2014) Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 787–798
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: European conference on computer vision, pp 69–85. Springer
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: ICCV, pp 4683–4693
Wang P, Wu Q, Cao J, Shen C, Gao L, Hengel van den A (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: CVPR, pp 1960–1968
Yu L, Lin Z, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1307–1315
Chen DZ, Chang AX, Nießner M (2020) Scanrefer: 3d object localization in rgb-d scans using natural language. In: ECCV
Achlioptas P, Abdelreheem A, Xia F, Elhoseiny M, Guibas L (2020) Referit3d: neural listeners for fine-grained 3d object identification in real-world scenes. In: ECCV, pp 422–440. Springer
Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S (2018) Gibson env: real-world perception for embodied agents. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9068–9079
Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, Straub J, Liu J, Koltun V, Malik J et al (2019) Habitat: a platform for embodied ai research. In: ICCV, pp 9339–9347
Kim K, Billinghurst M, Bruder G, Duh HBL, Welch GF (2018) Revisiting trends in augmented reality research: a review of the 2nd decade of ismar (2008–2017). IEEE Trans Visual Comput Graphics 24(11):2947–2962
Kress Bernard C, Cummings William J (2017) 11-1: Invited paper: Towards the ultimate mixed reality experience: Hololens display architecture choices. In SID symposium digest of technical papers, volume 48, pages 127-131. Wiley Online Library
He D, Zhao Y, Luo J, Hui T, Huang S, Zhang A, Liu S (2021) Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In: Proceedings of the 29th ACM international conference on multimedia
Roh J, Desingh K, Farhadi A, Fox D (2021) Languagerefer: spatial-language model for 3d visual grounding. In: CoRL
Yang Z, Zhang S, Wang L, Luo J (2021) Sat: 2d semantics assisted training for 3d visual grounding. In: ICCV
Yuan Z, Yan X, Liao Y, Zhang R, Wang S, Li Z, Cui S (2021) Instancerefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: ICCV
Phan AV, Nguyen ML, Nguyen YLH, Bui LT (2018) Dgcnn: a convolutional neural network over large-scale labeled graphs. Neural Networks
Huang PH, Lee HH, Chen HT, Liu TL (2021) Text-guided graph neural networks for referring 3D instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence 35, pp 1610–1618
Zhao L, Daigang CL, Sheng, Dong X, (2021) 3DVGTransformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2928–2937
Sadhu A, Chen K, Nevatia R (2019) Zero-shot grounding of objects from natural language queries. In: ICCV, pp 4694–4703
Yang Z, Chen T, Wang L, Luo J (2020) Improving one-stage visual grounding by recursive subquery construction. In: ECCV
Huang S, Chen Y, Jia J, Wang L (2022) Multi-view transformer for 3D visual grounding. In: CVPR
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of ICLR
Hu W, Fey M, Zitnik M, Dong Y, Ren H, Liu B, Catasta M, Leskovec J (2020) Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687
Li G, Xiong C, Thabet A, Ghanem B (2020) Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: International conference on machine learning, pp 1263–1272. PMLR
Wang G, Ying R, Huang J, Leskovec J (2020) Direct multi-hop attention based graph neural network. arXiv preprint arXiv:2009.14332
Yang S, Li G, Yu Y (2019) Dynamic graph attention for referring expression comprehension. In: CVPR
Yang S, Li G, Yu Y (2019) Cross-modal relationship inference for grounding referring expressions. In: CVPR
Jiang L, Zhao H, Shi S, Liu S, Fu CW, Jia J (2020) Pointgroup: dual-set point grouping for 3d instance segmentation. In: CVPR
Liu Z, Zhang Z, Cao Y, Hu H, Tong X (2021) Group-free 3d object detection via transformers. In: ICCV
Luo J, Fu1 J, Kong X, Gao C, Ren H, Shen H, Xia H, Liu S (2022) 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: CVPR2022
Liao Y, Liu S, Li G, Wang F, Chen Y, Qian C, Li B (2020) A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR
Mittal V (2020) Attngrounder: talking to cars with attention. In: European conference on computer vision, pp 62–73. Springer
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Aidan NG, Łukasz K, Illia P (2017) Attention is all you need. In: NeurIPS
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. ACL
Dai Z, Yang Z, Yang Y, Jaime GC, Quoc L, Ruslan S (2019) Attentive language models beyond a fixed-length context. In: ACL, Transformer-xl
Hui T, Liu S, Huang S, Linguistic structure guided context modeling for referring image segmentation[C], , Computer Vision-ECCV, et al (2020) 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. Springer International Publishing 2020:59–75
Ding Z, Hui T, Huang J et al (2022) Language-bridged spatial-temporal interaction for referring video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4964–4973
Tianrui H et al (2023) Language-aware spatial-temporal collaboration for referring video segmentation. IEEE Trans Pattern Anal Mach Intell
Feng M, Li Z, Li Q, Zhang L, Zhang X, Zhu G, Zhang H, Wang Y, Mian A (2021) Free-form description guided 3d visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3722–3731
Qi CR, Litany O, He K, Guibas LJ (2019) Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF international conference on computer vision, pp 9277–9286
Cai D, Zhao L, Zhang J, Sheng L, Xu D (2022) 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16464–16473
Wen X, Xiang P, Han Z et al (2022) PMP-Net++: point cloud completion by transformer-enhanced multi-step point moving paths. IEEE Trans Pattern Anal Mach Intell 45(1):852–867
Qi C R, Yi L, Su H et al (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst, 30
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597-1600
Dave ZC et al (2022) D3Net: a unified speaker-listener architecture for 3D dense captioning and visual grounding. In: European conference on computer vision. Cham, Springer Nature Switzerland
Chen J, Luo W, Wei X, Ma L, Zhang W (2022) Ham: hierarchical attention model with high performance for 3d visual grounding. arXiv preprint arXiv:2210.12513, 2
Wansen W et al (2023) Vision-language navigation: a survey and taxonomy. Neural Comput Appl 1–26
Zhao J et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34.11:9015–9023
Chen C, Xiaodong G (2021) Context-aware network with foreground recalibration for grounding natural language in video. Neural Comput Appl 33:10485–10502
Abdelreheem A, Upadhyay U, Skorokhodov I, Yahya RAl, Chen J, Elhoseiny M (2022) 3dreftransformer: Fine-grained object identification in realworld scenes using natural language. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3941–3950
Jain A, Gkanatsios N, Mediratta I, Fragkiadaki K (2022) Bottom up top down detection transformers for language grounding in images and point clouds. In: Proceedings of the European Conference on Computer Vision, pp 417–433. Springer
Jain A, Gkanatsios N, Mediratta I, Fragkiadaki K (2022) Bottom up top down detection transformers for language grounding in images and point clouds. In: Proceedings of the European conference on computer vision, pp 417–433. Springer
Yanmin W et al (2023) EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Acknowledgements
This work was supported partly by the National Natural Science Foundation of China (Grant No. 62173045, 61673192), partly by the Fundamental Research Funds for the Central Universities (Grant No. 2020XD-A04-3), and the Natural Science Foundation of Hainan Province (Grant No. 622RC675).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or nonfinancial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Geng, L., Yin, J. & Niu, Y. Lgvc: language-guided visual context modeling for 3D visual grounding. Neural Comput & Applic 36, 12977–12990 (2024). https://doi.org/10.1007/s00521-024-09764-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09764-1