Lgvc: language-guided visual context modeling for 3D visual grounding

Geng, Liang; Yin, Jianqin; Niu, Yingchun

doi:10.1007/s00521-024-09764-1

Lgvc: language-guided visual context modeling for 3D visual grounding

Original Article
Published: 23 April 2024

Volume 36, pages 12977–12990, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Liang Geng^1,2,
Jianqin Yin¹ &
Yingchun Niu¹

188 Accesses
Explore all metrics

Abstract

3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional methods often use fixed attention patterns in visual encoders, limiting the utility of language-guided attention mechanisms. To address this, we introduce a novel language-guided visual context modeling (LGVC) strategy. Our approach enriches the visual encoding at multiple levels through language knowledge: (1) A Language-Object Embedding (LOE) Module directs attention toward language-relevant proposals in 3D visual scenes, and (2) a Language-Relation Embedding (LRE) Module explores the relationships among objects in the context of accompanying text. Extensive experiments show that LGVC efficiently filters out language-irrelevant proposals and aligns multimodal entities, outperforming state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical cross-modal contextual attention network for visual grounding

Article 17 April 2023

An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

Rwkv-vg: visual grounding with RWKV-driven encoder-decoder framework

Article 21 February 2025

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The ScanRefer Dataset is publicly accessible and can be obtained from the following URL: https://daveredrum.github.io/ScanRefer/. Likewise, the Nr3D/Sr3D Dataset is also publicly available and can be accessed at the following URL: https://referit3d.github.io/#dataset.

References

Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int J Comput Vis 123(1):74–93
Article MathSciNet Google Scholar
Kazemzadeh S, Ordonez V, Matten M, Berg T (2014) Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 787–798
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: European conference on computer vision, pp 69–85. Springer
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: ICCV, pp 4683–4693
Wang P, Wu Q, Cao J, Shen C, Gao L, Hengel van den A (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: CVPR, pp 1960–1968
Yu L, Lin Z, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1307–1315
Chen DZ, Chang AX, Nießner M (2020) Scanrefer: 3d object localization in rgb-d scans using natural language. In: ECCV
Achlioptas P, Abdelreheem A, Xia F, Elhoseiny M, Guibas L (2020) Referit3d: neural listeners for fine-grained 3d object identification in real-world scenes. In: ECCV, pp 422–440. Springer
Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S (2018) Gibson env: real-world perception for embodied agents. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9068–9079
Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, Straub J, Liu J, Koltun V, Malik J et al (2019) Habitat: a platform for embodied ai research. In: ICCV, pp 9339–9347
Kim K, Billinghurst M, Bruder G, Duh HBL, Welch GF (2018) Revisiting trends in augmented reality research: a review of the 2nd decade of ismar (2008–2017). IEEE Trans Visual Comput Graphics 24(11):2947–2962
Article Google Scholar
Kress Bernard C, Cummings William J (2017) 11-1: Invited paper: Towards the ultimate mixed reality experience: Hololens display architecture choices. In SID symposium digest of technical papers, volume 48, pages 127-131. Wiley Online Library
He D, Zhao Y, Luo J, Hui T, Huang S, Zhang A, Liu S (2021) Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In: Proceedings of the 29th ACM international conference on multimedia
Roh J, Desingh K, Farhadi A, Fox D (2021) Languagerefer: spatial-language model for 3d visual grounding. In: CoRL
Yang Z, Zhang S, Wang L, Luo J (2021) Sat: 2d semantics assisted training for 3d visual grounding. In: ICCV
Yuan Z, Yan X, Liao Y, Zhang R, Wang S, Li Z, Cui S (2021) Instancerefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: ICCV
Phan AV, Nguyen ML, Nguyen YLH, Bui LT (2018) Dgcnn: a convolutional neural network over large-scale labeled graphs. Neural Networks
Huang PH, Lee HH, Chen HT, Liu TL (2021) Text-guided graph neural networks for referring 3D instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence 35, pp 1610–1618
Zhao L, Daigang CL, Sheng, Dong X, (2021) 3DVGTransformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2928–2937
Sadhu A, Chen K, Nevatia R (2019) Zero-shot grounding of objects from natural language queries. In: ICCV, pp 4694–4703
Yang Z, Chen T, Wang L, Luo J (2020) Improving one-stage visual grounding by recursive subquery construction. In: ECCV
Huang S, Chen Y, Jia J, Wang L (2022) Multi-view transformer for 3D visual grounding. In: CVPR
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of ICLR
Hu W, Fey M, Zitnik M, Dong Y, Ren H, Liu B, Catasta M, Leskovec J (2020) Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687
Li G, Xiong C, Thabet A, Ghanem B (2020) Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: International conference on machine learning, pp 1263–1272. PMLR
Wang G, Ying R, Huang J, Leskovec J (2020) Direct multi-hop attention based graph neural network. arXiv preprint arXiv:2009.14332
Yang S, Li G, Yu Y (2019) Dynamic graph attention for referring expression comprehension. In: CVPR
Yang S, Li G, Yu Y (2019) Cross-modal relationship inference for grounding referring expressions. In: CVPR
Jiang L, Zhao H, Shi S, Liu S, Fu CW, Jia J (2020) Pointgroup: dual-set point grouping for 3d instance segmentation. In: CVPR
Liu Z, Zhang Z, Cao Y, Hu H, Tong X (2021) Group-free 3d object detection via transformers. In: ICCV
Luo J, Fu1 J, Kong X, Gao C, Ren H, Shen H, Xia H, Liu S (2022) 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: CVPR2022
Liao Y, Liu S, Li G, Wang F, Chen Y, Qian C, Li B (2020) A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR
Mittal V (2020) Attngrounder: talking to cars with attention. In: European conference on computer vision, pp 62–73. Springer
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Aidan NG, Łukasz K, Illia P (2017) Attention is all you need. In: NeurIPS
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. ACL
Dai Z, Yang Z, Yang Y, Jaime GC, Quoc L, Ruslan S (2019) Attentive language models beyond a fixed-length context. In: ACL, Transformer-xl
Hui T, Liu S, Huang S, Linguistic structure guided context modeling for referring image segmentation[C], , Computer Vision-ECCV, et al (2020) 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. Springer International Publishing 2020:59–75
Ding Z, Hui T, Huang J et al (2022) Language-bridged spatial-temporal interaction for referring video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4964–4973
Tianrui H et al (2023) Language-aware spatial-temporal collaboration for referring video segmentation. IEEE Trans Pattern Anal Mach Intell
Feng M, Li Z, Li Q, Zhang L, Zhang X, Zhu G, Zhang H, Wang Y, Mian A (2021) Free-form description guided 3d visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3722–3731
Qi CR, Litany O, He K, Guibas LJ (2019) Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF international conference on computer vision, pp 9277–9286
Cai D, Zhao L, Zhang J, Sheng L, Xu D (2022) 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16464–16473
Wen X, Xiang P, Han Z et al (2022) PMP-Net++: point cloud completion by transformer-enhanced multi-step point moving paths. IEEE Trans Pattern Anal Mach Intell 45(1):852–867
Article Google Scholar
Qi C R, Yi L, Su H et al (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst, 30
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597-1600
Dave ZC et al (2022) D3Net: a unified speaker-listener architecture for 3D dense captioning and visual grounding. In: European conference on computer vision. Cham, Springer Nature Switzerland
Chen J, Luo W, Wei X, Ma L, Zhang W (2022) Ham: hierarchical attention model with high performance for 3d visual grounding. arXiv preprint arXiv:2210.12513, 2
Wansen W et al (2023) Vision-language navigation: a survey and taxonomy. Neural Comput Appl 1–26
Zhao J et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34.11:9015–9023
Article Google Scholar
Chen C, Xiaodong G (2021) Context-aware network with foreground recalibration for grounding natural language in video. Neural Comput Appl 33:10485–10502
Article Google Scholar
Abdelreheem A, Upadhyay U, Skorokhodov I, Yahya RAl, Chen J, Elhoseiny M (2022) 3dreftransformer: Fine-grained object identification in realworld scenes using natural language. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3941–3950
Jain A, Gkanatsios N, Mediratta I, Fragkiadaki K (2022) Bottom up top down detection transformers for language grounding in images and point clouds. In: Proceedings of the European Conference on Computer Vision, pp 417–433. Springer
Jain A, Gkanatsios N, Mediratta I, Fragkiadaki K (2022) Bottom up top down detection transformers for language grounding in images and point clouds. In: Proceedings of the European conference on computer vision, pp 417–433. Springer
Yanmin W et al (2023) EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Download references

Acknowledgements

This work was supported partly by the National Natural Science Foundation of China (Grant No. 62173045, 61673192), partly by the Fundamental Research Funds for the Central Universities (Grant No. 2020XD-A04-3), and the Natural Science Foundation of Hainan Province (Grant No. 622RC675).

Author information

Authors and Affiliations

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Liang Geng, Jianqin Yin & Yingchun Niu
College of Mechanical and Electrical Engineering, Shijiazhuang University, Hebei, Shijiazhuang, China
Liang Geng

Authors

Liang Geng
View author publications
You can also search for this author inPubMed Google Scholar
Jianqin Yin
View author publications
You can also search for this author inPubMed Google Scholar
Yingchun Niu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jianqin Yin.

Ethics declarations

Conflict of interest

The authors have no relevant financial or nonfinancial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Geng, L., Yin, J. & Niu, Y. Lgvc: language-guided visual context modeling for 3D visual grounding. Neural Comput & Applic 36, 12977–12990 (2024). https://doi.org/10.1007/s00521-024-09764-1

Download citation

Received: 31 May 2023
Accepted: 25 March 2024
Published: 23 April 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00521-024-09764-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lgvc: language-guided visual context modeling for 3D visual grounding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical cross-modal contextual attention network for visual grounding

An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

Rwkv-vg: visual grounding with RWKV-driven encoder-decoder framework

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now