Skip to main content
Log in

Lgvc: language-guided visual context modeling for 3D visual grounding

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional methods often use fixed attention patterns in visual encoders, limiting the utility of language-guided attention mechanisms. To address this, we introduce a novel language-guided visual context modeling (LGVC) strategy. Our approach enriches the visual encoding at multiple levels through language knowledge: (1) A Language-Object Embedding (LOE) Module directs attention toward language-relevant proposals in 3D visual scenes, and (2) a Language-Relation Embedding (LRE) Module explores the relationships among objects in the context of accompanying text. Extensive experiments show that LGVC efficiently filters out language-irrelevant proposals and aligns multimodal entities, outperforming state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The ScanRefer Dataset is publicly accessible and can be obtained from the following URL: https://daveredrum.github.io/ScanRefer/. Likewise, the Nr3D/Sr3D Dataset is also publicly available and can be accessed at the following URL: https://referit3d.github.io/#dataset.

References

  1. Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int J Comput Vis 123(1):74–93

    Article  MathSciNet  Google Scholar 

  2. Kazemzadeh S, Ordonez V, Matten M, Berg T (2014) Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 787–798

  3. Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: European conference on computer vision, pp 69–85. Springer

  4. Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: ICCV, pp 4683–4693

  5. Wang P, Wu Q, Cao J, Shen C, Gao L, Hengel van den A (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: CVPR, pp 1960–1968

  6. Yu L, Lin Z, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1307–1315

  7. Chen DZ, Chang AX, Nießner M (2020) Scanrefer: 3d object localization in rgb-d scans using natural language. In: ECCV

  8. Achlioptas P, Abdelreheem A, Xia F, Elhoseiny M, Guibas L (2020) Referit3d: neural listeners for fine-grained 3d object identification in real-world scenes. In: ECCV, pp 422–440. Springer

  9. Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S (2018) Gibson env: real-world perception for embodied agents. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9068–9079

  10. Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, Straub J, Liu J, Koltun V, Malik J et al (2019) Habitat: a platform for embodied ai research. In: ICCV, pp 9339–9347

  11. Kim K, Billinghurst M, Bruder G, Duh HBL, Welch GF (2018) Revisiting trends in augmented reality research: a review of the 2nd decade of ismar (2008–2017). IEEE Trans Visual Comput Graphics 24(11):2947–2962

    Article  Google Scholar 

  12. Kress Bernard C, Cummings William J (2017) 11-1: Invited paper: Towards the ultimate mixed reality experience: Hololens display architecture choices. In SID symposium digest of technical papers, volume 48, pages 127-131. Wiley Online Library

  13. He D, Zhao Y, Luo J, Hui T, Huang S, Zhang A, Liu S (2021) Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In: Proceedings of the 29th ACM international conference on multimedia

  14. Roh J, Desingh K, Farhadi A, Fox D (2021) Languagerefer: spatial-language model for 3d visual grounding. In: CoRL

  15. Yang Z, Zhang S, Wang L, Luo J (2021) Sat: 2d semantics assisted training for 3d visual grounding. In: ICCV

  16. Yuan Z, Yan X, Liao Y, Zhang R, Wang S, Li Z, Cui S (2021) Instancerefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: ICCV

  17. Phan AV, Nguyen ML, Nguyen YLH, Bui LT (2018) Dgcnn: a convolutional neural network over large-scale labeled graphs. Neural Networks

  18. Huang PH, Lee HH, Chen HT, Liu TL (2021) Text-guided graph neural networks for referring 3D instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence 35, pp 1610–1618

  19. Zhao L, Daigang CL, Sheng, Dong X, (2021) 3DVGTransformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2928–2937

  20. Sadhu A, Chen K, Nevatia R (2019) Zero-shot grounding of objects from natural language queries. In: ICCV, pp 4694–4703

  21. Yang Z, Chen T, Wang L, Luo J (2020) Improving one-stage visual grounding by recursive subquery construction. In: ECCV

  22. Huang S, Chen Y, Jia J, Wang L (2022) Multi-view transformer for 3D visual grounding. In: CVPR

  23. Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of ICLR

  24. Hu W, Fey M, Zitnik M, Dong Y, Ren H, Liu B, Catasta M, Leskovec J (2020) Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687

  25. Li G, Xiong C, Thabet A, Ghanem B (2020) Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739

  26. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: International conference on machine learning, pp 1263–1272. PMLR

  27. Wang G, Ying R, Huang J, Leskovec J (2020) Direct multi-hop attention based graph neural network. arXiv preprint arXiv:2009.14332

  28. Yang S, Li G, Yu Y (2019) Dynamic graph attention for referring expression comprehension. In: CVPR

  29. Yang S, Li G, Yu Y (2019) Cross-modal relationship inference for grounding referring expressions. In: CVPR

  30. Jiang L, Zhao H, Shi S, Liu S, Fu CW, Jia J (2020) Pointgroup: dual-set point grouping for 3d instance segmentation. In: CVPR

  31. Liu Z, Zhang Z, Cao Y, Hu H, Tong X (2021) Group-free 3d object detection via transformers. In: ICCV

  32. Luo J, Fu1 J, Kong X, Gao C, Ren H, Shen H, Xia H, Liu S (2022) 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: CVPR2022

  33. Liao Y, Liu S, Li G, Wang F, Chen Y, Qian C, Li B (2020) A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR

  34. Mittal V (2020) Attngrounder: talking to cars with attention. In: European conference on computer vision, pp 62–73. Springer

  35. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference

  36. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Aidan NG, Łukasz K, Illia P (2017) Attention is all you need. In: NeurIPS

  37. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. ACL

  38. Dai Z, Yang Z, Yang Y, Jaime GC, Quoc L, Ruslan S (2019) Attentive language models beyond a fixed-length context. In: ACL, Transformer-xl

  39. Hui T, Liu S, Huang S, Linguistic structure guided context modeling for referring image segmentation[C], , Computer Vision-ECCV, et al (2020) 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. Springer International Publishing 2020:59–75

  40. Ding Z, Hui T, Huang J et al (2022) Language-bridged spatial-temporal interaction for referring video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4964–4973

  41. Tianrui H et al (2023) Language-aware spatial-temporal collaboration for referring video segmentation. IEEE Trans Pattern Anal Mach Intell

  42. Feng M, Li Z, Li Q, Zhang L, Zhang X, Zhu G, Zhang H, Wang Y, Mian A (2021) Free-form description guided 3d visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3722–3731

  43. Qi CR, Litany O, He K, Guibas LJ (2019) Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF international conference on computer vision, pp 9277–9286

  44. Cai D, Zhao L, Zhang J, Sheng L, Xu D (2022) 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16464–16473

  45. Wen X, Xiang P, Han Z et al (2022) PMP-Net++: point cloud completion by transformer-enhanced multi-step point moving paths. IEEE Trans Pattern Anal Mach Intell 45(1):852–867

    Article  Google Scholar 

  46. Qi C R, Yi L, Su H et al (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst, 30

  47. Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597-1600

  48. Dave ZC et al (2022) D3Net: a unified speaker-listener architecture for 3D dense captioning and visual grounding. In: European conference on computer vision. Cham, Springer Nature Switzerland

  49. Chen J, Luo W, Wei X, Ma L, Zhang W (2022) Ham: hierarchical attention model with high performance for 3d visual grounding. arXiv preprint arXiv:2210.12513, 2

  50. Wansen W et al (2023) Vision-language navigation: a survey and taxonomy. Neural Comput Appl 1–26

  51. Zhao J et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34.11:9015–9023

    Article  Google Scholar 

  52. Chen C, Xiaodong G (2021) Context-aware network with foreground recalibration for grounding natural language in video. Neural Comput Appl 33:10485–10502

    Article  Google Scholar 

  53. Abdelreheem A, Upadhyay U, Skorokhodov I, Yahya RAl, Chen J, Elhoseiny M (2022) 3dreftransformer: Fine-grained object identification in realworld scenes using natural language. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3941–3950

  54. Jain A, Gkanatsios N, Mediratta I, Fragkiadaki K (2022) Bottom up top down detection transformers for language grounding in images and point clouds. In: Proceedings of the European Conference on Computer Vision, pp 417–433. Springer

  55. Jain A, Gkanatsios N, Mediratta I, Fragkiadaki K (2022) Bottom up top down detection transformers for language grounding in images and point clouds. In: Proceedings of the European conference on computer vision, pp 417–433. Springer

  56. Yanmin W et al (2023) EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Download references

Acknowledgements

This work was supported partly by the National Natural Science Foundation of China (Grant No. 62173045, 61673192), partly by the Fundamental Research Funds for the Central Universities (Grant No. 2020XD-A04-3), and the Natural Science Foundation of Hainan Province (Grant No. 622RC675).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianqin Yin.

Ethics declarations

Conflict of interest

The authors have no relevant financial or nonfinancial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geng, L., Yin, J. & Niu, Y. Lgvc: language-guided visual context modeling for 3D visual grounding. Neural Comput & Applic 36, 12977–12990 (2024). https://doi.org/10.1007/s00521-024-09764-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09764-1

Keywords