skip to main content
10.1145/3664647.3680758acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Advancing 3D Object Grounding Beyond a Single 3D Scene

Published: 28 October 2024 Publication History

Abstract

As a widely explored multi-modal task, 3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce a more realistic setting, named Group-wise 3D Object Grounding, to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. Instead of localizing target objects in each scene individually, we argue that ignoring the rich visual information contained in other related 3D scenes within the same group may lead to sub-optimal results. To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. Specifically, based on context-aware spatial-semantic alignment, a language-guided consensus aggregation module is developed to aggregate the visual features of target objects in each 3D scene to form a visual consensus representation, which is then distributed and injected into a consensus-modulated feature refinement module for refining visual features, thus benefiting the subsequent multi-modal reasoning. To validate the effectiveness of the proposed method, we reorganize and enhance the ReferIt3D dataset and propose evaluation metrics to benchmark prior work and GNL3D. Extensive experiments demonstrate that GNL3D achieves state-of-the-art results on the group-wise setting and the traditional 3D object grounding task.

Supplemental Material

MP4 File - Representation Video for Advancing 3D Object Grounding Beyond a Single 3D Scene
3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce Group-wise 3D Object Grounding to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. We propose GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections.

References

[1]
Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, and Panos Achlioptas. 2022. ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes. arXiv preprint arXiv:2212.06250 (2022).
[2]
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part I 16. Springer, 422--440.
[3]
Eslam Bakr, Yasmeen Alsaedy, and Mohamed Elhoseiny. 2022. Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. Advances in Neural Information Processing Systems, Vol. 35 (2022), 37146--37158.
[4]
Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, and Mohamed Elhoseiny. 2023. CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding. arXiv preprint arXiv:2310.06214 (2023).
[5]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41--48.
[6]
Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 2022. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16464--16473.
[7]
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX. Springer, 202--221.
[8]
Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, and Angel X Chang. 2022. UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding. arXiv preprint arXiv:2212.00836 (2022).
[9]
Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and Angel X Chang. 2021. D3Net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. (2021).
[10]
Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, and Xinggang Wang. 2021. Hierarchical aggregation for 3d instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15467--15476.
[11]
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. 2022. Language Conditioned Spatial Relation Reasoning for 3D Object Grounding. In NeurIPS 2022--36th Conference on Neural Information Processing Systems.
[12]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828--5839.
[13]
John David N Dionisio, William G Burns Iii, and Richard Gilbert. 2013. 3D virtual worlds and the metaverse: Current status and future possibilities. ACM Computing Surveys (CSUR), Vol. 45, 3 (2013), 1--38.
[14]
Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. 2022. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 6, 2 (2022), 230--244.
[15]
Deng-Ping Fan, Tengpeng Li, Zheng Lin, Ge-Peng Ji, Dingwen Zhang, Ming-Ming Cheng, Huazhu Fu, and Jianbing Shen. 2021. Re-thinking co-salient object detection. IEEE transactions on pattern analysis and machine intelligence, Vol. 44, 8 (2021), 4339--4354.
[16]
Qi Fan, Deng-Ping Fan, Huazhu Fu, Chi-Keung Tang, Ling Shao, and Yu-Wing Tai. 2021. Group collaborative learning for co-salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12288--12298.
[17]
Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, and Ajmal Mian. 2021. Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3722--3731.
[18]
Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. 2023. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15372--15383.
[19]
Guy Hacohen and Daphna Weinshall. 2019. On the power of curriculum learning in training deep networks. In International conference on machine learning. PMLR, 2535--2544.
[20]
Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344--2352.
[21]
Joy Hsu, Jiayuan Mao, and Jiajun Wu. 2023. Ns3d: Neuro-symbolic grounding of 3d objects and relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2614--2623.
[22]
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and Tyng-Luh Liu. 2021. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1610--1618.
[23]
Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. 2022. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15524--15533.
[24]
Wencan Huang, Daizong Liu, and Wei Hu. 2023. Dense Object Grounding in 3D Scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 5017--5026.
[25]
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Katerina Fragkiadaki. 2022. Bottom up top down detection transformers for language grounding in images and point clouds. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXVI. Springer, 417--433.
[26]
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. 2020. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and Pattern recognition. 4867--4876.
[27]
Wen-Da Jin, Jun Xu, Ming-Ming Cheng, Yi Zhang, and Wei Guo. 2020. Icnet: Intra-saliency correlation network for co-saliency detection. Advances in Neural Information Processing Systems, Vol. 33 (2020), 18749--18759.
[28]
Zhao Jin, Munawar Hayat, Yuwei Yang, Yulan Guo, and Yinjie Lei. 2023. Context-aware Alignment and Mutual Masking for 3D-Language Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10984--10994.
[29]
Armand Joulin, Kevin Tang, and Li Fei-Fei. 2014. Efficient image and video co-localization with frank-wolfe algorithm. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part VI 13. Springer, 253--268.
[30]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.
[31]
Long Li, Junwei Han, Ni Zhang, Nian Liu, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, and Fahad Shahbaz Khan. 2023. Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7247--7256.
[32]
Yunsheng Li and Yinpeng Chen. 2021. Revisiting Dynamic Convolution via Matrix Decomposition. In International Conference on Learning Representations.
[33]
Yao Li, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2016. Image co-localization by mimicking a good detector's confidence score distribution. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II 14. Springer, 19--34.
[34]
Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong Jin, Donghao Luo, Chengjie Wang, Yan Wang, and Liujuan Cao. 2023. A Unified Framework for 3D Point Cloud Visual Grounding. arXiv preprint arXiv:2308.11887 (2023).
[35]
Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. 2024. A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions. arXiv preprint arXiv:2406.05785 (2024).
[36]
Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, and Shuguang Cui. 2021. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6032--6041.
[37]
Yang Liu, Daizong Liu, Zongming Guo, and Wei Hu. 2024. Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning. In Proceedings of the 32st ACM International Conference on Multimedia. ACM.
[38]
Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. 2021. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2949--2958.
[39]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[40]
Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 2022. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16454--16463.
[41]
Stylianos Mystakidis. 2022. Metaverse. Encyclopedia, Vol. 2, 1 (2022), 486--497.
[42]
Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. 2019. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision. 9277--9286.
[43]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.
[44]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, Vol. 30 (2017).
[45]
Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter Fox. 2022. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning. PMLR, 1046--1056.
[46]
Kevin Tang, Armand Joulin, Li-Jia Li, and Li Fei-Fei. 2014. Co-localization in real-world images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1464--1471.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[48]
Thang Vu, Kookhoi Kim, Tung M Luu, Thanh Nguyen, and Chang D Yoo. 2022. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2708--2717.
[49]
Xiu-Shen Wei, Chen-Lin Zhang, Yao Li, Chen-Wei Xie, Jianxin Wu, Chunhua Shen, and Zhi-Hua Zhou. 2017. Deep descriptor transforming for image co-localization. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3048--3054.
[50]
Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, and Jian Zhang. 2023. EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19231--19242.
[51]
Yang Wu, Huihui Song, Bo Liu, Kaihua Zhang, and Dong Liu. 2023. Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19639--19648.
[52]
Yixuan Wu, Zhao Zhang, Chi Xie, Feng Zhu, and Rui Zhao. 2023. Advancing referring expression segmentation beyond single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2628--2638.
[53]
Peiran Xu and Yadong Mu. 2023. Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion. In Proceedings of the 31st ACM International Conference on Multimedia. 2744--2755.
[54]
Li Yang, Ziqi Zhang, Zhongang Qi, Yan Xu, Wei Liu, Ying Shan, Bing Li, Weiping Yang, Peng Li, Yan Wang, et al. 2024. Exploiting Contextual Objects and Relations for 3D Visual Grounding. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[55]
Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo Luo. 2021. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1856--1866.
[56]
Siyue Yu, Jimin Xiao, Bingfeng Zhang, and Eng Gee Lim. 2022. Democracy does matter: Comprehensive feature mining for co-salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 979--988.
[57]
Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, and Zhen Li. 2022. Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases. arXiv preprint arXiv:2207.01821 (2022).
[58]
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, and Shuguang Cui. 2021. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1791--1800.
[59]
Kaihua Zhang, Mingliang Dong, Bo Liu, Xiao-Tong Yuan, and Qingshan Liu. 2021. Deepacg: Co-saliency detection via semantic-aware contrast gromov-wasserstein distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13703--13712.
[60]
Kaihua Zhang, Tengpeng Li, Bo Liu, and Qingshan Liu. 2019. Co-saliency detection via mask-guided fully convolutional networks with multi-scale label smoothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3095--3104.
[61]
Kaihua Zhang, Tengpeng Li, Shiwen Shen, Bo Liu, Jin Chen, and Qingshan Liu. 2020. Adaptive graph convolutional network with attention graph clustering for co-saliency detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9050--9059.
[62]
Ni Zhang, Junwei Han, Nian Liu, and Ling Shao. 2021. Summarize and search: Learning consensus-aware dynamic convolution for co-saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4167--4176.
[63]
Qijian Zhang, Runmin Cong, Junhui Hou, Chongyi Li, and Yao Zhao. 2020. CoADNet: Collaborative aggregation-and-distribution networks for co-salient object detection. Advances in neural information processing systems, Vol. 33 (2020), 6959--6970.
[64]
Yiming Zhang, ZeMing Gong, and Angel X Chang. 2023. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15225--15236.
[65]
Zhao Zhang, Wenda Jin, Jun Xu, and Ming-Ming Cheng. 2020. Gradient-induced co-saliency detection. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XII 16. Springer, 455--472.
[66]
Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 2021. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2928--2937.
[67]
Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2020. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10012--10022.
[68]
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 2023. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2911--2921.
[69]
Ziyue Zhu, Zhao Zhang, Zheng Lin, Xing Sun, and Ming-Ming Cheng. 2023. Co-salient object detection with co-representation purification. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

Cited By

View all
  • (2024)Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680614(3818-3827)Online publication date: 28-Oct-2024

Index Terms

  1. Advancing 3D Object Grounding Beyond a Single 3D Scene

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d object grounding
    2. curriculum learning
    3. group-wise learning

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)124
    • Downloads (Last 6 weeks)63
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680614(3818-3827)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media