research-article

Advancing 3D Object Grounding Beyond a Single 3D Scene

Authors:

Wencan Huang,

Daizong Liu,

Wei HuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 7995 - 8004

https://doi.org/10.1145/3664647.3680758

Published: 28 October 2024 Publication History

Get Access

Abstract

As a widely explored multi-modal task, 3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce a more realistic setting, named Group-wise 3D Object Grounding, to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. Instead of localizing target objects in each scene individually, we argue that ignoring the rich visual information contained in other related 3D scenes within the same group may lead to sub-optimal results. To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. Specifically, based on context-aware spatial-semantic alignment, a language-guided consensus aggregation module is developed to aggregate the visual features of target objects in each 3D scene to form a visual consensus representation, which is then distributed and injected into a consensus-modulated feature refinement module for refining visual features, thus benefiting the subsequent multi-modal reasoning. To validate the effectiveness of the proposed method, we reorganize and enhance the ReferIt3D dataset and propose evaluation metrics to benchmark prior work and GNL3D. Extensive experiments demonstrate that GNL3D achieves state-of-the-art results on the group-wise setting and the traditional 3D object grounding task.

Supplemental Material

MP4 File - Representation Video for Advancing 3D Object Grounding Beyond a Single 3D Scene

3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce Group-wise 3D Object Grounding to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. We propose GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections.

Download
21.15 MB

References

[1]

Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, and Panos Achlioptas. 2022. ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes. arXiv preprint arXiv:2212.06250 (2022).

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Dense Object Grounding in 3D Scenes

Data-driven contextual modeling for 3D scene understanding

Sharing 3D object with multiple clients via networks using vision-based 3D object tracking

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations