research-article

Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning

Authors:

Yang Liu,

Daizong Liu,

Zongming Guo,

Wei HuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 3818 - 3827

https://doi.org/10.1145/3664647.3680614

Published: 28 October 2024 Publication History

Get Access

Abstract

3D visual grounding is a fundamental yet important task in multimedia understanding, which aims to locate a specific object in a complicated 3D scene semantically according to a text description. However, this task requires a large number of annotations of labeled text-object pairs for training, so the scarcity of annotated data has been a key obstacle in this task. To this end, this paper makes the first attempt to introduce and address a new semi-supervised setting, where only a few text-object labels are provided during training. Considering most scene data has no annotation, we explore a new solution for unlabeled 3D grounding by additionally training and transferring knowledge from a correlated task, i.e., 3D captioning. Our main insight is that 3D grounding and captioning are complementary and can be iteratively trained with unlabeled data to provide object and text contexts for each other with pseudo-label learning. Specifically, we propose a novel 3D Cross-Task Teacher-Student Framework (3D-CTTSF) for joint 3D grounding and captioning in the semi-supervised setting, where each branch contains parallel grounding and captioning modules. We first pre-train the two modules of the teacher branch with limited labeled data for warm-up. Then, we train the student branch to mimic the ability of the teacher model and iteratively update both branches with the unlabeled data. In particular, we transfer the learned knowledge between the grounding and captioning modules across two branches to generate and refine the pseudo-labels of unlabeled data for providing reliable supervision. To further improve the quality of the pseudo-labels, we design a cross-task pseudo-label generation scheme, filtering low-quality pseudo-labels at the detection, captioning, and grounding levels, respectively. Experimental results on various datasets show competitive performances in both tasks compared to previous fully- and weakly-supervised methods, demonstrating the proposed 3D-CTTSF can serve as an effective solution to overcome the data scarcity issue.

References

[1]

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part I 16. Springer, 422--440.

Abstract

References

Cited By

Index Terms

Recommendations

Bayesian Self-training for Semi-supervised 3D Segmentation

Inductive Semi-supervised Multi-Label Learning with Co-Training

Semi-supervised multi-label classification using incomplete label information

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations