skip to main content
10.1145/3664647.3680614acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning

Published: 28 October 2024 Publication History

Abstract

3D visual grounding is a fundamental yet important task in multimedia understanding, which aims to locate a specific object in a complicated 3D scene semantically according to a text description. However, this task requires a large number of annotations of labeled text-object pairs for training, so the scarcity of annotated data has been a key obstacle in this task. To this end, this paper makes the first attempt to introduce and address a new semi-supervised setting, where only a few text-object labels are provided during training. Considering most scene data has no annotation, we explore a new solution for unlabeled 3D grounding by additionally training and transferring knowledge from a correlated task, i.e., 3D captioning. Our main insight is that 3D grounding and captioning are complementary and can be iteratively trained with unlabeled data to provide object and text contexts for each other with pseudo-label learning. Specifically, we propose a novel 3D Cross-Task Teacher-Student Framework (3D-CTTSF) for joint 3D grounding and captioning in the semi-supervised setting, where each branch contains parallel grounding and captioning modules. We first pre-train the two modules of the teacher branch with limited labeled data for warm-up. Then, we train the student branch to mimic the ability of the teacher model and iteratively update both branches with the unlabeled data. In particular, we transfer the learned knowledge between the grounding and captioning modules across two branches to generate and refine the pseudo-labels of unlabeled data for providing reliable supervision. To further improve the quality of the pseudo-labels, we design a cross-task pseudo-label generation scheme, filtering low-quality pseudo-labels at the detection, captioning, and grounding levels, respectively. Experimental results on various datasets show competitive performances in both tasks compared to previous fully- and weakly-supervised methods, demonstrating the proposed 3D-CTTSF can serve as an effective solution to overcome the data scarcity issue.

References

[1]
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part I 16. Springer, 422--440.
[2]
Philip Bachman, Ouais Alsharif, and Doina Precup. 2014. Learning with pseudo-ensembles. Advances in neural information processing systems, Vol. 27 (2014).
[3]
Eslam Bakr, Yasmeen Alsaedy, and Mohamed Elhoseiny. 2022. Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. Advances in neural information processing systems, Vol. 35 (2022), 37146--37158.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.
[5]
David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. 2019. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 (2019).
[6]
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, Vol. 32 (2019).
[7]
Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 2022. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16464--16473.
[8]
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX. Springer, 202--221.
[9]
Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and Angel X Chang. 2022. D 3 Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. In European Conference on Computer Vision. Springer, 487--505.
[10]
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Gang Yu, and Tao Chen. 2023. End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11124--11133.
[11]
Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. 2021. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3193--3203.
[12]
Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, and Angel X Chang. 2023. Unit3d: A unified transformer for 3d dense captioning and visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 18109--18119.
[13]
Bowen Cheng, Lu Sheng, Shaoshuai Shi, Ming Yang, and Dong Xu. 2021. Back-tracing representative points for voting-based 3d object detection in point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8963--8972.
[14]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828--5839.
[15]
Francis Engelmann, Theodora Kontogianni, Alexander Hermans, and Bastian Leibe. 2017. Exploring spatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE international conference on computer vision workshops. 716--724.
[16]
Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, and Ajmal Mian. 2021. Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3722--3731.
[17]
Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Method, and Application. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 34, 7 (2024), 6238--6252.
[18]
Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. 2020. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11873--11882.
[19]
Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344--2352.
[20]
Qianjiang Hu, Daizong Liu, and Wei Hu. 2022. Exploring the devil in graph spectral domain for 3d point cloud attacks. In European Conference on Computer Vision. Springer, 229--248.
[21]
Qianjiang Hu, Daizong Liu, and Wei Hu. 2023. Density-insensitive unsupervised domain adaption on 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17556--17566.
[22]
Zijian Hu, Zhengyu Yang, Xuefeng Hu, and Ram Nevatia. 2021. Simple: Similar pseudo label exploitation for semi-supervised classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15099--15108.
[23]
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and Tyng-Luh Liu. 2021. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1610--1618.
[24]
Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. 2022. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15524--15533.
[25]
Wencan Huang, Daizong Liu, and Wei Hu. 2023. Dense object grounding in 3d scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 5017--5026.
[26]
Wencan Huang, Daizong Liu, and Wei Hu. 2024. Advancing 3D Object Grounding Beyond a Single 3D Scene. In Proceedings of the 32st ACM International Conference on Multimedia. ACM.
[27]
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Katerina Fragkiadaki. 2022. Bottom up top down detection transformers for language grounding in images and point clouds. In European Conference on Computer Vision. Springer, 417--433.
[28]
Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. 2019. Consistency-based semi-supervised learning for object detection. Advances in neural information processing systems, Vol. 32 (2019).
[29]
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. 2022. More: Multi-order relation mining for dense captioning in 3d scenes. In European Conference on Computer Vision. Springer, 528--545.
[30]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
[31]
Daizong Liu and Wei Hu. 2022. Imperceptible transfer attack and defense on 3d point cloud classification. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 4 (2022), 4727--4746.
[32]
Daizong Liu and Wei Hu. 2024. Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 3576--3584.
[33]
Daizong Liu, Wei Hu, and Xin Li. 2023. Point cloud attacks in graph spectral domain: When 3d geometry meets graph signal processing. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[34]
Daizong Liu, Wei Hu, and Xin Li. 2023. Robust geometry-dependent attack for 3D point clouds. IEEE Transactions on Multimedia (2023).
[35]
Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. 2024. A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions. arXiv preprint arXiv:2406.05785 (2024).
[36]
Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. 2021. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480 (2021).
[37]
Yen-Cheng Liu, Chih-Yao Ma, and Zsolt Kira. 2022. Unbiased teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9819--9828.
[38]
Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
[39]
Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 2022. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16454--16463.
[40]
Peng Mi, Jianghang Lin, Yiyi Zhou, Yunhang Shen, Gen Luo, Xiaoshuai Sun, Liujuan Cao, Rongrong Fu, Qiang Xu, and Rongrong Ji. 2022. Active teacher for semi-supervised object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14482--14491.
[41]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.
[42]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[43]
Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. 2019. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision. 9277--9286.
[44]
Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter Fox. 2022. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning. PMLR, 1046--1056.
[45]
Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. 2016. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, Vol. 29 (2016).
[46]
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, Vol. 33 (2020), 596--608.
[47]
Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. 2020. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757 (2020).
[48]
Jiamu Sun, Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, and Rongrong Ji. 2023. Refteacher: A strong baseline for semi-supervised referring expression comprehension. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19144--19154.
[49]
Yunbo Tao, Daizong Liu, Pan Zhou, Yulai Xie, Wei Du, and Wei Hu. 2023. 3DHacker: Spectrum-based decision boundary generation for hard-label 3D point cloud attack. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14340--14350.
[50]
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, Vol. 30 (2017).
[51]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
[52]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
[53]
He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J Guibas. 2021. 3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14615--14624.
[54]
Heng Wang, Chaoyi Zhang, Jianhui Yu, and Weidong Cai. 2022. Spatiality-guided transformer for 3d dense captioning on point clouds. arXiv preprint arXiv:2204.10688 (2022).
[55]
Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, and Zhou Zhao. 2023. Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2662--2671.
[56]
Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, and Jian Zhang. 2023. Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19231--19242.
[57]
Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, Vol. 33 (2020), 6256--6268.
[58]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048--2057.
[59]
Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, and Xu Wang. 2023. Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment. arXiv preprint arXiv:2312.09625 (2023).
[60]
Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo Luo. 2021. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1856--1866.
[61]
Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Shuguang Cui, and Zhen Li. 2022. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8563--8573.
[62]
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, and Shuguang Cui. 2021. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1791--1800.
[63]
Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 2021. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2928--2937.
[64]
Na Zhao, Tat-Seng Chua, and Gim Hee Lee. 2020. Sess: Self-ensembling semi-supervised 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11079--11087.
[65]
Yin Zhou and Oncel Tuzel. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4490--4499.

Cited By

View all
  • (2024)Advancing 3D Object Grounding Beyond a Single 3D SceneProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680758(7995-8004)Online publication date: 28-Oct-2024
  • (2024)Joint Top-Down and Bottom-Up Frameworks for 3D Visual GroundingPattern Recognition10.1007/978-3-031-78113-1_17(249-264)Online publication date: 4-Dec-2024

Index Terms

  1. Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d visual grounding
    2. semi-supervised learning

    Qualifiers

    • Research-article

    Funding Sources

    • National R&D project of China

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)120
    • Downloads (Last 6 weeks)61
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Advancing 3D Object Grounding Beyond a Single 3D SceneProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680758(7995-8004)Online publication date: 28-Oct-2024
    • (2024)Joint Top-Down and Bottom-Up Frameworks for 3D Visual GroundingPattern Recognition10.1007/978-3-031-78113-1_17(249-264)Online publication date: 4-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media