Multi-branch Collaborative Learning Network for 3D Visual Grounding

Qian, Zhipeng; Ma, Yiwei; Lin, Zhekai; Ji, Jiayi; Zheng, Xiawu; Sun, Xiaoshuai; Ji, Rongrong

doi:10.1007/978-3-031-72952-2_22

Zhipeng Qian¹³,
Yiwei Ma¹³,
Zhekai Lin¹³,
Jiayi Ji¹³,
Xiawu Zheng¹³,
Xiaoshuai Sun¹³ &
…
Rongrong Ji¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15104))

Included in the following conference series:

European Conference on Computer Vision

313 Accesses

Abstract

3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model’s capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of $\mathbf {2.05\%}$ in Acc@0.5 for 3DREC and $\mathbf {3.96\%}$ in mIoU for 3DRES. Our code is available at https://github.com/qzp2018/MCLN.

Z. Qian and Y. Ma—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SeqTR: A Simple Yet Universal Network for Visual Grounding

APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension

PrimitiveNet: decomposing the global constraints for referring segmentation

Article Open access 27 June 2024

References

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25
Chapter Google Scholar
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: real-time instance segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. https://doi.org/10.1109/iccv.2019.00925, http://dx.doi.org/10.1109/iccv.2019.00925
Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)
Google Scholar
Caruana, R., Pratt, L., Thrun, S.: Multitask Learning , p. 893, January 2017. https://doi.org/10.1007/978-1-4899-7687-1_100322, http://dx.doi.org/10.1007/978-1-4899-7687-1_100322
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chapter Google Scholar
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans (2021)
Google Scholar
Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3D object grounding. Adv. Neural Inf. Process. Syst. 35, 20522–20535 (2022)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niessner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. https://doi.org/10.1109/cvpr.2017.261, http://dx.doi.org/10.1109/cvpr.2017.261
Feng, M., et al.: Free-form description guided 3D visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3722–3731 (2021)
Google Scholar
Fu, C.Y., Shvets, M., Berg, A.: Retinamask: learning to predict masks improves state-of-the-art single-shot detection for free, January 2019. $\text{arXiv}$: Computer Vision and Pattern RecognitionRecognition, $\text{ arXiv }$: Computer Vision and Pattern Recognition
Google Scholar
Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)
Google Scholar
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 386–397 (2020). https://doi.org/10.1109/tpami.2018.2844175, http://dx.doi.org/10.1109/tpami.2018.2844175
Hua, G., Liao, M., Tian, S., Zhang, Y., Zou, W.: Multiple relational learning network for joint referring expression comprehension and segmentation. IEEE Trans. Multimed. 1–13 (2023). https://doi.org/10.1109/tmm.2023.3241802, http://dx.doi.org/10.1109/tmm.2023.3241802
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3d instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, May 2021. Proceedings of the ... AAAI Conference on Artificial Intelligence
Google Scholar
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13696, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24
Ji, J., et al.: Multi-branch distance-sensitive self-attention network for image captioning. IEEE Trans. Multimed. 25, 3962–3974 (2023). https://doi.org/10.1109/TMM.2022.3169061
Article Google Scholar
Ji, J., Ma, Y., Sun, X., Zhou, Y., Wu, Y., Ji, R.: Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Trans. Image Process. 31, 4321–4335 (2022). https://doi.org/10.1109/TIP.2022.3183434
Article Google Scholar
Ji, J., et al.: Attacking image captioning towards accuracy-preserving target words removal. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4226–4234. MM ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3414009
Li, Q., Zhang, Y., Sun, S., Wu, J., Zhao, X., Tan, M.: Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 99–114 (2022). https://doi.org/10.1016/j.neucom.2021.09.066, http://dx.doi.org/10.1016/j.neucom.2021.09.066
Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K.: Instance segmentation in 3D scenes using semantic superpoint tree networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2783–2792 (2021)
Google Scholar
Lin, H., et al.: A unified framework for 3d point cloud visual grounding. arXiv preprint arXiv:2308.11887 (2023)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), October 2017. https://doi.org/10.1109/iccv.2017.324, http://dx.doi.org/10.1109/iccv.2017.324
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach
Google Scholar
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), October 2021. https://doi.org/10.1109/iccv48922.2021.00294, http://dx.doi.org/10.1109/iccv48922.2021.00294
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. https://doi.org/10.1109/cvpr42600.2020.01005, http://dx.doi.org/10.1109/cvpr42600.2020.01005
Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)
Google Scholar
Ma, Y., Ji, J., Sun, X., Zhou, Y., Ji, R.: Towards local visual modeling for image captioning. Pattern Recognit. 138, 109420 (2023)
Article Google Scholar
Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 638–647 (2022)
Google Scholar
Ma, Y., et al.: X-Mesh: towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760 (2023)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), October 2016. https://doi.org/10.1109/3dv.2016.79, http://dx.doi.org/10.1109/3dv.2016.79
Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real-time joint semantic segmentation and depth estimation using asymmetric annotations, September 2018. $\text{ arXiv }$: Computer Vision and Pattern Recognition,$\text{ arXiv }$: Computer Vision and Pattern Recognition
Google Scholar
Qi, C., Yi, L., Su, H., Guibas, L.: Pointnet++: deep hierarchical feature learning on point sets in a metric space, June 2017. Cornell University - arXiv, Cornell University - arXiv
Google Scholar
Qian, Z., Ma, Y., Ji, J., Sun, X.: X-RefSeg3D: enhancing referring 3D instance segmentation via structured cross-modal graph neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4551–4559 (2024)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. https://doi.org/10.1109/cvpr.2019.00075, http://dx.doi.org/10.1109/cvpr.2019.00075
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3d scene instance segmentation, November 2022
Google Scholar
Wu, C., et al.: 3D-STMN: dependency-driven superpoint-text matching network for end-to-end 3D referring expression segmentation (2023)
Google Scholar
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3d visual grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Yang, D., et al.: Sam as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation. arXiv preprint arXiv:2406.01451 (2024)
Yang, D., et al.: Semi-supervised panoptic narrative grounding. In: Proceedings of the 31st ACM International Conference on Multimedia, October 2023. https://doi.org/10.1145/3581783.3612259, http://dx.doi.org/10.1145/3581783.3612259
Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2D semantics assisted training for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)
Google Scholar
Yuan, Z., et al.: InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1791–1800 (2021)
Google Scholar
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-Transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by National Science and Technology Major Project (No. 2022ZD0118201), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. 62072389), the National Natural Science Fund for Young Scholars of China (No. 62302411), China Postdoctoral Science Foundation (No. 2023M732948), the Natural Science Foundation of Fujian Province of China (No. 2021J06003, No. 2022J06001), and partially sponsored by CCF-NetEase ThunderFire Innovation Research Funding (NO. CCF-Netease 202301).

Author information

Authors and Affiliations

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China
Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun & Rongrong Ji

Authors

Zhipeng Qian
View author publications
You can also search for this author in PubMed Google Scholar
Yiwei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhekai Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jiayi Ji
View author publications
You can also search for this author in PubMed Google Scholar
Xiawu Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshuai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoshuai Sun .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 656 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, Z. et al. (2025). Multi-branch Collaborative Learning Network for 3D Visual Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-72952-2_22
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72951-5
Online ISBN: 978-3-031-72952-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-branch Collaborative Learning Network for 3D Visual Grounding