Skip to main content

Multi-branch Collaborative Learning Network for 3D Visual Grounding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15104))

Included in the following conference series:

  • 313 Accesses

Abstract

3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model’s capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of \(\mathbf {2.05\%}\) in Acc@0.5 for 3DREC and \(\mathbf {3.96\%}\) in mIoU for 3DRES. Our code is available at https://github.com/qzp2018/MCLN.

Z. Qian and Y. Ma—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25

    Chapter  Google Scholar 

  2. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: real-time instance segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. https://doi.org/10.1109/iccv.2019.00925, http://dx.doi.org/10.1109/iccv.2019.00925

  3. Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)

    Google Scholar 

  4. Caruana, R., Pratt, L., Thrun, S.: Multitask Learning , p. 893, January 2017. https://doi.org/10.1007/978-1-4899-7687-1_100322, http://dx.doi.org/10.1007/978-1-4899-7687-1_100322

  5. Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13

    Chapter  Google Scholar 

  6. Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans (2021)

    Google Scholar 

  7. Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3D object grounding. Adv. Neural Inf. Process. Syst. 35, 20522–20535 (2022)

    Google Scholar 

  8. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niessner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. https://doi.org/10.1109/cvpr.2017.261, http://dx.doi.org/10.1109/cvpr.2017.261

  9. Feng, M., et al.: Free-form description guided 3D visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3722–3731 (2021)

    Google Scholar 

  10. Fu, C.Y., Shvets, M., Berg, A.: Retinamask: learning to predict masks improves state-of-the-art single-shot detection for free, January 2019. \(\text{arXiv}\): Computer Vision and Pattern RecognitionRecognition, \(\text{ arXiv }\): Computer Vision and Pattern Recognition

    Google Scholar 

  11. Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)

    Google Scholar 

  12. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 386–397 (2020). https://doi.org/10.1109/tpami.2018.2844175, http://dx.doi.org/10.1109/tpami.2018.2844175

  13. Hua, G., Liao, M., Tian, S., Zhang, Y., Zou, W.: Multiple relational learning network for joint referring expression comprehension and segmentation. IEEE Trans. Multimed. 1–13 (2023). https://doi.org/10.1109/tmm.2023.3241802, http://dx.doi.org/10.1109/tmm.2023.3241802

  14. Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3d instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, May 2021. Proceedings of the ... AAAI Conference on Artificial Intelligence

    Google Scholar 

  15. Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13696, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24

  16. Ji, J., et al.: Multi-branch distance-sensitive self-attention network for image captioning. IEEE Trans. Multimed. 25, 3962–3974 (2023). https://doi.org/10.1109/TMM.2022.3169061

    Article  Google Scholar 

  17. Ji, J., Ma, Y., Sun, X., Zhou, Y., Wu, Y., Ji, R.: Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Trans. Image Process. 31, 4321–4335 (2022). https://doi.org/10.1109/TIP.2022.3183434

    Article  Google Scholar 

  18. Ji, J., et al.: Attacking image captioning towards accuracy-preserving target words removal. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4226–4234. MM ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3414009

  19. Li, Q., Zhang, Y., Sun, S., Wu, J., Zhao, X., Tan, M.: Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 99–114 (2022). https://doi.org/10.1016/j.neucom.2021.09.066, http://dx.doi.org/10.1016/j.neucom.2021.09.066

  20. Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K.: Instance segmentation in 3D scenes using semantic superpoint tree networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2783–2792 (2021)

    Google Scholar 

  21. Lin, H., et al.: A unified framework for 3d point cloud visual grounding. arXiv preprint arXiv:2308.11887 (2023)

  22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), October 2017. https://doi.org/10.1109/iccv.2017.324, http://dx.doi.org/10.1109/iccv.2017.324

  23. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach

    Google Scholar 

  24. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), October 2021. https://doi.org/10.1109/iccv48922.2021.00294, http://dx.doi.org/10.1109/iccv48922.2021.00294

  25. Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. https://doi.org/10.1109/cvpr42600.2020.01005, http://dx.doi.org/10.1109/cvpr42600.2020.01005

  26. Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)

    Google Scholar 

  27. Ma, Y., Ji, J., Sun, X., Zhou, Y., Ji, R.: Towards local visual modeling for image captioning. Pattern Recognit. 138, 109420 (2023)

    Article  Google Scholar 

  28. Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 638–647 (2022)

    Google Scholar 

  29. Ma, Y., et al.: X-Mesh: towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760 (2023)

    Google Scholar 

  30. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), October 2016. https://doi.org/10.1109/3dv.2016.79, http://dx.doi.org/10.1109/3dv.2016.79

  31. Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real-time joint semantic segmentation and depth estimation using asymmetric annotations, September 2018. \(\text{ arXiv }\): Computer Vision and Pattern Recognition,\(\text{ arXiv }\): Computer Vision and Pattern Recognition

    Google Scholar 

  32. Qi, C., Yi, L., Su, H., Guibas, L.: Pointnet++: deep hierarchical feature learning on point sets in a metric space, June 2017. Cornell University - arXiv, Cornell University - arXiv

    Google Scholar 

  33. Qian, Z., Ma, Y., Ji, J., Sun, X.: X-RefSeg3D: enhancing referring 3D instance segmentation via structured cross-modal graph neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4551–4559 (2024)

    Google Scholar 

  34. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. https://doi.org/10.1109/cvpr.2019.00075, http://dx.doi.org/10.1109/cvpr.2019.00075

  35. Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3d scene instance segmentation, November 2022

    Google Scholar 

  36. Wu, C., et al.: 3D-STMN: dependency-driven superpoint-text matching network for end-to-end 3D referring expression segmentation (2023)

    Google Scholar 

  37. Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3d visual grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  38. Yang, D., et al.: Sam as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation. arXiv preprint arXiv:2406.01451 (2024)

  39. Yang, D., et al.: Semi-supervised panoptic narrative grounding. In: Proceedings of the 31st ACM International Conference on Multimedia, October 2023. https://doi.org/10.1145/3581783.3612259, http://dx.doi.org/10.1145/3581783.3612259

  40. Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2D semantics assisted training for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)

    Google Scholar 

  41. Yuan, Z., et al.: InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1791–1800 (2021)

    Google Scholar 

  42. Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-Transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Science and Technology Major Project (No. 2022ZD0118201), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. 62072389), the National Natural Science Fund for Young Scholars of China (No. 62302411), China Postdoctoral Science Foundation (No. 2023M732948), the Natural Science Foundation of Fujian Province of China (No. 2021J06003, No. 2022J06001), and partially sponsored by CCF-NetEase ThunderFire Innovation Research Funding (NO. CCF-Netease 202301).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoshuai Sun .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 656 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qian, Z. et al. (2025). Multi-branch Collaborative Learning Network for 3D Visual Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72952-2_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72951-5

  • Online ISBN: 978-3-031-72952-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics