Abstract
3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model’s capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of \(\mathbf {2.05\%}\) in Acc@0.5 for 3DREC and \(\mathbf {3.96\%}\) in mIoU for 3DRES. Our code is available at https://github.com/qzp2018/MCLN.
Z. Qian and Y. Ma—Equal contributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: real-time instance segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. https://doi.org/10.1109/iccv.2019.00925, http://dx.doi.org/10.1109/iccv.2019.00925
Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)
Caruana, R., Pratt, L., Thrun, S.: Multitask Learning , p. 893, January 2017. https://doi.org/10.1007/978-1-4899-7687-1_100322, http://dx.doi.org/10.1007/978-1-4899-7687-1_100322
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans (2021)
Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3D object grounding. Adv. Neural Inf. Process. Syst. 35, 20522–20535 (2022)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niessner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. https://doi.org/10.1109/cvpr.2017.261, http://dx.doi.org/10.1109/cvpr.2017.261
Feng, M., et al.: Free-form description guided 3D visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3722–3731 (2021)
Fu, C.Y., Shvets, M., Berg, A.: Retinamask: learning to predict masks improves state-of-the-art single-shot detection for free, January 2019. \(\text{arXiv}\): Computer Vision and Pattern RecognitionRecognition, \(\text{ arXiv }\): Computer Vision and Pattern Recognition
Han, L., Zheng, T., Xu, L., Fang, L.: Occuseg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 386–397 (2020). https://doi.org/10.1109/tpami.2018.2844175, http://dx.doi.org/10.1109/tpami.2018.2844175
Hua, G., Liao, M., Tian, S., Zhang, Y., Zou, W.: Multiple relational learning network for joint referring expression comprehension and segmentation. IEEE Trans. Multimed. 1–13 (2023). https://doi.org/10.1109/tmm.2023.3241802, http://dx.doi.org/10.1109/tmm.2023.3241802
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3d instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, May 2021. Proceedings of the ... AAAI Conference on Artificial Intelligence
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13696, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24
Ji, J., et al.: Multi-branch distance-sensitive self-attention network for image captioning. IEEE Trans. Multimed. 25, 3962–3974 (2023). https://doi.org/10.1109/TMM.2022.3169061
Ji, J., Ma, Y., Sun, X., Zhou, Y., Wu, Y., Ji, R.: Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Trans. Image Process. 31, 4321–4335 (2022). https://doi.org/10.1109/TIP.2022.3183434
Ji, J., et al.: Attacking image captioning towards accuracy-preserving target words removal. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4226–4234. MM ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3414009
Li, Q., Zhang, Y., Sun, S., Wu, J., Zhao, X., Tan, M.: Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 99–114 (2022). https://doi.org/10.1016/j.neucom.2021.09.066, http://dx.doi.org/10.1016/j.neucom.2021.09.066
Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K.: Instance segmentation in 3D scenes using semantic superpoint tree networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2783–2792 (2021)
Lin, H., et al.: A unified framework for 3d point cloud visual grounding. arXiv preprint arXiv:2308.11887 (2023)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), October 2017. https://doi.org/10.1109/iccv.2017.324, http://dx.doi.org/10.1109/iccv.2017.324
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), October 2021. https://doi.org/10.1109/iccv48922.2021.00294, http://dx.doi.org/10.1109/iccv48922.2021.00294
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. https://doi.org/10.1109/cvpr42600.2020.01005, http://dx.doi.org/10.1109/cvpr42600.2020.01005
Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)
Ma, Y., Ji, J., Sun, X., Zhou, Y., Ji, R.: Towards local visual modeling for image captioning. Pattern Recognit. 138, 109420 (2023)
Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 638–647 (2022)
Ma, Y., et al.: X-Mesh: towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760 (2023)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), October 2016. https://doi.org/10.1109/3dv.2016.79, http://dx.doi.org/10.1109/3dv.2016.79
Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real-time joint semantic segmentation and depth estimation using asymmetric annotations, September 2018. \(\text{ arXiv }\): Computer Vision and Pattern Recognition,\(\text{ arXiv }\): Computer Vision and Pattern Recognition
Qi, C., Yi, L., Su, H., Guibas, L.: Pointnet++: deep hierarchical feature learning on point sets in a metric space, June 2017. Cornell University - arXiv, Cornell University - arXiv
Qian, Z., Ma, Y., Ji, J., Sun, X.: X-RefSeg3D: enhancing referring 3D instance segmentation via structured cross-modal graph neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4551–4559 (2024)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. https://doi.org/10.1109/cvpr.2019.00075, http://dx.doi.org/10.1109/cvpr.2019.00075
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3d scene instance segmentation, November 2022
Wu, C., et al.: 3D-STMN: dependency-driven superpoint-text matching network for end-to-end 3D referring expression segmentation (2023)
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3d visual grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Yang, D., et al.: Sam as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation. arXiv preprint arXiv:2406.01451 (2024)
Yang, D., et al.: Semi-supervised panoptic narrative grounding. In: Proceedings of the 31st ACM International Conference on Multimedia, October 2023. https://doi.org/10.1145/3581783.3612259, http://dx.doi.org/10.1145/3581783.3612259
Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2D semantics assisted training for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)
Yuan, Z., et al.: InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1791–1800 (2021)
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-Transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)
Acknowledgements
This work was supported by National Science and Technology Major Project (No. 2022ZD0118201), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. 62072389), the National Natural Science Fund for Young Scholars of China (No. 62302411), China Postdoctoral Science Foundation (No. 2023M732948), the Natural Science Foundation of Fujian Province of China (No. 2021J06003, No. 2022J06001), and partially sponsored by CCF-NetEase ThunderFire Innovation Research Funding (NO. CCF-Netease 202301).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qian, Z. et al. (2025). Multi-branch Collaborative Learning Network for 3D Visual Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-72952-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72951-5
Online ISBN: 978-3-031-72952-2
eBook Packages: Computer ScienceComputer Science (R0)