Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

Liu, Yang; Liu, Daizong; Hu, Wei

doi:10.1007/978-3-031-78113-1_17

Yang Liu¹³,
Daizong Liu¹³ &
Wei Hu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15330))

Included in the following conference series:

International Conference on Pattern Recognition

210 Accesses

Abstract

This paper tackles the challenging task of 3D visual grounding–locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GridPointNet: Grid and Point-Based 3D Object Detection from Point Cloud

HT-SSPG:Hierarchical Transformers for Semantic Surface Point Generation in 3D Object Detection

Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph

References

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25
Chapter Google Scholar
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chapter Google Scholar
Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X.: Hierarchical aggregation for 3D instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15467–15476 (2021)
Google Scholar
Cheng, B., Sheng, L., Shi, S., Yang, M., Xu, D.: Back-tracing representative points for voting-based 3d object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8963–8972 (2021)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TranSVG: end-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)
Google Scholar
Feng, M., et al.: Free-form description guided 3D visual graph network for object grounding in point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3722–3731 (2021)
Google Scholar
He, D., et al.: Transrefer3D: entity-and-relation aware transformer for fine-grained 3d visual grounding. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2344–2352 (2021)
Google Scholar
Huang, P.H., Lee, H.H., Chen, H.T., Liu, T.L.: Text-guided graph neural networks for referring 3d instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1610–1618 (2021)
Google Scholar
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: Pointgroup: dual-set point grouping for 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4867–4876 (2020)
Google Scholar
Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. In: Advances in Neural Information Processing Systems, vol. 34, pp. 19652–19664 (2021)
Google Scholar
Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Liu, D., Liu, Y., Huang, W., Hu, W.: A survey on text-guided 3D visual grounding: elements, recent advances, and future directions. arXiv preprint arXiv:2406.05785 (2024)
Liu, H., Lin, A., Han, X., Yang, L., Yu, Y., Cui, S.: Refer-it-in-RGBD: a bottom-up approach for 3D visual grounding in RGBD images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6032–6041 (2021)
Google Scholar
Liu, Y., Liu, D., Guo, Z., Hu, W.: Cross-task knowledge transfer for semi-supervised joint 3D grounding and captioning. In: Proceedings of the 32st ACM International Conference on Multimedia. ACM (2024)
Google Scholar
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in ADAM (2018)
Google Scholar
Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436 (2020)
Google Scholar
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4694–4703 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3D instance segmentation on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2022)
Google Scholar
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)
Article Google Scholar
Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., Hengel, A.v.d.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1960–1968 (2019)
Google Scholar
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (tog) 38(5), 1–12 (2019)
Article Google Scholar
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: TubeDeTR: spatio-temporal video grounding with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16442–16453 (2022)
Google Scholar
Yang, S., Li, G., Yu, Y.: Cross-modal relationship inference for grounding referring expressions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4145–4154 (2019)
Google Scholar
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Google Scholar
Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2D semantics assisted training for 3D visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)
Google Scholar
Yuan, Z., et al.: InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1791–1800 (2021)
Google Scholar
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Yang Liu, Daizong Liu & Wei Hu

Authors

Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Daizong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Liu .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 706 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Liu, D., Hu, W. (2025). Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15330. Springer, Cham. https://doi.org/10.1007/978-3-031-78113-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-78113-1_17
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78112-4
Online ISBN: 978-3-031-78113-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GridPointNet: Grid and Point-Based 3D Object Detection from Point Cloud

HT-SSPG:Hierarchical Transformers for Semantic Surface Point Generation in 3D Object Detection

Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 706 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GridPointNet: Grid and Point-Based 3D Object Detection from Point Cloud

HT-SSPG:Hierarchical Transformers for Semantic Surface Point Generation in 3D Object Detection

Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 706 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation