Find n’ Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Etchegaray, Djamahl; Huang, Zi; Harada, Tatsuya; Luo, Yadan

doi:10.1007/978-3-031-73661-2_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15098))

Included in the following conference series:

European Conference on Computer Vision

568 Accesses

Abstract

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal Find n’ Propagate approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at github.com/djamahl99/findnpropagate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

References

Ahmed, S.M., Tan, Y.Z., Chew, C., Mamun, A.A., Wong, F.S.: Edge and corner detection for unorganized 3D point clouds with application to robotic welding. In: International Conference on Intelligent Robots and Systems (IROS), pp. 7350–7355 (2018)
Google Scholar
Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1080–1089. IEEE (2022)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11618–11628 (2020)
Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7020–7030 (2023)
Google Scholar
Chen, Z., Luo, Y., Wang, Z., Baktashmotlagh, M., Huang, Z.: Revisiting domain-adaptive 3D object detection by reliable, diverse and class-balanced pseudo-labeling. In: International Conference on Computer Vision (ICCV), pp. 3691–3703 (2023)
Google Scholar
Deng, B., Qi, C.R., Najibi, M., Funkhouser, T.A., Zhou, Y., Anguelov, D.: Revisiting 3D object detection from an egocentric perspective. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 26066–26079 (2021)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE Computer Society (2012)
Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Houston, J., et al.: One thousand and one hours: self-driving motion prediction dataset. In: Kober, J., Ramos, F., Tomlin, C.J. (eds.) Conference on Robot Learning, (CoRL). Proceedings of Machine Learning Research, vol. 155, pp. 409–418. PMLR (2020)
Google Scholar
Huang, K., Tsai, Y., Yang, M.: Weakly supervised 3D object detection via multi-level visual guidance. CoRR abs/2312.07530 (2023)
Google Scholar
Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22157–22167 (2023)
Google Scholar
Kim, D., Angelova, A., Kuo, W.: Contrastive feature masking open-vocabulary vision transformer. CoRR abs/2309.00775 (2023)
Google Scholar
Kuo, W., Cui, Y., Gu, X., Piergiovanni, A.J., Angelova, A.: Open-vocabulary object detection upon frozen vision and language models. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 10955–10965 (2022)
Google Scholar
Liu, C., et al.: Multimodal transformer for automatic 3D annotation and object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 657–673. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_38
Chapter Google Scholar
Liu, M., et al.: Partslip: low-shot part segmentation for 3d point clouds via pretrained image-language models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21736–21746 (2023)
Google Scholar
Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Lu, Y., et al.: Open-vocabulary point-cloud object detection without 3d annotation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1190–1199 (2023)
Google Scholar
Luo, Y., Chen, Z., Fang, Z., Zhang, Z., Baktashmotlagh, M., Huang, Z.: KECOR: kernel coding rate maximization for active 3D object detection. In: International Conference on Computer Vision (ICCV), pp. 18233–18244 (2023)
Google Scholar
Luo, Y., Chen, Z., Wang, Z., Yu, X., Huang, Z., Baktashmotlagh, M.: Exploring active 3D object detection from a generalization perspective. In: International Conference on Machine Learning (ICLR) (2023)
Google Scholar
Ma, C., Jiang, Y., Wen, X., Yuan, Z., Qi, X.: CoDet: co-occurrence guided region-word alignment for open-vocabulary object detection. CoRR abs/2310.16667 (2023)
Google Scholar
Mao, J., et al.: One million scenes for autonomous driving: ONCE dataset. In: Vanschoren, J., Yeung, S. (eds.) Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Mao, J., Shi, S., Wang, X., Li, H.: 3D object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vis. 131(8), 1909–1963 (2023)
Article Google Scholar
Meng, Q., Wang, W., Zhou, T., Shen, J., Van Gool, L., Dai, D.: Weakly supervised 3D object detection from lidar point cloud. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 515–531. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_31
Chapter Google Scholar
Minderer, M., Gritsenko, A.A., Houlsby, N.: Scaling open-vocabulary object detection. CoRR abs/2306.09683 (2023)
Google Scholar
Minderer, M., et al.: Simple open-vocabulary object detection with vision transformers. CoRR abs/2205.06230 (2022)
Google Scholar
Montes, H.A., Louedec, J.L., Cielniak, G., Duckett, T.: Real-time detection of broccoli crops in 3D point clouds for autonomous robotic harvesting. In: International Conference on Intelligent Robots and Systems (IROS), pp. 10483–10488 (2020)
Google Scholar
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 918–927. Computer Vision Foundation/IEEE Computer Society (2018)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 28. Curran Associates, Inc. (2015)
Google Scholar
Song, Z., et al.: Robustness-aware 3D object detection in autonomous driving: a review and outlook. CoRR abs/2401.06542 (2024)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451 (2020)
Google Scholar
Tao, R., Han, W., Qiu, Z., Xu, C., Shen, J.: Weakly supervised monocular 3D object detection using multi-view projection and direction consistency. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17482–17492. IEEE (2023)
Google Scholar
Wang, J., Lan, S., Gao, M., Davis, L.S.: InfoFocus: 3D object detection for autonomous driving with dynamic information modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 405–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_24
Chapter Google Scholar
Wang, L., et al.: Multi-view fusion-based 3D object detection for robot indoor scene perception. Sensors 19(19), 4092 (2019)
Article Google Scholar
Wei, Y., Su, S., Lu, J., Zhou, J.: FGR: frustum-aware geometric reasoning for weakly supervised 3D vehicle detection. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4348–4354. IEEE (2021)
Google Scholar
Wu, X., Zhu, F., Zhao, R., Li, H.: Cora: adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7031–7040 (2023)
Google Scholar
Xue, L., et al.: ULIP: learning a unified representation of language, images, and point clouds for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179–1189 (2023)
Google Scholar
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018). https://doi.org/10.3390/S18103337
Article Google Scholar
Yao, L., et al.: Detclipv2: scalable open-vocabulary object detection pre-training via word-region alignment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23497–23506. IEEE (2023)
Google Scholar
Yao, L., et al.: Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D object detection and tracking. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11784–11793 (2021)
Google Scholar
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching (2022)
Google Scholar
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), pp. 14393–14402 (2021)
Google Scholar
Zeng, Y., et al.: Clip2: contrastive language-image-point pretraining from real-world point cloud data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15244–15253 (2023)
Google Scholar
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. CoRR abs/2303.15343 (2023)
Google Scholar
Zhang, H., et al.: Glipv2: unifying localization and vision-language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Zhang, H., et al.: OpenSight: a simple open-vocabulary framework for lidar-based object detection. CoRR abs/2312.08876 (2023)
Google Scholar
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8552–8562 (2022)
Google Scholar
Zhao, S., et al.: Exploiting unlabeled data with vision and language models for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 159–175. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_10
Chapter Google Scholar
Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16772–16782. IEEE (2022)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Chapter Google Scholar
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21
Chapter Google Scholar
Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3d open-world learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2639–2650 (2023)
Google Scholar

Download references

Acknowledgements

This research is partially supported by the Australian Research Council (DE240100105, DP240101814, DP230101196); JST Moonshot R&D Grant Number JPMJPS2011, CREST Grant Number JPMJCR2015 and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo.

Author information

Authors and Affiliations

UQMM Lab, University of Queensland, Brisbane, Australia
Djamahl Etchegaray, Zi Huang & Yadan Luo
The University of Tokyo, Tokyo, Japan
Tatsuya Harada

Authors

Djamahl Etchegaray
View author publications
You can also search for this author in PubMed Google Scholar
Zi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuya Harada
View author publications
You can also search for this author in PubMed Google Scholar
Yadan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yadan Luo .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8402 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Etchegaray, D., Huang, Z., Harada, T., Luo, Y. (2025). Find n’ Propagate: Open-Vocabulary 3D Object Detection in Urban Environments. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15098. Springer, Cham. https://doi.org/10.1007/978-3-031-73661-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-73661-2_8
Published: 10 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73660-5
Online ISBN: 978-3-031-73661-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Find n’ Propagate: Open-Vocabulary 3D Object Detection in Urban Environments