Skip to main content

Find n’ Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal Find n’ Propagate approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at github.com/djamahl99/findnpropagate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahmed, S.M., Tan, Y.Z., Chew, C., Mamun, A.A., Wong, F.S.: Edge and corner detection for unorganized 3D point clouds with application to robotic welding. In: International Conference on Intelligent Robots and Systems (IROS), pp. 7350–7355 (2018)

    Google Scholar 

  2. Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1080–1089. IEEE (2022)

    Google Scholar 

  3. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11618–11628 (2020)

    Google Scholar 

  4. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  5. Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7020–7030 (2023)

    Google Scholar 

  6. Chen, Z., Luo, Y., Wang, Z., Baktashmotlagh, M., Huang, Z.: Revisiting domain-adaptive 3D object detection by reliable, diverse and class-balanced pseudo-labeling. In: International Conference on Computer Vision (ICCV), pp. 3691–3703 (2023)

    Google Scholar 

  7. Deng, B., Qi, C.R., Najibi, M., Funkhouser, T.A., Zhou, Y., Anguelov, D.: Revisiting 3D object detection from an egocentric perspective. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 26066–26079 (2021)

    Google Scholar 

  8. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)

    Google Scholar 

  9. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE Computer Society (2012)

    Google Scholar 

  10. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)

  11. Houston, J., et al.: One thousand and one hours: self-driving motion prediction dataset. In: Kober, J., Ramos, F., Tomlin, C.J. (eds.) Conference on Robot Learning, (CoRL). Proceedings of Machine Learning Research, vol. 155, pp. 409–418. PMLR (2020)

    Google Scholar 

  12. Huang, K., Tsai, Y., Yang, M.: Weakly supervised 3D object detection via multi-level visual guidance. CoRR abs/2312.07530 (2023)

    Google Scholar 

  13. Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22157–22167 (2023)

    Google Scholar 

  14. Kim, D., Angelova, A., Kuo, W.: Contrastive feature masking open-vocabulary vision transformer. CoRR abs/2309.00775 (2023)

    Google Scholar 

  15. Kuo, W., Cui, Y., Gu, X., Piergiovanni, A.J., Angelova, A.: Open-vocabulary object detection upon frozen vision and language models. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  16. Li, L.H., et al.: Grounded language-image pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 10955–10965 (2022)

    Google Scholar 

  17. Liu, C., et al.: Multimodal transformer for automatic 3D annotation and object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 657–673. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_38

    Chapter  Google Scholar 

  18. Liu, M., et al.: Partslip: low-shot part segmentation for 3d point clouds via pretrained image-language models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21736–21746 (2023)

    Google Scholar 

  19. Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  20. Lu, Y., et al.: Open-vocabulary point-cloud object detection without 3d annotation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1190–1199 (2023)

    Google Scholar 

  21. Luo, Y., Chen, Z., Fang, Z., Zhang, Z., Baktashmotlagh, M., Huang, Z.: KECOR: kernel coding rate maximization for active 3D object detection. In: International Conference on Computer Vision (ICCV), pp. 18233–18244 (2023)

    Google Scholar 

  22. Luo, Y., Chen, Z., Wang, Z., Yu, X., Huang, Z., Baktashmotlagh, M.: Exploring active 3D object detection from a generalization perspective. In: International Conference on Machine Learning (ICLR) (2023)

    Google Scholar 

  23. Ma, C., Jiang, Y., Wen, X., Yuan, Z., Qi, X.: CoDet: co-occurrence guided region-word alignment for open-vocabulary object detection. CoRR abs/2310.16667 (2023)

    Google Scholar 

  24. Mao, J., et al.: One million scenes for autonomous driving: ONCE dataset. In: Vanschoren, J., Yeung, S. (eds.) Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  25. Mao, J., Shi, S., Wang, X., Li, H.: 3D object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vis. 131(8), 1909–1963 (2023)

    Article  Google Scholar 

  26. Meng, Q., Wang, W., Zhou, T., Shen, J., Van Gool, L., Dai, D.: Weakly supervised 3D object detection from lidar point cloud. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 515–531. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_31

    Chapter  Google Scholar 

  27. Minderer, M., Gritsenko, A.A., Houlsby, N.: Scaling open-vocabulary object detection. CoRR abs/2306.09683 (2023)

    Google Scholar 

  28. Minderer, M., et al.: Simple open-vocabulary object detection with vision transformers. CoRR abs/2205.06230 (2022)

    Google Scholar 

  29. Montes, H.A., Louedec, J.L., Cielniak, G., Duckett, T.: Real-time detection of broccoli crops in 3D point clouds for autonomous robotic harvesting. In: International Conference on Intelligent Robots and Systems (IROS), pp. 10483–10488 (2020)

    Google Scholar 

  30. Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  31. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 918–927. Computer Vision Foundation/IEEE Computer Society (2018)

    Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), vol. 139, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  33. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 28. Curran Associates, Inc. (2015)

    Google Scholar 

  34. Song, Z., et al.: Robustness-aware 3D object detection in autonomous driving: a review and outlook. CoRR abs/2401.06542 (2024)

    Google Scholar 

  35. Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451 (2020)

    Google Scholar 

  36. Tao, R., Han, W., Qiu, Z., Xu, C., Shen, J.: Weakly supervised monocular 3D object detection using multi-view projection and direction consistency. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17482–17492. IEEE (2023)

    Google Scholar 

  37. Wang, J., Lan, S., Gao, M., Davis, L.S.: InfoFocus: 3D object detection for autonomous driving with dynamic information modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 405–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_24

    Chapter  Google Scholar 

  38. Wang, L., et al.: Multi-view fusion-based 3D object detection for robot indoor scene perception. Sensors 19(19), 4092 (2019)

    Article  Google Scholar 

  39. Wei, Y., Su, S., Lu, J., Zhou, J.: FGR: frustum-aware geometric reasoning for weakly supervised 3D vehicle detection. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4348–4354. IEEE (2021)

    Google Scholar 

  40. Wu, X., Zhu, F., Zhao, R., Li, H.: Cora: adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7031–7040 (2023)

    Google Scholar 

  41. Xue, L., et al.: ULIP: learning a unified representation of language, images, and point clouds for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179–1189 (2023)

    Google Scholar 

  42. Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018). https://doi.org/10.3390/S18103337

    Article  Google Scholar 

  43. Yao, L., et al.: Detclipv2: scalable open-vocabulary object detection pre-training via word-region alignment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23497–23506. IEEE (2023)

    Google Scholar 

  44. Yao, L., et al.: Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  45. Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D object detection and tracking. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11784–11793 (2021)

    Google Scholar 

  46. Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching (2022)

    Google Scholar 

  47. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), pp. 14393–14402 (2021)

    Google Scholar 

  48. Zeng, Y., et al.: Clip2: contrastive language-image-point pretraining from real-world point cloud data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15244–15253 (2023)

    Google Scholar 

  49. Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. CoRR abs/2303.15343 (2023)

    Google Scholar 

  50. Zhang, H., et al.: Glipv2: unifying localization and vision-language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  51. Zhang, H., et al.: OpenSight: a simple open-vocabulary framework for lidar-based object detection. CoRR abs/2312.08876 (2023)

    Google Scholar 

  52. Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8552–8562 (2022)

    Google Scholar 

  53. Zhao, S., et al.: Exploiting unlabeled data with vision and language models for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 159–175. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_10

    Chapter  Google Scholar 

  54. Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16772–16782. IEEE (2022)

    Google Scholar 

  55. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40

    Chapter  Google Scholar 

  56. Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21

    Chapter  Google Scholar 

  57. Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3d open-world learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2639–2650 (2023)

    Google Scholar 

Download references

Acknowledgements

This research is partially supported by the Australian Research Council (DE240100105, DP240101814, DP230101196); JST Moonshot R&D Grant Number JPMJPS2011, CREST Grant Number JPMJCR2015 and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yadan Luo .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8402 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Etchegaray, D., Huang, Z., Harada, T., Luo, Y. (2025). Find n’ Propagate: Open-Vocabulary 3D Object Detection in Urban Environments. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15098. Springer, Cham. https://doi.org/10.1007/978-3-031-73661-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73661-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73660-5

  • Online ISBN: 978-3-031-73661-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics