Abstract
In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal Find n’ Propagate approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at github.com/djamahl99/findnpropagate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmed, S.M., Tan, Y.Z., Chew, C., Mamun, A.A., Wong, F.S.: Edge and corner detection for unorganized 3D point clouds with application to robotic welding. In: International Conference on Intelligent Robots and Systems (IROS), pp. 7350–7355 (2018)
Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1080–1089. IEEE (2022)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11618–11628 (2020)
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chen, R., et al.: Clip2scene: towards label-efficient 3D scene understanding by clip. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7020–7030 (2023)
Chen, Z., Luo, Y., Wang, Z., Baktashmotlagh, M., Huang, Z.: Revisiting domain-adaptive 3D object detection by reliable, diverse and class-balanced pseudo-labeling. In: International Conference on Computer Vision (ICCV), pp. 3691–3703 (2023)
Deng, B., Qi, C.R., Najibi, M., Funkhouser, T.A., Zhou, Y., Anguelov, D.: Revisiting 3D object detection from an egocentric perspective. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 26066–26079 (2021)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE Computer Society (2012)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Houston, J., et al.: One thousand and one hours: self-driving motion prediction dataset. In: Kober, J., Ramos, F., Tomlin, C.J. (eds.) Conference on Robot Learning, (CoRL). Proceedings of Machine Learning Research, vol. 155, pp. 409–418. PMLR (2020)
Huang, K., Tsai, Y., Yang, M.: Weakly supervised 3D object detection via multi-level visual guidance. CoRR abs/2312.07530 (2023)
Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22157–22167 (2023)
Kim, D., Angelova, A., Kuo, W.: Contrastive feature masking open-vocabulary vision transformer. CoRR abs/2309.00775 (2023)
Kuo, W., Cui, Y., Gu, X., Piergiovanni, A.J., Angelova, A.: Open-vocabulary object detection upon frozen vision and language models. In: International Conference on Learning Representations (ICLR) (2023)
Li, L.H., et al.: Grounded language-image pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 10955–10965 (2022)
Liu, C., et al.: Multimodal transformer for automatic 3D annotation and object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 657–673. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_38
Liu, M., et al.: Partslip: low-shot part segmentation for 3d point clouds via pretrained image-language models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21736–21746 (2023)
Liu, Y., et al.: Segment any point cloud sequences by distilling vision foundation models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Lu, Y., et al.: Open-vocabulary point-cloud object detection without 3d annotation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1190–1199 (2023)
Luo, Y., Chen, Z., Fang, Z., Zhang, Z., Baktashmotlagh, M., Huang, Z.: KECOR: kernel coding rate maximization for active 3D object detection. In: International Conference on Computer Vision (ICCV), pp. 18233–18244 (2023)
Luo, Y., Chen, Z., Wang, Z., Yu, X., Huang, Z., Baktashmotlagh, M.: Exploring active 3D object detection from a generalization perspective. In: International Conference on Machine Learning (ICLR) (2023)
Ma, C., Jiang, Y., Wen, X., Yuan, Z., Qi, X.: CoDet: co-occurrence guided region-word alignment for open-vocabulary object detection. CoRR abs/2310.16667 (2023)
Mao, J., et al.: One million scenes for autonomous driving: ONCE dataset. In: Vanschoren, J., Yeung, S. (eds.) Advances in Neural Information Processing Systems (NeurIPS) (2021)
Mao, J., Shi, S., Wang, X., Li, H.: 3D object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vis. 131(8), 1909–1963 (2023)
Meng, Q., Wang, W., Zhou, T., Shen, J., Van Gool, L., Dai, D.: Weakly supervised 3D object detection from lidar point cloud. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 515–531. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_31
Minderer, M., Gritsenko, A.A., Houlsby, N.: Scaling open-vocabulary object detection. CoRR abs/2306.09683 (2023)
Minderer, M., et al.: Simple open-vocabulary object detection with vision transformers. CoRR abs/2205.06230 (2022)
Montes, H.A., Louedec, J.L., Cielniak, G., Duckett, T.: Real-time detection of broccoli crops in 3D point clouds for autonomous robotic harvesting. In: International Conference on Intelligent Robots and Systems (IROS), pp. 10483–10488 (2020)
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 918–927. Computer Vision Foundation/IEEE Computer Society (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), vol. 139, pp. 8748–8763. PMLR (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 28. Curran Associates, Inc. (2015)
Song, Z., et al.: Robustness-aware 3D object detection in autonomous driving: a review and outlook. CoRR abs/2401.06542 (2024)
Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451 (2020)
Tao, R., Han, W., Qiu, Z., Xu, C., Shen, J.: Weakly supervised monocular 3D object detection using multi-view projection and direction consistency. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17482–17492. IEEE (2023)
Wang, J., Lan, S., Gao, M., Davis, L.S.: InfoFocus: 3D object detection for autonomous driving with dynamic information modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 405–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_24
Wang, L., et al.: Multi-view fusion-based 3D object detection for robot indoor scene perception. Sensors 19(19), 4092 (2019)
Wei, Y., Su, S., Lu, J., Zhou, J.: FGR: frustum-aware geometric reasoning for weakly supervised 3D vehicle detection. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4348–4354. IEEE (2021)
Wu, X., Zhu, F., Zhao, R., Li, H.: Cora: adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7031–7040 (2023)
Xue, L., et al.: ULIP: learning a unified representation of language, images, and point clouds for 3D understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179–1189 (2023)
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018). https://doi.org/10.3390/S18103337
Yao, L., et al.: Detclipv2: scalable open-vocabulary object detection pre-training via word-region alignment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23497–23506. IEEE (2023)
Yao, L., et al.: Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D object detection and tracking. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11784–11793 (2021)
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching (2022)
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), pp. 14393–14402 (2021)
Zeng, Y., et al.: Clip2: contrastive language-image-point pretraining from real-world point cloud data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15244–15253 (2023)
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. CoRR abs/2303.15343 (2023)
Zhang, H., et al.: Glipv2: unifying localization and vision-language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Zhang, H., et al.: OpenSight: a simple open-vocabulary framework for lidar-based object detection. CoRR abs/2312.08876 (2023)
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8552–8562 (2022)
Zhao, S., et al.: Exploiting unlabeled data with vision and language models for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 159–175. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_10
Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16772–16782. IEEE (2022)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21
Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3d open-world learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2639–2650 (2023)
Acknowledgements
This research is partially supported by the Australian Research Council (DE240100105, DP240101814, DP230101196); JST Moonshot R&D Grant Number JPMJPS2011, CREST Grant Number JPMJCR2015 and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Etchegaray, D., Huang, Z., Harada, T., Luo, Y. (2025). Find n’ Propagate: Open-Vocabulary 3D Object Detection in Urban Environments. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15098. Springer, Cham. https://doi.org/10.1007/978-3-031-73661-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-73661-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73660-5
Online ISBN: 978-3-031-73661-2
eBook Packages: Computer ScienceComputer Science (R0)