Skip to main content

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Traditional LiDAR-based object detection research primarily focuses on closed-set scenarios, which falls short in complex real-world applications. Directly transferring existing 2D open-vocabulary models with some known LiDAR classes for open-vocabulary ability, however, tends to suffer from over-fitting problems: The obtained model will detect the known objects, even presented with a novel category. In this paper, we propose OpenSight, a more advanced 2D-3D modeling framework for LiDAR-based open-vocabulary detection. OpenSight utilizes 2D-3D geometric priors for the initial discernment and localization of generic objects, followed by a more specific semantic interpretation of the detected objects. The process begins by generating 2D boxes for generic objects from the accompanying camera images of LiDAR. These 2D boxes, together with LiDAR points, are then lifted back into the LiDAR space to estimate corresponding 3D boxes. For better generic object perception, our framework integrates both temporal and spatial-aware constraints. Temporal awareness correlates the predicted 3D boxes across consecutive timestamps, recalibrating the missed or inaccurate boxes. The spatial awareness randomly places some “precisely” estimated 3D boxes at varying distances, increasing the visibility of generic objects. To interpret the specific semantics of detected objects, we develop a cross-modal alignment and fusion module to first align 3D features with 2D image embeddings and then fuse the aligned 3D-2D features for semantic decoding. Our experiments indicate that our method establishes state-of-the-art open-vocabulary performance on widely used 3D detection benchmarks and effectively identifies objects for new categories of interest.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell. 16(6), 641–647 (1994)

    Article  Google Scholar 

  2. Alliegro, A., Cappio Borlino, F., Tommasi, T.: 3DOS: towards 3D open set learning-benchmarking and understanding semantic novelty detection on point clouds. Adv. Neural. Inf. Process. Syst. 35, 21228–21240 (2022)

    Google Scholar 

  3. Bai, X., et al.: TransFusion: robust lidar-camera fusion for 3d object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1090–1099 (2022)

    Google Scholar 

  4. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)

    Google Scholar 

  5. Cao, Y., Zeng, Y., Xu, H., Xu, D.: CoDA: collaborative novel box discovery and cross-modal alignment for open-vocabulary 3D object detection. arXiv preprint arXiv:2310.02960 (2023)

  6. Cen, J., Yun, P., Cai, J., Wang, M.Y., Liu, M.: Open-set 3D object detection. In: 2021 International Conference on 3D Vision (3DV), pp. 869–878. IEEE (2021)

    Google Scholar 

  7. Cen, J., et al.: Open-world semantic segmentation for lidar point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVIII. LNCS, vol. 13698, pp. 318–334. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_19

    Chapter  Google Scholar 

  8. Cheraghian, A., Rahman, S., Campbell, D., Petersson, L.: Mitigating the hubness problem for zero-shot learning of 3D objects. arXiv preprint arXiv:1907.06371 (2019)

  9. Cheraghian, A., Rahman, S., Campbell, D., Petersson, L.: Transductive zero-shot learning for 3D point cloud classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 923–933 (2020)

    Google Scholar 

  10. Cheraghian, A., Rahman, S., Chowdhury, T.F., Campbell, D., Petersson, L.: Zero-shot learning on 3D point cloud objects and beyond. Int. J. Comput. Vision 130(10), 2364–2384 (2022)

    Article  Google Scholar 

  11. Cheraghian, A., Rahman, S., Petersson, L.: Zero-shot learning of 3D point cloud objects. In: 2019 16th International Conference on Machine Vision Applications (MVA), pp. 1–6. IEEE (2019)

    Google Scholar 

  12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  13. Douillard, B., et al.: On the segmentation of 3D LiDAR point clouds. In: 2011 IEEE International Conference on Robotics and Automation, pp. 2798–2805. IEEE (2011)

    Google Scholar 

  14. Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: RangeDet: in defense of range view for lidar-based 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2918–2927 (2021)

    Google Scholar 

  15. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)

    Google Scholar 

  16. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=lL3lnMbR4WU

  17. Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4338–4364 (2020)

    Article  Google Scholar 

  18. Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3

    Chapter  Google Scholar 

  19. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)

    Google Scholar 

  20. Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)

    Google Scholar 

  21. Li, X., et al.: LogoNet: towards accurate 3D object detection with local-to-global cross-modal fusion. arXiv preprint arXiv:2303.03595 (2023)

  22. Li, Y., et al.: DeepFusion: LiDAR-camera deep fusion for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)

    Google Scholar 

  23. Li, Z., Wang, F., Wang, N.: LiDAR R-CNN: an efficient and universal 3D object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)

    Google Scholar 

  24. Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 641–656 (2018)

    Google Scholar 

  25. Liang, T., et al.: BEVFusion: a simple and robust LiDAR-camera fusion framework. arXiv preprint arXiv:2205.13790 (2022)

  26. Liu, B., Deng, S., Dong, Q., Hu, Z.: Language-level semantics conditioned 3D point cloud segmentation. arXiv preprint arXiv:2107.00430 (2021)

  27. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  28. Liu, Z., et al.: BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542 (2022)

  29. Lu, Y., et al.: Open-vocabulary point-cloud object detection without 3D annotation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1190–1199 (2023)

    Google Scholar 

  30. Michele, B., Boulch, A., Puy, G., Bucher, M., Marlet, R.: Generative zero-shot learning for semantic segmentation of 3D point clouds. In: 2021 International Conference on 3D Vision (3DV), pp. 992–1002. IEEE (2021)

    Google Scholar 

  31. Minderer, M., et al.: Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230 (2022)

  32. Najibi, M., et al.: Unsupervised 3D perception with 2D vision-language distillation for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8602–8612 (2023)

    Google Scholar 

  33. Peng, S., et al.: OpenScene: 3D scene understanding with open vocabularies. arXiv preprint arXiv:2211.15654 (2022)

  34. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)

    Google Scholar 

  35. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

    Google Scholar 

  36. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural. Inf. Process. Syst. 30 (2017)

    Google Scholar 

  37. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  38. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28 (2015)

    Google Scholar 

  39. Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538 (2020)

    Google Scholar 

  40. Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)

    Google Scholar 

  41. Simon, M., et al.: Complexer-YOLO: real-time 3D object detection and tracking on semantic point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)

    Google Scholar 

  42. Sindagi, V.A., Zhou, Y., Tuzel, O.: MVX-net: multimodal VoxelNet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)

    Google Scholar 

  43. Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)

    Google Scholar 

  44. Sun, P., et al.: RSN: range sparse net for efficient, accurate LiDAR 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5725–5734 (2021)

    Google Scholar 

  45. Wang, C., Ma, C., Zhu, M., Yang, X.: PointAugmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803 (2021)

    Google Scholar 

  46. Wang, Y., et al.: Pillar-based object detection for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_2

    Chapter  Google Scholar 

  47. Wong, K., Wang, S., Ren, M., Liang, M., Urtasun, R.: Identifying unknown instances for autonomous driving. In: Conference on Robot Learning, pp. 384–393. PMLR (2020)

    Google Scholar 

  48. Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)

    Article  Google Scholar 

  49. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11040–11048 (2020)

    Google Scholar 

  50. Yao, L., et al.: DetCLIP: dictionary-enriched visual-concept paralleled pre-training for open-world detection. arXiv preprint arXiv:2209.09407 (2022)

  51. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)

    Google Scholar 

  52. Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3D detection. Adv. Neural. Inf. Process. Syst. 34, 16494–16507 (2021)

    Google Scholar 

  53. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43

    Chapter  Google Scholar 

  54. You, Y., et al.: Learning to detect mobile objects from LiDAR scans without labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1130–1140 (2022)

    Google Scholar 

  55. Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A survey of autonomous driving: common practices and emerging technologies. IEEE Access 8, 58443–58469 (2020)

    Article  Google Scholar 

  56. Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part IX. LNCS, vol. 13669, pp. 106–122. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_7

    Chapter  Google Scholar 

  57. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)

    Google Scholar 

  58. Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

  59. Zhang, L., et al.: Towards unsupervised object detection from LiDAR point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9317–9328 (2023)

    Google Scholar 

  60. Zhang, R., et al.: PointCLIP: point cloud understanding by CLIP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)

    Google Scholar 

  61. Zhang, R., Zhang, H., Yu, H., Zheng, Z.: Approaching outside: scaling unsupervised 3D object detection from 2D scene. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

    Google Scholar 

  62. Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part IX. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21

    Chapter  Google Scholar 

  63. Zhou, Y., et al.: End-to-end multi-view fusion for 3D object detection in LiDAR point clouds. In: Conference on Robot Learning, pp. 923–932. PMLR (2020)

    Google Scholar 

  64. Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)

    Google Scholar 

  65. Zhu, X., et al.: PointCLIP v2: prompting CLIP and GPT for powerful 3D open-world learning. arXiv preprint arXiv:2211.11682 (2022)

Download references

Acknowledgement

This research is funded in part by ARC-Discovery grant (DP220100800 to XY) and ARC-DECRA grant (DE230100477 to XY). We thank all anonymous reviewers and ACs for their constructive suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Yu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4205 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, H. et al. (2025). OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15142. Springer, Cham. https://doi.org/10.1007/978-3-031-72907-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72907-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72906-5

  • Online ISBN: 978-3-031-72907-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics