Skip to main content

RoScenes: A Large-Scale Multi-view 3D Dataset for Roadside Perception

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15099))

Included in the following conference series:

Abstract

We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird’s Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 \({\text {m}}^2\). To relieve the expensive costorgnames of roadside 3D labeling, we present a novel BEV-orgname-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, orgname comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit are at https://roscenes.github.io.

X. Zhu and H. Sheng—Equal contribution. Work done when Xiaosu Zhu interns at Alibaba Cloud.

S. Cai—Project lead.

L. Gao—Independent Researcher.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Scene samples, trajectories visualization and more analysis appear in supplementary materials.

  2. 2.

    https://enterprise.dji.com/dji-terra.

  3. 3.

    https://www.qxwz.com/.

  4. 4.

    The check covers \(30\%\) annotations from RoScenes and about 7k samples from unpublished 31 scenes.

References

  1. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11618–11628 (2020)

    Google Scholar 

  2. Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric SORT: rethinking SORT for robust multi-object tracking. In: CVPR, pp. 9686–9696 (2023)

    Google Scholar 

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chang, M., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: CVPR, pp. 8748–8757 (2019)

    Google Scholar 

  5. Creß, C., et al.: A9-dataset: multi-sensor infrastructure-based dataset for mobility research. In: IEEE Intelligent Vehicles Symposium, pp. 965–970 (2022)

    Google Scholar 

  6. Dai, J., et al.: Deformable convolutional networks. In: ICCV, pp. 764–773 (2017)

    Google Scholar 

  7. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  9. Ettinger, S., et al.: Large scale interactive motion forecasting for autonomous driving: the Waymo open motion dataset. In: ICCV, pp. 9690–9699 (2021)

    Google Scholar 

  10. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)

    Google Scholar 

  11. Geyer, J., et al.: A2D2: Audi autonomous driving dataset. arXiv preprint 2004.06320 (2020)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  13. Henkel, P., Mittmann, U., Iafrancesco, M.: Real-time kinematic positioning with GPS and GLONASS. In: European Signal Processing Conference (2016)

    Google Scholar 

  14. Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint 2203.17054 (2022)

    Google Scholar 

  15. Huang, J., Huang, G., Zhu, Z., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint 2112.11790 (2021)

    Google Scholar 

  16. Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2702–2719 (2020)

    Article  Google Scholar 

  17. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  18. Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI, pp. 1477–1485 (2023)

    Google Scholar 

  19. Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1

    Chapter  Google Scholar 

  20. Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV (2023)

    Google Scholar 

  21. Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3d object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31

    Chapter  Google Scholar 

  22. Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. In: ICCV (2023)

    Google Scholar 

  23. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)

    Google Scholar 

  24. Lyu, C., et al.: RTMDet: an empirical study of designing real-time object detectors. arXiv preprint 2212.07784 (2022)

    Google Scholar 

  25. Mao, J., et al.: One million scenes for autonomous driving: ONCE dataset. In: NeurIPS (2021)

    Google Scholar 

  26. Park, J., et al.: Time will tell: new outlooks and A baseline for temporal multi-view 3D object detection. In: ICLR (2023)

    Google Scholar 

  27. Patil, A., Malla, S., Gang, H., Chen, Y.: The H3D dataset for full-surround 3D multi-object detection and tracking in crowded urban scenes. In: ICRA, pp. 9552–9557 (2019)

    Google Scholar 

  28. Pham, Q., et al.: A*3d dataset: towards autonomous driving in challenging environments. In: ICRA, pp. 2267–2273 (2020)

    Google Scholar 

  29. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  30. Ravi, N., et al.: Accelerating 3D deep learning with pytorch3d. arXiv preprint 2007.08501 (2020)

    Google Scholar 

  31. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  32. Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: ICCV, pp. 3621–3631 (2023)

    Google Scholar 

  33. Wang, W., et al.: Internimage: exploring large-scale vision foundation models with deformable convolutions. In: CVPR, pp. 14408–14419 (2023)

    Google Scholar 

  34. Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL, pp. 180–191 (2021)

    Google Scholar 

  35. Xia, G., et al.: DOTA: a large-scale dataset for object detection in aerial images. In: CVPR, pp. 3974–3983 (2018)

    Google Scholar 

  36. Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR, pp. 17830–17839 (2023)

    Google Scholar 

  37. Yang, L., et al.: BEVHeight: a robust framework for vision-based roadside 3D object detection. In: CVPR, pp. 21611–21620 (2023)

    Google Scholar 

  38. Ye, X., et al.: Rope3D: the roadside perception dataset for autonomous driving and monocular 3D object detection task. In: CVPR, pp. 21309–21318 (2022)

    Google Scholar 

  39. Yu, H., et al.: DAIR-V2X: a large-scale dataset for vehicle-infrastructure cooperative 3D object detection. In: CVPR, pp. 21329–21338 (2022)

    Google Scholar 

  40. Yu, H., et al.: V2x-Seq: a large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In: CVPR, pp. 5486–5495 (2023)

    Google Scholar 

Download references

Acknowledgements

We sincerely thank Sichuan Expressway Construction & Development Group Co., Ltd. and Western Sichuan Expressway Co., Ltd. and Sichuan Intelligent Expressway Technology Co., Ltd. for their invaluable assistance with data acquisition.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jingkuan Song or Jieping Ye .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 96920 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, X. et al. (2025). RoScenes: A Large-Scale Multi-view 3D Dataset for Roadside Perception. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15099. Springer, Cham. https://doi.org/10.1007/978-3-031-72940-9_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72940-9_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72939-3

  • Online ISBN: 978-3-031-72940-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics