Abstract
We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird’s Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 \({\text {m}}^2\). To relieve the expensive costorgnames of roadside 3D labeling, we present a novel BEV-orgname-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, orgname comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit are at https://roscenes.github.io.
X. Zhu and H. Sheng—Equal contribution. Work done when Xiaosu Zhu interns at Alibaba Cloud.
S. Cai—Project lead.
L. Gao—Independent Researcher.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Scene samples, trajectories visualization and more analysis appear in supplementary materials.
- 2.
- 3.
- 4.
The check covers \(30\%\) annotations from RoScenes and about 7k samples from unpublished 31 scenes.
References
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11618–11628 (2020)
Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric SORT: rethinking SORT for robust multi-object tracking. In: CVPR, pp. 9686–9696 (2023)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chang, M., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: CVPR, pp. 8748–8757 (2019)
Creß, C., et al.: A9-dataset: multi-sensor infrastructure-based dataset for mobility research. In: IEEE Intelligent Vehicles Symposium, pp. 965–970 (2022)
Dai, J., et al.: Deformable convolutional networks. In: ICCV, pp. 764–773 (2017)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Ettinger, S., et al.: Large scale interactive motion forecasting for autonomous driving: the Waymo open motion dataset. In: ICCV, pp. 9690–9699 (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Geyer, J., et al.: A2D2: Audi autonomous driving dataset. arXiv preprint 2004.06320 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Henkel, P., Mittmann, U., Iafrancesco, M.: Real-time kinematic positioning with GPS and GLONASS. In: European Signal Processing Conference (2016)
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint 2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint 2112.11790 (2021)
Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2702–2719 (2020)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI, pp. 1477–1485 (2023)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV (2023)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3d object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. In: ICCV (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)
Lyu, C., et al.: RTMDet: an empirical study of designing real-time object detectors. arXiv preprint 2212.07784 (2022)
Mao, J., et al.: One million scenes for autonomous driving: ONCE dataset. In: NeurIPS (2021)
Park, J., et al.: Time will tell: new outlooks and A baseline for temporal multi-view 3D object detection. In: ICLR (2023)
Patil, A., Malla, S., Gang, H., Chen, Y.: The H3D dataset for full-surround 3D multi-object detection and tracking in crowded urban scenes. In: ICRA, pp. 9552–9557 (2019)
Pham, Q., et al.: A*3d dataset: towards autonomous driving in challenging environments. In: ICRA, pp. 2267–2273 (2020)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Ravi, N., et al.: Accelerating 3D deep learning with pytorch3d. arXiv preprint 2007.08501 (2020)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: ICCV, pp. 3621–3631 (2023)
Wang, W., et al.: Internimage: exploring large-scale vision foundation models with deformable convolutions. In: CVPR, pp. 14408–14419 (2023)
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL, pp. 180–191 (2021)
Xia, G., et al.: DOTA: a large-scale dataset for object detection in aerial images. In: CVPR, pp. 3974–3983 (2018)
Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR, pp. 17830–17839 (2023)
Yang, L., et al.: BEVHeight: a robust framework for vision-based roadside 3D object detection. In: CVPR, pp. 21611–21620 (2023)
Ye, X., et al.: Rope3D: the roadside perception dataset for autonomous driving and monocular 3D object detection task. In: CVPR, pp. 21309–21318 (2022)
Yu, H., et al.: DAIR-V2X: a large-scale dataset for vehicle-infrastructure cooperative 3D object detection. In: CVPR, pp. 21329–21338 (2022)
Yu, H., et al.: V2x-Seq: a large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In: CVPR, pp. 5486–5495 (2023)
Acknowledgements
We sincerely thank Sichuan Expressway Construction & Development Group Co., Ltd. and Western Sichuan Expressway Co., Ltd. and Sichuan Intelligent Expressway Technology Co., Ltd. for their invaluable assistance with data acquisition.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, X. et al. (2025). RoScenes: A Large-Scale Multi-view 3D Dataset for Roadside Perception. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15099. Springer, Cham. https://doi.org/10.1007/978-3-031-72940-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-72940-9_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72939-3
Online ISBN: 978-3-031-72940-9
eBook Packages: Computer ScienceComputer Science (R0)