The evolution of 3D object detection hinges not only on advanced models but also on effective and efficient annotation strategies. Despite this progress, the labor-intensive nature of 3D object annotation remains a bottleneck, hindering further development in the field. This paper introduces a novel approach, incorporated with “prompt in 2D, detect in 3D” and “detect in 3D, refine in 3D” strategies, to 3D object annotation: multi-modal interactive 3D object detection. Firstly, by allowing users to engage with simpler 2D interaction prompts (e.g., clicks or boxes on a camera image or a bird’s eye view), we bridge the complexity gap between 2D and 3D spaces, reimagining the annotation workflow. Besides, Our framework also supports flexible iterative refinement to the initial 3D annotations, further assisting annotators in achieving satisfying results. Evaluation on the nuScenes dataset demonstrates the effectiveness of our method. And thanks to the prompt-driven and interactive designs, our approach also exhibits outstanding performance in open-set scenarios. This work not only offers a potential solution to the 3D object annotation problem but also paves the way for further innovations in the 3D object detection community.
R. Zhang and X. Lin—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3d object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1090–1099 (2022)
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, X., Zhao, Z., Zhang, Y., Duan, M., Qi, D., Zhao, H.: Focalclick: towards practical interactive image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1300–1309 (2022)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Chen, Y., Chen, Q., Sun, P., Chen, S., Wang, J., Cheng, J.: Enhancing your trained detrs with box refinement. arXiv preprint arXiv:2307.11828 (2023)
Choi, D., Cho, W., Kim, K., Choo, J.: idet3d: towards efficient interactive object detection for lidar point clouds. arXiv preprint arXiv:2312.15449 (2023)
Ge, C., et al.: Metabev: solving sensor failures for Bev detection and map segmentation. arXiv preprint arXiv:2304.09801 (2023)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The Kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Lee, C., et al.: Interactive multi-class tiny-object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14136–14145 (2022)
Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., Jia, J.: Unifying voxel-based representation with transformer for 3d object detection. Adv. Neural. Inf. Process. Syst. 35, 18442–18455 (2022)
Li, Z., Wang, F., Wang, N.: Lidar r-cnn: An efficient and universal 3d object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part IX, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, S., et al.: Dab-detr: dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 (2022)
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: European Conference on Computer Vision, pp. 531–548. Springer, Heidelberg (2022)
Liu, Y., et al.: Petrv2: a unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256 (2022)
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)
Lu, Y., et al.: Geometry uncertainty projection network for monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3111–3121 (2021)
Maninis, K.K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: from extreme points to object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 616–625 (2018)
Meng, D., et al.: Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660 (2021)
Meng, Q., Wang, W., Zhou, T., Shen, J., Jia, Y., Van Gool, L.: Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4454–4468 (2021)
Papadopoulos, D.P., Uijlings, J.R., Keller, F., Ferrari, V.: Extreme clicking for efficient object annotation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4930–4939 (2017)
Papadopoulos, D.P., Uijlings, J.R., Keller, F., Ferrari, V.: Training object class detectors with click supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6374–6383 (2017)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: dlearning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Shi, S., Wang, X., Li, H.: Pointrcnn: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Shin, K., Kwon, Y.P., Tomizuka, M.: Roarnet: a robust 3d object detection based on region approximation refinement. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 2510–2515. IEEE (2019)
Sofiiuk, K., Petrov, I.A., Konushin, A.: Reviving iterative training with mask guidance for interactive segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 3141–3145. IEEE (2022)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Wang, B., Wu, V., Wu, B., Keutzer, K.: Latte: accelerating lidar point cloud annotation via sensor fusion, one-click annotation, and tracking. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 265–272. IEEE (2019)
Wang, K., et al.: De-biased teacher: Rethinking IOU matching for semi-supervised object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2573–2580 (2023)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2567–2575 (2022)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Xie, Y., et al.: Sparsefusion: fusing multi-modal sparse representations for multi-sensor 3d object detection. arXiv preprint arXiv:2304.14340 (2023)
Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069 (2021)
Xu, N., Price, B., Cohen, S., Yang, J., Huang, T.S.: Deep interactive object selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 373–381 (2016)
Yan, J., et al.: Cross modal transformer via coordinates encoding for 3d object dectection. arXiv preprint arXiv:2301.01283 (2023)
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Yang, J., Zeng, A., Li, F., Liu, S., Zhang, R., Zhang, L.: Neural interactive keypoint detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15122–15132 (2023)
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11040–11048 (2020)
Yao, A., Gall, J., Leistner, C., Van Gool, L.: Interactive object detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3242–3249. IEEE (2012)
Zhang, J., et al.: Semi-detr: semi-supervised object detection with detection transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23809–23818 (2023)
Zhang, R., et al.: Monodetr: depth-guided transformer for monocular 3d object detection. arXiv preprint arXiv:2203.13310 (2022)
Zhao, J., Sun, L., Li, Q.: Recursivedet: end-to-end region-based recursive object detection. arXiv preprint arXiv:2307.13619 (2023)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Zhou, Y., Zhu, H., Liu, Q., Chang, S., Guo, M.: Monoatt: online monocular 3d object detection with adaptive token transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17493–17503 (2023)
This work was supported in part by the National Natural Science Foundation of China (NO. 62322608), in part by the Fundamental Research Funds for the Central Universities under Grant 22lgqb25, and in part by the Open Project Program of the Key Laboratory of Artificial Intelligence for Perception and Understanding, Liaoning Province (AIPU, No. 20230003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, R. et al. (2025). Interactive 3D Object Detection with Prompts. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15075. Springer, Cham. https://doi.org/10.1007/978-3-031-72643-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72643-9_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72642-2
Online ISBN: 978-3-031-72643-9
eBook Packages: Computer ScienceComputer Science (R0)