Abstract
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. Although many sparse query-based methods have already attempted to improve the efficiency of 3D detectors, they neglect to consider the backbone, especially when using Vision Transformers (ViT) for better performance. To tackle this problem, we explore the efficient ViT backbones for multi-view 3D detection via token compression and propose a simple yet effective method called TokenCompression3D (ToC3D). By leveraging history object queries as foreground priors of high quality, modeling 3D motion information in them, and interacting them with image tokens through the attention mechanism, ToC3D can effectively determine the magnitude of information densities of image tokens and segment the salient foreground tokens. With the introduced dynamic router design, ToC3D can weigh more computing resources to important foreground tokens while compressing the information loss, leading to a more efficient ViT-based multi-view 3D detector. Extensive results on the large-scale nuScenes dataset show that our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution. The code will be made at https://github.com/DYZhang09/ToC3D.
D. Zhang and D. Liang—Contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your ViT but faster. In: ICLR (2023)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: ICLR (2020)
Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA-02: a visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G.: BEVPoolv2: a cutting-edge implementation of BEVDet toward deployment. arXiv preprint arXiv:2211.17111 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Jiang, Y., et al.: PolarFormer: multi-camera 3D object detection with polar transformer. In: AAAI, vol. 37, pp. 1042–1050 (2023)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Kong, Z., et al.: SPViT: enabling faster vision transformers via latency-aware soft token pruning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13671, pp. 620–640. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20083-0_37
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: ECCV, pp. 734–750 (2018)
Li, Y., et al.: Fast-BEV: a fast and strong bird’s-eye view perception baseline. IEEE Trans. Pattern Anal. Mach. Intell. (2024)
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI, vol. 37, pp. 1477–1485 (2023)
Li, Z., Lan, S., Alvarez, J.M., Wu, Z.: BEVNeXt: reviving dense BEV frameworks for 3D object detection. In: CVPR (2024)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: FB-BEV: BEV representation from forward-backward view transformations. In: ICCV, pp. 6919–6928 (2023)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: expediting vision transformers via token reorganizations. In: ICLR (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D v2: recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)
Lin, X., Pei, Z., Lin, T., Huang, L., Su, Z.: Sparse4D v3: advancing end-to-end 3D detection and tracking. arXiv preprint arXiv:2311.11722 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV, pp. 18580–18590 (2023)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. In: ICCV, pp. 3262–3272 (2023)
Long, S., Zhao, Z., Pi, J., Wang, S., Wang, J.: Beyond attentive tokens: incorporating token importance and diversity for efficient vision transformers. In: CVPR, pp. 10334–10343 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)
Ma, X., Ouyang, W., Simonelli, A., Ricci, E.: 3D object detection from images for autonomous driving: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Meng, L., et al.: AdaViT: adaptive vision transformers for efficient image recognition. In: CVPR, pp. 12309–12318 (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2022)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. In: NeurIPS, vol. 34, pp. 13937–13949 (2021)
Roh, B., Shin, J., Shin, W., Kim, S.: Sparse DETR: efficient end-to-end object detection with learnable sparsity. In: ICLR (2021)
Shu, C., Deng, J., Yu, F., Liu, Y.: 3DPPE: 3D point positional encoding for transformer-based multi-camera 3D object detection. In: ICCV, pp. 3580–3589 (2023)
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: ICCV (2023)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. vol. 34, pp. 12077–12090 (2021)
Xiong, K., et al.: Cape: camera view position embedding for multi-view 3D object detection. In: CVPR, pp. 21570–21579 (2023)
Xu, Y., et al.: Evo-ViT: slow-fast token evolution for dynamic vision transformer. In: AAAI, vol. 36, pp. 2964–2972 (2022)
Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR, pp. 17830–17839 (2023)
Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-ViT: adaptive tokens for efficient vision transformer. In: CVPR, pp. 10809–10818 (2022)
Zhang, D., et al.: SAM3D: zero-shot 3D object detection via segment anything model. Sci. China Inf. Sci. (2024)
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant. No. 62225603 and 623B2038), and the Hubei Key R&D Program (Grant No. 2022BAA078).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, D. et al. (2025). Make Your ViT-Based Multi-view 3D Detectors Faster via Token Compression. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-72970-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)