skip to main content
10.1145/3603781.3603857acmotherconferencesArticle/Chapter ViewAbstractPublication PagescniotConference Proceedingsconference-collections
research-article

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving

Published:27 July 2023Publication History

ABSTRACT

This paper proposes a novel solution to the problem of efficiently detecting 3D objects in point clouds. By leveraging Convolutional Neural Networks (CNNs) and Transformer Networks, our method combines the strengths of both networks in feature extraction and long-range contextual information. To improve the detection performance under occlusion conditions, we propose a temporal fusion module to fuse the features of the current frame and the previous frame together. At the same time, we use BiFPN to effectively aggregate features of different scales.

Finally, we conducted experiments on the nuScenes dataset, and compared with the baseline, our algorithm improved by 2.54% on NDS and 2.44% on mAP.

References

  1. S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” Cornell University - arXiv, 2018.Google ScholarGoogle Scholar
  2. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” arXiv: Computer Vision and Pattern Recognition, 2017.Google ScholarGoogle Scholar
  3. G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. Welling- ton, “Lasernet: An efficient probabilistic 3d object detector for au- tonomous driving,” Computer Vision and Pattern Recognition, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  4. B. Li, Z. Tianlei, and X. Tian, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv: Computer Vision and Pattern Recognition, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” Cornell University - arXiv, 2018.Google ScholarGoogle Scholar
  6. A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” arXiv: Learning, 2018.Google ScholarGoogle Scholar
  7. X. Chen, H. Ma, J. Wan, B. Li, and X. Tian, “Multi-view 3d object detection network for autonomous driving,” Cornell University - arXiv, 2016.Google ScholarGoogle Scholar
  8. Z. Liu, H. Tang, M. A. Amini, X. Yang, H. Mao, O. Daniela, R. Mit, H. Song, and Mit, “Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation,” 2023.Google ScholarGoogle Scholar
  9. J. Fang, D. Zhou, X. Song, and L. Zhang, “Mapfusion: A general framework for 3d object detection with hdmaps.” arXiv: Computer Vision and Pattern Recognition, 2021.Google ScholarGoogle Scholar
  10. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv: Computer Vision and Pattern Recognition, 2021.Google ScholarGoogle Scholar
  11. X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” Cornell University - arXiv, 2021.Google ScholarGoogle Scholar
  12. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” arXiv: Computer Vision and Pattern Recognition, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of detr with spatially modulated co-attention,” International Conference on Computer Vision, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  14. Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector.” Proceedings of the ... AAAI Conference on Artificial Intelligence, 2021.Google ScholarGoogle Scholar
  15. Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” 2023.Google ScholarGoogle Scholar
  16. C. R. Qi, Y. Zhou, M. Najibi, P. Sun, K. Vo, B. Deng, and D. Anguelov, “Offboard 3d object detection from point cloud sequences,” Computer Vision and Pattern Recognition, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  17. C. Luo, X. Yang, and A. L. Yuille, “Exploring simple 3d multi-object tracking for autonomous driving,” arXiv: Computer Vision and Pattern Recognition, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  18. Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers,” 2023.Google ScholarGoogle Scholar
  19. Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, 2018.Google ScholarGoogle Scholar
  20. Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame attention network for object detection,” Computer Vision and Pattern Recognition, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Bhattacharyya, C. Huang, and K. Czarnecki, “Sa-det3d: Self-attention based context-aware 3d object detection,” arXiv: Computer Vision and Pattern Recognition, 2021.Google ScholarGoogle Scholar
  22. Z. Zhou, X. Zhao, Y. Wang, P. Wang, and H. Foroosh, “Centerformer: Center-based transformer for 3d object detection,” 2023.Google ScholarGoogle Scholar
  23. L. Casia, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” 2023.Google ScholarGoogle Scholar
  24. M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” Cornell University - arXiv, 2019.Google ScholarGoogle Scholar
  25. T. Yin, X. Zhou, and P. Kra¨henbu¨hl, “Center-based 3d object detection and tracking.” 2020.Google ScholarGoogle Scholar
  26. X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin, “Ssn: Shape signature networks for multi-class object detection from point clouds,” Springer International Publishing eBooks, 2020.Google ScholarGoogle Scholar

Index Terms

  1. STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things
      May 2023
      1025 pages
      ISBN:9798400700705
      DOI:10.1145/3603781

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 July 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate39of82submissions,48%
    • Article Metrics

      • Downloads (Last 12 months)33
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format