research-article

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving

Authors:
Wei Liu

College of Information Science and Engineering, Northeastern University, China

College of Information Science and Engineering, Northeastern University, China

0000-0002-0623-7178
View Profile

,
Yue Zhang

College of Information Science and Engineering, Northeastern University, China

College of Information Science and Engineering, Northeastern University, China

0009-0003-2422-5589
View Profile

,
Haoxiang Jie

Neusoft Reach Automotive Technology Ltd, China

Neusoft Reach Automotive Technology Ltd, China

0000-0001-6434-8119
View Profile

,
Jun Hu

Neusoft Reach Automotive Technology Ltd, China

Neusoft Reach Automotive Technology Ltd, China

0000-0002-7094-1901
View Profile

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of ThingsMay 2023Pages 415–420https://doi.org/10.1145/3603781.3603857

Published:27 July 2023Publication History

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

Pages 415–420

ABSTRACT

This paper proposes a novel solution to the problem of efficiently detecting 3D objects in point clouds. By leveraging Convolutional Neural Networks (CNNs) and Transformer Networks, our method combines the strengths of both networks in feature extraction and long-range contextual information. To improve the detection performance under occlusion conditions, we propose a temporal fusion module to fuse the features of the current frame and the previous frame together. At the same time, we use BiFPN to effectively aggregate features of different scales.

Finally, we conducted experiments on the nuScenes dataset, and compared with the baseline, our algorithm improved by 2.54% on NDS and 2.44% on mAP.

References

S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” Cornell University - arXiv, 2018.Google Scholar
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” arXiv: Computer Vision and Pattern Recognition, 2017.Google Scholar
G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. Welling- ton, “Lasernet: An efficient probabilistic 3d object detector for au- tonomous driving,” Computer Vision and Pattern Recognition, 2019.Google ScholarCross Ref
B. Li, Z. Tianlei, and X. Tian, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv: Computer Vision and Pattern Recognition, 2016.Google ScholarCross Ref
Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” Cornell University - arXiv, 2018.Google Scholar
A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” arXiv: Learning, 2018.Google Scholar
X. Chen, H. Ma, J. Wan, B. Li, and X. Tian, “Multi-view 3d object detection network for autonomous driving,” Cornell University - arXiv, 2016.Google Scholar
Z. Liu, H. Tang, M. A. Amini, X. Yang, H. Mao, O. Daniela, R. Mit, H. Song, and Mit, “Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation,” 2023.Google Scholar
J. Fang, D. Zhou, X. Song, and L. Zhang, “Mapfusion: A general framework for 3d object detection with hdmaps.” arXiv: Computer Vision and Pattern Recognition, 2021.Google Scholar
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv: Computer Vision and Pattern Recognition, 2021.Google Scholar
X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” Cornell University - arXiv, 2021.Google Scholar
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” arXiv: Computer Vision and Pattern Recognition, 2020.Google ScholarDigital Library
P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of detr with spatially modulated co-attention,” International Conference on Computer Vision, 2021.Google ScholarCross Ref
Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector.” Proceedings of the ... AAAI Conference on Artificial Intelligence, 2021.Google Scholar
Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” 2023.Google Scholar
C. R. Qi, Y. Zhou, M. Najibi, P. Sun, K. Vo, B. Deng, and D. Anguelov, “Offboard 3d object detection from point cloud sequences,” Computer Vision and Pattern Recognition, 2021.Google ScholarCross Ref
C. Luo, X. Yang, and A. L. Yuille, “Exploring simple 3d multi-object tracking for autonomous driving,” arXiv: Computer Vision and Pattern Recognition, 2021.Google ScholarCross Ref
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers,” 2023.Google Scholar
Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, 2018.Google Scholar
Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame attention network for object detection,” Computer Vision and Pattern Recognition, 2021.Google ScholarCross Ref
P. Bhattacharyya, C. Huang, and K. Czarnecki, “Sa-det3d: Self-attention based context-aware 3d object detection,” arXiv: Computer Vision and Pattern Recognition, 2021.Google Scholar
Z. Zhou, X. Zhao, Y. Wang, P. Wang, and H. Foroosh, “Centerformer: Center-based transformer for 3d object detection,” 2023.Google Scholar
L. Casia, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” 2023.Google Scholar
M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” Cornell University - arXiv, 2019.Google Scholar
T. Yin, X. Zhou, and P. Kra¨henbu¨hl, “Center-based 3d object detection and tracking.” 2020.Google Scholar
X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin, “Ssn: Shape signature networks for multi-class object detection from point clouds,” Springer International Publishing eBooks, 2020.Google Scholar

Index Terms

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection

Recommendations

ReAGFormer: Reaggregation Transformer with Affine Group Features for 3D Object Detection
Computer Vision – ACCV 2022
Abstract
Direct detection of 3D objects from point clouds is a challenging task due to sparsity and irregularity of point clouds. To capture point features from the raw point clouds for 3D object detection, most previous researches utilize PointNet and its ...
Read More
Deep multi-scale and multi-modal fusion for 3D object detection
Highlights
- We propose a multi-scale feature fusion method from different resolution feature maps for 3D object detection.
Abstract
The perception of 3D objects in the scene is the basis of autonomous driving. Most autonomous driving cars are equipped with cameras and Lidar to obtain 3D spatial information. RGB images taken from the camera and point cloud produced ...
Read More
3D object detection algorithm based on multi-sensor segmental fusion of frustum association for autonomous driving
Abstract
The rotation characteristics of point clouds are challenging to capture in current multimodal fusion methods for 3D object detection. A single fusion method cannot well balance the accuracy and speed in object detection. Therefore, a multi-sensor ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things
May 2023
1025 pages
ISBN:9798400700705
DOI:10.1145/3603781

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3D Object detection
LiDAR
Multi-frame fusion
Transformer
autonomous driving
point cloud
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate39of82submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 33
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

ABSTRACT

References

Cited By

Index Terms

Recommendations

ReAGFormer: Reaggregation Transformer with Affine Group Features for 3D Object Detection

Deep multi-scale and multi-modal fusion for 3D object detection

3D object detection algorithm based on multi-sensor segmental fusion of frustum association for autonomous driving

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

ABSTRACT

References

Cited By

Index Terms

Recommendations

ReAGFormer: Reaggregation Transformer with Affine Group Features for 3D Object Detection

Deep multi-scale and multi-modal fusion for 3D object detection

3D object detection algorithm based on multi-sensor segmental fusion of frustum association for autonomous driving

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media