ABSTRACT
This paper proposes a novel solution to the problem of efficiently detecting 3D objects in point clouds. By leveraging Convolutional Neural Networks (CNNs) and Transformer Networks, our method combines the strengths of both networks in feature extraction and long-range contextual information. To improve the detection performance under occlusion conditions, we propose a temporal fusion module to fuse the features of the current frame and the previous frame together. At the same time, we use BiFPN to effectively aggregate features of different scales.
Finally, we conducted experiments on the nuScenes dataset, and compared with the baseline, our algorithm improved by 2.54% on NDS and 2.44% on mAP.
- S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” Cornell University - arXiv, 2018.Google Scholar
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” arXiv: Computer Vision and Pattern Recognition, 2017.Google Scholar
- G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. Welling- ton, “Lasernet: An efficient probabilistic 3d object detector for au- tonomous driving,” Computer Vision and Pattern Recognition, 2019.Google ScholarCross Ref
- B. Li, Z. Tianlei, and X. Tian, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv: Computer Vision and Pattern Recognition, 2016.Google ScholarCross Ref
- Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” Cornell University - arXiv, 2018.Google Scholar
- A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” arXiv: Learning, 2018.Google Scholar
- X. Chen, H. Ma, J. Wan, B. Li, and X. Tian, “Multi-view 3d object detection network for autonomous driving,” Cornell University - arXiv, 2016.Google Scholar
- Z. Liu, H. Tang, M. A. Amini, X. Yang, H. Mao, O. Daniela, R. Mit, H. Song, and Mit, “Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation,” 2023.Google Scholar
- J. Fang, D. Zhou, X. Song, and L. Zhang, “Mapfusion: A general framework for 3d object detection with hdmaps.” arXiv: Computer Vision and Pattern Recognition, 2021.Google Scholar
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv: Computer Vision and Pattern Recognition, 2021.Google Scholar
- X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” Cornell University - arXiv, 2021.Google Scholar
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” arXiv: Computer Vision and Pattern Recognition, 2020.Google ScholarDigital Library
- P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of detr with spatially modulated co-attention,” International Conference on Computer Vision, 2021.Google ScholarCross Ref
- Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector.” Proceedings of the ... AAAI Conference on Artificial Intelligence, 2021.Google Scholar
- Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” 2023.Google Scholar
- C. R. Qi, Y. Zhou, M. Najibi, P. Sun, K. Vo, B. Deng, and D. Anguelov, “Offboard 3d object detection from point cloud sequences,” Computer Vision and Pattern Recognition, 2021.Google ScholarCross Ref
- C. Luo, X. Yang, and A. L. Yuille, “Exploring simple 3d multi-object tracking for autonomous driving,” arXiv: Computer Vision and Pattern Recognition, 2021.Google ScholarCross Ref
- Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers,” 2023.Google Scholar
- Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, 2018.Google Scholar
- Z. Yang, Y. Zhou, Z. Chen, and J. Ngiam, “3d-man: 3d multi-frame attention network for object detection,” Computer Vision and Pattern Recognition, 2021.Google ScholarCross Ref
- P. Bhattacharyya, C. Huang, and K. Czarnecki, “Sa-det3d: Self-attention based context-aware 3d object detection,” arXiv: Computer Vision and Pattern Recognition, 2021.Google Scholar
- Z. Zhou, X. Zhao, Y. Wang, P. Wang, and H. Foroosh, “Centerformer: Center-based transformer for 3d object detection,” 2023.Google Scholar
- L. Casia, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” 2023.Google Scholar
- M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” Cornell University - arXiv, 2019.Google Scholar
- T. Yin, X. Zhou, and P. Kra¨henbu¨hl, “Center-based 3d object detection and tracking.” 2020.Google Scholar
- X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin, “Ssn: Shape signature networks for multi-class object detection from point clouds,” Springer International Publishing eBooks, 2020.Google Scholar
Index Terms
- STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving
Recommendations
ReAGFormer: Reaggregation Transformer with Affine Group Features for 3D Object Detection
Computer Vision – ACCV 2022AbstractDirect detection of 3D objects from point clouds is a challenging task due to sparsity and irregularity of point clouds. To capture point features from the raw point clouds for 3D object detection, most previous researches utilize PointNet and its ...
Deep multi-scale and multi-modal fusion for 3D object detection
Highlights- We propose a multi-scale feature fusion method from different resolution feature maps for 3D object detection.
AbstractThe perception of 3D objects in the scene is the basis of autonomous driving. Most autonomous driving cars are equipped with cameras and Lidar to obtain 3D spatial information. RGB images taken from the camera and point cloud produced ...
3D object detection algorithm based on multi-sensor segmental fusion of frustum association for autonomous driving
AbstractThe rotation characteristics of point clouds are challenging to capture in current multimodal fusion methods for 3D object detection. A single fusion method cannot well balance the accuracy and speed in object detection. Therefore, a multi-sensor ...
Comments