Abstract
In recent years, with the continuous development of autonomous driving, monocular 3D object detection has garnered increasing attention as a crucial research topic. However, the precision of 3D object detection is impeded by the limitations of monocular camera sensors, which struggle to capture accurate depth information. To address this challenge, a novel Aggregation Transformer Network (ATNet) is introduced, featuring Cross-Attention based Positional Aggregation and Dual Expansion-Squeeze based Channel Aggregation. The proposed ATNet adaptively fuses radar and camera data at both positional and channel levels. Specifically, the Cross-Attention based Positional Aggregation leverages camera-radar information to compute a non-linear attention coefficient, which reinforces salient features and suppresses irrelevant ones. The Dual Expansion-Squeeze based Channel Aggregation utilizes refined processing techniques to integrate radar and camera data adaptively at the channel level. Furthermore, to enhance feature-level fusion, we propose a multi-scale radar-camera fusion strategy that integrates radar information across multiple stages of the camera subnet’s backbone, allowing for improved object detection across various scales. Extensive experiments conducted on the widely-used nuScenes dataset validate that our proposed Aggregation Transformer, when integrated into superb monocular 3D object detection models, delivers promising results compared to existing methods.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
The data that support the findings of this study are openly available in nuScenes at https://www.nuscenes.org/nuscenes.
References
Ahn B, Kim Y, Park G, Cho NI (2018) Block-matching convolutional neural network (bmcnn): improving cnn-based denoising by block-matched inputs. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 516–525. IEEE https://doi.org/10.23919/apsipa.2018.8659548
Hosseini SA, Abbaszadeh Shahri A, Asheghi R (2022) Prediction of bedload transport rate using a block combined network structure. Hydrol Sci J 67(1):117–128. https://doi.org/10.1080/02626667.2021.2003367
Asheghi R, Hosseini SA, Saneie M, Shahri AA (2020) Updating the neural network sediment load models using different sensitivity analysis methods: a regional application. J Hydroinf 22(3):562–577. https://doi.org/10.2166/hydro.2020.098
Zhang J, Huang K, Tan T, Zhang Z (2017) Local structured representation for generic object detection. Front Comp Sci 11:632–648. https://doi.org/10.1007/s11704-016-5530-6
Lee DH, Chen K-L, Liou K-H, Liu C-L, Liu J-L (2021) Deep learning and control algorithms of direct perception for autonomous driving. Appl Intell 51(1):237–247. https://doi.org/10.1007/s10489-020-01827-9
Dickmanns ED (1992) A general dynamic vision architecture for ugv and uav. Appl Intell 2:251–270. https://doi.org/10.1007/bf00119551
Dupuis E, Novo D, O’Connor I, Bosio A (2020) Sensitivity analysis and compression opportunities in dnns using weight sharing. In: 2020 23rd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 1–6. IEEE https://doi.org/10.1109/ddecs50862.2020.9095658
Abbaszadeh Shahri A, Chunling S, Larsson S (2023) A hybrid ensemble-based automated deep learning approach to generate 3d geo-models and uncertainty analysis. Eng Comput 1–16. https://doi.org/10.1007/s00366-023-01852-5
Kumar A, Brazil G, Corona E, Parchami A, Liu X (2022) Deviant: Depth equivariant network for monocular 3d object detection. In: European Conference on Computer Vision, pp. 664–683. https://doi.org/10.1007/978-3-031-20077-9_39 . Springer
Peng L, Wu X, Yang Z, Liu H, Cai D (2022) Did-m3d: Decoupling instance depth for monocular 3d object detection. arXiv preprint arXiv:2207.08531https://doi.org/10.1007/978-3-031-19769-7_5
Hao W, Andolina IM, Wang W, Zhang Z (2021) Biologically inspired visual computing: the state of the art. Front Comp Sci 15(1):151304. https://doi.org/10.1007/s11704-020-9001-8
Wang T, Pang J, Lin D (2022) Monocular 3d object detection with depth from motion. In: European Conference on Computer Vision, pp. 386–403. Springer https://doi.org/10.1007/978-3-031-20077-9_23
Gao T, Jia Z, Lin W, Li Y (2022) Delving into monocular 3d vehicle tracking: a decoupled framework and a dedicated metric. Appl Intell 1–11. https://doi.org/10.1007/s10489-022-03432-4
Naik DL, Kiran R (2021) A novel sensitivity-based method for feature selection. J Big Data 8(1):128. https://doi.org/10.1186/s40537-021-00515-w
Zhang P (2019) A novel feature selection method based on global sensitivity analysis with application in machine learning-based prediction model. Appl Soft Comput 85:105859. https://doi.org/10.1016/j.asoc.2019.105859
Gao T, Pan H, Gao H (2022) Monocular 3d object detection with sequential feature association and depth hint augmentation. IEEE Trans Intell 7(2):240–250. https://doi.org/10.1109/tiv.2022.3143954
Wang T, Zhu X, Pang J, Lin D (2021) Fcos3d: Fully convolutional one-stage monocular 3d object detection. In: IEEE International Conference on Computer Vision, pp. 913–922. https://doi.org/10.1109/iccvw54120.2021.00107
Liu Z, Wu Z, Tóth R (2020) Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 996–997. https://doi.org/10.1109/cvprw50498.2020.00506
Zhang Y, Zheng W, Zhu Z, Huang G, Du D, Zhou J, Lu J (2022) Dimension embeddings for monocular 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1589–1598. https://doi.org/10.1109/cvpr52688.2022.00164
Li Z, Qu Z, Zhou Y, Liu J, Wang H, Jiang L (2022) Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2791–2800. https://doi.org/10.1109/cvpr52688.2022.00281
Jiang H, Cheng MM, Li SJ, Borji A, Wang J (2019) Joint salient object detection and existence prediction. Front Comp Sci 13:778–788. https://doi.org/10.1007/s11704-017-6613-8
Lian, Q., Ye, B., Xu, R., Yao, W., Zhang, T (2022) Exploring geometric consistency for monocular 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1694. https://doi.org/10.1109/cvpr52688.2022.00173
Gu J, Wu B, Fan L, Huang J, Cao S, Xiang Z, Hua XS (2022) Homography loss for monocular 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1080–1089. https://doi.org/10.1109/cvpr52688.2022.00115
Yang B, Luo W, Urtasun R (2018) Pixor: Real-time 3d object detection from point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7660. https://doi.org/10.1109/cvpr.2018.00798
Wang W, Wang T, Cai Y (2022) Multi-view attention-convolution pooling network for 3d point cloud classification. Appl Intell 52(13):14787–14798. https://doi.org/10.1007/s10489-021-02840-2
Rozsa Z, Sziranyi T (2019) Object detection from a few lidar scanning planes. IEEE Trans Intell Veh 4(4):548–560. https://doi.org/10.1109/tiv.2019.2938109
Zhang R, Qiu H, Wang T, Guo Z, Xu X, Qiao Y, Gao P, Li H (2022) Monodetr: Depth-guided transformer for monocular 3d object detection. arXiv preprint arXiv:2203.13310https://doi.org/10.1109/iccv51070.2023.00840
Qin Z, Li X (2022) Monoground: Detecting monocular 3d objects from the ground. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802. https://doi.org/10.1109/cvpr52688.2022.00377
Lian Q, Li P, Chen X (2022) Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1070–1079. https://doi.org/10.1109/cvpr52688.2022.00114
Chen YN, Dai H, Ding Y (2022) Pseudo-stereo for monocular 3d object detection in autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 887–897. https://doi.org/10.1109/cvpr52688.2022.00096
Li P, Jin J (2022) Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3885–3894. https://doi.org/10.1109/cvpr52688.2022.00386
Li Y, Chen Y, He J, Zhang Z (2022) Densely constrained depth estimator for monocular 3d object detection. In: European Conference on Computer Vision, pp. 718–734. Springer https://doi.org/10.1007/978-3-031-20077-9_42
Battaglia E, Bioglio L, Pensa RG (2020) Towards content sensitivity analysis. In: Berthold, M.R., Feelders, A., Krempl, G. (eds.) Advances in Intelligent Data Analysis XVIII, pp. 67–79. Springer, Cham. https://doi.org/10.1007/978-3-030-44584-3_6
Yeung DS, Cloete I, Shi D, Ng W (2010) Sensitivity Analysis for Neural Networks. Springer, ???
He C, Li R, Li S, Zhang L (2022) Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8427. https://doi.org/10.1109/cvpr52688.2022.00823
Li Y, Qi X, Chen Y, Wang L, Li Z, Sun J, Jia J (2022) Voxel field fusion for 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1120–1129. https://doi.org/10.1109/cvpr52688.2022.00119
Fazlali H, Xu Y, Ren Y, Liu B (2022) A versatile multi-view framework for lidar-based 3d object detection with guidance from panoptic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17192–17201. https://doi.org/10.1109/cvpr52688.2022.01668
Fan L, Pang Z, Zhang T, Wang YX, Zhao H, Wang F, Wang N, Zhang Z (2022) Embracing single stride 3d object detector with sparse transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8458–8468. https://doi.org/10.1109/cvpr52688.2022.00827
Lehner A, Gasperini S, Marcos-Ramiro A, Schmidt M, Mahani MAN, Navab N, Busam B, Tombari F (2022) 3d-vfield: Adversarial augmentation of point clouds for domain generalization in 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17295–17304. https://doi.org/10.1109/cvpr52688.2022.01678
Li X, Kong D (2022) Srif-rcnn: Sparsely represented inputs fusion of different sensors for 3d object detection. Appl Intell 1–22. https://doi.org/10.1007/s10489-022-03594-1
Xu X, Wang W, Wang J (2016) A three-way incremental-learning algorithm for radar emitter identification. Front Comp Sci 10:673–688. https://doi.org/10.1007/s11704-015-4457-7
Nabati R, Qi H (2021) Centerfusion: Center-based radar and camera fusion for 3d object detection. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1527–1536. https://doi.org/10.1109/wacv48630.2021.00157
Abbaszadeh Shahri A, Maghsoudi Moud F (2021) Landslide susceptibility mapping using hybridized block modular intelligence model. Bull Eng Geol Environ 80:267–284. https://doi.org/10.1007/s10064-020-01922-8
Zou BJ, Guo YD, He Q, Ouyang PB, Liu K, Chen ZL (2018) 3d filtering by block matching and convolutional neural network for image denoising. J Comput Sci Technol 33:838–848. https://doi.org/10.1007/s11390-018-1859-7
Zhou J, Ni J, Rao Y (2017) Block-based convolutional neural network for image forgery detection. In: Digital Forensics and Watermarking: 16th International Workshop, IWDW 2017, Magdeburg, Germany, August 23-25, 2017, Proceedings 16, pp. 65–76. Springer https://doi.org/10.1007/978-3-319-64185-0_6
Lin JT, Dai D, Van Gool L (2020) Depth estimation from monocular images and sparse radar data. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 10233–10240. IEEE https://doi.org/10.1109/iros45743.2020.9340998
Li Y, Chen Y, Qi X, Li Z, Sun J, Jia J (2022) Unifying voxel-based representation with transformer for 3d object detection. arXiv preprint arXiv:2206.00630
Liu Z, Tang H, Amini A, Yang X, Mao H, Rus D, Han S (2022) Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542. https://doi.org/10.1109/icra48891.2023.10160968
Bai X, Hu Z, Zhu X, Huang Q, Chen Y, Fu H, Tai CL (2022) Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1090–1099. https://doi.org/10.1109/cvpr52688.2022.00116
Xu S, Zhou D, Fang J, Yin J, Bin Z, Zhang L (2021) Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In: IEEE International Conference on Intelligent Transportation Systems, pp. 3047–3054. IEEE https://doi.org/10.1109/itsc48978.2021.9564951
Nobis F, Geisslinger M, Weber M, Betz J, Lienkamp M (2019) A deep learning-based radar and camera sensor fusion architecture for object detection. In: Sensor Data Fusion: Trends, Solutions, Applications, pp. 1–7. IEEE https://doi.org/10.1109/sdf.2019.8916629
Long Y, Morris D, Liu X, Castro M, Chakravarty P, Narayanan P (2021) Radar-camera pixel depth association for depth completion. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12507–12516. https://doi.org/10.1109/cvpr46437.2021.01232
Wang Y, Jiang Z, Li Y, Hwang J-N, Xing G, Liu H (2021) Rodnet: A real-time radar object detection network cross-supervised by camera-radar fused object 3d localization. IEEE J Sel Top Signal Process 15(4):954–967. https://doi.org/10.1109/jstsp.2021.3058895
Nabati R, Qi H (2019) Rrpn: Radar region proposal network for object detection in autonomous vehicles. In: IEEE International Conference on Image Processing, pp. 3093–3097. IEEE https://doi.org/10.1109/icip.2019.8803392
Zeng Y, Zhang D, Wang C, Miao Z, Liu T, Zhan X, Hao D, Ma C (2022) Lift: Learning 4d lidar image fusion transformer for 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17172–17181. https://doi.org/10.1109/cvpr52688.2022.01666
Peri N, Luiten J, Li M, Ošep A, Leal-Taixé L, Ramanan D (2022) Forecasting from lidar via future object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17202–17211. https://doi.org/10.1109/cvpr52688.2022.01669
Liu C, Gao C, Liu, F, Liu J, Meng D, Gao X (2022) Ss3d: Sparsely-supervised 3d object detection from point cloud. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8428–8437. https://doi.org/10.1109/cvpr52688.2022.00824
Hu JSK, Kuai T, Waslander SL (2022) Point density-aware voxels for lidar 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8469–8478. https://doi.org/10.1109/cvpr52688.2022.00828
Hahner M, Sakaridis C, Bijelic M, Heide F, Yu F, Dai D, Van Gool L (2022) Lidar snowfall simulation for robust 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16364–16374. https://doi.org/10.1109/cvpr52688.2022.01588
Samal K, Kumawat H, Saha P, Wolf M, Mukhopadhyay S (2022) Task-driven rgb-lidar fusion for object tracking in resource-efficient autonomous system. IEEE Trans Intell Veh 7(1):102–112. https://doi.org/10.1109/tiv.2021.3087664
Sun Y, Li J, Wang Y, Xu X, Yang X, Sun Z (2022) Atop: An attention-to-optimization approach for automatic lidar-camera calibration via cross-modal object matching. IEEE Trans Intell Veh 8(1):1–13. https://doi.org/10.1109/tiv.2022.3184976
Li G, Ji Z, Qu X, Zhou R, Cao D (2022) Cross-domain object detection for autonomous driving: A stepwise domain adaptative yolo approach. IEEE Trans Intell Veh 7(3):603–615. https://doi.org/10.1109/tiv.2022.3165353
Yadav R, Vierling A, Berns K (2020) Radar+ rgb fusion for robust object detection in autonomous vehicle. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 1986–1990. IEEE https://doi.org/10.1109/icip40778.2020.9191046
Qian K, Zhu S, Zhang X, Li LE (2021) Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 444–453. https://doi.org/10.1109/cvpr46437.2021.00051
Hussain MI, Rafique MA, Jeon M (2021) Rvmde: Radar validated monocular depth estimation for robotics. arXiv preprint arXiv:2109.05265
Misra I, Girdhar R, Joulin A (2021) An End-to-End Transformer Model for 3D Object Detection. In: IEEE International Conference on Computer Vision. https://doi.org/10.1109/iccv48922.2021.00290
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Yu Q, Dai J (2022) Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270. https://doi.org/10.1007/978-3-031-20077-9_1
Huang KC, Wu TH, Su HT, Hsu WH (2022) Monodtr: Monocular 3d object detection with depth-aware transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4012–4021. https://doi.org/10.1109/cvpr52688.2022.00398
Zhu X, Ma Y, Wang T, Xu Y, Shi J, Lin D (2020) Ssn: Shape signature networks for multi-class object detection from point clouds. In: European Conference on Computer Vision, pp. 581–597. Springer https://doi.org/10.1007/978-3-030-58595-2_35
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: IEEE International Conference on Computer Vision, pp. 10012–10022. https://doi.org/10.1109/iccv48922.2021.00986
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: Simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090
Chen CFR, Fan Q, Panda R (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. In: IEEE International Conference on Computer Vision, pp. 357–366. https://doi.org/10.1109/iccv48922.2021.00041
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941. https://doi.org/10.1109/cvpr.2016.213
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: Keypoint triplets for object detection. In: IEEE International Conference on Computer Vision, pp. 6569–6578 https://doi.org/10.1109/iccv.2019.00667
Wang Y, Guizilini VC, Zhang T, Wang Y, Zhao H, Solomon J (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: The Conference on Robot Learning, pp. 180–191. PMLR
Chen H, Wang P, Wang F, Tian W, Xiong L, Li H (2022) Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2781–2790. https://doi.org/10.1109/cvpr52688.2022.00280
Wang T, Xinge Z, Pang J, Lin D (2022) Probabilistic and geometric depth: Detecting objects in perspective. In: The Conference on Robot Learning, pp. 1475–1485. PMLR
Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11621–11631. https://doi.org/10.1109/cvpr42600.2020.01164
Wang J, Lan S, Gao M, Davis LS (2020) Infofocus: 3d object detection for autonomous driving with dynamic information modeling. In: European Conference on Computer Vision, pp. 405–420. Springer https://doi.org/10.1007/978-3-030-58607-2_24
Simonelli A, Bulo SR, Porzi L, López-Antequera M, Kontschieder P (2019) Disentangling monocular 3d object detection. In: IEEE International Conference on Computer Vision, pp. 1991–1999. https://doi.org/10.1109/iccv.2019.00208
Contributors M (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d
Funding
This work was supported in part by the National Nature Science Foundation of China under Grant 62106158; in part by the Research and Development Program of Beijing Municipal Education Commission under Grant KM202210028007; and in part by the R&D Program of Beijing Municipal Education Commission (KZ20231002822).
Author information
Authors and Affiliations
Contributions
Conceptualization: Jun Li, Zizhang Wu; Methodology: Jun Li; Formal analysis and investigation: Zizhang Wu; Writing - original draft preparation: Han Zhang; Writing - review and editing: Tianhao Xu.
Corresponding author
Ethics declarations
Ethical and informed consent for data used
Not applicable. This study was conducted without directly involving human participants, and thus no informed consent was required.
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, J., Zhang, H., Wu, Z. et al. Radar-camera fusion for 3D object detection with aggregation transformer. Appl Intell 54, 10627–10639 (2024). https://doi.org/10.1007/s10489-024-05718-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05718-1