Skip to main content
Log in

An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

3D object detection has received extensive attention from researchers. RGB-D sensors are often used for the information complementary in 3D object detection tasks due to their easy acquisition of aligned point cloud and RGB image data, relatively reasonable prices, and reliable performance. However, how to effectively fuse point cloud data and RGB image data in RGB-D images, and use this cross-modal information to improve the performance of 3D object detection, remains a challenge for further research. To deal with these problems, an improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images is proposed in this paper. First, a dense-to-sparse cross-modal learning module (DCLM) is designed, which reduces information waste in the interaction between 2D dense information and 3D sparse information. Then, an inter-modal attention fusion module (IAFM) is designed, which can retain more meaningful information adaptively in the fusion process for the 2D and 3D features. In addition, an intra-modal attention context aggregation module (IACAM) is designed to aggregate context information in both 2D and 3D modalities, and model the relationship between objects. Finally, the detailed quantitative and qualitative experiments are carried out on the SUN RGB-D dataset, and the results show that the proposed model can obtain state-of-the-art 3D object detection results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://rgbd.cs.princeton.edu/data/.

References

  1. Araki R, Hirakawa T, Yamashita T, Fujiyoshi H (2022) MT-DSSD: multi-task deconvolutional single shot detector for object detection, segmentation, and grasping detection. Advanced Robotics 36(8):373–387. https://doi.org/10.1080/01691864.2022.2043183

    Article  Google Scholar 

  2. Bai, X, Hu, Z, Zhu, X, Huang, Q, Chen, Y, Fu, H, Tai, C-L (2022) Transfusion: Robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 090–1099. https://doi.org/10.1109/CVPR52688.2022.00116

  3. Chang, J.-R, Chen, Y-S (2018) Pyramid stereo matching network. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 5410–5418. https://doi.org/10.1109/CVPR.2018.00567

  4. Chen, Z, Huang, S, Tao, D (2018) Context refinement for object detection. In: Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics):vol 11212 LNCS. Munich, Germany, pp 74–89. https://doi.org/10.1007/978-3-030-01237-3_5

  5. Chen, J, Lei, B, Song, Q, Ying, H, Chen, DZ, Wu, J (2020) A hierarchical graph network for 3D object detection on point clouds. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 389–398. https://doi.org/10.1109/CVPR42600.2020.00047

  6. Chen, Z, Li, Z, Zhang, S, Fang, L, Jiang, Q, Zhao, F (2022) AutoAlignV2: Deformable feature aggregation for dynamic multi-modal 3D object detection. arXiv:2207.10316https://doi.org/10.48550

  7. Cheng, B, Sheng, L, Shi, S, Yang, M, Xu, D (2021) Back-tracing representative points for voting-based 3D object detection in point clouds. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 8959–8968. https://doi.org/10.1109/CVPR46437.2021.00885

  8. Dai, A, Chang, AX, Savva, M, Halber, M, Funkhouser, T, Niecner, M (2017) ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January. Honolulu, HI, United States, pp 2432–2443. https://doi.org/10.1109/CVPR.2017.261

  9. Ding, M, Huo, Y, Yi, H, Wang, Z, Shi, J, Lu, Z, Luo, P (2020) Learning depth-guided convolutions for monocular 3d object detection. In: Proceedings of the IEEE computer society conference on computer Vision and Pattern Recognition, Virtual, Online, United States, pp 11669–11678. https://doi.org/10.1109/CVPR42600.2020.01169

  10. Engelcke, M, Rao, D, Wang, D.Z, Tong, C.H, Posner, I (2017) Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In: Proceedings - IEEE international conference on robotics and automation, vol 0. Singapore, Singapore, pp 1355–1361. https://doi.org/10.1109/ICRA.2017.7989161

  11. Fu, H, Gong, M, Wang, C, Batmanghelich, K, Tao, D (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 2002–2011. https://doi.org/10.1109/CVPR.2018.00214

  12. Gao Z, Zhai G, Deng H, Yang X (2020) Extended geometric models for stereoscopic 3D with vertical screen disparity. Displays 65:101972. https://doi.org/10.1016/j.displa.2020.101972

    Article  Google Scholar 

  13. Gupta, S, Arbelaez, P, Girshick, R, Malik, J (2015) Aligning 3D models to RGB-D images of cluttered scenes. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June-2015. Boston, MA, United States, pp 4731–4740. https://doi.org/10.1109/CVPR.2015.7299105

  14. Gupta, S, Girshick, R, Arbelaez, P, Malik, J (2014) Learning rich features from RGB-D images for object detection and segmentation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics):vol 8695 LNCS. Zurich, Switzerland, pp 345–360. https://doi.org/10.1007/978-3-319-10584-0_23

  15. Huang, S, Xie, Y, Zhu, S.-C, Zhu, Y (2021) Spatio-temporal self-supervised representation learning for 3D point clouds. In: Proceedings of the IEEE International Conference on Computer Vision, Virtual, Online, Canada, pp 6515–6525. https://doi.org/10.1109/ICCV48922.2021.00647

  16. Jeon G, Anisetti M, Damiani E, Kantarci B (2020) Artificial intelligence in deep learning algorithms for multimedia analysis. Multimedia Tools and Applications 79(45–46):34129–34139. https://doi.org/10.1007/s11042-020-09232-7

    Article  Google Scholar 

  17. Ji C, Liu G, Zhao D (2022) Monocular 3D object detection via estimation of paired keypoints for autonomous driving. Multimedia Tools and Applications 81(4):5973–5988. https://doi.org/10.1007/s11042-021-11801-3

    Article  Google Scholar 

  18. Keselman, L, Woodfill, JI, Grunnet-Jepsen, A, Bhowmik, A (2017) Intel(R) RealSense(TM) stereoscopic depth cameras. In: IEEE computer society conference on computer vision and pattern recognition workshops, vol 2017-July. Honolulu, HI, United States, pp 1267–1276. https://doi.org/10.1109/CVPRW.2017.167

  19. Ku, J, Mozifian, M, Lee, J, Harakeh, A, Waslander, SL (2018) Joint 3D proposal generation and object detection from view aggregation. In: IEEE International Conference on Intelligent Robots and Systems, Madrid, Spain, pp 5750–5757. https://doi.org/10.1109/IROS.2018.8594049

  20. Lahoud, J, Ghanem, B (2017) 2D-Driven 3D object detection in RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision, vol 2017-October. Venice, Italy, pp 4632–4640. https://doi.org/10.1109/ICCV.2017.495

  21. Li, B, Ouyang, W, Sheng, L, Zeng, X, Wang, X (2020) GS3D: An efficient 3D object detection framework for autonomous driving. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June. Long Beach, CA, United States, pp 1019–1028. https://doi.org/10.1109/CVPR.2019.00111

  22. Li, Y, Qi, X, Chen, Y, Wang, L, Li, Z, Sun, J, Jia, J (2022) Voxel field fusion for 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 1120–1129. https://doi.org/10.1109/CVPR52688.2022.00119

  23. Li J, Liang X, Shen S, Xu T, Feng J, Yan S (2018) Scale-aware Fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia 20(4):985–996. https://doi.org/10.1109/TMM.2017.2759508

    Article  Google Scholar 

  24. Li Y, Ma L, Tan W, Sun C, Cao D, Li J (2020) GRNet: Geometric relation network for 3D object detection from point clouds. ISPRS Journal of Photogrammetry and Remote Sensing 165:43–53. https://doi.org/10.1016/j.isprsjprs.2020.05.008

    Article  Google Scholar 

  25. Li L, Wan Z, He H (2021) Incomplete multi-view clustering with joint partition and graph learning. IEEE Transactions on Knowledge and Data Engineering 35(1):589–602. https://doi.org/10.1109/TKDE.2021.3082470

    Article  Google Scholar 

  26. Liu, Z, Zhang, Z, Cao, Y, Hu, H, Tong, X (2021) Group-free 3D object detection via transformers. In: Proceedings of the IEEE international conference on computer vision, Virtual, Online, Canada, pp 2929–2938. https://doi.org/10.1109/ICCV48922.2021.00294

  27. Liu B, Wu H, Su W, Zhang W, Sun J (2018) Rotation-invariant object detection using sector-ring HOG and boosted random ferns. Visual Computer 34(5):707–719. https://doi.org/10.1007/s00371-017-1408-3

    Article  Google Scholar 

  28. Lu Y-F, Yu Q, Gao J-W, Li Y, Zou J-C, Qiao H (2022) Cross stage partial connections based weighted bi-directional feature pyramid and enhanced spatial transformation network for robust object detection. Neurocomputing 513:70–82. https://doi.org/10.1016/j.neucom.2022.09.117

    Article  Google Scholar 

  29. Luo, S, Dai, H, Shao, L, Ding, Y (2021) M3DSSD: Monocular 3D single stage object detector. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 6141–6150. https://doi.org/10.1109/CVPR46437.2021.00608

  30. Luo Q, Ma H, Tang L, Wang Y, Xiong R (2020) 3D-SSD: Learning hierarchical features from RGB-D images for amodal 3D object detection. Neurocomputing 378:364–374. https://doi.org/10.1016/j.neucom.2019.10.025

    Article  Google Scholar 

  31. Misra, I, Girdhar, R, Joulin, A (2021) An end-to-end transformer model for 3D object detection. In: Proceedings of the IEEE international conference on computer vision, Virtual, Online, Canada, pp 2886–2897. https://doi.org/10.1109/ICCV48922.2021.00290

  32. Mousavian, A, Anguelov, D, Koecka, J, Flynn, J (2017) 3D bounding box estimation using deep learning and geometry. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January. Honolulu, HI, United States, pp 5632–5640. https://doi.org/10.1109/CVPR.2017.597

  33. Ni J, Chen Y, Chen Y, Zhu J, Ali D, Cao W (2020) A survey on theories and applications for self-driving cars based on deep learning methods. Applied Sciences-Basel 10(8):2749. https://doi.org/10.3390/app10082749

    Article  Google Scholar 

  34. Ni J, Shen K, Chen Y, Cao W, Yang SX (2022) An improved deep network-based scene classification method for self-driving cars. IEEE Transactions on Instrumentation and Measurement 71:5001614. https://doi.org/10.1109/TIM.2022.3146923

    Article  Google Scholar 

  35. Qi, C.R, Chen, X, Litany, O, Guibas, LJ (2020) ImVoteNet: Boosting 3D object detection in point clouds with image votes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 4403–4412. https://doi.org/10.1109/CVPR42600.2020.00446

  36. Qi, C.R, Litany, O, He, K, Guibas, L (2019) Deep hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October. Seoul, Korea, Republic of, pp 9276–9285. https://doi.org/10.1109/ICCV.2019.00937

  37. Qi, C.R, Liu, W, Wu, C, Su, H, Guibas, LJ (2018) Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, United States, pp 918–927. https://doi.org/10.1109/CVPR.2018.00102

  38. Qi, C.R, Su, H, Mo, K, Guibas, LJ (2017) PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January. Honolulu, HI, United States, pp 77–85. https://doi.org/10.1109/CVPR.2017.16

  39. Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, vol 2017-December. Long Beach, CA, United States, pp 5100–5109

    Google Scholar 

  40. Rahman MM, Tan Y, Xue J, Lu K (2020) Notice of removal: Recent advances in 3d object detection in the era of deep neural networks: A survey. IEEE Transactions on Image Processing 29:2947–2962. https://doi.org/10.1109/TIP.2019.2955239

    Article  Google Scholar 

  41. Ren Z, Sudderth EB (2020) Clouds of oriented gradients for 3D detection of objects, surfaces, and indoor scene layouts. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10):2670–2683. https://doi.org/10.1109/TPAMI.2019.2923201

    Article  Google Scholar 

  42. Ren Y, Chen C, Li S, Kuo C-CJ (2018) Context-assisted 3D (C3D) object detection from RGB-D images. Journal of Visual Communication and Image Representation 55:131–141. https://doi.org/10.1016/j.jvcir.2018.05.019

    Article  Google Scholar 

  43. Rosten E, Porter R, Drummond T (2010) Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1):105–119. https://doi.org/10.1109/TPAMI.2008.275

    Article  Google Scholar 

  44. Shi, S, Wang, X, Li, H (2019) PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June. Long Beach, CA, United States, pp 770–779. https://doi.org/10.1109/CVPR.2019.00086

  45. Silberman, N, Hoiem, D, Kohli, P, Fergus, R (2012) Indoor segmentation and support inference from RGBD images. In: Lecture notes in computer science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics):vol 7576 LNCS. Florence, Italy, pp 746–760. https://doi.org/10.1007/978-3-642-33715-4_54

  46. Song, S, Lichtenberg, S.P, Xiao, J (2015) SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June-2015. Boston, MA, United States, pp 567–576. https://doi.org/10.1109/CVPR.2015.7298655

  47. Song, S, Xiao, J (2014) Sliding shapes for 3D object detection in depth images. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics):vol 8694 LNCS. Zurich, Switzerland, pp 634–651. https://doi.org/10.1007/978-3-319-10599-4_41

  48. Song, S, Xiao, J (2016) Deep sliding shapes for amodal 3D object detection in RGB-D images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016-December. Las Vegas, NV, United States, pp 808–816. https://doi.org/10.1109/CVPR.2016.94

  49. Sun, R, Qian, J, Jose, R.H, Gong, Z, Miao, R, Xue, W, Liu, P (2020) A flexible and efficient real-time ORB-based full-HD image feature extraction accelerator. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(2):565–575. https://doi.org/10.1109/TVLSI.2019.2945982

  50. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, vol 2017-December. Long Beach, CA, United States, pp 5999–6009

    Google Scholar 

  51. Wang, Y, Chen, X, Cao, L, Huang, W, Sun, F, Wang, Y (2022) Multimodal token fusion for vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 12186–12195. https://doi.org/10.1109/CVPR52688.2022.01187

  52. Wang, H, Shi, S, Yang, Z, Fang, R, Qian, Q, Li, H, Schiele, B, Wang, L (2022) RBGNet: Ray-based grouping for 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 1110–1119. https://doi.org/10.1109/CVPR52688.2022.00118

  53. Wang, W, Tran, D, Feiszli, M (2020) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, Online, United States, pp 12692–12702. https://doi.org/10.1109/CVPR42600.2020.01271

  54. Wang, Y, Ye, T, Cao, L, Huang, W, Sun, F, He, F, Tao, D (2022) Bridged transformer for vision and point cloud 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 12114–12123. https://doi.org/10.1109/CVPR52688.2022.01180

  55. Wang Y, Wang C, Long P, Gu Y, Li W (2021) Recent advances in 3D object detection based on RGB-D: A survey. Displays 70:102077. https://doi.org/10.1016/j.displa.2021.102077

    Article  Google Scholar 

  56. Wang Z, Xie Q, Wei M, Long K, Wang J (2022) Multi-feature fusion VoteNet for 3D object detection. ACM Transactions on Multimedia Computing, Communications and Applications 18(1):6. https://doi.org/10.1145/3462219

    Article  Google Scholar 

  57. Woodford OJ, Pham M-T, Maki A, Perbet F, Stenger B (2014) Demisting the hough transform for 3d shape recognition and registration. International Journal of Computer Vision 106(3):332–341. https://doi.org/10.1007/s11263-013-0623-2

    Article  Google Scholar 

  58. Xiao, J, Owens, A, Torralba, A (2013) SUN3D: A database of big spaces reconstructed using SfM and object labels. In: Proceedings of the IEEE international conference on computer vision, Sydney, NSW, Australia, pp 1625–1632. https://doi.org/10.1109/ICCV.2013.458

  59. Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X (2020) A review of object detection based on deep learning. Multimedia Tools and Applications 79(33–34):23729–23791. https://doi.org/10.1007/s11042-020-08976-6

    Article  Google Scholar 

  60. Xie Q, Lai Y-K, Wu J, Wang Z, Zhang Y, Xu K, Wang J (2021) Vote-based 3D object detection with context modeling and SOB-3DNMS. International Journal of Computer Vision 129(6):1857–1874. https://doi.org/10.1007/s11263-021-01456-w

    Article  Google Scholar 

  61. Xu, D, Anguelov, D, Jain, A (2018) PointFusion: Deep sensor fusion for 3D bounding box estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 244–253. https://doi.org/10.1109/CVPR.2018.00033

  62. Xu, B, Chen, Z (2018) Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 2345–2353. https://doi.org/10.1109/CVPR.2018.00249

  63. Zhang, Y, Chen, J, Huang, D (2022) CAT-Det: Contrastively augmented transformer for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 908–917. https://doi.org/10.1109/CVPR52688.2022.00098

  64. Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimedia 19(2):4–10. https://doi.org/10.1109/MMUL.2012.24

    Article  Google Scholar 

  65. Zhang M, Xu S, Song W, He Q (2021) Wei, Q (2021) Lightweight underwater object detection based on YOLO v4 and multi-scale attentional feature fusion. Remote Sensing 13(22):4706. https://doi.org/10.3390/rs13224706

    Article  Google Scholar 

  66. Zhang L, Li W, Yu L, Sun L, Dong X, Ning X (2021) GmFace: An explicit function for face image representation. Displays 68:102022. https://doi.org/10.1016/j.displa.2021.102022

    Article  Google Scholar 

  67. Zhao L, Guo J, Xu D, Sheng L (2021) Transformer3D-Det: Improving 3D object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology 31(12):4735–4746. https://doi.org/10.1109/TCSVT.2021.3102025

    Article  Google Scholar 

  68. Zhou, Z, Fan, X, Shi, P, Xin, Y (2021) R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In: Proceedings of the IEEE international conference on computer vision, Virtual, Online, Canada, pp 12757–12766. https://doi.org/10.1109/ICCV48922.2021.01254

  69. Zhou, Y, Tuzel, O (2018) VoxelNet: End-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 4490–4499. https://doi.org/10.1109/CVPR.2018.00472

  70. Zhou H, Yuan Y, Shi C (2009) Object tracking using SIFT features and mean shift. Computer Vision and Image Understanding 113(3):345–352. https://doi.org/10.1016/j.cviu.2008.08.006

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (61873086) and the Science and Technology Support Program of Changzhou (CE20215022).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianjun Ni.

Ethics declarations

Conflicts of interest

The authors declared that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Ni, J., Tang, G. et al. An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images. Multimed Tools Appl 83, 12159–12184 (2024). https://doi.org/10.1007/s11042-023-15845-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15845-5

Keywords

Navigation