Skip to main content

Advertisement

Log in

MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Object detection in real-world scenarios with multi-modal inputs is crucial for some safety-critical systems, such as autonomous driving, security monitoring, and traffic management. Despite significant progress in previous work, existing methods still suffer from insufficient fusion, feature loss, and poor performance in images with complex textures and occlusions. In this paper, we propose a novel framework for multi-modal object detection, multi-modal EfficientDet with multi-scale CapsNet (MEDMCN). In MEDMCN, the depth information of depth image and texture details of RGB image is well integrated by our residual iterative bi-directional feature pyramid network (ResIBi-FPN) to overcome the issues of insufficient fusion and feature loss. In addition, a novel multi-scale CapsNet-based component, EfficientDet-Caps, is presented as the detection head of MEDMCN, which allows MEDMCN to focus on the whole-part correlation and the spatial position relationship of entities, enhancing its performance in real-world scenarios with complex textures and occlusions. Extensive experiments on MS COCO 2017 and MAVD datasets demonstrate that MEDMCN achieves great results when evaluated using the average precision (AP) metric. Specifically, it shows significant improvements of +2.8AP and +6.9AP compared to its baseline on MS COCO 2017 and MAVD datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Datasets analyzed during the current study are publicly available. The links to the datasets are https://cocodataset.org/#detection-2017 and https://doi.org/10.5281/zenodo.3338727

References

  1. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  2. Jafri R, Ali SA, Arabnia HR, Fatima S (2014) Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. Vis Comput 30:1197–1222

    Article  Google Scholar 

  3. Li X, Liu J, Xie Y, Gong P, Zhang X, He H (2024) Magdra: a multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition. Knowl-Based Syst 283:111126

    Article  Google Scholar 

  4. Wang H, Liu J, Duan M, Gong P, Wu Z, Wang J, Han B (2023) Cross-modal knowledge guided model for abstractive summarization. Complex Intell Syst. pp 1–18

  5. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988

  6. Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790

  7. Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 734–750

  8. Gong P, Liu J, Xie Y, Liu M, Zhang X (2023) Enhancing context representations with part-of-speech information and neighboring signals for question classification. Complex Intell Syst. pp 1–19

  9. Yang Y, Xu C, Dong F, Wang X (2019) A new multi-scale convolutional model based on multiple attention for image classification. Appl Sci 10(1):101

    Article  Google Scholar 

  10. Liu J, Yang Y, Lv S, Wang J, Chen H (2019) Attention-based BiGRU-CNN for Chinese question classification. J Ambient Intell Hum Comput. pp 1–12

  11. Yang Y, Wang X, Zhao Q, Sui T (2019) Two-level attentions and grouping attention convolutional network for fine-grained image classification. Appl Sci 9(9):1939

    Article  Google Scholar 

  12. Liu J, Yang Y, He H (2020) Multi-level semantic representation enhancement network for relationship extraction. Neurocomputing 403:282–293

    Article  Google Scholar 

  13. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 779–788

  14. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European Conference on Computer Vision. pp. 21–37. Springer

  15. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 580–587

  16. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1440–1448

  17. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. Vol. 28

  18. Wagner J, Fischer V, Herman M, Behnke S et al (2016) Multispectral pedestrian detection using deep fusion convolutional neural networks. In: ESANN, vol. 587. pp. 509–514

  19. Liu J, Zhang S, Wang S, Metaxas DN (2016) Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644

  20. Chen H, Li Y, Su D (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recogn 86:376–385

    Article  Google Scholar 

  21. Chen H, Li Y (2019) Three-stream attention-aware network for RGB-D salient object detection. IEEE Trans Image Process 28(6):2825–2835

    Article  MathSciNet  Google Scholar 

  22. Mees O, Eitel A, Burgard W (2016) Choosing smartly: adaptive multimodal fusion for object detection in changing environments. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 151–156. IEEE

  23. Wang N, Gong X (2019) Adaptive fusion for RGB-D salient object detection. IEEE Access 7:55277–55284

    Article  Google Scholar 

  24. Xiang C, Zhang L, Tang Y, Zou W, Xu C (2018) MS-CApsNet: a novel multi-scale capsule network. IEEE Signal Process Lett 25(12):1850–1854

    Article  Google Scholar 

  25. Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Adv Neural Inf Process Syst. Vol. 30

  26. Valverde FR, Hurtado JV, Valada A (2021) There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11612–11621

  27. Patterson G, Hays J (2016) Coco attributes: attributes for people, animals, and objects. In: European Conference on Computer Vision, pp. 85–100. Springer

  28. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  29. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, p. IEEE

  30. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893. IEEE

  31. Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE

  32. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  33. Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6569–6578

  34. Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849

  35. Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636

  36. Zong Z, Song G, Liu Y (2023) Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758

  37. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations

  38. Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, Cao Y (2023) Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369

  39. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold, G Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations

  40. Kim JU, Ro YM (2023) Enabling visual object detection with object sounds via visual modality recalling memory. IEEE Trans Neural Netw Learn Syst

  41. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. pp. 6105–6114. PMLR

  42. Syazwany NS, Nam J-H, Lee S-C (2021) MM-BiFPN: multi-modality fusion network with Bi-FPN for MRI brain tumor segmentation. IEEE Access 9:160708–160720

    Article  Google Scholar 

  43. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125

  44. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: ICML

  45. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258

  46. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR

  47. Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)

  48. Chen J, Mai H, Luo L, Chen X, Wu K (2021) Effective feature fusion network in BIFPN for small object detection. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 699–703. IEEE

  49. Chang S, Liu J (2020) Multi-lane capsule network for classifying images with complex background. IEEE Access 8:79876–79886

    Article  Google Scholar 

  50. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338

    Article  Google Scholar 

  51. Valverde FR, Hurtado JV, Valada A (2021) There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11612–11621

  52. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500

  53. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767

  54. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969

  55. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778

  56. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022

  57. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 3–19

Download references

Funding

This work was supported by the National Key Research and Development Program of China (No.2021YFC2801001), in part by the National Social Science Foundation of China (No. 20 &ZD130).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Liu, J., Tang, Z. et al. MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection. J Supercomput 80, 12863–12890 (2024). https://doi.org/10.1007/s11227-024-05932-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-05932-1

Keywords