MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection

Li, Xingye; Liu, Jin; Tang, Zhengyu; Han, Bing; Wu, Zhongdai

doi:10.1007/s11227-024-05932-1

MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection

Published: 23 February 2024

Volume 80, pages 12863–12890, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xingye Li¹,
Jin Liu¹,
Zhengyu Tang¹,
Bing Han² &
…
Zhongdai Wu²

444 Accesses
Explore all metrics

Abstract

Object detection in real-world scenarios with multi-modal inputs is crucial for some safety-critical systems, such as autonomous driving, security monitoring, and traffic management. Despite significant progress in previous work, existing methods still suffer from insufficient fusion, feature loss, and poor performance in images with complex textures and occlusions. In this paper, we propose a novel framework for multi-modal object detection, multi-modal EfficientDet with multi-scale CapsNet (MEDMCN). In MEDMCN, the depth information of depth image and texture details of RGB image is well integrated by our residual iterative bi-directional feature pyramid network (ResIBi-FPN) to overcome the issues of insufficient fusion and feature loss. In addition, a novel multi-scale CapsNet-based component, EfficientDet-Caps, is presented as the detection head of MEDMCN, which allows MEDMCN to focus on the whole-part correlation and the spatial position relationship of entities, enhancing its performance in real-world scenarios with complex textures and occlusions. Extensive experiments on MS COCO 2017 and MAVD datasets demonstrate that MEDMCN achieves great results when evaluated using the average precision (AP) metric. Specifically, it shows significant improvements of +2.8AP and +6.9AP compared to its baseline on MS COCO 2017 and MAVD datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DYOLO: A Novel Object Detection Model for Multi-scene and Multi-object Based on an Improved D-Net Split Task Model is Proposed

Efficient Multi-object Detection for Complexity Spatio-Temporal Scenes

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Data availability

Datasets analyzed during the current study are publicly available. The links to the datasets are https://cocodataset.org/#detection-2017 and https://doi.org/10.5281/zenodo.3338727

References

LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Jafri R, Ali SA, Arabnia HR, Fatima S (2014) Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. Vis Comput 30:1197–1222
Article Google Scholar
Li X, Liu J, Xie Y, Gong P, Zhang X, He H (2024) Magdra: a multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition. Knowl-Based Syst 283:111126
Article Google Scholar
Wang H, Liu J, Duan M, Gong P, Wu Z, Wang J, Han B (2023) Cross-modal knowledge guided model for abstractive summarization. Complex Intell Syst. pp 1–18
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 734–750
Gong P, Liu J, Xie Y, Liu M, Zhang X (2023) Enhancing context representations with part-of-speech information and neighboring signals for question classification. Complex Intell Syst. pp 1–19
Yang Y, Xu C, Dong F, Wang X (2019) A new multi-scale convolutional model based on multiple attention for image classification. Appl Sci 10(1):101
Article Google Scholar
Liu J, Yang Y, Lv S, Wang J, Chen H (2019) Attention-based BiGRU-CNN for Chinese question classification. J Ambient Intell Hum Comput. pp 1–12
Yang Y, Wang X, Zhao Q, Sui T (2019) Two-level attentions and grouping attention convolutional network for fine-grained image classification. Appl Sci 9(9):1939
Article Google Scholar
Liu J, Yang Y, He H (2020) Multi-level semantic representation enhancement network for relationship extraction. Neurocomputing 403:282–293
Article Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 779–788
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European Conference on Computer Vision. pp. 21–37. Springer
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 580–587
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. Vol. 28
Wagner J, Fischer V, Herman M, Behnke S et al (2016) Multispectral pedestrian detection using deep fusion convolutional neural networks. In: ESANN, vol. 587. pp. 509–514
Liu J, Zhang S, Wang S, Metaxas DN (2016) Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644
Chen H, Li Y, Su D (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recogn 86:376–385
Article Google Scholar
Chen H, Li Y (2019) Three-stream attention-aware network for RGB-D salient object detection. IEEE Trans Image Process 28(6):2825–2835
Article MathSciNet Google Scholar
Mees O, Eitel A, Burgard W (2016) Choosing smartly: adaptive multimodal fusion for object detection in changing environments. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 151–156. IEEE
Wang N, Gong X (2019) Adaptive fusion for RGB-D salient object detection. IEEE Access 7:55277–55284
Article Google Scholar
Xiang C, Zhang L, Tang Y, Zou W, Xu C (2018) MS-CApsNet: a novel multi-scale capsule network. IEEE Signal Process Lett 25(12):1850–1854
Article Google Scholar
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Adv Neural Inf Process Syst. Vol. 30
Valverde FR, Hurtado JV, Valada A (2021) There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11612–11621
Patterson G, Hays J (2016) Coco attributes: attributes for people, animals, and objects. In: European Conference on Computer Vision, pp. 85–100. Springer
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, p. IEEE
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893. IEEE
Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Article Google Scholar
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6569–6578
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636
Zong Z, Song G, Liu Y (2023) Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations
Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, Cao Y (2023) Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold, G Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations
Kim JU, Ro YM (2023) Enabling visual object detection with object sounds via visual modality recalling memory. IEEE Trans Neural Netw Learn Syst
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. pp. 6105–6114. PMLR
Syazwany NS, Nam J-H, Lee S-C (2021) MM-BiFPN: multi-modality fusion network with Bi-FPN for MRI brain tumor segmentation. IEEE Access 9:160708–160720
Article Google Scholar
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: ICML
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)
Chen J, Mai H, Luo L, Chen X, Wu K (2021) Effective feature fusion network in BIFPN for small object detection. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 699–703. IEEE
Chang S, Liu J (2020) Multi-lane capsule network for classifying images with complex background. IEEE Access 8:79876–79886
Article Google Scholar
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338
Article Google Scholar
Valverde FR, Hurtado JV, Valada A (2021) There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11612–11621
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 3–19

Download references

Funding

This work was supported by the National Key Research and Development Program of China (No.2021YFC2801001), in part by the National Social Science Foundation of China (No. 20 &ZD130).

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, Shanghai, China
Xingye Li, Jin Liu & Zhengyu Tang
Shanghai Ship and Shipping Research Institute, Shanghai, China
Bing Han & Zhongdai Wu

Authors

Xingye Li
View author publications
You can also search for this author inPubMed Google Scholar
Jin Liu
View author publications
You can also search for this author inPubMed Google Scholar
Zhengyu Tang
View author publications
You can also search for this author inPubMed Google Scholar
Bing Han
View author publications
You can also search for this author inPubMed Google Scholar
Zhongdai Wu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, X., Liu, J., Tang, Z. et al. MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection. J Supercomput 80, 12863–12890 (2024). https://doi.org/10.1007/s11227-024-05932-1

Download citation

Accepted: 26 January 2024
Published: 23 February 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s11227-024-05932-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DYOLO: A Novel Object Detection Model for Multi-scene and Multi-object Based on an Improved D-Net Split Task Model is Proposed

Efficient Multi-object Detection for Complexity Spatio-Temporal Scenes

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now