Abstract
Object detection in real-world scenarios with multi-modal inputs is crucial for some safety-critical systems, such as autonomous driving, security monitoring, and traffic management. Despite significant progress in previous work, existing methods still suffer from insufficient fusion, feature loss, and poor performance in images with complex textures and occlusions. In this paper, we propose a novel framework for multi-modal object detection, multi-modal EfficientDet with multi-scale CapsNet (MEDMCN). In MEDMCN, the depth information of depth image and texture details of RGB image is well integrated by our residual iterative bi-directional feature pyramid network (ResIBi-FPN) to overcome the issues of insufficient fusion and feature loss. In addition, a novel multi-scale CapsNet-based component, EfficientDet-Caps, is presented as the detection head of MEDMCN, which allows MEDMCN to focus on the whole-part correlation and the spatial position relationship of entities, enhancing its performance in real-world scenarios with complex textures and occlusions. Extensive experiments on MS COCO 2017 and MAVD datasets demonstrate that MEDMCN achieves great results when evaluated using the average precision (AP) metric. Specifically, it shows significant improvements of +2.8AP and +6.9AP compared to its baseline on MS COCO 2017 and MAVD datasets, respectively.











Similar content being viewed by others
Data availability
Datasets analyzed during the current study are publicly available. The links to the datasets are https://cocodataset.org/#detection-2017 and https://doi.org/10.5281/zenodo.3338727
References
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Jafri R, Ali SA, Arabnia HR, Fatima S (2014) Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. Vis Comput 30:1197–1222
Li X, Liu J, Xie Y, Gong P, Zhang X, He H (2024) Magdra: a multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition. Knowl-Based Syst 283:111126
Wang H, Liu J, Duan M, Gong P, Wu Z, Wang J, Han B (2023) Cross-modal knowledge guided model for abstractive summarization. Complex Intell Syst. pp 1–18
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 734–750
Gong P, Liu J, Xie Y, Liu M, Zhang X (2023) Enhancing context representations with part-of-speech information and neighboring signals for question classification. Complex Intell Syst. pp 1–19
Yang Y, Xu C, Dong F, Wang X (2019) A new multi-scale convolutional model based on multiple attention for image classification. Appl Sci 10(1):101
Liu J, Yang Y, Lv S, Wang J, Chen H (2019) Attention-based BiGRU-CNN for Chinese question classification. J Ambient Intell Hum Comput. pp 1–12
Yang Y, Wang X, Zhao Q, Sui T (2019) Two-level attentions and grouping attention convolutional network for fine-grained image classification. Appl Sci 9(9):1939
Liu J, Yang Y, He H (2020) Multi-level semantic representation enhancement network for relationship extraction. Neurocomputing 403:282–293
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 779–788
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European Conference on Computer Vision. pp. 21–37. Springer
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 580–587
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. Vol. 28
Wagner J, Fischer V, Herman M, Behnke S et al (2016) Multispectral pedestrian detection using deep fusion convolutional neural networks. In: ESANN, vol. 587. pp. 509–514
Liu J, Zhang S, Wang S, Metaxas DN (2016) Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644
Chen H, Li Y, Su D (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recogn 86:376–385
Chen H, Li Y (2019) Three-stream attention-aware network for RGB-D salient object detection. IEEE Trans Image Process 28(6):2825–2835
Mees O, Eitel A, Burgard W (2016) Choosing smartly: adaptive multimodal fusion for object detection in changing environments. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 151–156. IEEE
Wang N, Gong X (2019) Adaptive fusion for RGB-D salient object detection. IEEE Access 7:55277–55284
Xiang C, Zhang L, Tang Y, Zou W, Xu C (2018) MS-CApsNet: a novel multi-scale capsule network. IEEE Signal Process Lett 25(12):1850–1854
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Adv Neural Inf Process Syst. Vol. 30
Valverde FR, Hurtado JV, Valada A (2021) There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11612–11621
Patterson G, Hays J (2016) Coco attributes: attributes for people, animals, and objects. In: European Conference on Computer Vision, pp. 85–100. Springer
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, p. IEEE
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893. IEEE
Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6569–6578
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636
Zong Z, Song G, Liu Y (2023) Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations
Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, Cao Y (2023) Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold, G Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations
Kim JU, Ro YM (2023) Enabling visual object detection with object sounds via visual modality recalling memory. IEEE Trans Neural Netw Learn Syst
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. pp. 6105–6114. PMLR
Syazwany NS, Nam J-H, Lee S-C (2021) MM-BiFPN: multi-modality fusion network with Bi-FPN for MRI brain tumor segmentation. IEEE Access 9:160708–160720
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: ICML
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)
Chen J, Mai H, Luo L, Chen X, Wu K (2021) Effective feature fusion network in BIFPN for small object detection. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 699–703. IEEE
Chang S, Liu J (2020) Multi-lane capsule network for classifying images with complex background. IEEE Access 8:79876–79886
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338
Valverde FR, Hurtado JV, Valada A (2021) There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11612–11621
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 3–19
Funding
This work was supported by the National Key Research and Development Program of China (No.2021YFC2801001), in part by the National Social Science Foundation of China (No. 20 &ZD130).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Liu, J., Tang, Z. et al. MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection. J Supercomput 80, 12863–12890 (2024). https://doi.org/10.1007/s11227-024-05932-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-05932-1