Skip to main content
Log in

Hierarchical multiples self-attention mechanism for multi-modal analysis

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Because of the massive multimedia in daily life, people perceive the world by concurrently processing and fusing multi-modalities with high-dimensional data which may include text, vision, audio and some others. Depending on the popular Machine Learning, we would like to get much better fusion results. Therefore, multi-modal analysis has become an innovative field in data processing. By combining different modes, data can be more informative. However the difficulties of multi-modality analysis and processing lie in Feature extraction and Feature fusion. This paper focussed on this point to propose the BERT-HMAG model for feature extraction and LMF-SA model for multi-modality fusion. During the experiment, compared with traditional models, such as LSTM and Transformer, they are improved to a certain extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The experimental data used in the present study was published GitHub (https://github.com/QXYDCR/HM_BERT/tree/master).

References

  1. Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.-P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction. ICMI ’17, pp. 163–171. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3136755.3136801

  2. Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H.W., Narang, S., Yogatama, D., Vaswani, A., Metzler, D.: Scale efficiently: Insights from pre-training and fine-tuning transformers. CoRR abs/2109.10686 (2021)

  3. Ramanathan, V., Wang, R., Mahajan, D.: Predet: Large-scale weakly supervised pre-training for detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2865–2875 (2021)

  4. Kumar, A., Sachdeva, N.: Multi-input integrative learning using deep neural networks and transfer learning for cyberbullying detection in real-time code-mix data. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-020-00672-7

    Article  Google Scholar 

  5. Li, X., Ma, S., Shan, L.: Multi-window transformer parallel fusion feature pyramid network for pedestrian orientation detection. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-00993-9

    Article  Google Scholar 

  6. Ben Chaabene, N.E.H., Bouzeghoub, A., Guetari, R., Ghezala, H.H.B.: Deep learning methods for anomalies detection in social networks using multidimensional networks and multimodal data: A survey. Multimed. Syst. 28(6), 2133–2143 (2022). https://doi.org/10.1007/s00530-020-00731-z

    Article  Google Scholar 

  7. Rei, L., Mladenic, D., Dorozynski, M., Rottensteiner, F., Schleider, T., Troncy, R., Lozano, J.S., Salvatella, M.G.: Multimodal metadata assignment for cultural heritage artifacts. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-01025-2

    Article  Google Scholar 

  8. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv preprint (2017). https://doi.org/10.48550/arXiv.1707.07250

    Article  Google Scholar 

  9. Sahay, S., Okur, E., Kumar, S.H., Nachman, L.: Low rank fusion based transformers for multimodal sequences. CoRR abs/2007.02038 (2020)

  10. Zhou, Y., Li, J., Chen, H., Wu, Y., Wu, J., Chen, L.: A spatiotemporal hierarchical attention mechanism-based model for multi-step station-level crowd flow prediction. Inform. Sci. 544, 308–324 (2021). https://doi.org/10.1016/j.ins.2020.07.049

    Article  MathSciNet  MATH  Google Scholar 

  11. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)

  12. Demirkiran, F., Çayir, A., Ünal, U., Dağ, H.: Website category classification using fine-tuned bert language model. Int. Conf. Comput. Sci. Eng. (2020). https://doi.org/10.1109/UBMK50275.2020.9219384

    Article  Google Scholar 

  13. Madichetty, S., Muthukumarasamy, S., Jayadev, P.: Multi-modal classification of twitter data during disasters for humanitarian response. J. Ambient. Intell. Humaniz. Comput. 12(11), 10223–10237 (2021). https://doi.org/10.1007/s12652-020-02791-5

    Article  Google Scholar 

  14. Zhang, Y., Wang, Y., Wang, X., Zou, B., Xie, H.: Text-based decision fusion model for detecting depression. In: 2020 2nd symposium on signal processing systems SSPS 2020, pp. 101–106. Association for Computing Machinery, NY, USA (2020)

    Google Scholar 

  15. Zou, W., Ding, J., Wang, C.: Utilizing bert intermediate layers for multimodal sentiment analysis. IEEE Int. Conf. Multimed. Export (2022). https://doi.org/10.1109/ICME52920.2022.9860014

    Article  Google Scholar 

  16. Lee, S., Han, D.K., Ko, H.: Multimodal emotion recognition fusion analysis adapting bert with heterogeneous feature unification. IEEE Access 9, 94557–94572 (2021). https://doi.org/10.1109/ACCESS.2021.3092735

    Article  Google Scholar 

  17. Agarwal, K., Choudhury, S., Tipirneni, S., Mukherjee, P., Ham, C., Tamang, S., Baker, M., Tang, S., Kocaman, V., Gevaert, O.: Preparing for the next pandemic via transfer learning from existing diseases with hierarchical multi-modal bert: a study on covid-19 outcome prediction. Sci. Rep. 12(1), 1–13 (2022). https://doi.org/10.1038/s41598-022-13072-w

    Article  Google Scholar 

  18. Lei, Z., Ul Haq, A., Zeb, A., Suzauddola, M., Zhang, D.: Is the suggested food your desired?: Multi-modal recipe recommendation with demand-based knowledge graph. Expert Syst. Appl. 186, 115708 (2021). https://doi.org/10.1016/j.eswa.2021.115708

    Article  Google Scholar 

  19. Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. CoRR abs/2104.01394 (2021)

  20. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849 (2020)

  21. Ge, Y., Ge, Y., Liu, X., Wang, J., Wu, J., Shan, Y., Qie, X., Luo, P.: Miles: Visual bert pre-training with injected language semantics for video-text retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer vision - ECCV 2022, pp. 691–708. Springer, Cham (2022)

    Chapter  Google Scholar 

  22. Zhang, Z., Ma, J., Zhou, C., Men, R., Li, Z., Ding, M., Tang, J., Zhou, J., Yang, H.: UFC-BERT: unifying multi-modal controls for conditional image synthesis. CoRR abs/2105.14211 (2021)

  23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., ??? (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  24. Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., Gong, B.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. CoRR abs/2104.11178 (2021)

  25. Li, Y., Zhao, T., Shen, X.: Attention-based multimodal fusion for estimating human emotion in real-world hri. In: Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 340–342. Association for Computing Machinery, NY, USA (2020)

    Chapter  Google Scholar 

  26. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

  27. Yang, K., Xu, H., Gao, K.: CM-BERT cross-modal BERT for text-audio sentiment analysis, pp. 521–528. Association for Computing Machinery, New York, NY, USA (2020)

    Google Scholar 

  28. Kim, D., Kang, P.: Cross-modal distillation with audio-text fusion for fine-grained emotion classification using bert and wav2vec 2.0. Neurocomputing 506, 168–183 (2022). https://doi.org/10.1016/j.neucom.2022.07.035

    Article  Google Scholar 

  29. Boukabous, M., Azizi, M.: Multimodal sentiment analysis using audio and text for crime detection. In: 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), pp. 1–5 (2022). https://doi.org/10.1109/IRASET52964.2022.9738175

  30. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 61602161,61772180 ), Hubei Province Science and Technology Support Project (Grant No: 2020BAB012 ), The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046 ).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhu Tianliang.

Ethics declarations

Conflict of interest

All authors declared no conflict of interest.

Ethics approval

We promise that our studies have no ethical issues.

Additional information

Communicated by M. Katsurai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jun, W., Tianliang, Z., Jiahui, Z. et al. Hierarchical multiples self-attention mechanism for multi-modal analysis. Multimedia Systems 29, 3599–3608 (2023). https://doi.org/10.1007/s00530-023-01133-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01133-7

Keywords

Navigation