Hierarchical multiples self-attention mechanism for multi-modal analysis

Jun, Wu; Tianliang, Zhu; Jiahui, Zhu; Tianyi, Li; Chunzhi, Wang

doi:10.1007/s00530-023-01133-7

Hierarchical multiples self-attention mechanism for multi-modal analysis

Regular Paper
Published: 22 July 2023

Volume 29, pages 3599–3608, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Wu Jun¹,
Zhu Tianliang ORCID: orcid.org/0000-0001-6514-8055¹,
Zhu Jiahui¹^na1,
Li Tianyi¹^na1 &
…
Wang Chunzhi¹^na1

379 Accesses
5 Citations
Explore all metrics

Abstract

Because of the massive multimedia in daily life, people perceive the world by concurrently processing and fusing multi-modalities with high-dimensional data which may include text, vision, audio and some others. Depending on the popular Machine Learning, we would like to get much better fusion results. Therefore, multi-modal analysis has become an innovative field in data processing. By combining different modes, data can be more informative. However the difficulties of multi-modality analysis and processing lie in Feature extraction and Feature fusion. This paper focussed on this point to propose the BERT-HMAG model for feature extraction and LMF-SA model for multi-modality fusion. During the experiment, compared with traditional models, such as LSTM and Transformer, they are improved to a certain extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing semantic audio-visual representation learning with supervised multi-scale attention

Article 11 February 2025

Discriminative multi-task multi-view feature selection and fusion for multimedia analysis

Article 06 September 2017

Heterogeneous Features Integration via Semi-supervised Multi-modal Deep Networks

Data availability

The experimental data used in the present study was published GitHub (https://github.com/QXYDCR/HM_BERT/tree/master).

References

Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.-P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction. ICMI ’17, pp. 163–171. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3136755.3136801
Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H.W., Narang, S., Yogatama, D., Vaswani, A., Metzler, D.: Scale efficiently: Insights from pre-training and fine-tuning transformers. CoRR abs/2109.10686 (2021)
Ramanathan, V., Wang, R., Mahajan, D.: Predet: Large-scale weakly supervised pre-training for detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2865–2875 (2021)
Kumar, A., Sachdeva, N.: Multi-input integrative learning using deep neural networks and transfer learning for cyberbullying detection in real-time code-mix data. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-020-00672-7
Article Google Scholar
Li, X., Ma, S., Shan, L.: Multi-window transformer parallel fusion feature pyramid network for pedestrian orientation detection. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-00993-9
Article Google Scholar
Ben Chaabene, N.E.H., Bouzeghoub, A., Guetari, R., Ghezala, H.H.B.: Deep learning methods for anomalies detection in social networks using multidimensional networks and multimodal data: A survey. Multimed. Syst. 28(6), 2133–2143 (2022). https://doi.org/10.1007/s00530-020-00731-z
Article Google Scholar
Rei, L., Mladenic, D., Dorozynski, M., Rottensteiner, F., Schleider, T., Troncy, R., Lozano, J.S., Salvatella, M.G.: Multimodal metadata assignment for cultural heritage artifacts. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-01025-2
Article Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv preprint (2017). https://doi.org/10.48550/arXiv.1707.07250
Article Google Scholar
Sahay, S., Okur, E., Kumar, S.H., Nachman, L.: Low rank fusion based transformers for multimodal sequences. CoRR abs/2007.02038 (2020)
Zhou, Y., Li, J., Chen, H., Wu, Y., Wu, J., Chen, L.: A spatiotemporal hierarchical attention mechanism-based model for multi-step station-level crowd flow prediction. Inform. Sci. 544, 308–324 (2021). https://doi.org/10.1016/j.ins.2020.07.049
Article MathSciNet MATH Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)
Demirkiran, F., Çayir, A., Ünal, U., Dağ, H.: Website category classification using fine-tuned bert language model. Int. Conf. Comput. Sci. Eng. (2020). https://doi.org/10.1109/UBMK50275.2020.9219384
Article Google Scholar
Madichetty, S., Muthukumarasamy, S., Jayadev, P.: Multi-modal classification of twitter data during disasters for humanitarian response. J. Ambient. Intell. Humaniz. Comput. 12(11), 10223–10237 (2021). https://doi.org/10.1007/s12652-020-02791-5
Article Google Scholar
Zhang, Y., Wang, Y., Wang, X., Zou, B., Xie, H.: Text-based decision fusion model for detecting depression. In: 2020 2nd symposium on signal processing systems SSPS 2020, pp. 101–106. Association for Computing Machinery, NY, USA (2020)
Google Scholar
Zou, W., Ding, J., Wang, C.: Utilizing bert intermediate layers for multimodal sentiment analysis. IEEE Int. Conf. Multimed. Export (2022). https://doi.org/10.1109/ICME52920.2022.9860014
Article Google Scholar
Lee, S., Han, D.K., Ko, H.: Multimodal emotion recognition fusion analysis adapting bert with heterogeneous feature unification. IEEE Access 9, 94557–94572 (2021). https://doi.org/10.1109/ACCESS.2021.3092735
Article Google Scholar
Agarwal, K., Choudhury, S., Tipirneni, S., Mukherjee, P., Ham, C., Tamang, S., Baker, M., Tang, S., Kocaman, V., Gevaert, O.: Preparing for the next pandemic via transfer learning from existing diseases with hierarchical multi-modal bert: a study on covid-19 outcome prediction. Sci. Rep. 12(1), 1–13 (2022). https://doi.org/10.1038/s41598-022-13072-w
Article Google Scholar
Lei, Z., Ul Haq, A., Zeb, A., Suzauddola, M., Zhang, D.: Is the suggested food your desired?: Multi-modal recipe recommendation with demand-based knowledge graph. Expert Syst. Appl. 186, 115708 (2021). https://doi.org/10.1016/j.eswa.2021.115708
Article Google Scholar
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. CoRR abs/2104.01394 (2021)
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849 (2020)
Ge, Y., Ge, Y., Liu, X., Wang, J., Wu, J., Shan, Y., Qie, X., Luo, P.: Miles: Visual bert pre-training with injected language semantics for video-text retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer vision - ECCV 2022, pp. 691–708. Springer, Cham (2022)
Chapter Google Scholar
Zhang, Z., Ma, J., Zhou, C., Men, R., Li, Z., Ding, M., Tang, J., Zhou, J., Yang, H.: UFC-BERT: unifying multi-modal controls for conditional image synthesis. CoRR abs/2105.14211 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., ??? (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., Gong, B.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. CoRR abs/2104.11178 (2021)
Li, Y., Zhao, T., Shen, X.: Attention-based multimodal fusion for estimating human emotion in real-world hri. In: Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 340–342. Association for Computing Machinery, NY, USA (2020)
Chapter Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Yang, K., Xu, H., Gao, K.: CM-BERT cross-modal BERT for text-audio sentiment analysis, pp. 521–528. Association for Computing Machinery, New York, NY, USA (2020)
Google Scholar
Kim, D., Kang, P.: Cross-modal distillation with audio-text fusion for fine-grained emotion classification using bert and wav2vec 2.0. Neurocomputing 506, 168–183 (2022). https://doi.org/10.1016/j.neucom.2022.07.035
Article Google Scholar
Boukabous, M., Azizi, M.: Multimodal sentiment analysis using audio and text for crime detection. In: 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), pp. 1–5 (2022). https://doi.org/10.1109/IRASET52964.2022.9738175
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 61602161,61772180 ), Hubei Province Science and Technology Support Project (Grant No: 2020BAB012 ), The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046 ).

Author information

Zhu Tianliang, Zhu Jiahui, Li Tianyi and Wang Chunzhi have contributed equally to this work.

Authors and Affiliations

School of Computer Science, Hubei University of Technology, Shizishan Street, Wuhan, 430068, Hubei, China
Wu Jun, Zhu Tianliang, Zhu Jiahui, Li Tianyi & Wang Chunzhi

Authors

Wu Jun
View author publications
You can also search for this author inPubMed Google Scholar
Zhu Tianliang
View author publications
You can also search for this author inPubMed Google Scholar
Zhu Jiahui
View author publications
You can also search for this author inPubMed Google Scholar
Li Tianyi
View author publications
You can also search for this author inPubMed Google Scholar
Wang Chunzhi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhu Tianliang.

Ethics declarations

Conflict of interest

All authors declared no conflict of interest.

Ethics approval

We promise that our studies have no ethical issues.

Additional information

Communicated by M. Katsurai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jun, W., Tianliang, Z., Jiahui, Z. et al. Hierarchical multiples self-attention mechanism for multi-modal analysis. Multimedia Systems 29, 3599–3608 (2023). https://doi.org/10.1007/s00530-023-01133-7

Download citation

Received: 16 March 2022
Accepted: 01 July 2023
Published: 22 July 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00530-023-01133-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical multiples self-attention mechanism for multi-modal analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing semantic audio-visual representation learning with supervised multi-scale attention

Discriminative multi-task multi-view feature selection and fusion for multimedia analysis

Heterogeneous Features Integration via Semi-supervised Multi-modal Deep Networks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now