Skip to main content
Log in

SmartRAN: Smart Routing Attention Network for multimodal sentiment analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis has received widespread attention from the research community in recent years; it aims to use information from different modalities to predict sentiment polarity. However, the model architecture of most existing methods is fixed, and data can only flow along an established path, which leads to poor generalization of the model to different types of data. Furthermore, most methods explore only intra- or intermodal interactions and do not combine the two. In this paper, we propose the Smart Routing Attention Network (SmartRAN). SmartRAN can smartly select the data flow path on the basis of the smart routing attention module, effectively avoiding the disadvantages of poor adaptability and generalizability caused by a fixed model architecture. In addition, SmartRAN includes the learning process of both intra- and intermodal information, which can enhance the semantic consistency of comprehensive information and improve the learning ability of the model for complex relationships. Extensive experiments on two benchmark datasets, CMU-MOSI and CMU-MOSEI, prove that the proposed SmartRAN has superior performance to state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

All the datasets used in this research are benchmark data that are publicly available online.

Notes

  1. https://www.youtube.com

  2. https://www.tiktok.com

  3. https://github.com/thuiar/MMSA/tree/master

References

  1. Krishnan H, Elayidom MS, Santhanakrishnan T (2022) A comprehensive survey on sentiment analysis in twitter data. Int J Distributed Syst Technol 13(5):1–22

    Article  Google Scholar 

  2. Zeng Y, Li Z, Chen Z, Ma H (2023) Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network. Front Comp Sci 17(6):176340

    Article  Google Scholar 

  3. Yang B, Shao B, Wu L, Lin X (2022) Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467:130–137. https://doi.org/10.1016/j.neucom.2021.09.041

    Article  Google Scholar 

  4. Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. In: Palmer M, Hwa R, Riedel S (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pp. 1103–111. https://doi.org/10.18653/v1/D17-1115 . https://aclanthology.org/D17-1115

  5. Yang J, Yu Y, Niu D, Guo W, Xu Y (2023) ConFEDE: Contrastive feature decomposition for multimodal sentiment analysis. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, pp 7617–763 https://doi.org/10.18653/v1/2023.acl-long.421 . https://aclanthology.org/2023.acl-long.421

  6. Yu W, Xu H, Yuan Z, Wu J (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell 35(12):10790–10797. https://doi.org/10.1609/aaai.v35i12.17289

    Article  Google Scholar 

  7. Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Korhonen A, Traum D, Màrquez L (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp 6558–656https://doi.org/10.18653/v1/P19-1656 . https://aclanthology.org/P19-1656

  8. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30

  9. Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: Routing the attention spans in transformer for visual question answering. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp 2054–206https://doi.org/10.1109/ICCV48922.2021.00208

  10. Xue Z, Marculescu R (2023) Dynamic multimodal fusion. In: Multi-Modal Learning and Applications Workshop (MULA). CVPR

  11. Tian Y, Xu N, Zhang R, Mao W (2023) Dynamic routing transformer network for multimodal sarcasm detection. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, pp 2468–2480.https://doi.org/10.18653/v1/2023.acl-long.139 . https://aclanthology.org/2023.acl-long.139

  12. Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) Tetfn: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recogn 136:109259. https://doi.org/10.1016/j.patcog.2022.109259

    Article  Google Scholar 

  13. Yu Y, Zhao M, Qi S-A, Sun F, Wang B, Guo W, Wang X, Yang L, Niu D (2023) ConKI: Contrastive knowledge injection for multimodal sentiment analysis. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, pp 13610–13624.https://doi.org/10.18653/v1/2023.findings-acl.860 . https://aclanthology.org/2023.findings-acl.860

  14. Kim K, Park S (2023) Aobert: All-modalities-in-one bert for multimodal sentiment analysis. Inform Fus 92:37–45. https://doi.org/10.1016/j.inffus.2022.11.022

    Article  Google Scholar 

  15. Zhu L, Zhu Z, Zhang C, Xu Y, Kong X (2023) Multimodal sentiment analysis based on fusion methods: A survey. Inform Fus 95:306–325. https://doi.org/10.1016/j.inffus.2023.02.028

    Article  Google Scholar 

  16. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Bagher Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2247–225https://doi.org/10.18653/v1/P18-1209 . https://aclanthology.org/P18-1209

  17. Xu J, Li Z, Huang F, Li C, Yu PS (2021) Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans Industr Inf 17(4):2974–2982. https://doi.org/10.1109/TII.2020.3005405

    Article  Google Scholar 

  18. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186.https://doi.org/10.18653/v1/N19-1423 . https://aclanthology.org/N19-1423

  19. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy

  20. Hazarika D, Zimmermann R, Poria S (2020) Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. pp 1122–1131

  21. Han W, Chen H, Poria S (2021) Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Moens M-F, Huang X, Specia L, Yih SW-t. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 9180–919https://doi.org/10.18653/v1/2021.emnlp-main.723 . https://aclanthology.org/2021.emnlp-main.723

  22. Li Y, Wang Y, Cui Z (2023) Decoupled multimodal distilling for emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 6631–6640

  23. Guo X, Tian S, Yu L, He X, Wang Z (2024) Mtfr: An universal multimodal fusion method through modality transfer and fusion refinement. Eng Appl Artif Intell 135:108844. https://doi.org/10.1016/j.engappai.2024.108844

    Article  Google Scholar 

  24. Sun L, Lian Z, Liu B, Tao J (2024) Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans Affect Comput 15(1):309–325. https://doi.org/10.1109/TAFFC.2023.3274829

    Article  Google Scholar 

  25. Fu Y, Zhang Z, Yang R, Yao C (2024) Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing. 571:127201. https://doi.org/10.1016/j.neucom.2023.127201

    Article  Google Scholar 

  26. Han Y, Huang G, Song S, Yang L, Wang H, Wang Y (2022) Dynamic neural networks: A survey. IEEE Trans Pattern Anal Mach Intell 44(11):7436–7456. https://doi.org/10.1109/TPAMI.2021.3117837

    Article  Google Scholar 

  27. Qu L, Liu M, Wu J, Gao Z, Nie L (2021) Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR conference on research and development in information retrieval. pp 1104–1113

  28. Cai S, Shu Y, Wang W (2021) Dynamic routing networks. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 3588–3597

  29. Huang G, Chen D, Li T, Wu F, Maaten L, Weinberger K (2018) Multi-scale dense networks for resource efficient image classification. In: International conference on learning representations. https://openreview.net/forum?id=Hk2aImxAb

  30. Wang X, Yu F, Dou Z-Y, Darrell T, Gonzalez JE (2018) Skipnet: Learning dynamic routing in convolutional networks. In: The European Conference on Computer Vision (ECCV)

  31. Shazeer N, Fatahalian K, Mark WR, Mullapudi RT (2018) Hydranets: Specialized dynamic architectures for efficient inference. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. pp 8080–8089.https://doi.org/10.1109/CVPR.2018.00843

  32. Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017) Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA. pp 2755–276https://doi.org/10.1109/ICCV.2017.298 . https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.298

  33. Li Y, Song L, Chen Y, Li Z, Zhang X, Wang X, Sun J (2020) Learning dynamic routing for semantic segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 8550–855. https://doi.org/10.1109/CVPR42600.2020.00858 . https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.00858

  34. Yang L, Han Y, Chen X, Song S, Dai J, Huang G (2020) Resolution adaptive networks for efficient inference. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 2366–237. https://doi.org/10.1109/CVPR42600.2020.00244. https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.00244

  35. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692

  36. Cho K, Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734. https://doi.org/10.3115/v1/D14-1179 . https://aclanthology.org/D14-1179

  37. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  38. He R, Ravula A, Kanagal B, Ainslie J (2021) RealFormer: Transformer likes residual attention. In: Zong C, Xia F, Li W, Navigli R (eds.) Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, pp 929–943. https://doi.org/10.18653/v1/2021.findings-acl.81 . https://aclanthology.org/2021.findings-acl.81

  39. Hendrycks D, Gimpel K (2017) Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. https://openreview.net/forum?id=Bk0MRI5lg

  40. Radford A, Narasimhan K, Salimans T, Sutskever I, et al (2018) Improving language understanding by generative pre-training

  41. Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94

    Article  Google Scholar 

  42. Bagher Zadeh A, Liang PP, Poria S, Cambria E, Morency L-P (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Gurevych I, Miyao Y (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2236–2246. https://doi.org/10.18653/v1/P18-1208 . https://aclanthology.org/P18-1208

  43. Mao H, Yuan Z, Xu H, Yu W, Liu Y, Gao K (2022) M-SENA: An integrated platform for multimodal sentiment analysis. In: Basile V, Kozareva Z, Stajner S (eds.) Proceedings of the 60th annual meeting of the association for computational linguistics: system demonstrations. Association for Computational Linguistics, Dublin, Ireland, pp 204–213. https://doi.org/10.18653/v1/2022.acl-demo.20 . https://aclanthology.org/2022.acl-demo.20

  44. Loshchilov I, Hutter F (2018) Fixing Weight Decay Regularization in Adam. https://openreview.net/forum?id=rk6qdGgCZ

  45. Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605

    Google Scholar 

  46. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments. This study was partially supported by the Tianshan Talent Training Program in the Autonomous Region, China (grant number: 2023TSYCLJ0023); Natural Science Foundation of Xinjiang Uygur Autonomous Region (grant number: 2023D01C176); Xinjiang Uygur Autonomous Region Universities Fundamental Research Funds Scientific Research Project (grant number: XJEDU2022P018); Key Research and Development Projects in the Autonomous Region, China (grant number: 2023A03001, 2021B01002) and Key Program of the National Natural Science Foundation of China (grant number: U2003208).

Author information

Authors and Affiliations

Authors

Contributions

\(\bullet \) Xueyu Guo: Conceptualization, Methodology, Validation, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization. \(\bullet \) Shengwei Tian: Validation, Writing - Review & Editing, Supervision, Funding acquisition.\(\bullet \) Long Yu: Validation, Writing - Review & Editing, Supervision.\(\bullet \) Xiaoyu He: Conceptualization, Validation, Writing - Review & Editing.

Corresponding author

Correspondence to Shengwei Tian.

Ethics declarations

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

The two multimodal sentiment analysis datasets used in this study are both open source datasets and do not involve any ethical issues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, X., Tian, S., Yu, L. et al. SmartRAN: Smart Routing Attention Network for multimodal sentiment analysis. Appl Intell 54, 12742–12763 (2024). https://doi.org/10.1007/s10489-024-05839-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05839-7

Keywords