Abstract
Multimodal sentiment analysis has received widespread attention from the research community in recent years; it aims to use information from different modalities to predict sentiment polarity. However, the model architecture of most existing methods is fixed, and data can only flow along an established path, which leads to poor generalization of the model to different types of data. Furthermore, most methods explore only intra- or intermodal interactions and do not combine the two. In this paper, we propose the Smart Routing Attention Network (SmartRAN). SmartRAN can smartly select the data flow path on the basis of the smart routing attention module, effectively avoiding the disadvantages of poor adaptability and generalizability caused by a fixed model architecture. In addition, SmartRAN includes the learning process of both intra- and intermodal information, which can enhance the semantic consistency of comprehensive information and improve the learning ability of the model for complex relationships. Extensive experiments on two benchmark datasets, CMU-MOSI and CMU-MOSEI, prove that the proposed SmartRAN has superior performance to state-of-the-art models.
















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
All the datasets used in this research are benchmark data that are publicly available online.
References
Krishnan H, Elayidom MS, Santhanakrishnan T (2022) A comprehensive survey on sentiment analysis in twitter data. Int J Distributed Syst Technol 13(5):1–22
Zeng Y, Li Z, Chen Z, Ma H (2023) Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network. Front Comp Sci 17(6):176340
Yang B, Shao B, Wu L, Lin X (2022) Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467:130–137. https://doi.org/10.1016/j.neucom.2021.09.041
Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. In: Palmer M, Hwa R, Riedel S (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pp. 1103–111. https://doi.org/10.18653/v1/D17-1115 . https://aclanthology.org/D17-1115
Yang J, Yu Y, Niu D, Guo W, Xu Y (2023) ConFEDE: Contrastive feature decomposition for multimodal sentiment analysis. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, pp 7617–763 https://doi.org/10.18653/v1/2023.acl-long.421 . https://aclanthology.org/2023.acl-long.421
Yu W, Xu H, Yuan Z, Wu J (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell 35(12):10790–10797. https://doi.org/10.1609/aaai.v35i12.17289
Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Korhonen A, Traum D, Màrquez L (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp 6558–656https://doi.org/10.18653/v1/P19-1656 . https://aclanthology.org/P19-1656
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: Routing the attention spans in transformer for visual question answering. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp 2054–206https://doi.org/10.1109/ICCV48922.2021.00208
Xue Z, Marculescu R (2023) Dynamic multimodal fusion. In: Multi-Modal Learning and Applications Workshop (MULA). CVPR
Tian Y, Xu N, Zhang R, Mao W (2023) Dynamic routing transformer network for multimodal sarcasm detection. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, pp 2468–2480.https://doi.org/10.18653/v1/2023.acl-long.139 . https://aclanthology.org/2023.acl-long.139
Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) Tetfn: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recogn 136:109259. https://doi.org/10.1016/j.patcog.2022.109259
Yu Y, Zhao M, Qi S-A, Sun F, Wang B, Guo W, Wang X, Yang L, Niu D (2023) ConKI: Contrastive knowledge injection for multimodal sentiment analysis. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, pp 13610–13624.https://doi.org/10.18653/v1/2023.findings-acl.860 . https://aclanthology.org/2023.findings-acl.860
Kim K, Park S (2023) Aobert: All-modalities-in-one bert for multimodal sentiment analysis. Inform Fus 92:37–45. https://doi.org/10.1016/j.inffus.2022.11.022
Zhu L, Zhu Z, Zhang C, Xu Y, Kong X (2023) Multimodal sentiment analysis based on fusion methods: A survey. Inform Fus 95:306–325. https://doi.org/10.1016/j.inffus.2023.02.028
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Bagher Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2247–225https://doi.org/10.18653/v1/P18-1209 . https://aclanthology.org/P18-1209
Xu J, Li Z, Huang F, Li C, Yu PS (2021) Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans Industr Inf 17(4):2974–2982. https://doi.org/10.1109/TII.2020.3005405
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186.https://doi.org/10.18653/v1/N19-1423 . https://aclanthology.org/N19-1423
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
Hazarika D, Zimmermann R, Poria S (2020) Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. pp 1122–1131
Han W, Chen H, Poria S (2021) Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Moens M-F, Huang X, Specia L, Yih SW-t. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 9180–919https://doi.org/10.18653/v1/2021.emnlp-main.723 . https://aclanthology.org/2021.emnlp-main.723
Li Y, Wang Y, Cui Z (2023) Decoupled multimodal distilling for emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 6631–6640
Guo X, Tian S, Yu L, He X, Wang Z (2024) Mtfr: An universal multimodal fusion method through modality transfer and fusion refinement. Eng Appl Artif Intell 135:108844. https://doi.org/10.1016/j.engappai.2024.108844
Sun L, Lian Z, Liu B, Tao J (2024) Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans Affect Comput 15(1):309–325. https://doi.org/10.1109/TAFFC.2023.3274829
Fu Y, Zhang Z, Yang R, Yao C (2024) Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing. 571:127201. https://doi.org/10.1016/j.neucom.2023.127201
Han Y, Huang G, Song S, Yang L, Wang H, Wang Y (2022) Dynamic neural networks: A survey. IEEE Trans Pattern Anal Mach Intell 44(11):7436–7456. https://doi.org/10.1109/TPAMI.2021.3117837
Qu L, Liu M, Wu J, Gao Z, Nie L (2021) Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR conference on research and development in information retrieval. pp 1104–1113
Cai S, Shu Y, Wang W (2021) Dynamic routing networks. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 3588–3597
Huang G, Chen D, Li T, Wu F, Maaten L, Weinberger K (2018) Multi-scale dense networks for resource efficient image classification. In: International conference on learning representations. https://openreview.net/forum?id=Hk2aImxAb
Wang X, Yu F, Dou Z-Y, Darrell T, Gonzalez JE (2018) Skipnet: Learning dynamic routing in convolutional networks. In: The European Conference on Computer Vision (ECCV)
Shazeer N, Fatahalian K, Mark WR, Mullapudi RT (2018) Hydranets: Specialized dynamic architectures for efficient inference. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. pp 8080–8089.https://doi.org/10.1109/CVPR.2018.00843
Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017) Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA. pp 2755–276https://doi.org/10.1109/ICCV.2017.298 . https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.298
Li Y, Song L, Chen Y, Li Z, Zhang X, Wang X, Sun J (2020) Learning dynamic routing for semantic segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 8550–855. https://doi.org/10.1109/CVPR42600.2020.00858 . https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.00858
Yang L, Han Y, Chen X, Song S, Dai J, Huang G (2020) Resolution adaptive networks for efficient inference. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 2366–237. https://doi.org/10.1109/CVPR42600.2020.00244. https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.00244
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692
Cho K, Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734. https://doi.org/10.3115/v1/D14-1179 . https://aclanthology.org/D14-1179
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
He R, Ravula A, Kanagal B, Ainslie J (2021) RealFormer: Transformer likes residual attention. In: Zong C, Xia F, Li W, Navigli R (eds.) Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, pp 929–943. https://doi.org/10.18653/v1/2021.findings-acl.81 . https://aclanthology.org/2021.findings-acl.81
Hendrycks D, Gimpel K (2017) Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. https://openreview.net/forum?id=Bk0MRI5lg
Radford A, Narasimhan K, Salimans T, Sutskever I, et al (2018) Improving language understanding by generative pre-training
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94
Bagher Zadeh A, Liang PP, Poria S, Cambria E, Morency L-P (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Gurevych I, Miyao Y (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2236–2246. https://doi.org/10.18653/v1/P18-1208 . https://aclanthology.org/P18-1208
Mao H, Yuan Z, Xu H, Yu W, Liu Y, Gao K (2022) M-SENA: An integrated platform for multimodal sentiment analysis. In: Basile V, Kozareva Z, Stajner S (eds.) Proceedings of the 60th annual meeting of the association for computational linguistics: system demonstrations. Association for Computational Linguistics, Dublin, Ireland, pp 204–213. https://doi.org/10.18653/v1/2022.acl-demo.20 . https://aclanthology.org/2022.acl-demo.20
Loshchilov I, Hutter F (2018) Fixing Weight Decay Regularization in Adam. https://openreview.net/forum?id=rk6qdGgCZ
Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
Acknowledgements
We thank the anonymous reviewers for their insightful comments. This study was partially supported by the Tianshan Talent Training Program in the Autonomous Region, China (grant number: 2023TSYCLJ0023); Natural Science Foundation of Xinjiang Uygur Autonomous Region (grant number: 2023D01C176); Xinjiang Uygur Autonomous Region Universities Fundamental Research Funds Scientific Research Project (grant number: XJEDU2022P018); Key Research and Development Projects in the Autonomous Region, China (grant number: 2023A03001, 2021B01002) and Key Program of the National Natural Science Foundation of China (grant number: U2003208).
Author information
Authors and Affiliations
Contributions
\(\bullet \) Xueyu Guo: Conceptualization, Methodology, Validation, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization. \(\bullet \) Shengwei Tian: Validation, Writing - Review & Editing, Supervision, Funding acquisition.\(\bullet \) Long Yu: Validation, Writing - Review & Editing, Supervision.\(\bullet \) Xiaoyu He: Conceptualization, Validation, Writing - Review & Editing.
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical and informed consent for data used
The two multimodal sentiment analysis datasets used in this study are both open source datasets and do not involve any ethical issues.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, X., Tian, S., Yu, L. et al. SmartRAN: Smart Routing Attention Network for multimodal sentiment analysis. Appl Intell 54, 12742–12763 (2024). https://doi.org/10.1007/s10489-024-05839-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05839-7