Abstract
Computational analysis of human multimodal sentiment is an emerging research area. Fusing semantic, visual and acoustic modalities requires exploring the inter-modal and intra-modal interactions. The first challenge for the inter-modal understanding is to break the heterogeneous gap between different modalities. Meanwhile, when modeling the intra-modal connection of time-series data, we must deal with the long-range dependencies among multiple steps. Moreover, The time-series data usually is unaligned between different modalities because individually specialized processing approaches or sampling frequencies. In this paper, we propose a method based on the transformer and the gate mechanism - the Gate-Fusion Transformer - to address these problems. We conducted detailed experiments for verifying the effectiveness of our proposed method. Because of the flexibility of gate-mechanism for information flow controlling and the great modeling power of the transformer for modeling the inter- and intra-modal interactions, we can achieve superior performance compared with the current state-of-the-art method but more extendible and flexible by stacking multiple gate-fusion blocks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ain, Q.T., et al.: Sentiment analysis using deep learning techniques: a review. Int. J. Adv. Comput. Sci. Appl. 8(6), 424 (2017)
Baltrušaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, vol. 10, pp. 359–370 (1994)
Cambria, E.: Affective computing and sentiment analysis. IEEE Intell. Syst. 31(2), 102–107 (2016)
Cho, K.,et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 2014, pp. 1724–1734. Association for Computational Linguistics (2014)
Dai, Z., et al.: Transformer-xl: attentive language models beyond a fixed-length context (2019). arXiv preprint arXiv:1901.02860
Gonen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12(Jul), 2211–2268 (2011)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Gurban, M., Thiran, J.P., Drugman, T., Dutoit, T.: Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In: Proceedings of the 10th International Conference on Multimodal Interfaces, pp. 237–240. ACM (2008)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Ittichaichareon, C., Suksri, S., Yingthawornsuk, T.: Speech recognition using MFCC. In: International Conference on Computer Graphics, Simulation and Modeling, pp. 135–138 (2012)
Natasha, J., Taylor, S., Sano, A., Picard, R.: Multi-task, multi-kernel learning for estimating individual wellbeing. In: Proceedings NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec, vol. 898, p. 63 (2015)
Jiang, X.Y., Wu, F., Zhang, Y., Tang, S.L., Lu, W.M., Zhuang, Y.T.: The classification of multi-modal data with hidden conditional random field. Pattern Recogn. Lett. 51, 63–69 (2015)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1746–1751. Association for Computational Linguistics (2014)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
Liang, P.P., Liu, Z., Zadeh, A.A.B., Morency, L.P.: Multimodal language analysis with recurrent multistage fusion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November 2018, pp. 150–161. Association for Computational Linguistics (2018)
Gwen, L., Sikka, K., Bartlett, M.S., Dykstra, K., Sathyanarayana, S.: Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM International Conference on Multimodal Interaction, pp. 517–524 (2013)
Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. ISMIR 270, 1–11 (2000)
Tomáš, M., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Manish, M., Shakya, S., Shrestha, A.: Fine-grained sentiment classification using bert (2019). arXiv preprint arXiv:1910.03474
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, pp. 689–696 (2011)
Peng, G., et al.: Dynamic fusion with intra- and inter- modality attention flow for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (2019)
Peng, Y., Qi, J.: CM-GANS: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 22 (2019)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations (2018). arXiv preprint arXiv:1803.02155
Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.F., Pantic, M.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing
Tsai, Y.H.H., Bai, S., Liang, P.P., Zico Kolter,J., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 6558–6569. Association for Computational Linguistics (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, J., Fu, J., Xu, Y., Mei, T.: Beyond object recognition: Visual sentiment analysis with deep coupled adjective and noun neural networks. In: IJCAI, pp. 3484–3490 (2016)
Wang, N., Gao, X., Tao, D., Yang, H., Li, X.: Facial feature point detection: a comprehensive survey. Neurocomputing 275, 50–65 (2018)
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Dynamically adjusting word representations using nonverbal behaviors. Words can shift. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 2017, pp. 1103–1114. Association for Computational Linguistics (2017)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos (2016). arXiv preprint arXiv:1606.06259
Zadeh, A.A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)
Zhou, C., Sun, C., Liu, Z., Lau, F.: A c-lstm neural network for text classification (2015). arXiv preprint arXiv:1511.08630
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Xie, LF., Zhang, XY. (2020). Gate-Fusion Transformer for Multimodal Sentiment Analysis. In: Lu, Y., Vincent, N., Yuen, P.C., Zheng, WS., Cheriet, F., Suen, C.Y. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2020. Lecture Notes in Computer Science(), vol 12068. Springer, Cham. https://doi.org/10.1007/978-3-030-59830-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-59830-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59829-7
Online ISBN: 978-3-030-59830-3
eBook Packages: Computer ScienceComputer Science (R0)