Gate-Fusion Transformer for Multimodal Sentiment Analysis

Xie, Long-Fei; Zhang, Xu-Yao

doi:10.1007/978-3-030-59830-3_3

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12068))

Included in the following conference series:

International Conference on Pattern Recognition and Artificial Intelligence

1689 Accesses
3 Citations

Abstract

Computational analysis of human multimodal sentiment is an emerging research area. Fusing semantic, visual and acoustic modalities requires exploring the inter-modal and intra-modal interactions. The first challenge for the inter-modal understanding is to break the heterogeneous gap between different modalities. Meanwhile, when modeling the intra-modal connection of time-series data, we must deal with the long-range dependencies among multiple steps. Moreover, The time-series data usually is unaligned between different modalities because individually specialized processing approaches or sampling frequencies. In this paper, we propose a method based on the transformer and the gate mechanism - the Gate-Fusion Transformer - to address these problems. We conducted detailed experiments for verifying the effectiveness of our proposed method. Because of the flexibility of gate-mechanism for information flow controlling and the great modeling power of the transformer for modeling the inter- and intra-modal interactions, we can achieve superior performance compared with the current state-of-the-art method but more extendible and flexible by stacking multiple gate-fusion blocks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://pytorch.org/docs/stable/_modules/torch/nn/utils/rnn.html#pad_sequence.

References

Ain, Q.T., et al.: Sentiment analysis using deep learning techniques: a review. Int. J. Adv. Comput. Sci. Appl. 8(6), 424 (2017)
Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Article Google Scholar
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, vol. 10, pp. 359–370 (1994)
Google Scholar
Cambria, E.: Affective computing and sentiment analysis. IEEE Intell. Syst. 31(2), 102–107 (2016)
Article Google Scholar
Cho, K.,et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 2014, pp. 1724–1734. Association for Computational Linguistics (2014)
Google Scholar
Dai, Z., et al.: Transformer-xl: attentive language models beyond a fixed-length context (2019). arXiv preprint arXiv:1901.02860
Gonen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12(Jul), 2211–2268 (2011)
MathSciNet MATH Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Google Scholar
Gurban, M., Thiran, J.P., Drugman, T., Dutoit, T.: Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In: Proceedings of the 10th International Conference on Multimodal Interfaces, pp. 237–240. ACM (2008)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ittichaichareon, C., Suksri, S., Yingthawornsuk, T.: Speech recognition using MFCC. In: International Conference on Computer Graphics, Simulation and Modeling, pp. 135–138 (2012)
Google Scholar
Natasha, J., Taylor, S., Sano, A., Picard, R.: Multi-task, multi-kernel learning for estimating individual wellbeing. In: Proceedings NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec, vol. 898, p. 63 (2015)
Google Scholar
Jiang, X.Y., Wu, F., Zhang, Y., Tang, S.L., Lu, W.M., Zhuang, Y.T.: The classification of multi-modal data with hidden conditional random field. Pattern Recogn. Lett. 51, 63–69 (2015)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1746–1751. Association for Computational Linguistics (2014)
Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
MATH Google Scholar
Liang, P.P., Liu, Z., Zadeh, A.A.B., Morency, L.P.: Multimodal language analysis with recurrent multistage fusion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November 2018, pp. 150–161. Association for Computational Linguistics (2018)
Google Scholar
Gwen, L., Sikka, K., Bartlett, M.S., Dykstra, K., Sathyanarayana, S.: Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM International Conference on Multimodal Interaction, pp. 517–524 (2013)
Google Scholar
Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. ISMIR 270, 1–11 (2000)
Google Scholar
Tomáš, M., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Google Scholar
Manish, M., Shakya, S., Shrestha, A.: Fine-grained sentiment classification using bert (2019). arXiv preprint arXiv:1910.03474
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, pp. 689–696 (2011)
Google Scholar
Peng, G., et al.: Dynamic fusion with intra- and inter- modality attention flow for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Peng, Y., Qi, J.: CM-GANS: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 22 (2019)
Article MathSciNet Google Scholar
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations (2018). arXiv preprint arXiv:1803.02155
Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.F., Pantic, M.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing
Google Scholar
Tsai, Y.H.H., Bai, S., Liang, P.P., Zico Kolter,J., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 6558–6569. Association for Computational Linguistics (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, J., Fu, J., Xu, Y., Mei, T.: Beyond object recognition: Visual sentiment analysis with deep coupled adjective and noun neural networks. In: IJCAI, pp. 3484–3490 (2016)
Google Scholar
Wang, N., Gao, X., Tao, D., Yang, H., Li, X.: Facial feature point detection: a comprehensive survey. Neurocomputing 275, 50–65 (2018)
Article Google Scholar
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Dynamically adjusting word representations using nonverbal behaviors. Words can shift. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019)
Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 2017, pp. 1103–1114. Association for Computational Linguistics (2017)
Google Scholar
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos (2016). arXiv preprint arXiv:1606.06259
Zadeh, A.A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)
Google Scholar
Zhou, C., Sun, C., Liu, Z., Lau, F.: A c-lstm neural network for text classification (2015). arXiv preprint arXiv:1511.08630

Download references

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences, Beijing, China
Long-Fei Xie & Xu-Yao Zhang
University of Chinese Academy of Sciences, Beijing, People’s Republic of China
Long-Fei Xie & Xu-Yao Zhang

Authors

Long-Fei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xu-Yao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Long-Fei Xie .

Editor information

Editors and Affiliations

East China Normal University, Shanghai, China
Yue Lu
Paris Descartes University, Paris, France
Nicole Vincent
Hong Kong Baptist University, Kowloon, Hong Kong
Pong Chi Yuen
Sun Yat-sen University, Guangzhou, China
Wei-Shi Zheng
Polytechnique Montréal, Montreal, QC, Canada
Farida Cheriet
Concordia University, Montreal, QC, Canada
Ching Y. Suen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, LF., Zhang, XY. (2020). Gate-Fusion Transformer for Multimodal Sentiment Analysis. In: Lu, Y., Vincent, N., Yuen, P.C., Zheng, WS., Cheriet, F., Suen, C.Y. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2020. Lecture Notes in Computer Science(), vol 12068. Springer, Cham. https://doi.org/10.1007/978-3-030-59830-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-59830-3_3
Published: 09 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59829-7
Online ISBN: 978-3-030-59830-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics