Skip to main content

Gate-Fusion Transformer for Multimodal Sentiment Analysis

  • Conference paper
  • First Online:
Pattern Recognition and Artificial Intelligence (ICPRAI 2020)

Abstract

Computational analysis of human multimodal sentiment is an emerging research area. Fusing semantic, visual and acoustic modalities requires exploring the inter-modal and intra-modal interactions. The first challenge for the inter-modal understanding is to break the heterogeneous gap between different modalities. Meanwhile, when modeling the intra-modal connection of time-series data, we must deal with the long-range dependencies among multiple steps. Moreover, The time-series data usually is unaligned between different modalities because individually specialized processing approaches or sampling frequencies. In this paper, we propose a method based on the transformer and the gate mechanism - the Gate-Fusion Transformer - to address these problems. We conducted detailed experiments for verifying the effectiveness of our proposed method. Because of the flexibility of gate-mechanism for information flow controlling and the great modeling power of the transformer for modeling the inter- and intra-modal interactions, we can achieve superior performance compared with the current state-of-the-art method but more extendible and flexible by stacking multiple gate-fusion blocks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pytorch.org/docs/stable/_modules/torch/nn/utils/rnn.html#pad_sequence.

References

  1. Ain, Q.T., et al.: Sentiment analysis using deep learning techniques: a review. Int. J. Adv. Comput. Sci. Appl. 8(6), 424 (2017)

    Google Scholar 

  2. Baltrušaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)

    Article  Google Scholar 

  3. Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, vol. 10, pp. 359–370 (1994)

    Google Scholar 

  4. Cambria, E.: Affective computing and sentiment analysis. IEEE Intell. Syst. 31(2), 102–107 (2016)

    Article  Google Scholar 

  5. Cho, K.,et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 2014, pp. 1724–1734. Association for Computational Linguistics (2014)

    Google Scholar 

  6. Dai, Z., et al.: Transformer-xl: attentive language models beyond a fixed-length context (2019). arXiv preprint arXiv:1901.02860

  7. Gonen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12(Jul), 2211–2268 (2011)

    MathSciNet  MATH  Google Scholar 

  8. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)

    Google Scholar 

  9. Gurban, M., Thiran, J.P., Drugman, T., Dutoit, T.: Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In: Proceedings of the 10th International Conference on Multimodal Interfaces, pp. 237–240. ACM (2008)

    Google Scholar 

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Ittichaichareon, C., Suksri, S., Yingthawornsuk, T.: Speech recognition using MFCC. In: International Conference on Computer Graphics, Simulation and Modeling, pp. 135–138 (2012)

    Google Scholar 

  12. Natasha, J., Taylor, S., Sano, A., Picard, R.: Multi-task, multi-kernel learning for estimating individual wellbeing. In: Proceedings NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec, vol. 898, p. 63 (2015)

    Google Scholar 

  13. Jiang, X.Y., Wu, F., Zhang, Y., Tang, S.L., Lu, W.M., Zhuang, Y.T.: The classification of multi-modal data with hidden conditional random field. Pattern Recogn. Lett. 51, 63–69 (2015)

    Article  Google Scholar 

  14. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1746–1751. Association for Computational Linguistics (2014)

    Google Scholar 

  15. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

    MATH  Google Scholar 

  16. Liang, P.P., Liu, Z., Zadeh, A.A.B., Morency, L.P.: Multimodal language analysis with recurrent multistage fusion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November 2018, pp. 150–161. Association for Computational Linguistics (2018)

    Google Scholar 

  17. Gwen, L., Sikka, K., Bartlett, M.S., Dykstra, K., Sathyanarayana, S.: Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM International Conference on Multimodal Interaction, pp. 517–524 (2013)

    Google Scholar 

  18. Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. ISMIR 270, 1–11 (2000)

    Google Scholar 

  19. Tomáš, M., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

    Google Scholar 

  20. Manish, M., Shakya, S., Shrestha, A.: Fine-grained sentiment classification using bert (2019). arXiv preprint arXiv:1910.03474

  21. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, pp. 689–696 (2011)

    Google Scholar 

  22. Peng, G., et al.: Dynamic fusion with intra- and inter- modality attention flow for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  23. Peng, Y., Qi, J.: CM-GANS: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 22 (2019)

    Article  MathSciNet  Google Scholar 

  24. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations (2018). arXiv preprint arXiv:1803.02155

  25. Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.F., Pantic, M.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing

    Google Scholar 

  26. Tsai, Y.H.H., Bai, S., Liang, P.P., Zico Kolter,J., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 6558–6569. Association for Computational Linguistics (2019)

    Google Scholar 

  27. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  28. Wang, J., Fu, J., Xu, Y., Mei, T.: Beyond object recognition: Visual sentiment analysis with deep coupled adjective and noun neural networks. In: IJCAI, pp. 3484–3490 (2016)

    Google Scholar 

  29. Wang, N., Gao, X., Tao, D., Yang, H., Li, X.: Facial feature point detection: a comprehensive survey. Neurocomputing 275, 50–65 (2018)

    Article  Google Scholar 

  30. Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Dynamically adjusting word representations using nonverbal behaviors. Words can shift. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019)

    Google Scholar 

  31. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 2017, pp. 1103–1114. Association for Computational Linguistics (2017)

    Google Scholar 

  32. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  33. Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos (2016). arXiv preprint arXiv:1606.06259

  34. Zadeh, A.A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)

    Google Scholar 

  35. Zhou, C., Sun, C., Liu, Z., Lau, F.: A c-lstm neural network for text classification (2015). arXiv preprint arXiv:1511.08630

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long-Fei Xie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xie, LF., Zhang, XY. (2020). Gate-Fusion Transformer for Multimodal Sentiment Analysis. In: Lu, Y., Vincent, N., Yuen, P.C., Zheng, WS., Cheriet, F., Suen, C.Y. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2020. Lecture Notes in Computer Science(), vol 12068. Springer, Cham. https://doi.org/10.1007/978-3-030-59830-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59830-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59829-7

  • Online ISBN: 978-3-030-59830-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics