Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism

Zhu, Xuelin; Cao, Biwei; Xu, Shuai; Liu, Bo; Cao, Jiuxin

doi:10.1007/978-3-030-05710-7_22

Xuelin Zhu¹⁸,
Biwei Cao¹⁹,
Shuai Xu¹⁸,
Bo Liu¹⁸ &
…
Jiuxin Cao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11295))

Included in the following conference series:

International Conference on Multimedia Modeling

2971 Accesses
8 Citations

Abstract

Recently, many researchers have focused on the joint visual-textual sentiment analysis since it can better extract user sentiments toward events or topics. In this paper, we propose that visual and textual information should differ in their contribution to sentiment analysis. Our model learns a robust joint visual-textual representation by incorporating a cross-modality attention mechanism and semantic embedding learning based on bidirectional recurrent neural network. Experimental results show that our model outperforms existing the state-of-the-art models in sentiment analysis under real datasets. In addition, we also investigate different proposed model’s variants and analyze the effects of semantic embedding learning and cross-modality attention mechanism in order to provide deeper insight on how these two techniques help the learning of joint visual-textual sentiment classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.flickr.com.
2.
https://www.gettyimages.co.uk.

References

Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: AAAI, pp. 231–237 (2017)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Article Google Scholar
Chen, X., Wang, Y., Liu, Q.: Visual and textual sentiment analysis using deep fusion convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1557–1561. IEEE (2017)
Google Scholar
You, Q., Cao, L., Jin, H., et al.: Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1008–1017. ACM (2016)
Google Scholar
You, Q., Luo, J., Jin, H., et al.: Cross-modality consistent regression for joint visual textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 13–22. ACM (2016)
Google Scholar
Katsurai, M., Satoh, S.: Image sentiment analysis using latent correlations among visual, textual, and sentiment views. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2837–2841. IEEE (2016)
Google Scholar
Cao, D., Ji, R., Lin, D., et al.: A cross-media public sentiment analysis system for microblog. Multimedia Syst. 22(4), 479–486 (2016)
Article Google Scholar
Yang, Z., Yang, D., Dyer, C., et al.: Hierarchical attention networks for document classification. In: 2016 Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Pang, L., Zhu, S., Ngo, C.W.: Deep multimodal learning for affective analysis and retrieval. IEEE Trans. Multimedia 17(11), 2008–2020 (2015)
Article Google Scholar
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Ma, L., Lu, Z., Shang, L., et al.: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)
Google Scholar
Campos, V., Salvador, A., Giro-i-Nieto, X., et al.: Diving deep into sentiment: understanding fine-tuned CNNs for visual sentiment prediction. In: Proceedings of the 1st International Workshop on Affect and Sentiment in Multimedia, pp. 57–62. ACM (2015)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Wang, M., Cao, D., Li, L., et al.: Microblog sentiment analysis based on cross-media bag-of-words model. In: Proceedings of International Conference on Internet Multi-media Computing and Service, p. 76. ACM (2014)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate arXiv preprint arXiv:1409.0473 (2014)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
You, Q., Luo, J.: Towards social imagematics: sentiment analysis in social multimedia. In: Proceedings of the Thirteenth International Workshop on Multimedia Data Mining, p. 3. ACM (2013)
Google Scholar
Chen, T., Lu, D., Kan, M.Y., et al.: Understanding and classifying image tweets. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 781–784. ACM (2013)
Google Scholar
Socher, R., Perelygin, A., Wu, J., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Google Scholar
Borth, D., Ji, R., Chen, T., et al.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 223–232. ACM (2013)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing systems, pp. 1097–1105 (2012)
Google Scholar

Download references

Acknowledgment

This work is supported by National Natural Science Foundation of China under Grants, No. 61772133, No. 61472081, No. 61402104, No. 61370207, No. 61370208, No. 61300024, No. 61320106007, Key Laboratory of Computer Network Technology of Jiangsu Province, Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No. BM2003201, and Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No. 93K-9.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, China
Xuelin Zhu, Shuai Xu & Bo Liu
ANU College of Engineering and Computer Science, Australian National University, Canberra, Australia
Biwei Cao
School of Cyber Science and Engineering, Southeast University, Nanjing, China
Jiuxin Cao

Authors

Xuelin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Biwei Cao
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiuxin Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiuxin Cao .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, X., Cao, B., Xu, S., Liu, B., Cao, J. (2019). Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-05710-7_22
Published: 08 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05709-1
Online ISBN: 978-3-030-05710-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics