Skip to main content

Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism

  • Conference paper
  • First Online:
Book cover MultiMedia Modeling (MMM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11295))

Included in the following conference series:

Abstract

Recently, many researchers have focused on the joint visual-textual sentiment analysis since it can better extract user sentiments toward events or topics. In this paper, we propose that visual and textual information should differ in their contribution to sentiment analysis. Our model learns a robust joint visual-textual representation by incorporating a cross-modality attention mechanism and semantic embedding learning based on bidirectional recurrent neural network. Experimental results show that our model outperforms existing the state-of-the-art models in sentiment analysis under real datasets. In addition, we also investigate different proposed model’s variants and analyze the effects of semantic embedding learning and cross-modality attention mechanism in order to provide deeper insight on how these two techniques help the learning of joint visual-textual sentiment classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.flickr.com.

  2. 2.

    https://www.gettyimages.co.uk.

References

  1. Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018)

    Google Scholar 

  2. You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: AAAI, pp. 231–237 (2017)

    Google Scholar 

  3. Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)

    Article  Google Scholar 

  4. Chen, X., Wang, Y., Liu, Q.: Visual and textual sentiment analysis using deep fusion convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1557–1561. IEEE (2017)

    Google Scholar 

  5. You, Q., Cao, L., Jin, H., et al.: Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1008–1017. ACM (2016)

    Google Scholar 

  6. You, Q., Luo, J., Jin, H., et al.: Cross-modality consistent regression for joint visual textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 13–22. ACM (2016)

    Google Scholar 

  7. Katsurai, M., Satoh, S.: Image sentiment analysis using latent correlations among visual, textual, and sentiment views. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2837–2841. IEEE (2016)

    Google Scholar 

  8. Cao, D., Ji, R., Lin, D., et al.: A cross-media public sentiment analysis system for microblog. Multimedia Syst. 22(4), 479–486 (2016)

    Article  Google Scholar 

  9. Yang, Z., Yang, D., Dyer, C., et al.: Hierarchical attention networks for document classification. In: 2016 Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)

    Google Scholar 

  10. Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  11. Pang, L., Zhu, S., Ngo, C.W.: Deep multimodal learning for affective analysis and retrieval. IEEE Trans. Multimedia 17(11), 2008–2020 (2015)

    Article  Google Scholar 

  12. Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

  13. Ma, L., Lu, Z., Shang, L., et al.: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)

    Google Scholar 

  14. Campos, V., Salvador, A., Giro-i-Nieto, X., et al.: Diving deep into sentiment: understanding fine-tuned CNNs for visual sentiment prediction. In: Proceedings of the 1st International Workshop on Affect and Sentiment in Multimedia, pp. 57–62. ACM (2015)

    Google Scholar 

  15. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

    Google Scholar 

  16. Wang, M., Cao, D., Li, L., et al.: Microblog sentiment analysis based on cross-media bag-of-words model. In: Proceedings of International Conference on Internet Multi-media Computing and Service, p. 76. ACM (2014)

    Google Scholar 

  17. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate arXiv preprint arXiv:1409.0473 (2014)

  18. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  19. You, Q., Luo, J.: Towards social imagematics: sentiment analysis in social multimedia. In: Proceedings of the Thirteenth International Workshop on Multimedia Data Mining, p. 3. ACM (2013)

    Google Scholar 

  20. Chen, T., Lu, D., Kan, M.Y., et al.: Understanding and classifying image tweets. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 781–784. ACM (2013)

    Google Scholar 

  21. Socher, R., Perelygin, A., Wu, J., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)

    Google Scholar 

  22. Borth, D., Ji, R., Chen, T., et al.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 223–232. ACM (2013)

    Google Scholar 

  23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing systems, pp. 1097–1105 (2012)

    Google Scholar 

Download references

Acknowledgment

This work is supported by National Natural Science Foundation of China under Grants, No. 61772133, No. 61472081, No. 61402104, No. 61370207, No. 61370208, No. 61300024, No. 61320106007, Key Laboratory of Computer Network Technology of Jiangsu Province, Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No. BM2003201, and Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No. 93K-9.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiuxin Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, X., Cao, B., Xu, S., Liu, B., Cao, J. (2019). Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05710-7_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05709-1

  • Online ISBN: 978-3-030-05710-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics