Abstract
This paper addresses the problem of Visual Sentiment Analysis focusing on the estimation of the polarity of the sentiment evoked by an image. Starting from an embedding approach which exploits both visual and textual features, we attempt to boost the contribution of each input view. We propose to extract and employ an Objective Text description of images rather than the classic Subjective Text provided by the users (i.e., title, tags and image description) which is extensively exploited in the state of the art to infer the sentiment associated to social images. Objective Text is obtained from the visual content of the images through recent deep learning architectures which are used to classify object, scene and to perform image captioning. Objective Text features are then combined with visual features in an embedding space obtained with Canonical Correlation Analysis. The sentiment polarity is then inferred by a supervised Support Vector Machine. During the evaluation, we compared an extensive number of text and visual features combinations and baselines obtained by considering the state of the art methods. Experiments performed on a representative dataset of 47235 labelled samples demonstrate that the exploitation of Objective Text helps to outperform state-of-the-art for sentiment polarity estimation.






Similar content being viewed by others
Notes
Our implementation exploits the MVSO English model provided by [23], that corresponds to the DeepSentiBank CNN fine-tuned to predict 4342 English Adjective Noun Pairs.
The code to repeat the performance evaluation is available at the URL: http://iplab.dmi.unict.it/sentimentembedding/
References
Ahmad K, Mekhalfi ML, Conci N, Melgani F, Natale FD (2018) Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 14(2):51
Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525
Battiato S, Farinella GM, Milotta FL, Ortis A, Addesso L, Casella A, D’Amico V, Torrisi G (2016) The social picture. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 397–400. ACM
Battiato S, Moltisanti M, Ravì F, Bruna AR, Naccari F (2013) Aesthetic scoring of digital portraits for consumer applications. In: IS&T/SPIE electronic imaging, pp 866008–866008. International Society for Optics and Photonics
Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, pp 223–232. ACM
Campos V, Jou B, i Nieto XG (2017) From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction. Image and Vision Computing 65:15–22. https://doi.org/10.1016/j.imavis.2017.01.011. http://www.sciencedirect.com/science/article/pii/S0262885617300355. Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing
Campos V, Salvador A, Giró-i Nieto X, Jou B (2015) Diving deep into sentiment: Understanding fine-tuned cnns for visual sentiment prediction. In: Proceedings of the 1st international workshop on affect & sentiment in multimedia, ASM ’15. https://doi.org/10.1145/2813524.2813530. ACM, New York, pp 57–62
Chen T, Borth D, Darrell T, Chang SF (2014) Deepsentibank:, Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586
Cui P, Liu S, Zhu W (2017) General knowledge embedded image representation learning. IEEE Transactions on Multimedia
Datta R, Joshi D, Li J, Wang JZ (2006) Studying aesthetics in photographic images using a computational approach. In: European conference on computer vision, pp 288–301. Springer
Esuli A, Sebastiani F (2006) Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proceedings of The European language resources association, vol 6, pp 417–422. Citeseer
Fu Y, Hospedales TM, Xiang T, Fu Z, Gong S (2014) Transductive multi-view embedding for zero-shot recognition and annotation. In: Proceedings of the European conference on computer vision, pp 584–599. Springer
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proceedings of the European conference on computer vision, pp 529–545. Springer
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 902–909. IEEE
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
Huang F, Zhang X, Zhao Z, Xu J, Li Z (2019) Image–text sentiment analysis via deep multimodal attentive fusion. Knowl-Based Syst 167:26–37
Hung C, Lin HK (2013) Using objective words in sentiwordnet to improve sentiment classification for word of mouth. IEEE Intell Syst 28(2):47–54
Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: Proceedings of British machine vision conference, vol 1, 2
Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153
Itten J (1962) The art of color; the subjective experience and objective rationale of colour
Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: Image annotation by exploiting image metadata. In: Proceedings of the IEEE international conference on computer vision, pp 4624–4632
Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang SF (2015) Visual affect around the world: A large-scale multilingual visual sentiment ontology. In: Proceedings of the 23rd ACM international conference on multimedia, pp 159–168. ACM
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Katsurai M, Satoh S (2016) Image sentiment analysis using latent correlations among visual, textual, and sentiment views. In: Inproceedings of the IEEE international conference on acoustics, speech and signal processing, pp 2837–2841. IEEE
Lei X, Qian X, Zhao G (2016) Rating prediction based on social sentiment from textual reviews. IEEE Trans Multimed 18(9):1910–1921
Li X, Uricchio T, Ballan L, Bertini M, Snoek CG, Bimbo AD (2016) Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surveys (CSUR) 49(1):14
Machajdik J, Hanbury A (2010) Affective image classification using features inspired by psychology and art theory. In: Proceedings of the 18th ACM international conference on multimedia, pp 83–92. ACM
Mike T, Kevan B, Georgios P, Di C, Arvid K (2010) Sentiment in short strength detection informal text. Journal of the Association for Information Science and Technology 61(12):2544–2558
Miller GA (1995) Wordnet: a lexical database for english. In: Communications of the ACM, vol 38, pp 39–41. ACM
Ortis A, Farinella GM, Torrisi G, Battiato S (2018) Visual sentiment analysis based on on objective text description of images. In: 2018 International conference on content-based multimedia indexing (CBMI), pp 1–6. IEEE
Pang L, Zhu S, Ngo CW (2015) Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimed 17(11):2008–2020
Perronnin F, Sénchez J, Xerox YL (2010) Large-scale image categorization with explicit data embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2297–2304
Qian S, Zhang T, Xu C, Shao J (2016) Multi-modal event topic model for social event analysis. IEEE Trans Multimed 18(2):233–246
Rahimi A, Recht B, et al. (2007) Random features for large-scale kernel machines. In: Inproceedings of the neural information processing systems, vol 3, pp 5
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260. ACM
Rudinac S, Larson M, Hanjalic A (2013) Learning crowdsourced user preferences for visual summarization of image collections. IEEE Trans Multimed 15(6):1231–1243
Siersdorfer S, Minack E, Deng F, Hare J (2010) Analyzing and predicting sentiment of images on the social web. In: Proceedings of the 18th ACM international conference on multimedia, pp 715–718. ACM
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Valdez P, Mehrabian A (1994) Effects of color on emotions. In: Journal of experimental psychology: General, vol. 123, p. 394. American Psychological Association
Wang G, Hoiem D, Forsyth D (2009) Building text features for object image classification. In: Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 1367–1374
Wang Y, Wang S, Tang J, Liu H, Li B (2015) Unsupervised sentiment analysis for social media images. In: Proceedings of the 24th international joint conference on artificial intelligence, Buenos Aires, Argentina, pp 2378–2379
Xu C, Cetintas S, Lee K, Li L (2014) Visual sentiment prediction with deep convolutional neural networks. arXiv:1411.5731
Yang X, Zhang T, Xu C (2015) Cross-domain feature learning in multimedia. IEEE Trans Multimed 17(1):64–78
You Q, Cao L, Cong Y, Zhang X, Luo J (2015) A multifaceted approach to social multimedia-based prediction of elections. IEEE Trans Multimed 17 (12):2271–2280
You Q, Luo J, Jin H, Yang J (2015) Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: 29th AAAI conference on artificial intelligence
Yu FX, Cao L, Feris RS, Smith JR, Chang SF (2013) Designing category-level attributes for discriminative visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 771–778
Yuan J, Mcdonough S, You Q, Luo J (2013) Sentribute: image sentiment analysis from a mid-level perspective. In: Proceedings of the 2nd international workshop on issues of sentiment discovery and opinion mining. ACM
Yuan Z, Sang J, Xu C (2013) Tag-aware image classification via nested deep belief nets. In: 2013 IEEE international conference on multimedia and expo (ICME), pp 1–6. IEEE
Yuan Z, Sang J, Xu C, Liu Y (2014) A unified framework of latent feature learning in social media. IEEE Trans Multimed 16(6):1624–1635
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems, pp 487–495
Zhu X, Cao B, Xu S, Liu B, Cao J (2019) Joint visual-textual sentiment analysis based on cross-modality attention mechanism. In: International conference on multimedia modeling, pp 264–276. Springer
Acknowledgments
This work has been partially supported by Telecom Italia TIM - Joint Open Lab.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ortis, A., Farinella, G.M., Torrisi, G. et al. Exploiting objective text description of images for visual sentiment analysis. Multimed Tools Appl 80, 22323–22346 (2021). https://doi.org/10.1007/s11042-019-08312-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08312-7