Abstract
Visual content is a rich medium that can be used to communicate not only facts and events, but also emotions and opinions. In some cases, visual content may carry a universal affective bias (e.g., natural disasters or beautiful scenes). Often however, to achieve a parity in the affections a visual media invokes in its recipient compared to the one an author intended requires a deep understanding and even sharing of cultural backgrounds. In this study, we propose a computational framework for the clustering and analysis of multilingual visual affective concepts used in different languages which enable us to pinpoint alignable differences (via similar concepts) and nonalignable differences (via unique concepts) across cultures. To do so, we crowdsource sentiment labels for the MVSO dataset, which contains 16 K multilingual visual sentiment concepts and 7.3M images tagged with these concepts. We then represent these concepts in a distribution-based word vector space via (1) pivotal translation or (2) cross-lingual semantic alignment. We then evaluate these representations on three tasks: affective concept retrieval, concept clustering, and sentiment prediction—all across languages. The proposed clustering framework enables the analysis of the large multilingual dataset both quantitatively and qualitatively. We also show a novel use case consisting of a facial image data subset and explore cultural insights about visual sentiment concepts in such portrait-focused images.











Similar content being viewed by others
Notes
We did not perform lemmatization or any other preprocessing step to preserve the original visual concept properties.
References
Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang S-F (2015) Visual affect around the world: a large-scale multilingual visual sentiment ontology. In: ACM international conference on multimedia, (Brisbane, Australia), pp 159–168
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: 48th annual meeting of the Association for Computational Linguistics. ACL ’10, (Uppsala, Sweden), pp 384–394
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR, vol. arXiv:1301.3781
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Empirical methods in natural language processing, pp 1532–1543
Al-Rfou R, Perozzi B, Skiena S (2013) Polyglot: distributed word representations for multilingual NLP. CoRR, vol arXiv:1307.1662
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012, (Mumbai, India), pp 1459–1474
Zou WY, Socher R, Cer D, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation.In: Proceedings of the 2013 conference on empirical methods in natural language processing, (Seattle. WA, USA), pp 1393–1398
Hermann KM, Blunsom P (2014) Multilingual models for compositional distributed semantics. In: Annual meeting of the association for computational linguistics, (Baltimore, Maryland), pp 58–68
Chandar APS, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar VC, Saha A (2014) An autoencoder approach to learning bilingual word representations. CoRR, vol arXiv:1402.1454
Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. CoRR, vol arXiv:1408.3456
Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Intell Res 49:1–47
Silberer C, Lapata M (2014) Learning grounded meaning representations with autoencoders. In: 52nd annual meeting of the association for computational linguistics, (Baltimore, Maryland), pp 721–732
Lazaridou A, Pham NT, Baroni M (2015) Combining language and vision with a multimodal skip-gram model. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, (Denver, Colorado), pp 153–163
Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems 27, pp 1889–1897, Curran Associates, Inc
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR, vol arXiv:1411.2539
Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Association for computational linguistics
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. TACL 2:207–218
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. CoRR vol. arXiv:1410.1090
Kottur S, Vedantam R, Moura JMF, Parikh D (2015) Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. CoRR, vol. arXiv:1511.07067
Schnabel T, Labutov I, Mimno D, Joachims T (2015) Evaluation methods for unsupervised word embeddings. In: Conference on empirical methods in natural language processing, (Lisbon, Portugal), pp 298–307
Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling 3:211–225
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119
Lebret R, Collobert R (2014) Word embeddings through hellinger pca. In: Conference of the European chapter of the association for computational linguistics, (Gothenburg, Sweden), pp 482–490
Baroni M, Zamparelli R (2010) Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In: Conference on empirical methods in natural language processing, (Cambridge. MA, USA), pp 1183–1193
Socher R, Huval B, Manning CD, Ng AY (2012) Semantic compositionality through recursive matrix-vector spaces. In: Joint conference on empirical methods in natural language processing and computational natural language learning, (Jeju Island, Korea), pp 1201–1211
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: International conference on new methods in language processing, (Manchester, UK)
Freiwald WA, Tsao DY (2014) Neurons that keep a straight face. Natl Acad Sci 111(22):7894–7895
Redi M, Rasiwasia N, Aggarwal G, Jaimes A (2015) The beauty of capturing faces: Rating the quality of digital portraits. In: IEEE international conference and workshops on automatic face and gesture recognition, (Ljubljana, Slovenia), pp 1–8
Jou B, Bhattacharya S, Chang S-F (2014) Predicting viewer perceived emotions in animated GIFs. In: ACM international conference on multimedia, (Orlando, Florida, USA), pp 213–216
Bakhshi S, Shamma DA, Gilbert E (2014) Faces engage us: photos with faces attract more likes and comments on instagram. In: ACM conference on human factors in computing systems, (Toronto, ON, Canada), pp 965–974
Liao S, Jain AK, Li SZ (2016) A fast and accurate unconstrained face detector. IEEE Trans Pattern Anal Mach Intell 38:211–223
Ammar W, Mulcaire G, Tsvetkov Y, Lample G, Dyer C, Smith NA (2016) Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925
Quasthoff U, Richter M, Biemann C (2006) Corpus portal for search in monolingual Corpora. In: Proceedings of the fifth international conference on language resources and evaluation. LREC, pp 1799–1802, Genoa
Pappas N, Redi M, Topkara M, Brendan J, Liu H, Chen T, Chang S-F (2015) Multilingual visual sentiment concept matching. In: ACM international conference on multimedia retrieval, pp 151–158, New York, USA
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: 43rd annual meeting on association for computational linguistics, pp 115–124, Ann Arbor, Michigan
Brendan J, Chang S-F (2016) Deep cross residual learning for multitask visual recognition. In: Proceedings of the 2016 ACM conference on multimedia conference, pp 998–1007, Amsterdam, Netherlands
Bo Pang, Lee Lillian (2008) Opinion mining and sentiment analysis. Found Trends Inf Retrieval 2(1–2):1–135
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: ACL-02 conference on empirical methods in natural language processing Vol 10, pp 79–86, Philadelphia, PA
Liu H, Brendan J, Chen T, Topkara M, Pappas N, Redi M, Chang S-F (2015) Complura: exploring and leveraging a large-scale multilingual visual sentiment ontology. pp 417–420, New York, USA
Turney PD (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting on association for computational linguistics, pp 417–424, Philadelphia, PA
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis In: 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1, pp 142–150
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B (2014) Learning sentiment-specific word embedding for Twitter sentiment classification. In: 52nd annual meeting of the association for computational linguistics, pp 1555–1565, Baltimore, MD
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 168–177, Seattle, WA
Li Z, Jing F, Zhu X-Y (2006) Movie review mining and summarization. In: 15th ACM international conference on information and knowledge management, pp 43–50, Arlington, VA
Titov I, McDonald R (2008) Modeling online reviews with multi-grain topic models. In: 17th international conference on World Wide Web, pp 111–120, Beijing, China
Sauper C, Haghighi A, Barzilay R (2010) Incorporating content structure into text analysis applications. In: 2010 conference on empirical methods in natural language processing, pp 377–387, Cambridge, MA
Lu B, Ott M, Cardie C, Tsou BK (2011) Multi-aspect sentiment analysis with topic models. In: 2011 IEEE 11th international conference on data mining workshops. pp 81–88 Washington, DC
McAuley J, Leskovec J, Jurafsky D (2012) Learning attitudes and attributes from multi-aspect reviews In: 2012 IEEE 12th international conference on data mining, pp 1020–1025, Brussels, Belgium
Pappas N, Popescu-Belis A (2014) Explaining the stars: weighted multiple-instance learning for aspect-based sentiment analysis. In: Conference on empirical methods in natural language processing, pp 455–466, Doha, Qatar
Morency L-P, Mihalcea R, Doshi P (2011) Towards multimodal sentiment analysis: harvesting opinions from the web. In: 13th international conference on multimodal interfaces, pp 169–176, Tokyo, Japan
Rosas Veronica, Mihalcea Rada, Morency Louis-Philippe (2013) Multimodal sentiment analysis of Spanish online videos. IEEE Intell Syst 28(3):38–45
Cambria Erik, Schuller Bjorn, Xia Yunqing, Havasi Catherine (2013) New avenues in opinion mining and sentiment analysis. IEEE Intell Syst 28(2):15–21
Borth D, Ji R, Chen T, Breuel T, Chang S-F (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: 21st ACM international conference on Multimedia, pp 223–232, Barcelona, Spain
You Q, Luo J, Jin H, Yang J (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: 9th ACM international conference on web search and data mining, pp 13–22, San Fransisco, USA
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: 2015 conference on empirical methods in natural language processing, pp 2539–2544, Lisbon, Portugal
Dodds PS, Clark EM, Desu S, Frank MR, Reagan AJ, Williams JR, Mitchell L et al (2015) Human language reveals a universal positivity bias. In: Proceedings of the national academy of sciences 112(8): 2389–2394
Poria Soujanya, Cambria Erik, Howard Newton, Huang Guang-Bin, Hussain Amir (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Li, H, Ellis Joseph G, Heng J, Chang S-F (2016) Event specific multimodal pattern mining for knowledge base construction. In: Proceedings of the 2016 ACM on multimedia conference, pp 821–830. ACM
Author information
Authors and Affiliations
Corresponding author
Additional information
Nikolaos Pappas, Miriam Redi, Mercan Topkara, Hongyi Liu have contributed equally.
Rights and permissions
About this article
Cite this article
Pappas, N., Redi, M., Topkara, M. et al. Multilingual visual sentiment concept clustering and analysis. Int J Multimed Info Retr 6, 51–70 (2017). https://doi.org/10.1007/s13735-017-0120-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-017-0120-4