Deep Coordinated Textual and Visual Network for Sentiment-Oriented Cross-Modal Retrieval

Fu, Jiamei; She, Dongyu; Yao, Xingxu; Zhang, Yuxiang; Yang, Jufeng

doi:10.1007/978-3-319-97304-3_52

Jiamei Fu^15,16,
Dongyu She¹⁶,
Xingxu Yao¹⁶,
Yuxiang Zhang¹⁵ &
…
Jufeng Yang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11012))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

3329 Accesses

Abstract

Cross-modal retrieval has attracted more and more attention recently, which enables people to retrieve desired information efficiently from a large amount of multimedia data. Most methods on cross-modal retrieval only focus on aligning the objects in image and text, while sentiment alignment is also essential for facilitating various applications, e.g., entertainment, advertisement, etc. This paper studies the problem of retrieving visual sentiment concepts with a goal to extract sentiment-oriented information from social multimedia content, i.e., sentiment oriented cross-media retrieval. Such problem is inherently challenging due to the subjective and ambiguity characteristics of the adjectives like “sad” and “awesome”. Thus, we focus on modeling visual sentiment concepts with adjective-noun pairs, e.g., “sad dog” and “awesome flower”, where associating adjectives with concrete objects makes the concepts more tractable. This paper proposes a deep coordinated textural and visual network with two branches to learn a joint semantic embedding space for both images and texts. The visual branch is based on a convolutional neural network (CNN) pre-trained on a large dataset, which is optimized with the classification loss. The textual branch is added on the fully-connected layer providing supervision of the textual semantic space. In order to learn the coordinated representation for different modalities, the multi-task loss function is optimized during the end-to-end training process. We have conducted extensive experiments on a subset of the large-scale VSO dataset. The results show that the proposed model is able to retrieval sentiment-oriented data, which performs favorably against the state-of-the-art methods.

J. Fu and D. She—The two authors contributed equally to this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260. ACM, New York (2010)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)
Article MathSciNet Google Scholar
Joshi, D., et al.: Aesthetics and emotions in images. IEEE Sig. Process. Mag. 28(5), 94–115 (2011)
Article Google Scholar
Truong, Q.T., Lauw, H.W.: Visual sentiment analysis for review images with item-oriented and user-oriented CNN. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1274–1282. ACM (2017)
Google Scholar
Jia, J., Wu, S., Wang, X., Hu, P., Cai, L., Tang, J.: Can we understand van Gogh’s mood?: learning to infer affects from images in social networks. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 857–860. ACM (2012)
Google Scholar
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. (2014)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)
Google Scholar
Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDB J. 25(1), 79–101 (2016)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106(2), 210–233 (2014)
Article Google Scholar
Klein, B., Lev, G., Sadeh, G., Wolf, L.: Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446 (2015)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: 28th International Conference on Machine Learning, pp. 689–696 (2011)
Google Scholar
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)
Google Scholar
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014)
Article Google Scholar
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2010–2023 (2016)
Article Google Scholar
Zhai, X., Peng, Y., Xiao, J.: Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In: Association for the Advancement of Artificial Intelligence (2013)
Google Scholar
Ranjan, V., Rasiwasia, N., Jawahar, C.V.: Multi-label cross-modal retrieval. In: IEEE International Conference on Computer Vision, pp. 4094–4102 (2015)
Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: 22nd ACM International Conference on Multimedia, pp. 7–16 (2014)
Google Scholar
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: 25th International Joint Conference on Artificial Intelligence, pp. 3846–3853 (2016)
Google Scholar
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)
Google Scholar
Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: 18th ACM International Conference on Multimedia, pp. 83–92. ACM (2010)
Google Scholar
You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In: Association for the Advancement of Artificial Intelligence, pp. 308–314 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Liu, T., Zhao, Y., Wei, S., Wei, Y., Liao, L.: Enhanced isomorphic semantic representation for cross-media retrieval. In: IEEE International Conference on Multimedia and Expo, pp. 967–972 (2017)
Google Scholar
Chen, T., Yu, F.X., Chen, J., Cui, Y., Chen, Y.Y., Chang, S.F.: Object-based visual sentiment concept analysis and application. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 367–376. ACM (2014)
Google Scholar
Yang, J., Sun, M., Sun, X.: Learning visual sentiment distributions via augmented conditional probability neural network. In: The Association for the Advancement of Artificial Intelligence, pp. 224–230 (2017)
Google Scholar
Yang, J., She, D., Sun, M.: Joint image emotion classification and distribution learning via deep convolutional neural network. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3266–3272 (2017)
Google Scholar
Yang, J., She, D., Lai, Y., Yang, M.H.: Retrieving and classifying affective images via deep metric learning. In: The Association for the Advancement of Artificial Intelligence (2018)
Google Scholar
Yang, J., She, D., Sun, M., Cheng, M.M., Rosin, P., Wang, L.: Visual sentiment prediction based on automatic discovery of affective regions. IEEE Trans. Multimed. PP(99), 1 (2018)
Google Scholar
Yang, J., She, D., Lai, Y.K., Rosin, P.L., Yang, M.H.: Weakly supervised coupled networks for visual sentiment analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)
Google Scholar

Download references

Acknowledgments

This work was partially supported by grants from the NSFC (No. U1533104), the Natural Science Foundation of Tianjin, China (No. 18JCYBJC15400), and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).

Author information

Authors and Affiliations

College of Computer Science and Technology, Civil Aviation University of China, Tianjin, China
Jiamei Fu & Yuxiang Zhang
College of Computer and Control Engineering, Nankai University, Tianjin, China
Jiamei Fu, Dongyu She, Xingxu Yao & Jufeng Yang

Authors

Jiamei Fu
View author publications
You can also search for this author in PubMed Google Scholar
Dongyu She
View author publications
You can also search for this author in PubMed Google Scholar
Xingxu Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jufeng Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jufeng Yang .

Editor information

Editors and Affiliations

Southeast University, Nanjing, China
Xin Geng
University of Tasmania, Hobart, Tasmania, Australia
Byeong-Ho Kang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, J., She, D., Yao, X., Zhang, Y., Yang, J. (2018). Deep Coordinated Textual and Visual Network for Sentiment-Oriented Cross-Modal Retrieval. In: Geng, X., Kang, BH. (eds) PRICAI 2018: Trends in Artificial Intelligence. PRICAI 2018. Lecture Notes in Computer Science(), vol 11012. Springer, Cham. https://doi.org/10.1007/978-3-319-97304-3_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-97304-3_52
Published: 27 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97303-6
Online ISBN: 978-3-319-97304-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics