skip to main content
10.1145/2964284.2967212acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Joint Image-Text Representation by Gaussian Visual-Semantic Embedding

Published:01 October 2016Publication History

ABSTRACT

How to jointly represent images and texts is important for tasks involving both modalities. Visual-semantic embedding models have been recently proposed and shown to be effective. The key idea is that by learning a mapping from images into a semantic text space, the algorithm is able to learn a compact and effective joint representation. However, existing approaches simply map each text concept to a single point in the semantic space. Mapping instead to a density distribution provides many interesting advantages, including better capturing uncertainty about each text concept, and enabling better geometric interpretation of concepts such as inclusion, intersection, etc. In this work, we present a novel Gaussian Visual-Semantic Embedding (GVSE) model, which leverages the visual information to model text concepts as Gaussian distributions in semantic space. Experiments in two tasks, image classification and text-based image retrieval on the large scale MIT Places205 dataset, have demonstrated the superiority of our method over existing approaches, with higher accuracy and better robustness.

References

  1. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: a large-scale hierachical image database. In CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Gong, Y. Jia, T. K. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. In ICLR, 2014.Google ScholarGoogle Scholar
  4. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  5. A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  6. R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In TACL, 2015.Google ScholarGoogle Scholar
  7. R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In TACL, 2015.Google ScholarGoogle Scholar
  8. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.Google ScholarGoogle Scholar
  9. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.Google ScholarGoogle Scholar
  11. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille. Multi-instance visual-semantic embedding. In arXiv preprint arXiv:1512.06963, 2015.Google ScholarGoogle Scholar
  13. Z. Ren, C. Wang, and A. Yuille. Scene-domain active part models for object representation. In ICCV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising directions, and open issues. Journal of Visual Communication and Image Representation, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  16. L. Vilnis and A. McCallum. Word representations via gaussian embedding. In ICLR, 2015.Google ScholarGoogle Scholar
  17. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  18. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Joint Image-Text Representation by Gaussian Visual-Semantic Embedding

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  MM '16: Proceedings of the 24th ACM international conference on Multimedia
                  October 2016
                  1542 pages
                  ISBN:9781450336031
                  DOI:10.1145/2964284

                  Copyright © 2016 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 October 2016

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • short-paper

                  Acceptance Rates

                  MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%

                  Upcoming Conference

                  MM '24
                  MM '24: The 32nd ACM International Conference on Multimedia
                  October 28 - November 1, 2024
                  Melbourne , VIC , Australia

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader