Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Yu, Tianyuan; Bai, Liang; Guo, Jinlin; Yang, Zheng; Xie, Yuxiang

doi:10.1007/978-3-319-51814-5_12

Tianyuan Yu¹⁸,
Liang Bai¹⁸,
Jinlin Guo¹⁸,
Zheng Yang¹⁸ &
…
Yuxiang Xie¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10133))

Included in the following conference series:

International Conference on Multimedia Modeling

1678 Accesses
1 Citations

Abstract

With the rapid development of the Internet and the explosion of data volume, it is important to access the cross-media big data including text, image, audio, and video, etc., efficiently and accurately. However, the content heterogeneity and semantic gap make it challenging to retrieve such cross-media archives. The existing approaches try to learn the connection between multiple modalities by direct utilization of hand-crafted low-level features, and the learned correlations are merely constructed with high-level feature representations without considering semantic information. To further exploit the intrinsic structures of multimodal data representations, it is essential to build up an interpretable correlation between these heterogeneous representations. In this paper, a deep model is proposed to first learn the high-level feature representation shared by different modalities like texts and images, with convolutional neural network (CNN). Moreover, the learned CNN features can reflect the salient objects as well as the details in the images and sentences. Experimental results demonstrate that proposed approach outperforms the current state-of-the-art base methods on public dataset of Flickr8K.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://research.microsoft.com/en-us/downloads/731572aa-98e4-4c50-b99d-ae3f0c9562b9/default.aspx.

References

Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 119–126. ACM (2003)
Google Scholar
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)
Google Scholar
Wu, F., Lu, X., Zhang, Z., et al.: Cross-media semantic representation via bi-directional learning to rank. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 877–886. ACM (2013)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: IEEE International Conference on Computer Vision, pp. 1–9. IEEE (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807 (2015)
Google Scholar
Paulin, M., Douze, M., Harchaoui, Z., et al.: Local convolutional features with unsupervised training for image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 91–99 (2015)
Google Scholar
Matsuo, S., Yanai, K.: CNN-based style vector for style image retrieval. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 309–312. ACM (2016)
Google Scholar
Socher, R., Karpathy, A., Le, Q.V., et al.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Google Scholar
Zhuang, Y., Yu, Z., Wang, W., et al.: Cross-media hashing with neural networks. In: Proceedings of the ACM International Conference on Multimedia, pp. 901–904. ACM (2014)
Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
MathSciNet MATH Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article MATH Google Scholar
Ballan, L., Uricchio, T., Seidenari, L., et al.: A cross-media model for automatic image annotation. In: Proceedings of International Conference on Multimedia Retrieval, p. 73. ACM (2014)
Google Scholar
Wang, Y., Wu, F., Song, J., et al.: Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 307–316. ACM (2014)
Google Scholar
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)
Google Scholar
Pereira, J.C., Coviello, E., Doyle, G., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)
Article Google Scholar
Frome, A., Corrado, G.S., Shlens, J., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Google Scholar
Karpathy, A., Joulin, A., Li, F.F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)
Google Scholar
Gao, J., Deng, L., Gamon, M., et al.: Modeling interestingness with deep neural networks: U.S. Patent 20,150,363,688, 17 December 2015
Google Scholar
Huang, P.S., He, X., Gao, J., et al.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2333–2338. ACM (2013)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, (NIPS 2015), pp. 2017–2025 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information System and Management, National University of Defense Technology, Changsha, 410073, China
Tianyuan Yu, Liang Bai, Jinlin Guo, Zheng Yang & Yuxiang Xie

Authors

Tianyuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Liang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jinlin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianyuan Yu .

Editor information

Editors and Affiliations

CNRS–IRISA, Rennes, France
Laurent Amsaleg
Reykjavík University, Reykjavik, Iceland
Gylfi Þór Guðmundsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
Reykjavik University, Reykjavik, Ireland
Björn Þór Jónsson
National Institute of Informatics, Tokyo, Japan
Shin’ichi Satoh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, T., Bai, L., Guo, J., Yang, Z., Xie, Y. (2017). Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping. In: Amsaleg, L., Guðmundsson, G., Gurrin, C., Jónsson, B., Satoh, S. (eds) MultiMedia Modeling. MMM 2017. Lecture Notes in Computer Science(), vol 10133. Springer, Cham. https://doi.org/10.1007/978-3-319-51814-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-51814-5_12
Published: 31 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51813-8
Online ISBN: 978-3-319-51814-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics