Learning a Limited Text Space for Cross-Media Retrieval

Yu, Zheng; Wang, Wenmin; Fan, Mengdi

doi:10.1007/978-3-319-64689-3_24

Zheng Yu¹⁶,
Wenmin Wang¹⁶ &
Mengdi Fan¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10424))

Included in the following conference series:

International Conference on Computer Analysis of Images and Patterns

1456 Accesses
3 Citations

Abstract

In this paper, we propose a novel model for cross-media retrieval which relies on a limited text space rather than a common space or an image space. More specifically, the model consists of three parts: A visual part that consists of a convolutional neural network and an image understanding network; A language model part that achieves sentence understanding by recurrent neural network; An embedding part that contains a fusion layer to capture both visual label information and semantic correlations between images and sentences, as well as learn the final limited text space by optimizing pairwise ranking loss. Experimental results on three benchmark datasets show that our proposed model gains promising improvement in accuracy for cross-media retrieval especially on sentence retrieval compared with the related state-of-the-art methods.

W. Wang—This project was supported by Shenzhen Peacock Plan (20130408-183003656).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, vol. abs/1409.1556 (2014)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Thompson, B.: Canonical correlation analysis. In: Encyclopedia of Statistics in Behavioral Science (2005)
Google Scholar
Klein, B., Lev, G., Sadeh, G., Wolf, L.: Fisher vectors derived from hybrid Gaussian-Laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)
Google Scholar
Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)
Google Scholar
Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G.S., Dean, J.: Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650 (2013)
Dong, J., Li, X., Snoek, C.G.: Word2visualvec: cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_48
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Google Scholar
Karpathy, A., Joulin, A., Li, F.F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Aistats, vol. 9, pp. 249–256 (2010)
Google Scholar
Fan, M., Wang, W., Wang, R.: Coupled feature mapping and correlation mining for cross-media retrieval. In: 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6. IEEE (2016)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Google Scholar
Wang, J., Wang, W., Wang, R., Gao, W., et al.: Deep alternative neural network: exploring contexts as early as possible for action recognition. In: Advances in Neural Information Processing Systems, pp. 811–819 (2016)
Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Lishui Road 2199, Nanshan District, Shenzhen, 518055, China
Zheng Yu, Wenmin Wang & Mengdi Fan

Authors

Zheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wenmin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mengdi Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenmin Wang .

Editor information

Editors and Affiliations

Linköping University, Linköping, Sweden
Michael Felsberg
Lund University, Lund, Sweden
Anders Heyden
University of Southern Denmark, Odense, Denmark
Norbert Krüger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, Z., Wang, W., Fan, M. (2017). Learning a Limited Text Space for Cross-Media Retrieval. In: Felsberg, M., Heyden, A., Krüger, N. (eds) Computer Analysis of Images and Patterns. CAIP 2017. Lecture Notes in Computer Science(), vol 10424. Springer, Cham. https://doi.org/10.1007/978-3-319-64689-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-64689-3_24
Published: 28 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64688-6
Online ISBN: 978-3-319-64689-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics