Dual Subspaces with Adversarial Learning for Cross-Modal Retrieval

Xia, Yaxian; Wang, Wenmin; Han, Liang

doi:10.1007/978-3-030-00776-8_60

Yaxian Xia¹⁸,
Wenmin Wang¹⁸ &
Liang Han¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11164))

Included in the following conference series:

Pacific Rim Conference on Multimedia

3647 Accesses

Abstract

Learning an effective subspace to calculate the correlation of items from different modalities is the core of cross-modal retrieval task, such as image, text or latent subspace. However, data in different modalities have imbalance and complementary relationships. Image contains abundant spatial information while text includes more background and context details. In this paper, we propose a model with dual parallel subspaces (visual and textual subspace) to better preserve modality-specific information. Triplet constraints are employed to minimize the semantic gap between items from different modalities with the same concept, while maximize that of concept-different image-text pair in corresponding subspace. Then we novelly combine adversarial learning with dual subspaces, which act as an interplay of two agents. The first agent, dual subspaces with similarity merging and concept prediction, aims to narrow the difference of data distributions from different modalities under the premise of concept invariance to fool the other agent, modality discriminator, which tries to distinguish image from text accurately. Extensive experiments on Wikipedia dataset and NUS-WIDE-10k dataset verify the effectiveness of our proposed model for cross-modal retrieval tasks, which outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ballan, L., Uricchio, T., Seidenari, L., Bimbo, A.D.: A cross-media model for automatic image annotation. In: International Conference on Multimedia Retrieval, p. 73 (2014)
Google Scholar
Chen, Y., Wang, L., Wang, W., Zhang, Z.: Continuum regression for cross-modal multimedia retrieval. In: IEEE International Conference on Image Processing, pp. 1949–1952 (2013)
Google Scholar
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of ACM Conference on Image and Video Retrieval (CIVR 2009), Santorini, Greece, 8–10 July 2009
Google Scholar
Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: cross-media retrieval by visual feature prediction (2016)
Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder, pp. 7–16 (2014)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial nets. In: International Conference on Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Jacobs, D.W., Daume, H., Kumar, A., Sharma, A.: Generalized multiview analysis: a discriminative latent space. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2160–2167 (2012)
Google Scholar
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network, pp. 105–114 (2016)
Google Scholar
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. Eprint Arxiv (2013)
Google Scholar
Peng, Y., Qi, J., Yuan, Y.: CM-GANs: cross-modal generative adversarial networks for common representation learning (2017)
Google Scholar
Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network (2017)
Google Scholar
Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–35 (2014)
Article Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. Comput. Sci. (2015)
Google Scholar
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: International Conference on Multimedia, pp. 251–260 (2010)
Google Scholar
Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. New Republic (2016)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis, pp. 1060–1069 (2016)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: ICML Workshop (2012)
Google Scholar
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: ACM on Multimedia Conference, pp. 154–162 (2017)
Google Scholar
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014)
Article Google Scholar
Zhang, H., Xu, T., Li, H.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, pp. 5908–5916 (2016)
Google Scholar

Download references

Acknowledgement

This work is supported by Shenzhen Peacock Plan (20130408-183003656), Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467) and National Natural Science Foundation of China (NSFC, No.U1613209).

Author information

Authors and Affiliations

School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Lishui Road 2199, Nanshan District, Shenzhen, 518055, China
Yaxian Xia, Wenmin Wang & Liang Han

Authors

Yaxian Xia
View author publications
You can also search for this author in PubMed Google Scholar
Wenmin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenmin Wang .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, Y., Wang, W., Han, L. (2018). Dual Subspaces with Adversarial Learning for Cross-Modal Retrieval. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_60

Download citation

DOI: https://doi.org/10.1007/978-3-030-00776-8_60
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics