Abstract
Learning an effective subspace to calculate the correlation of items from different modalities is the core of cross-modal retrieval task, such as image, text or latent subspace. However, data in different modalities have imbalance and complementary relationships. Image contains abundant spatial information while text includes more background and context details. In this paper, we propose a model with dual parallel subspaces (visual and textual subspace) to better preserve modality-specific information. Triplet constraints are employed to minimize the semantic gap between items from different modalities with the same concept, while maximize that of concept-different image-text pair in corresponding subspace. Then we novelly combine adversarial learning with dual subspaces, which act as an interplay of two agents. The first agent, dual subspaces with similarity merging and concept prediction, aims to narrow the difference of data distributions from different modalities under the premise of concept invariance to fool the other agent, modality discriminator, which tries to distinguish image from text accurately. Extensive experiments on Wikipedia dataset and NUS-WIDE-10k dataset verify the effectiveness of our proposed model for cross-modal retrieval tasks, which outperforms the state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ballan, L., Uricchio, T., Seidenari, L., Bimbo, A.D.: A cross-media model for automatic image annotation. In: International Conference on Multimedia Retrieval, p. 73 (2014)
Chen, Y., Wang, L., Wang, W., Zhang, Z.: Continuum regression for cross-modal multimedia retrieval. In: IEEE International Conference on Image Processing, pp. 1949–1952 (2013)
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of ACM Conference on Image and Video Retrieval (CIVR 2009), Santorini, Greece, 8–10 July 2009
Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: cross-media retrieval by visual feature prediction (2016)
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder, pp. 7–16 (2014)
Goodfellow, I.J., et al.: Generative adversarial nets. In: International Conference on Neural Information Processing Systems, pp. 2672–2680 (2014)
Jacobs, D.W., Daume, H., Kumar, A., Sharma, A.: Generalized multiview analysis: a discriminative latent space. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2160–2167 (2012)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network, pp. 105–114 (2016)
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. Eprint Arxiv (2013)
Peng, Y., Qi, J., Yuan, Y.: CM-GANs: cross-modal generative adversarial networks for common representation learning (2017)
Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network (2017)
Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–35 (2014)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. Comput. Sci. (2015)
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: International Conference on Multimedia, pp. 251–260 (2010)
Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. New Republic (2016)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis, pp. 1060–1069 (2016)
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: ICML Workshop (2012)
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: ACM on Multimedia Conference, pp. 154–162 (2017)
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014)
Zhang, H., Xu, T., Li, H.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, pp. 5908–5916 (2016)
Acknowledgement
This work is supported by Shenzhen Peacock Plan (20130408-183003656), Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467) and National Natural Science Foundation of China (NSFC, No.U1613209).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Xia, Y., Wang, W., Han, L. (2018). Dual Subspaces with Adversarial Learning for Cross-Modal Retrieval. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_60
Download citation
DOI: https://doi.org/10.1007/978-3-030-00776-8_60
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)