Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Zou, Fuhao; Bai, Xingqiang; Luan, Chaoyang; Li, Kai; Wang, Yunfei; Ling, Hefei

doi:10.1007/s11280-018-0581-2

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Published: 13 July 2018

Volume 22, pages 825–841, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Fuhao Zou¹,
Xingqiang Bai¹,
Chaoyang Luan¹,
Kai Li¹,
Yunfei Wang¹ &
…
Hefei Ling¹

621 Accesses
8 Citations
Explore all metrics

Abstract

Multimedia data are usually associated with multiple modalities represented by heterogeneous features. Recently, many information retrieval tasks are not only restricted to the case of a single modal and the contend-based cross modal retrieval has become one of the popular research fields. The premise of cross modal retrieval is discovering the relationships between different modalities efficiently. Though some approaches have been proposed to address this challenging problem, they either ignores the precious labels, or heavily depends on the completely labeled training data. In addition, for features with relatively high dimensionality, it is of great importance to select the most informative ones. In this paper, we propose a semi-supervised algorithm for cross modal learning. Our algorithm can make full use of both a small number of labeled and an abundant unlabeled data to establish connections between modalities via a shared semantic space discovering. On the other hand, our algorithm automatically filter out the noisy and redundant features to further improve our model. Finally, we give an efficient solution to the objective function. The experiments on two publicly available datasets demonstrate that the proposed method is competitive with or even superior to the state-of-art counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning with Noisy Correspondence

Article 13 April 2024

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Learning to Prompt for Vision-Language Models

Article 31 July 2022

References

Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. Adv. Neural Inf. Proces. Syst. 191(41), 41–50 (2007). MIT
Google Scholar
Bandla, S., Grauman, K.: Active learning of an action detector from untrimmed videos. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2013)
Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003). JMLR. org
MATH Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006). JMLR org
MathSciNet MATH Google Scholar
Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Earned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2, 848–854 (2004)
Google Scholar
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001). Taylor & Francis
Article MathSciNet MATH Google Scholar
Fazel, M.: Matrix Rank Minimization with Applications. PhD thesis, Stanford University (2002)
Fazel, M., Hindi, H., Boyd, S.P.: A rank minimization heuristic with application to minimum order system approximation. Proc. 2001 Am. Control Conf. 6 (1), 4734–4739 (2001). IEEE
Google Scholar
Grave, E., Obozinski, G., Bach, F., et al.: Trace Lasso: a trace norm regularization for correlated designs. NIPS 3(2), 5–5 (2011)
Google Scholar
Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 902–909. IEEE (2010)
Hwang, S.J., Grauman, K.: Reading between the lines: Object localization using implicit cues from image tags. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1145–1158 (2012). IEEE
Article Google Scholar
Jia, Y., Salzmann, M., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)
Jingdong, W., Ting, Z., Jingkuan, S., Nicu, S., Tao, S.H.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)
Article Google Scholar
Jingkuan, S., Hanwang, Z., Xiangpeng, L., Lianli, G., Meng, W., Richang, H.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 20(3), 233–50 (2018)
MathSciNet MATH Google Scholar
Li, A., Shan, S., Chen, X., Gao, W: Face recognition based on non-corresponding region matching. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1060–1067. IEEE (2011)
Li, Z., Qin, L., Cheng, H., Zhang, X., Zhou, X.: TRIP: An interactive retrieving-inferring data imputation approach. IEEE Trans. Knowl. Data Eng. 27(9), 2550–2563 (2015)
Article Google Scholar
Li, Z., Sharaf, M.A., Sitbon, L., Du, X., Zhou, X.: CoRE: A context-aware relation extraction method for relation completion. IEEE Trans. Knowl. Data Eng. 26 (4), 836–49 (2014)
Article Google Scholar
Li, Z., Sitbon, L., Wang, L., Zhou, X., Du, X.: AML: efficient approximate membership localization within a Web-based join framework. IEEE Trans. Knowl. Data Eng. 25(2), 298–310 (2013)
Article Google Scholar
Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 339–348 (2009)
Ma, Z., Yang, Y., Cai, Y., Sebe, N., Hauptmann, A.G.: Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 469–478. ACM (2012)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696 (2011)
Nie, F., Huang, H., Cai, X., Ding, C.: Efficient and robust feature selection via joint l2, 1-norms minimization. Adv. Neural Inf. Proces. Syst. 23(1), 1813–1821 (2010)
Google Scholar
Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010). Springer
Article MathSciNet Google Scholar
Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3408–3415. IEEE (2010)
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)
Sharma, A., Jacobs, D.W.: Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: Proceedings 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600. IEEE (2011)
Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent spacer. In: Proceeding of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2160–2167. IEEE (2012)
Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 966–973. IEEE (2010)
Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 1339–1351 (2018)
Article Google Scholar
Wei, Z., Ke, Z., Pan, G., Xiangyang, X.: Multi-view embedding learning for incompletely labeled data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1910–1916. AAAI Press (2013)
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012). IEEE
Article Google Scholar
Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013). IEEE
Article Google Scholar
Zhou, N., Zhu, J.: Group variable selection via a hierarchical lasso and its oracle property, arXiv:1006.2871 (2010)
Zhu, L., Huang, Z., Liu, X., He, X., Sun, J., Zhou, X.: Discrete multimodal hashing with canonical views for robust mobile landmark search. IEEE Trans. Multimedia 19(9), 2066–2079 (2017)
Article Google Scholar

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant No.61672254 and 61300222, Key project of National Natural Science Foundation of China Grant No U1536203, Natural Science Foundation of Hubei Province Grant No.2015CFB687, the Fundamental Research Funds for the Central Universities, HUST:2016YXMS088. The authors appreciate the valuable suggestions from the anonymous reviewers and the Editors.

Author information

Authors and Affiliations

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Fuhao Zou, Xingqiang Bai, Chaoyang Luan, Kai Li, Yunfei Wang & Hefei Ling

Authors

Fuhao Zou
View author publications
You can also search for this author in PubMed Google Scholar
Xingqiang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Chaoyang Luan
View author publications
You can also search for this author in PubMed Google Scholar
Kai Li
View author publications
You can also search for this author in PubMed Google Scholar
Yunfei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hefei Ling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fuhao Zou.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications

Guest Editors: Jingkuan Song, Shuqiang Jiang, Elisa Ricci, and Zi Huang

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zou, F., Bai, X., Luan, C. et al. Semi-supervised cross-modal learning for cross modal retrieval and image annotation. World Wide Web 22, 825–841 (2019). https://doi.org/10.1007/s11280-018-0581-2

Download citation

Received: 15 June 2017
Revised: 19 April 2018
Accepted: 26 April 2018
Published: 13 July 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11280-018-0581-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Abstract

Access this article

Similar content being viewed by others

Learning with Noisy Correspondence

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Learning to Prompt for Vision-Language Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Abstract

Access this article

Similar content being viewed by others

Learning with Noisy Correspondence

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Learning to Prompt for Vision-Language Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation