Abstract
Multimedia data are usually associated with multiple modalities represented by heterogeneous features. Recently, many information retrieval tasks are not only restricted to the case of a single modal and the contend-based cross modal retrieval has become one of the popular research fields. The premise of cross modal retrieval is discovering the relationships between different modalities efficiently. Though some approaches have been proposed to address this challenging problem, they either ignores the precious labels, or heavily depends on the completely labeled training data. In addition, for features with relatively high dimensionality, it is of great importance to select the most informative ones. In this paper, we propose a semi-supervised algorithm for cross modal learning. Our algorithm can make full use of both a small number of labeled and an abundant unlabeled data to establish connections between modalities via a shared semantic space discovering. On the other hand, our algorithm automatically filter out the noisy and redundant features to further improve our model. Finally, we give an efficient solution to the objective function. The experiments on two publicly available datasets demonstrate that the proposed method is competitive with or even superior to the state-of-art counterparts.
Similar content being viewed by others
References
Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. Adv. Neural Inf. Proces. Syst. 191(41), 41–50 (2007). MIT
Bandla, S., Grauman, K.: Active learning of an action detector from untrimmed videos. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2013)
Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003). JMLR. org
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006). JMLR org
Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Earned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2, 848–854 (2004)
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001). Taylor & Francis
Fazel, M.: Matrix Rank Minimization with Applications. PhD thesis, Stanford University (2002)
Fazel, M., Hindi, H., Boyd, S.P.: A rank minimization heuristic with application to minimum order system approximation. Proc. 2001 Am. Control Conf. 6 (1), 4734–4739 (2001). IEEE
Grave, E., Obozinski, G., Bach, F., et al.: Trace Lasso: a trace norm regularization for correlated designs. NIPS 3(2), 5–5 (2011)
Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 902–909. IEEE (2010)
Hwang, S.J., Grauman, K.: Reading between the lines: Object localization using implicit cues from image tags. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1145–1158 (2012). IEEE
Jia, Y., Salzmann, M., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)
Jingdong, W., Ting, Z., Jingkuan, S., Nicu, S., Tao, S.H.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)
Jingkuan, S., Hanwang, Z., Xiangpeng, L., Lianli, G., Meng, W., Richang, H.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 20(3), 233–50 (2018)
Li, A., Shan, S., Chen, X., Gao, W: Face recognition based on non-corresponding region matching. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1060–1067. IEEE (2011)
Li, Z., Qin, L., Cheng, H., Zhang, X., Zhou, X.: TRIP: An interactive retrieving-inferring data imputation approach. IEEE Trans. Knowl. Data Eng. 27(9), 2550–2563 (2015)
Li, Z., Sharaf, M.A., Sitbon, L., Du, X., Zhou, X.: CoRE: A context-aware relation extraction method for relation completion. IEEE Trans. Knowl. Data Eng. 26 (4), 836–49 (2014)
Li, Z., Sitbon, L., Wang, L., Zhou, X., Du, X.: AML: efficient approximate membership localization within a Web-based join framework. IEEE Trans. Knowl. Data Eng. 25(2), 298–310 (2013)
Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 339–348 (2009)
Ma, Z., Yang, Y., Cai, Y., Sebe, N., Hauptmann, A.G.: Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 469–478. ACM (2012)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696 (2011)
Nie, F., Huang, H., Cai, X., Ding, C.: Efficient and robust feature selection via joint l2, 1-norms minimization. Adv. Neural Inf. Proces. Syst. 23(1), 1813–1821 (2010)
Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010). Springer
Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3408–3415. IEEE (2010)
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)
Sharma, A., Jacobs, D.W.: Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: Proceedings 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600. IEEE (2011)
Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent spacer. In: Proceeding of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2160–2167. IEEE (2012)
Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 966–973. IEEE (2010)
Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 1339–1351 (2018)
Wei, Z., Ke, Z., Pan, G., Xiangyang, X.: Multi-view embedding learning for incompletely labeled data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1910–1916. AAAI Press (2013)
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012). IEEE
Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013). IEEE
Zhou, N., Zhu, J.: Group variable selection via a hierarchical lasso and its oracle property, arXiv:1006.2871 (2010)
Zhu, L., Huang, Z., Liu, X., He, X., Sun, J., Zhou, X.: Discrete multimodal hashing with canonical views for robust mobile landmark search. IEEE Trans. Multimedia 19(9), 2066–2079 (2017)
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grant No.61672254 and 61300222, Key project of National Natural Science Foundation of China Grant No U1536203, Natural Science Foundation of Hubei Province Grant No.2015CFB687, the Fundamental Research Funds for the Central Universities, HUST:2016YXMS088. The authors appreciate the valuable suggestions from the anonymous reviewers and the Editors.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications
Guest Editors: Jingkuan Song, Shuqiang Jiang, Elisa Ricci, and Zi Huang
Rights and permissions
About this article
Cite this article
Zou, F., Bai, X., Luan, C. et al. Semi-supervised cross-modal learning for cross modal retrieval and image annotation. World Wide Web 22, 825–841 (2019). https://doi.org/10.1007/s11280-018-0581-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0581-2