Abstract
Cross-modal retrieval has recently drawn much attention in multimedia analysis, and it is still a challenging topic mainly attributes to its heterogeneous nature. In this paper, we propose a flexible supervised collective matrix factorization hashing (FS-CMFH) to efficient cross-modal retrieval. First, we exploit a flexible collective matrix factorization framework to jointly learn the individual latent space of similar semantic with respected to each modality. Meanwhile, the label consistency across different modalities is simultaneously exploited to preserve both intra-modal and inter-modal semantics within these similar latent semantic spaces. Accordingly, these two ingredients are formulated as a joint graph regularization term in an overall objective function, through which the similar hash codes of different modalities in an instance can be discriminatively obtained to flexibly characterize such instance. As a result, these derived hash codes incorporating higher discrimination power are able to improve the cross-modal searching accuracy significantly. The extensive experiments tested on three popular benchmark datasets show that the proposed approach performs favorably compared to the state-of-the-art competing approaches.







Similar content being viewed by others
References
Bronstein MM, Bronstein AM, Michel F, Paragios N (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Proc. IEEE Conference on computer vision and pattern recognition, pp 3594–3601
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proc. ACM International conference on image and video retrieval, pp 48:1–48:9
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proc. IEEE Conference on computer vision and pattern recognition, pp 2083–2090
Ding G, Guo Y, Zhou J, Yue G (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440
Gong Y, Lazebnik S (2013) Iterative quantization: a procrustean approach to learning binary codes. In: Proc. IEEE Conference on computer vision and pattern recognition, pp 817–824
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Hardoon D R, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proc. ACM International conference on multimedia information retrieval, pp 39–43
Kim TK, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans Pattern Anal Mach Intell 29(6):1005–1018
Kim S, Kang Y, Choi S (2012) Sequential spectral learning to hash with multiple representations. In: Proc. European Conference on computer vision, pp 538–551
Lee SG, Vu QP (2011) Simultaneous solutions of sylvester equations and idempotent matrices separating the joint spectrum. Linear Algebra Appl 435 (9):2097–2109
Li A, Shan S, Chen X, Gao W (2009) Maximizing intra-individual correlations for face recognition across pose differences. In: Proc. IEEE Conference on computer vision and pattern recognition, pp 605–611
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proc. IEEE Conference on computer vision and pattern recognition, pp 3864–3872
Pauleve L, Jegou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn Lett 31(11):1348–1358
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proc. ACM International conference on multimedia, pp 251–260
Sharma A, Jacobs DW (2011) Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch. In: Proc. IEEE Conference on computer vision and pattern recognition, pp 593–600
Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: IEEE Conference on computer vision and pattern recognition, pp 2160–2167
Singh AP, Gordon GJ (2008) Relational learning via collective matrix factorization. In: Proc. ACM SIGKDD International conference on knowledge discovery and data mining, pp 650–658
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proc. ACM SIGMOD International conference on management of data, pp 785–796
Tang J, Wang K, Shao L (2016) Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans Image Process 25(7):3157–3166
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Proc. Neural Information processing systems, pp 1753–1760
Wu B, Yang Q, Zheng WS, Wang Y, Wang J (2015) Quantized correlation hashing for fast cross-modal search. In: Proc. International Joint conference on artificial intelligence, pp 3946–3952
Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204
Xu C, Tao D, Xu C (2015) Multi-view intact space learning. IEEE Trans Pattern Anal Mach Intell 37(12):2531–2544
You X, Li Q, Tao D, Ou W, Gong M (2014) Local metric learning for exemplar-based object detection. IEEE Trans Circ Syst Vid Technol 24(8):1265–1276
You X, Ou W, Chen CLP, Li Q, Zhu Z, Tang Y (2015) Robust nonnegative patch alignment for dimensionality reduction. IEEE Trans Neural Netw Learn Syst 26(11):2760–2774
Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. In: Proc. ACM SIGIR Conference on research and development in information retrieval, pp 395–404
Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Proc. AAAI Conference on artificial intelligence, pp 2177–2183
Zhang D, Wang F, Si L (2011) Composite hashing with multiple information sources. In: Proc. ACM SIGIR Conference on research and development in information retrieval, pp 225–234
Zhen Y, Yeung DY (2012) Co-regularized hashing for multimodal data. In: Proc. Advances in neural information processing systems, vol 2, pp 1385–1393
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proc. International ACM SIGIR conference on research and development in information retrieval, pp 415–424
Acknowledgments
The work was supported by the National Science Foundation of China under Grants 61673185, 61572205 and 61673186, National Science Foundation of Fujian Province (No. 2017J01112), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research (No. ZQN-PY309).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this section, we show the equivalent derivations for (12), (13) and (14), respectively. Let \(\mathbf {X}=\{\mathbf {x}_{i}\}_{i = 1}^{n}\in \mathbb {R}^{d\times n}\), \(\mathbf {Y}=\{\mathbf {y}_{i}\}_{i = 1}^{n}\in \mathbb {R}^{d\times n}\), and \(\mathbf {w}(i,j)\) denotes the similarity measurement between \(\mathbf {x}_{i}\) and xj, we first construct a function \(\boldsymbol {{\Phi }}\) of following form:
Accordingly, we can also generalize function \(\boldsymbol {{\Phi }}\) as:
Therefore, we can obtain \(\boldsymbol {{\Phi }}=tr(\mathbf {XAX}^{\mathrm {T}} {+} \mathbf {YEX}^{\mathrm {T}} {+} \mathbf {XBY}^{\mathrm {T}} {+} \mathbf {YFY}^{\mathrm {T}})\). By comparing the (25) and (26), we can obtain that \(\mathbf {A}_{ii}{=} \sum \nolimits _{j = 1}^{m} {{\mathbf {w}_{ij}} = {\mathbf {D}_{ii}}}\), \(\mathbf {B} {=} - \mathbf {W}\), \(\mathbf {F}_{jj}{=}\sum \nolimits _{i = 1}^{n} {{\mathbf {w}_{ij}} = {\mathbf {D}_{jj}}}\), \(\mathbf {E} {=} - {\mathbf {W}^{\mathrm {T}}}\). Let \(\mathbf {U} = [\mathbf {X}{\kern 1pt} ~~ {\kern 1pt} \mathbf {Y}]\), we can obtain:
where \(\mathbf {L} = \left [ {\begin {array}{*{20}{c}} {{\mathbf {D}}}&{ - \mathbf {W}}\\ { - {\mathbf {W}^{\mathrm {T}}}}&{{\mathbf {D}}} \end {array}} \right ]\). According to this general formation, the three items in (11) can be directly converted as:
where \(\mathbf {D}_{1}\), \(\mathbf {D}_{2}\), \(\mathbf {D}_{3}\in {\mathbb {R}}^{n\times n}\) are diagonal matrices with entries being the column sum of \(\lambda _{1} \mathbf {R}^{(1)}\), \(\lambda _{2} \mathbf {R}^{(2)}\) and \(\mathbf {C}\), respectively.
Rights and permissions
About this article
Cite this article
Liu, X., Li, A., Du, JX. et al. Efficient cross-modal retrieval via flexible supervised collective matrix factorization hashing. Multimed Tools Appl 77, 28665–28683 (2018). https://doi.org/10.1007/s11042-018-6006-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6006-5