Cluster-sensitive Structured Correlation Analysis for Web cross-modal retrieval
Introduction
Millions of Web users produce diverse online content of multiple modalities everyday, e.g., textual documents and visual images. Instead of single media, knowledge is delivered by different modalities with rich context and structure information, which is known as cross media [1], [2], [3]. In this new information carrier, Web topics and events are described by semantically related documents from different modalities, providing complementary explanations from different aspects. For instance, the concept “tiger” can be described by a tiger head in an image, and textual description of the life of a tiger. On one side, Web users need to retrieve content of heterogeneous modalities. On the other side, a user-centric retrieving system should support more flexible query input and more versatile data retrieving. Therefore, it has become a very interesting yet challenging problem to develop effective cross-modal retrieving models for cross media.
As a well-established paradigm for modeling the cross-modal correlation, the low dimensional subspaces maximizing the correlation between two modalities can be learned by using canonical correlation analysis (CCA) [4] and partial least square (PLS) [5]. However, as much effort devoted to improving the correlation models [6], [7], [8], [9], [10], they are not capable of learning the correlation among cross-modal data from the Web. In general, the main technical challenges for developing robust correlation models can be analyzed from several aspects.
First, the topic and content distribution for cross media is complex and divergent. The research challenge is two folds:
- •
Intra-modal divergence: Given a topic or concept, the related documents are divergent within one modality. For example, the concept “Apple” may be related to content from multiple domains such as food, plant, art, industry and hi-tech, see Fig. 1. The intra-modal divergence poses difficulties in representing the wide range of content genres with a unified subspace.
- •
Inter-modal divergence: The physical structures are drastically different among features from different modalities. They are also drastically different among multiple features from one modality, as shown in the bottom part of Fig. 1. Therefore, it is hard to find the subspaces to directly calculate similarities among data from different modalities.
In previous study on correlation learning [6], [8], [11], [12], [13], the unified subspace learning is the well-studied paradigm assuming the prior of projection function parameters that are Gaussian or Laplacian. They are not flexible in dealing with the content divergence in Web data. The intermediate shared latent topic spaces are learned with various probabilistic graphical models [14], [15], [16] to tackle the content divergence problem, while they suffer from the high computational cost of parameter inference. As another possible solution, localized approach [17] tends to achieve superfine correlation model with much more model parameters, but it is too sensitive to the ubiquitously existing noise.
The content divergence can also be observed on the high dimensional structured representation. For example, the vocabularies and writing styles (i.e., word frequencies and orders) of different textual documents are diversified, and the images can be represented by complementary visual features, such as color, texture, shape and Bag-of-Visual-Words. Intuitively, the importance of different feature dimensions should be topic dependent. For instance, the words (black, white) in textual representation have close relationship with color histogram in visual representation on “Apple products” related documents. However, the words (chunk, leaf) will be more related with the visual texture features on images describing “apple tree”. Unfortunately, such topic specific relation cannot be well captured by global correlation models, even with complicated structured input and output regularization [11], [18], [19].
Another critical issue for correlation learning on Web data is correspondence missing. Specifically, there may be no explicit corresponding cross-modal documents. For example, on Wikipedia, there are many pages with only textual content but no images, and there are certain amount of textual paragraphs without a corresponding image description. However, the potential complementary cross-modal descriptions of these textual paragraphs may be found in other Web data corpus, e.g., social media photos. The correspondence missing is similar to the setting of semi-supervised learning, where a certain level of label information is assumed to be missing.
The correspondence information is usually supposed to be fully provided in existing study [6], [10], [20], [21]. The one-to-one alignment of multiple modalities is enforced in the correlation learning objectives. This assumption is overly strict which makes the correlation models too sensitive to the noise and vacancy in the correspondence information. Introducing both intra-modal similarity and side information provides a good remedy for correspondence missing [16], [22], [23], but the potential power has not been fully released by existing global subspace learning strategies, and may only result in an over-smooth correlation models instead.
We address the challenges of content divergence and correspondence missing, and propose a new correlation learning approach based on an ensemble of multiple cross-modal sub-models, as shown in Fig. 2. First, we model the cross-modal correlation at the sub-topic level,1 where the sub-topic structure is reflected by the cluster distribution on each modality. The cluster-sensitive transformation is determined by both the transformation function on each cluster (i.e., sub-topic) and the membership between the documents and the clusters. Compared to existing approaches, our method achieves a smaller model bias than global projection learning [4], and a smaller model variance than localized projection learning [17]. Therefore, such a bias–variance trade-off leads to a smaller expected model error. To further deal with high dimensional multi-modal representation on diversified content genres, we apply sparsity [6], [13] and structured sparsity constraints [11], [12] on each sub-model. Consequently, we obtain a set of interpretable cross-modal subspaces where each dimension is the combination of part of the feature dimensions.
Second, to compensate for the correspondence missing, we take full advantage of both intra-modal affinity and inter-modal co-occurrence which has been shown to be equally important by [8], [22], [23]. By encoding the intra-modal relation, the correspondence information can be appropriately propagated to the neighboring data to make the transformed representation more semantically consistent. Moreover, our method achieves better model generality by penalizing the unsmooth projection brought by multiple sub-models using the intra-modal affinity. Therefore, the correspondence missing can be firmly alleviated, and the model robustness can be enhanced.
In summary, we construct a set of transformation sub-models where the number of projection sub-models for each data modality is equal to the number of clusters. Data from each modality can be projected by the weighted combination of sub-models. The new presentation maximizes the correlation of different modalities and measures the intra-modal relation in a more appropriate manner. The advantage over existing correlation model is that the learned transformation is topic sensitive, leading to better adaptability to topic divergence. Our approach is more robust to noise compared to localized correlation model. It can also be seen as a generalization of unified correlation models [4] and localized models [17]. The trade-off between model bias and variance can be well controlled by adjusting the number of clusters. The key technical contributions can be summarized as follows:
- •
We propose a new correlation subspace learning method for cross modality retrieval. It better fits the content divergence by learning a set of cluster-sensitive correlation sub-models. By applying structured sparsity regularization, the learned projection is more interpretable compared to dense correlation models. Our method achieves better trade-off between model bias and variance.
- •
By encoding the intra-modal information with correlation sub-models, our model is more robust to the correspondence missing than traditional approaches that only leverage the content co-occurrence.
- •
Extensive experiments on two large scale cross-modal datasets demonstrate the advantages of our approach. With moderate model training complexity, our method achieves at least 20% higher performance in Mean Average Precision than state-of-the-art approaches.
The rest of the paper is organized as follows. In Section 2 we briefly review related works. In 3 Approach, 4 Model solution we introduce our approach and implementation details. We provide description on experiments in Section 5, and conclude this paper in Section 6.
Section snippets
Related work
CCA [4] is the first study on how to seek optimal basic vectors for two sets of variables to model the multi-modal correlation. It is used in various problems, such as cross language analysis [24] and Socio-Economic Transition [9]. PLS [5] aims to find a linear regression model by projecting the predicted variables and the observable variables to a new space, which is equivalent with CCA in many situations [25]. Such models are further extended to a regularized correlation learning framework
Overview
We are given two sets of data from two modalities and , respectively, where Nx or Ny represents the number of training samples for each modality, and Dx or Dy represents the number of dimensions. Typically, we assume that both X and Y are zero-centered and norm-bounded. Without loss of generality, we denote each sample row vector as or , and or as the sub-matrices where the row entry indices are included in set S. Consistently, we denote the row indices subset with
Sub-problem optimization
Due to the complex cluster-sensitive projection functions, the objective function in (10) cannot be directly minimized with any existing off-the-shelf convex optimization toolbox because a set of projection vector pairs other than one pair should be learned in our cluster-sensitive model. We borrow the idea from [6] and develop a special purpose bilateral optimization for our model, which alternatively optimizes and . First, given an initialized and l2-norm normalized and , we
Experiments
Datasets: We conduct experiments on two datasets: the ImageClef 2010 dataset and the dataset collected by [41] from Wikipedia. The ImageClef data consists of 223 065 image and text document pairs after noise cleansing, where images and texts with identical document IDs imply that they serve as the complementary descriptions of each other, i.e., the correspondence information. We randomly select 50K images with their corresponding text as the multi-modal test dataset and the rest as the training
Conclusions
We propose a cluster-sensitive structured correlation learning framework for cross-modal retrieval. Multiple cluster-sensitive correlation sub-models are learned instead of a unified correlation model, which better fits the content divergence in different modalities. By using structure sparsity regularization on the projection vectors, a set of interpretable structure sparse correlation sub-models are obtained. To deal with correspondence information missing, we take full advantage of both
Acknowledgment
This work was supported in part by National Basic Research Program of China (973 Program): (2012CB316400) and (2015CB351802), 863 program of China: (2014AA015202), and National Natural Science Foundation of China (NSFC): (61025011), (61303160), (61332016), (61390511), (61322212), (61473273) and (61429201). This work was supported in part to Prof. Qi Tian by ARO Grant W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America.
Shuhui Wang received the B.S. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2006, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2012. He is currently an Assistant Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include semantic
References (42)
Canonical ridge and econometrics of joint production
J. Econom.
(1976)- et al.
A unified framework for multimodal retrieval
Pattern Recognit.
(2013) - F. Wu, H. Zhang, Y. Zhuang, Learning semantic correlations for cross media retrieval, in: ICIP,...
- et al.
Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval
Trans. Multimed.
(2008) - et al.
Manifold learning based cross-media retrievala solution to media object complementary nature
J. VLSI Signal Process.
(2007) Relations between two sets of variates
Biometrika
(1936)- H. Wold, Partial least squares, in: Samuel Kotz, Norman L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol....
- et al.
A penalized matrix decomposition with applications to sparse principal components and canonical correlation analysis
Biostatistics
(2009) - et al.
Sparse canonical correlation analysis
Mach. Learn.
(2011) - M.B. Blaschko, C.H. Lampert, A.Gretton, Semi-supervised Laplacian regularization of kernel canonical correlation...
Regularized generalized canonical correlation analysis
Psychometrika
Large-margin predictive latent subspace learning for multi-view data analysis
Trans. Pattern Anal. Mach. Intell.
Multi-view metric learning with global consistency and local smoothness
ACM Trans. Intell. Syst. Technol.
Cited by (9)
Coarse-to-fine matching via cross fusion of satellite images
2023, International Journal of Applied Earth Observation and GeoinformationComparative analysis on cross-modal information retrieval: A review
2021, Computer Science ReviewCitation Excerpt :It increases the discriminative ability of intra-modality information from diverse concepts and relevance of inter-modality information in the same class. To handle the huge multi-modal web data, [70] has proposed a cluster-sensitive cross-modal correlation learning framework. A novel correlation subspace learning technique which learns a group of a cluster-sensitive sub-models is presented to better fit the content divergence of various modalities.
Multi-label double-layer learning for cross-modal retrieval
2018, NeurocomputingCitation Excerpt :With the high-speed development of Internet technology, multimedia data has increased dramatically. Consequently, more and more researchers pay their attention on the task of cross-modal retrieval [1–10]. The goal of cross-modal retrieval is to match the feature of one modality with the feature of the other modality in a learned semantic space [11].
Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval
2017, Signal ProcessingCitation Excerpt :Particularly for social media (see Fig. 1), the user-contributed tags are incomplete and not always truthful [9]. The one-to-one alignment in intra-pair makes the correlation model sensitive to noise [10]. The many-to-many inter-pair correlation schema is a complement of intra-pair correlation.
Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval
2016, NeurocomputingCitation Excerpt :Complete introduction and recent extensions about CCA can be found in [11,12]. Based on CCA, various variants [13–18] are proposed to model the multi-model correlations. Gong [19] first expands two-view CCA to three-view CCA by incorporating a third view that captures high-level image semantics, represented either by a single category or multiple non-mutually exclusive concepts.
A correlation analysis framework via joint sample and feature selection
2023, Multimedia Tools and Applications
Shuhui Wang received the B.S. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2006, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2012. He is currently an Assistant Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include semantic image analysis, image and video retrieval and large-scale web multimedia data mining.
Fuzhen Zhuang received the B.S. degree in Computer Science from Chongqing University, China, in 2006, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2011. He is currently an Associate Professor in the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include transfer learning, machine learning, data mining, and parallel classification algorithms.
Shuqiang Jiang received the M.S. degree from the College of Information Science and Engineering, Shandong University of Science and Technology, Shandong, China, in 2000, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2005. He is currently a Full Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include multimedia processing and semantic understanding, pattern recognition, and computer vision. He has authored or coauthored more than 90 papers on the related research topics.
Qingming Huang received the B.S. degree in Computer Science and Ph.D. degree in Computer Engineering from Harbin Institute of Technology, Harbin, China, in 1988 and 1994, respectively. He is currently a Professor with the Graduate University of the Chinese Academy of Sciences (CAS), Beijing, China, and an Adjunct Research Professor with the Institute of Computing Technology, CAS. He has authored or coauthored nearly 200 academic papers in prestigious international journals and conferences. His research areas include multimedia video analysis, video adaptation, image processing, computer vision, and pattern recognition. Dr. Huang is a reviewer for IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Communications. He has served as program chair, track chair and TPC member for various conferences, including ACM Multimedia, CVPR, ICCV, ICME, and PSIVT.
Qi Tian received the B.E. degree in Electronic Engineering from Tsinghua University, China, in 1992 and the Ph.D. degree in Electrical and Computer Engineering from the University of Illinois, Urbana-Champaign in 2002. He is currently a Full Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). His research interests include multimedia information retrieval and computer vision. He has published over 150 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he also received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Labs. He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. He was the author of a Top 10% Best Paper Award in MMSP 2011, a Best Student Paper in ICASSP 2006, and a Best Paper Candidate in PCM 2007. He received 2010 ACM Service Award. He has been serving as Program Chairs, Organization Committee Members and TPCs for numerous IEEE and ACM Conferences including ACM Multimedia, SIGIR, ICCV, and ICME. He is the Guest Editors of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, and is in the Editorial Board of IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), Journal of Multimedia (JMM) and Journal of Machine Visions and Applications (MVA). He is a Senior Member of IEEE and a Member of ACM.