Cluster-sensitive Structured Correlation Analysis for Web cross-modal retrieval

doi:10.1016/j.neucom.2015.05.049

Neurocomputing

Volume 168, 30 November 2015, Pages 747-760

https://doi.org/10.1016/j.neucom.2015.05.049 Get rights and content

Abstract

Modern cross-modal retrieving technology is required to find semantically relevant content from heterogeneous modalities. As previous studies construct unified dense correlation models on small scale cross-modal data, they are not capable of processing large scale Web data, because (a) the content of Web cross media is divergent; (b) the topic sensitive structure information in the high dimensional space is neglected; and (c) data should be organized as strictly corresponding pairs, which is not satisfied in real world scenarios. To address these challenges, we propose a cluster-sensitive cross-modal correlation learning framework. First, a set of cluster-sensitive correlation sub-models are learned instead of a unified correlation model, which better fits the content divergence in different modalities. We impose structured sparsity regularization on the projection vectors to learn a set of interpretable structured sparse correlation sub-models. Second, to compensate for the correspondence missing, we take full advantage of both intra-modal affinity and inter-modal co-occurrence. The projected coordinates of adjacent data within a modality tend to be similar, and the inconsistency of cluster-sensitive projection is minimized. The learned correlation model adapts to the content divergence and thus achieves better model generality and bias–variance trade-off. Extensive experiments on two large scale cross-modal data demonstrate the effectiveness of our approach.

Introduction

Millions of Web users produce diverse online content of multiple modalities everyday, e.g., textual documents and visual images. Instead of single media, knowledge is delivered by different modalities with rich context and structure information, which is known as cross media [1], [2], [3]. In this new information carrier, Web topics and events are described by semantically related documents from different modalities, providing complementary explanations from different aspects. For instance, the concept “tiger” can be described by a tiger head in an image, and textual description of the life of a tiger. On one side, Web users need to retrieve content of heterogeneous modalities. On the other side, a user-centric retrieving system should support more flexible query input and more versatile data retrieving. Therefore, it has become a very interesting yet challenging problem to develop effective cross-modal retrieving models for cross media.

As a well-established paradigm for modeling the cross-modal correlation, the low dimensional subspaces maximizing the correlation between two modalities can be learned by using canonical correlation analysis (CCA) [4] and partial least square (PLS) [5]. However, as much effort devoted to improving the correlation models [6], [7], [8], [9], [10], they are not capable of learning the correlation among cross-modal data from the Web. In general, the main technical challenges for developing robust correlation models can be analyzed from several aspects.

First, the topic and content distribution for cross media is complex and divergent. The research challenge is two folds:

•
Intra-modal divergence: Given a topic or concept, the related documents are divergent within one modality. For example, the concept “Apple” may be related to content from multiple domains such as food, plant, art, industry and hi-tech, see Fig. 1. The intra-modal divergence poses difficulties in representing the wide range of content genres with a unified subspace.
•
Inter-modal divergence: The physical structures are drastically different among features from different modalities. They are also drastically different among multiple features from one modality, as shown in the bottom part of Fig. 1. Therefore, it is hard to find the subspaces to directly calculate similarities among data from different modalities.

In previous study on correlation learning [6], [8], [11], [12], [13], the unified subspace learning is the well-studied paradigm assuming the prior of projection function parameters that are Gaussian or Laplacian. They are not flexible in dealing with the content divergence in Web data. The intermediate shared latent topic spaces are learned with various probabilistic graphical models [14], [15], [16] to tackle the content divergence problem, while they suffer from the high computational cost of parameter inference. As another possible solution, localized approach [17] tends to achieve superfine correlation model with much more model parameters, but it is too sensitive to the ubiquitously existing noise.

The content divergence can also be observed on the high dimensional structured representation. For example, the vocabularies and writing styles (i.e., word frequencies and orders) of different textual documents are diversified, and the images can be represented by complementary visual features, such as color, texture, shape and Bag-of-Visual-Words. Intuitively, the importance of different feature dimensions should be topic dependent. For instance, the words (black, white) in textual representation have close relationship with color histogram in visual representation on “Apple products” related documents. However, the words (chunk, leaf) will be more related with the visual texture features on images describing “apple tree”. Unfortunately, such topic specific relation cannot be well captured by global correlation models, even with complicated structured input and output regularization [11], [18], [19].

Another critical issue for correlation learning on Web data is correspondence missing. Specifically, there may be no explicit corresponding cross-modal documents. For example, on Wikipedia, there are many pages with only textual content but no images, and there are certain amount of textual paragraphs without a corresponding image description. However, the potential complementary cross-modal descriptions of these textual paragraphs may be found in other Web data corpus, e.g., social media photos. The correspondence missing is similar to the setting of semi-supervised learning, where a certain level of label information is assumed to be missing.

The correspondence information is usually supposed to be fully provided in existing study [6], [10], [20], [21]. The one-to-one alignment of multiple modalities is enforced in the correlation learning objectives. This assumption is overly strict which makes the correlation models too sensitive to the noise and vacancy in the correspondence information. Introducing both intra-modal similarity and side information provides a good remedy for correspondence missing [16], [22], [23], but the potential power has not been fully released by existing global subspace learning strategies, and may only result in an over-smooth correlation models instead.

We address the challenges of content divergence and correspondence missing, and propose a new correlation learning approach based on an ensemble of multiple cross-modal sub-models, as shown in Fig. 2. First, we model the cross-modal correlation at the sub-topic level,¹ where the sub-topic structure is reflected by the cluster distribution on each modality. The cluster-sensitive transformation is determined by both the transformation function on each cluster (i.e., sub-topic) and the membership between the documents and the clusters. Compared to existing approaches, our method achieves a smaller model bias than global projection learning [4], and a smaller model variance than localized projection learning [17]. Therefore, such a bias–variance trade-off leads to a smaller expected model error. To further deal with high dimensional multi-modal representation on diversified content genres, we apply sparsity [6], [13] and structured sparsity constraints [11], [12] on each sub-model. Consequently, we obtain a set of interpretable cross-modal subspaces where each dimension is the combination of part of the feature dimensions.

Second, to compensate for the correspondence missing, we take full advantage of both intra-modal affinity and inter-modal co-occurrence which has been shown to be equally important by [8], [22], [23]. By encoding the intra-modal relation, the correspondence information can be appropriately propagated to the neighboring data to make the transformed representation more semantically consistent. Moreover, our method achieves better model generality by penalizing the unsmooth projection brought by multiple sub-models using the intra-modal affinity. Therefore, the correspondence missing can be firmly alleviated, and the model robustness can be enhanced.

In summary, we construct a set of transformation sub-models where the number of projection sub-models for each data modality is equal to the number of clusters. Data from each modality can be projected by the weighted combination of sub-models. The new presentation maximizes the correlation of different modalities and measures the intra-modal relation in a more appropriate manner. The advantage over existing correlation model is that the learned transformation is topic sensitive, leading to better adaptability to topic divergence. Our approach is more robust to noise compared to localized correlation model. It can also be seen as a generalization of unified correlation models [4] and localized models [17]. The trade-off between model bias and variance can be well controlled by adjusting the number of clusters. The key technical contributions can be summarized as follows:

•
We propose a new correlation subspace learning method for cross modality retrieval. It better fits the content divergence by learning a set of cluster-sensitive correlation sub-models. By applying structured sparsity regularization, the learned projection is more interpretable compared to dense correlation models. Our method achieves better trade-off between model bias and variance.
•
By encoding the intra-modal information with correlation sub-models, our model is more robust to the correspondence missing than traditional approaches that only leverage the content co-occurrence.
•
Extensive experiments on two large scale cross-modal datasets demonstrate the advantages of our approach. With moderate model training complexity, our method achieves at least 20% higher performance in Mean Average Precision than state-of-the-art approaches.

The rest of the paper is organized as follows. In Section 2 we briefly review related works. In 3 Approach, 4 Model solution we introduce our approach and implementation details. We provide description on experiments in Section 5, and conclude this paper in Section 6.

Section snippets

Related work

CCA [4] is the first study on how to seek optimal basic vectors for two sets of variables to model the multi-modal correlation. It is used in various problems, such as cross language analysis [24] and Socio-Economic Transition [9]. PLS [5] aims to find a linear regression model by projecting the predicted variables and the observable variables to a new space, which is equivalent with CCA in many situations [25]. Such models are further extended to a regularized correlation learning framework

Overview

We are given two sets of data from two modalities $X \in R^{N_{x} \times D_{x}}$ and $Y \in R^{N_{y} \times D_{y}}$ , respectively, where N_x or N_y represents the number of training samples for each modality, and D_x or D_y represents the number of dimensions. Typically, we assume that both X and Y are zero-centered and norm-bounded. Without loss of generality, we denote each sample row vector as $x_{i}$ or $y_{j}$ , and $X_{S}$ or $Y_{S}$ as the sub-matrices where the row entry indices are included in set S. Consistently, we denote the row indices subset with

Sub-problem optimization

Due to the complex cluster-sensitive projection functions, the objective function in (10) cannot be directly minimized with any existing off-the-shelf convex optimization toolbox because a set of projection vector pairs other than one pair should be learned in our cluster-sensitive model. We borrow the idea from [6] and develop a special purpose bilateral optimization for our model, which alternatively optimizes $u^{p}$ and $v^{q}$ . First, given an initialized and l₂-norm normalized $U_{0}$ and $V_{0}$ , we

Experiments

Datasets: We conduct experiments on two datasets: the ImageClef 2010 dataset and the dataset collected by [41] from Wikipedia. The ImageClef data consists of 223 065 image and text document pairs after noise cleansing, where images and texts with identical document IDs imply that they serve as the complementary descriptions of each other, i.e., the correspondence information. We randomly select 50K images with their corresponding text as the multi-modal test dataset and the rest as the training

Conclusions

We propose a cluster-sensitive structured correlation learning framework for cross-modal retrieval. Multiple cluster-sensitive correlation sub-models are learned instead of a unified correlation model, which better fits the content divergence in different modalities. By using structure sparsity regularization on the projection vectors, a set of interpretable structure sparse correlation sub-models are obtained. To deal with correspondence information missing, we take full advantage of both

Acknowledgment

This work was supported in part by National Basic Research Program of China (973 Program): (2012CB316400) and (2015CB351802), 863 program of China: (2014AA015202), and National Natural Science Foundation of China (NSFC): (61025011), (61303160), (61332016), (61390511), (61322212), (61473273) and (61429201). This work was supported in part to Prof. Qi Tian by ARO Grant W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America.

Shuhui Wang received the B.S. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2006, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2012. He is currently an Assistant Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include semantic

References (42)

H.D. Vinod
Canonical ridge and econometrics of joint production
J. Econom.
(1976)
D. Rafailidis et al.
A unified framework for multimodal retrieval
Pattern Recognit.
(2013)
F. Wu, H. Zhang, Y. Zhuang, Learning semantic correlations for cross media retrieval, in: ICIP,...
Y. Zhuang et al.
Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval
Trans. Multimed.
(2008)
Y. Zhuang et al.
Manifold learning based cross-media retrievala solution to media object complementary nature
J. VLSI Signal Process.
(2007)
H. Hotelling
Relations between two sets of variates
Biometrika
(1936)
H. Wold, Partial least squares, in: Samuel Kotz, Norman L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol....
D. Witten et al.
A penalized matrix decomposition with applications to sparse principal components and canonical correlation analysis
Biostatistics
(2009)
D.R. Hardoon et al.
Sparse canonical correlation analysis
Mach. Learn.
(2011)
M.B. Blaschko, C.H. Lampert, A.Gretton, Semi-supervised Laplacian regularization of kernel canonical correlation...

A. Tenenhaus et al.

Regularized generalized canonical correlation analysis

Psychometrika

(2011)

N. Rasiwasia, J.C. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal...

X. Chen, H. Liu, J.G. Carbonell, Structured sparse canonical correlation analysis, in: AISTATS,...

S. Virtanen, A. Klami, S. Kaski, Bayesian CCA via Group Sparsity, in: ICML,...

MLAKimAnh Lê Cao/ Debra Rossouw/ Christèle RobertGranié/ Philippe Besse, / Debra Rossouw, and / Philippe Besse. "A...

N. Chen et al.

Large-margin predictive latent subspace learning for multi-view data analysis

Trans. Pattern Anal. Mach. Intell.

(2012)

D. Blei, M. Jordan, Modeling annotated data, in: SIGIR,...

Y. Jia, M. Salzmann, T. Darrell, Learning cross-modality similarity for multinominal data, in: ICCV,...

D. Zhai et al.

Multi-view metric learning with global consistency and local smoothness

ACM Trans. Intell. Syst. Technol.

(2011)

D.K.H. Lim, B.McFee, G.Lanckriet, Robust structural metric learning, in: ICML,...

X. Chen, Q. Lin, S. Kim, J.G. Carbonell, E.P. Xing, Smoothing proximal gradient method for general structured sparse...

Cited by (9)

Coarse-to-fine matching via cross fusion of satellite images
2023, International Journal of Applied Earth Observation and Geoinformation
The registration of multimodal satellite images is essential for a prerequisite for accruing complementary observational data. Nevertheless, the differential imaging nuances amongst non-linear radiometric multimodal images precipitate a complexity in keypoint detection, rendering it a great challenge. This complexity exacerbates the difficulty encountered in matching multimodal satellite images. In this paper, a dual-branch cross fusion network (DF-Net) is proposed for the purpose of satellite image registration. DF-Net relies on the self-attention granted to a pair of images, thereby providing cross-modal fusion feature descriptions. Initially, reference and sensed images are deployed as inputs for the dual-branch network, which in turn engenders feature descriptions of both high and low resolution, respectively. Sequentially, the matching of individual feature descriptions is anchored on the low-resolution feature map, paving the way for the establishment of coarse matching correspondences. Subsequently, the outcomes of these coarse correspondences are transposed onto the feature map with a higher resolution, thereby generating fine matching results for each coarse correspondence. An exhaustive set of qualitative and quantitative assessments have been administered on three satellite image datasets encompassing a diverse range of scenarios. The average Repeatability (Rep.), Mean Matching Accuracy (MMA), and Root-Mean-Square Error (RMSE) of the DF-Net applied to three large-scale satellite images were recorded to be 0.71, 0.65, and 2.34, respectively. These findings buttress the proficiency of the proposed strategy in facilitating cross-modal matching and bear testimony to the sterling performance of the method proposed.
Comparative analysis on cross-modal information retrieval: A review
2021, Computer Science Review
Citation Excerpt :
It increases the discriminative ability of intra-modality information from diverse concepts and relevance of inter-modality information in the same class. To handle the huge multi-modal web data, [70] has proposed a cluster-sensitive cross-modal correlation learning framework. A novel correlation subspace learning technique which learns a group of a cluster-sensitive sub-models is presented to better fit the content divergence of various modalities.
Human beings experience life through a spectrum of modes such as vision, taste, hearing, smell, and touch. These multiple modes are integrated for information processing in our brain using a complex network of neuron connections. Likewise for artificial intelligence to mimic the human way of learning and evolve into the next generation, it should elucidate multi-modal information fusion efficiently. Modality is a channel that conveys information about an object or an event such as image, text, video, and audio. A research problem is said to be multi-modal when it incorporates information from more than a single modality. Multi-modal systems involve one mode of data to be inquired for any (same or varying) modality outcome whereas cross-modal system strictly retrieves the information from a dissimilar modality. As the input–output queries belong to diverse modal families, their coherent comparison is still an open challenge with their primitive forms and subjective definition of content similarity. Numerous techniques have been proposed by researchers to handle this issue and to reduce the semantic gap of information retrieval among different modalities. This paper focuses on a comparative analysis of various research works in the field of cross-modal information retrieval. Comparative analysis of several cross-modal representations and the results of the state-of-the-art methods when applied on benchmark datasets have also been discussed. In the end, open issues are presented to enable the researchers to a better understanding of the present scenario and to identify future research directions.
Multi-label double-layer learning for cross-modal retrieval
2018, Neurocomputing
Citation Excerpt :
With the high-speed development of Internet technology, multimedia data has increased dramatically. Consequently, more and more researchers pay their attention on the task of cross-modal retrieval [1–10]. The goal of cross-modal retrieval is to match the feature of one modality with the feature of the other modality in a learned semantic space [11].
This paper proposes a novel method named Multi-label Double-layer Learning (MDLL) for multi-label cross-modal retrieval task. MDLL includes two stages (layers): L2C (Label to Common) and C2L (Common to Label). In the L2C stage, considering that labels can provide semantic information, we take label information as an auxiliary modality and apply a covariance matrix to represent label similarity in multi-label situation. Thus we can maximize the correlation of different modalities and reduce their semantic gap in the L2C stage. In addition, we find that samples with the same semantic labels may have different contents from users’ view. According to this problem, in the C2L stage, labels are projected to a latent space learned from features of image and text. By this way, the label latent space are more related to the sample’s contents. Then, it is noticed that the samples have same labels but various contents can be decreased. In MDLL, iterative learning of the L2C and C2L stages will improve the discriminative ability greatly and decline the discrepancy between the labels and the contents. To show the effectiveness of MDLL, some experiments are conducted on three multi-label cross-modal retrieval tasks (Pascal Voc 2007, Nus-wide, and LabelMe), on which competitive results are obtained.
Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval
2017, Signal Processing
Citation Excerpt :
Particularly for social media (see Fig. 1), the user-contributed tags are incomplete and not always truthful [9]. The one-to-one alignment in intra-pair makes the correlation model sensitive to noise [10]. The many-to-many inter-pair correlation schema is a complement of intra-pair correlation.
Cross modal (e.g., text-to-image or image-to-text) retrieval has received great attention with the flushed multi-modal social media data. It is of considerable challenge to stride across the heterogeneous gap between modalities. Existing methods project different modalities into a common space by minimizing the distance within the heterogeneous pairs (intra-pair) of the new latent space. However, the relationship among these multi-modal pairs (inter-pair) are neglected, which are beneficial to eliminate the heterogeneity. In this paper, we propose a novel algorithm based on canonical correlation analysis by considering the high-order relationship among pairs (HCCA) for cross-modal retrieval. Supervised with additional semantic labels and unsupervised without semantic labels are simultaneously considered by treating the intra- and inter-pair correlation discriminatively. Moreover, kernel tricks are also performed on HCCA to learn a non-linear projection, termed HKCCA. Extensive experiments conducted on three public datasets demonstrate the superiority of the proposed methods compared with the state-of-the-art approaches in cross modal retrieval.
Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval
2016, Neurocomputing
Citation Excerpt :
Complete introduction and recent extensions about CCA can be found in [11,12]. Based on CCA, various variants [13–18] are proposed to model the multi-model correlations. Gong [19] first expands two-view CCA to three-view CCA by incorporating a third view that captures high-level image semantics, represented either by a single category or multiple non-mutually exclusive concepts.
This paper deals with the problem of modeling Internet images and associated texts for cross-modal retrieval such as text-to-image retrieval and image-to-text retrieval. We start with deep canonical correlation analysis (DCCA), a deep approach for mapping text and image pairs into a common latent space. We first propose a novel progressive framework and embed DCCA in it. In our progressive framework, a linear projection loss layer is inserted before the nonlinear hidden layers of a deep network. The training of linear projection and the training of nonlinear layers are combined to ensure that the linear projection is well matched with the nonlinear processing stages and good representations of the input raw data are learned at the output of the network. Then we introduce a hypergraph semantic embedding (HSE) method, which extracts latent semantics from texts, into DCCA to regularize the latent space learned by image view and text view. In addition, a search-based similarity measure is proposed to score relevance of image-text pairs. Based on the above ideas, we propose a model, called DCCA-PHS, for cross-modal retrieval. Experiments on three publicly available data sets show that DCCA-PHS is effective and efficient, and achieves state-of-the-art performance for unsupervised scenario.
A correlation analysis framework via joint sample and feature selection
2023, Multimedia Tools and Applications

View all citing articles on Scopus

Fuzhen Zhuang received the B.S. degree in Computer Science from Chongqing University, China, in 2006, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2011. He is currently an Associate Professor in the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include transfer learning, machine learning, data mining, and parallel classification algorithms.

Shuqiang Jiang received the M.S. degree from the College of Information Science and Engineering, Shandong University of Science and Technology, Shandong, China, in 2000, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2005. He is currently a Full Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include multimedia processing and semantic understanding, pattern recognition, and computer vision. He has authored or coauthored more than 90 papers on the related research topics.

Qingming Huang received the B.S. degree in Computer Science and Ph.D. degree in Computer Engineering from Harbin Institute of Technology, Harbin, China, in 1988 and 1994, respectively. He is currently a Professor with the Graduate University of the Chinese Academy of Sciences (CAS), Beijing, China, and an Adjunct Research Professor with the Institute of Computing Technology, CAS. He has authored or coauthored nearly 200 academic papers in prestigious international journals and conferences. His research areas include multimedia video analysis, video adaptation, image processing, computer vision, and pattern recognition. Dr. Huang is a reviewer for IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Communications. He has served as program chair, track chair and TPC member for various conferences, including ACM Multimedia, CVPR, ICCV, ICME, and PSIVT.

Qi Tian received the B.E. degree in Electronic Engineering from Tsinghua University, China, in 1992 and the Ph.D. degree in Electrical and Computer Engineering from the University of Illinois, Urbana-Champaign in 2002. He is currently a Full Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). His research interests include multimedia information retrieval and computer vision. He has published over 150 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he also received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Labs. He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. He was the author of a Top 10% Best Paper Award in MMSP 2011, a Best Student Paper in ICASSP 2006, and a Best Paper Candidate in PCM 2007. He received 2010 ACM Service Award. He has been serving as Program Chairs, Organization Committee Members and TPCs for numerous IEEE and ACM Conferences including ACM Multimedia, SIGIR, ICCV, and ICME. He is the Guest Editors of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, and is in the Editorial Board of IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), Journal of Multimedia (JMM) and Journal of Machine Visions and Applications (MVA). He is a Senior Member of IEEE and a Member of ACM.

View full text

Cluster-sensitive Structured Correlation Analysis for Web cross-modal retrieval

Abstract

Introduction

Section snippets

Related work

Overview

Sub-problem optimization

Experiments

Conclusions

Acknowledgment

J. Econom.

Pattern Recognit.

Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval

Trans. Multimed.

Manifold learning based cross-media retrievala solution to media object complementary nature

J. VLSI Signal Process.

Relations between two sets of variates

Biometrika

A penalized matrix decomposition with applications to sparse principal components and canonical correlation analysis

Biostatistics

Sparse canonical correlation analysis