Multi-label enhancement based self-supervised deep cross-modal hashing
Introduction
Recent years have witnessed a huge surge of multimedia data such as images, texts, audios and videos on the website. Because different data modalities may describe the same events or topics, we can take advantage of the potential semantically correlation among these multi-modal data to realize large-scale cross-modal data retrieval. Hence, cross-modal retrieval [1], [2], which is proposed to search semantically relevant instances from one modality by using a query of another modality, has drawn escalating attention. These different modality data have different feature representations and distributions. Thus, how to efficiently and effectively unify these massive yet heterogeneous modality data and further reduce their semantic gaps is still a big challenge.
The goal of cross-modal retrieval is to learn an isomorphic latent space, where the original heterogeneous feature representation of data from different modalities can be unified by a latent embedding. This is based on the hypothesis that different modalities having semantically related properties can be mapped and grouped into a common latent space. Existing cross-modal retrieval methods can be divided into two main categories [3]: real-valued representation learning methods and binary representation learning methods. The real-valued representation (such as subspace learning [4], [5], [6], [7], [8], [9], topic models [10], [11], [12], and deep models [13], [14], [15], [16], [17], [18], [19], [20], [21]) are usually measured by Euclidean distance to ensure that semantically relevant data are close to each other. However, the similarity measure in the real-valued representation space suffers from low search response and high computational complexity. Thus, binary codes are used, which have both low data storage requirements and a highly efficient distance measure (XOR operation) [22], [23]. Binary representation learning methods, also referred to as cross-modal hashing (CMH) [24] methods, could effectively project a high dimensional real-valued representation of multi-modal data into an isomorphic Hamming space, endowing similar cross-modal data representations with similar hash codes.
In general, existing cross-modal hashing methods can be further categorized into unsupervised and supervised methods. Unsupervised cross-modal hashing methods, such as inter-media hashing (IMH) [25], collective matrix factorization hashing (CMFH) [26], latent semantic sparse hashing (LSSH) [27], and unsupervised generative adversarial cross-modal hashing (UGACH) [28], learn the hash projection functions by exploring the underlying distributions and structures among the similarities of multi-modal data representations without using any further supervised information. Supervised cross-modal hashing methods learn the hash functions by mapping pairwise instances into pairwise binary codes and preserve the semantic relevance of the pairwise instances with the guidance of supervised information (such as semantic labels). Representative supervised CMH methods, such as semantic correlation maximization (SCM) [29], semantics preserving hashing (SePH) [30], dictionary learning cross-modal hashing (DLCMH) [31] and semi-relaxation supervised hashing (SRSH) [32], can effectively distill semantic correlations of cross-modal data by exploiting semantic labels and achieve superior performance compared to unsupervised CMH methods. Nevertheless, these methods are based on shallow architectures and cannot describe the complicated nonlinear correlations among different modalities. Moreover, in these methods, the hand-crafted feature extraction and hash function learning are independently performed, which might not be optimally compatible with each other and may result in suboptimal performance.
Recently, deep convolutional neural networks (CNNs) [33], [34], [35], [36] have made significant progress in various computer vision applications [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], and have been adopted for cross-modal hashing retrieval. The deep neural network based cross-modal hashing methods, such as deep cross-modal hashing (DCMH) [47], pairwise relationship guided deep hashing (PRDH) [48], correlation hashing network (CHN) [49], collective deep quantization (CDQ) [50], self-supervised adversarial hashing (SSAH) [51], cross-modal hamming hashing (CMHH) [52], fast discrete cross-modal hashing (FDCH) [53], deep multiscale fusion hashing (DMFH) [54]. and triplet-based deep hashing (TDH) [55], integrate the learning of hash representations and hash functions into an end-to-end trainable architecture. Meanwhile, deep model based CMH methods can effectively capture nonlinear heterogeneous cross-modal correlations and obtain better performance than methods using shallow architectures.
However, most existing deep CMH methods simply leverage the single-class labels to define the semantic affinity of the original pairwise instances, and meanwhile, minimize the difference between the semantic affinity and the similarities of learned hash representations to preserve the semantically correlation of pairwise instances in the Hamming space. Nonetheless, this simple definition of semantic affinity cannot effectively preserve the semantic correlation and may lead to inferior retrieval performance. Actually, in cross-modal retrieval benchmark datasets as well as practical applications, these data from different modalities are usually labeled with multiple categories, e.g., multi-labels. This allows us to refine the definition of semantic affinity and base it on multi-labels information, thereby exploiting more accurate semantic relevance both for inter-modality and intra-modality pairwise instances (as shown in Fig. 1). However, the problem to effectively minimize the gap between the multi-label semantic affinity and the corresponding similarity of learned hash representations still remains. A common solution to minimize this gap is optimizing a MSE (Mean Square Error) based loss function [56]. However, the measure of MSE is based on Euclidean distance and is difficult to optimize. Moreover, MSE based loss functions are not robust to outlier pairs of instances [49]. To address this issue, a straight forward solution is constraining the similarities of learned hash representations (S) to fit their corresponding multi-label semantic affinity of original pairwise instances (P) by using a classical Kullback–Leibler divergence [57] based loss function (KL Loss) together with stochastic gradient descent optimization [58]. This leads to two other problems. Firstly, the ranges of S and P are not always the same, and it has been shown that it is ineffective to try to unify the ranges of S and P with simple linear transformations. Secondly, if most of the initial values of S are bigger than the corresponding values of P, the Kullback–Leibler divergence based loss function may generate negative loss values, which may work against the fitting goal during the optimization procedure (as the shown in Fig. 2).
Furthermore, most existing deep CMH methods simply use all modalities of the data to learn hash functions but neglect the fact that original instances in all data modalities may contain noise. These noises may reduce the performance and robustness of the learned hash representations and hash functions. At the same time, the assigned labels of instances constitute refinements for the original features in each modality and contain much semantic information, which usually shows little noise. Thus, a self-supervised semantic network based on multi-label annotations is usually utilized to improve the performance of deep CMH methods (as shown in Fig. 3). However, they define the semantic affinity matrix of instances based on single-label information, which cannot accurately capture the semantic affinity of original pairwise instances.
Taking the above problems into consideration, in this paper, we propose a novel and efficient multi-label enhancement based self-supervised deep cross-modal hashing (MESDCH) method to improve the robustness of learned hash representations and hash functions. As shown in Fig. 4, two novel modules are introduced in our MESDCH. The first one is the multi-label semantic affinity preserving module. This module mainly consists of three parts: the first part is a new definition of multi-labels semantic affinity, which aims to accurately exploit the semantic affinity of original pairwise instances under the supervision of multi-label information. The second part is a novel space transformation using a ReLU function to effectively unify the ranges of the similarities of learned hash representations and the corresponding multi-label semantic affinity of original pairwise instances in the Kullback–Leibler divergence based loss function. The third part is the proposed positive-constraint Kullback–Leibler divergence loss function which mainly prevents the Kullback–Leibler divergence based loss function from having negative loss values. The second one is self-supervised semantic generation module. This module effectively use multi-label annotations as a modal to supervise the hash representation and hash function learning with the aim to which alleviate the impact of noisy data in all modalities data. The proposed MESDCH merges the multi-label semantic affinity preserving module and self-supervised semantic generation module into deep cross-modal hashing based on three deep neural networks, LabelNet for multi-label modality, ImgNet for the image modality and TxtNet for the text modality. LabelNet plays as a supervisor role to guide the training of TxtNet and ImgNet, and the multi-label semantic affinity of both intra-modalities and inter-modalities are preserved by minimizing the difference between the multi-label semantic affinity matrix and the corresponding hash representation affinity matrix. Superior hash binary codes can be obtained by utilizing a sign function on the learned hash representations.
The main contributions of our work can be summarized as follows.
- 1.
A novel multi-label semantic affinity preserving module is proposed. In this module, a multi-label semantic affinity matrix is defined to accurately calculate the semantic relevance of original pairwise instances. A ReLU transformation is put forward to transform the range of similarities of learned hash representations close to the range of semantic affinity of original pairwise instances. A positive-constraint Kullback–Leibler divergence based loss function is defined to ensure the value of the loss function is non-negative during the hash functions learning procedure.
- 2.
To effectively lessen the influence of noisy data in original training instances during the hash functions learning procedure, the proposed multi-label enhancement based self-supervised deep cross-modal hashing method incorporate both the multi-label semantic affinity preserving module and the self-supervised semantic generation module into an end-to-end trainable architecture, which can further enhance the robustness of the learned hash representations and hash functions.
- 3.
Extensive experiments on four cross-modal retrieval benchmark datasets demonstrate that MESDCH significantly enhances the performance compared to CMH methods without the proposed modules. Furthermore, experimental results also show that our proposed MESDCH outperforms other state-of-the-art CMH methods.
The remainder of this paper is organized as follows. We briefly review the related works on cross-modal hashing retrieval in Section 2. Section 3 elaborates our proposed multi-label enhancement based self-supervised deep cross-modal hashing method. Section 4 presents the detailed optimizations used in our framework. Section 5 provides the experimental results and the corresponding analysis. Section 6 concludes our work.
Section snippets
Related work
According to the style of feature learning, existing CMH methods can be roughly categorized into shallow architecture methods and deep architecture methods. Specifically, shallow architecture CMH methods learn hash representations by using traditional hand crafted feature learning methods. Semantic preserving hashing for cross-view retrieval (SePH) [30] is a two-stage method which firstly learns hash codes by using the similarities of randomly initialized hash codes to fit the corresponding
Proposed method
In this section, we first introduce the formal notations, problem definition, and the details of the proposed multi-label enhancement based self-supervised deep cross-modal hashing method (MESDCH). Without loss of generality, in our method, we assume that each instance has two modalities, i.e., an image modality and a text modality. Nevertheless, our proposed MESDCH can easily be extended to various other multi-modalities (such as audio, video and graphics). Moreover, in MESDCH, the multi-label
Learning Algorithm of MESDCH
In our proposed MESDCH method, an alternating learning strategy is utilized to learn and B during the training process. During each epoch, we update one parameter while keeping the others unchanged. We briefly outline the entire optimizing process of MESDCH and the whole alternating learning procedure for solving Eq. 17 in Algorithm 1.
Experiments and setup
To evaluate the effective and efficient of our proposed MESDCH, four popular cross-modal retrieval benchmark datasets are used, and the performance of MESDCH is also compared to state-of-the-art cross-modal hashing methods.
Conclusion
In this paper, we presented a superior cross-modal hashing method named multi-label enhancement based self-supervised deep cross-modal hashing (MESDCH). A multi-label semantic affinity module is defined in MESDCH to preserve the semantic affinity of inter-modalities and intra-modalities. Compared to the single-label affinity preserving cross-modal hashing methods, our MESDCH can significantly improve the search accuracy by using the proposed multi-label semantic affinity module. Furthermore, a
CRediT authorship contribution statement
Xitao Zou: Conceptualization, Methodology, Writing - original draft. Song Wu: Conceptualization, Methodology.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61806168), Fundamental Research Funds for the Central Universities (SWU117059), and Venture & Innovation Support Program for Chongqing Overseas Returnees (CX2018075).
Xitao Zou received his B.S. degree in computer science from the Guiyang University, Guiyang, China in 2012. He received his M.S. degree in computer science from the Southwest University, Chongqing, China in 2015. After this, he works at the College of Computer Science and Engineering of Chongqing Three Gorges University, Chongqing, China. He is now a Ph.D candidate at the College of Computer and Information Science, Southwest University. His current research interests include deep learning
References (72)
- et al.
Deep discrete cross-modal hashing for cross-media retrieval
Pattern Recognition
(2018) - et al.
Cyclematch: A cycle-consistent embedding network for image-text matching
Pattern Recognition
(2019) - et al.
Semi-supervised cross-modal image generation with generative adversarial networks
Pattern Recognition
(2020) - et al.
An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges
IEEE Transactions on Circuits and Systems for Video Technology
(2017) - Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey on cross-modal retrieval. arXiv...
- Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. Deep supervised cross-modal retrieval. In The IEEE Conference on...
- et al.
Generalized multiview analysis: A discriminative latent space
- Xiao-Yuan Jing, Rui-Min Hu, Yang-Ping Zhu, Shan-Shan Wu, Chao Liang, and Jing-Yu Yang. Intra-view and inter-view...
- et al.
Parallel field alignment for cross media retrieval
- Yue Ting Zhuang, Yan Fei Wang, Fei Wu, Yin Zhang, and Wei Ming Lu. Supervised coupled dictionary learning with group...
A multi-view embedding space for modeling internet images, tags, and their semantics
International journal of computer vision
Joint feature selection and subspace learning for cross-modal retrieval
IEEE transactions on pattern analysis and machine intelligence
Learning cross-modality similarity for multinomial data
Topic modeling of multimodal data: an autoregressive approach
Multi-modal mutual topic reinforce modeling for cross-media retrieval
Cross-modal retrieval with cnn visual features: A new baseline
IEEE transactions on cybernetics
Cycle-consistent deep generative hashing for cross-modal retrieval
IEEE Transactions on Image Processing
Generalized semantic preserving hashing for cross-modal retrieval
IEEE Transactions on Image Processing
Jiwen Lu, and Yap-Peng Tan
Cross-modal discrete hashing. Pattern Recognition
Collective matrix factorization hashing for multimodal data
Latent semantic sparse hashing for cross-modal similarity search
Unsupervised generative adversarial cross-modal hashing
Large-scale supervised multimodal hashing with semantic correlation maximization
Semantics-preserving hashing for cross-view retrieval
Chuan-Xiang Li, Meng-Yuan Liu, Liqiang Nie, and Xin-Shun Xu. Semi-relaxation supervised hashing for cross-modal retrieval
Cited by (14)
Adaptive weight multi-channel center similar deep hashing
2022, Journal of Visual Communication and Image RepresentationOrthogonal multi-view analysis by successive approximations via eigenvectors
2022, NeurocomputingCitation Excerpt :It is rather natural for human beings to perceive the world through comprehensive information collected by multiple sensory organs, but it is an open question on how to endow machines with analogous cognitive capabilities to do the same. To take full advantage of multi-view data, multi-view learning has attracted increasing attention due to its wide applications such as dimensionality reduction [1], cross-view recognition [2,3], clustering [4,5], classification [6], and multi-label learning [7,8]. Many learning criteria have been explored to capture the relations among multiple views including subspace learning methods [9,10], tensor approaches [11,12] and the deep learning [13–15].
Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal Retrieval
2024, ACM Transactions on Multimedia Computing, Communications and ApplicationsDeep Semantic-Aware Proxy Hashing for Multi-Label Cross-Modal Retrieval
2024, IEEE Transactions on Circuits and Systems for Video Technology
Xitao Zou received his B.S. degree in computer science from the Guiyang University, Guiyang, China in 2012. He received his M.S. degree in computer science from the Southwest University, Chongqing, China in 2015. After this, he works at the College of Computer Science and Engineering of Chongqing Three Gorges University, Chongqing, China. He is now a Ph.D candidate at the College of Computer and Information Science, Southwest University. His current research interests include deep learning based computer vision and cross-modal retrieval.
Song Wu received his B.S. degree and M.S. degree in computer science from the Southwest University, Chongqing, China, in 2009 and 2012, respectively. He received his Ph.D from the Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Netherlands. He is a member of the Overseas High-level Talent Program in Chongqing and currently working at the College of Computer and Information Science of Southwest University. His current research interests include large-scale image retrieval and classification, big data technology and deep learning based computer vision (the co-author of the most cited paper of journal Neurocomputing: Deep learning for visual understanding: A review).
Erwin M. Bakker is co-director of the LIACS Media Lab at Leiden University. He has published widely in the fields of image retrieval, audio analysis and retrieval and bioinformatics. He was closely involved with the start of the International Conference on Image and Video Retrieval (CIVR) serving on the organizing committee in 2003 and 2005. Moreover, he regularly serves as a program committee member or organizing committee member for scientific multimedia and human-computer interaction conferences and workshops.
Xinzhi Wang is now a bachelor student in computer science at Southwest University. He has three years of engineering experience in intelligent system development, and is highly experienced in algorithm analysis and mathematic modeling. His current research interests include deep learning based computer vision, cross-modal retrieval and hashing, person re-identification and face detection.