Elsevier

Neurocomputing

Volume 467, 7 January 2022, Pages 138-162
Neurocomputing

Multi-label enhancement based self-supervised deep cross-modal hashing

https://doi.org/10.1016/j.neucom.2021.09.053Get rights and content

Abstract

Deep cross-modal hashing which integrates deep learning and hashing into cross-modal retrieval, achieves better performance than traditional cross-modal retrieval methods. Nevertheless, most previous deep cross-modal hashing methods only utilize single-class labels to compute the semantic affinity across modalities but overlook the existence of multiple category labels, which can capture the semantic affinity more accurately. Additionally, almost all existing cross-modal hashing methods straightforwardly employ all modalities to learn hash functions but neglect the fact that original instances in all modalities may contain noise. To avoid the above weaknesses, in this paper, a novel multi-label enhancement based self-supervised deep cross-modal hashing (MESDCH) approach is proposed. MESDCH first propose a multi-label semantic affinity preserving module, which uses ReLU transformation to unify the similarities of learned hash representations and the corresponding multi-label semantic affinity of original instances and defines a positive-constraint Kullback–Leibler loss function to preserve their similarity. Then this module is integrated into a self-supervised semantic generation module to further enhance the performance of deep cross-modal hashing. Extensive evaluation experiments on four well-known datasets demonstrate that the proposed MESDCH achieves state-of-the-art performance and outperforms several excellent baseline methods in the application of cross-modal hashing retrieval. Code is available at: https://github.com/SWU-CS-MediaLab/MESDCH.

Introduction

Recent years have witnessed a huge surge of multimedia data such as images, texts, audios and videos on the website. Because different data modalities may describe the same events or topics, we can take advantage of the potential semantically correlation among these multi-modal data to realize large-scale cross-modal data retrieval. Hence, cross-modal retrieval [1], [2], which is proposed to search semantically relevant instances from one modality by using a query of another modality, has drawn escalating attention. These different modality data have different feature representations and distributions. Thus, how to efficiently and effectively unify these massive yet heterogeneous modality data and further reduce their semantic gaps is still a big challenge.

The goal of cross-modal retrieval is to learn an isomorphic latent space, where the original heterogeneous feature representation of data from different modalities can be unified by a latent embedding. This is based on the hypothesis that different modalities having semantically related properties can be mapped and grouped into a common latent space. Existing cross-modal retrieval methods can be divided into two main categories [3]: real-valued representation learning methods and binary representation learning methods. The real-valued representation (such as subspace learning [4], [5], [6], [7], [8], [9], topic models [10], [11], [12], and deep models [13], [14], [15], [16], [17], [18], [19], [20], [21]) are usually measured by Euclidean distance to ensure that semantically relevant data are close to each other. However, the similarity measure in the real-valued representation space suffers from low search response and high computational complexity. Thus, binary codes are used, which have both low data storage requirements and a highly efficient distance measure (XOR operation) [22], [23]. Binary representation learning methods, also referred to as cross-modal hashing (CMH) [24] methods, could effectively project a high dimensional real-valued representation of multi-modal data into an isomorphic Hamming space, endowing similar cross-modal data representations with similar hash codes.

In general, existing cross-modal hashing methods can be further categorized into unsupervised and supervised methods. Unsupervised cross-modal hashing methods, such as inter-media hashing (IMH) [25], collective matrix factorization hashing (CMFH) [26], latent semantic sparse hashing (LSSH) [27], and unsupervised generative adversarial cross-modal hashing (UGACH) [28], learn the hash projection functions by exploring the underlying distributions and structures among the similarities of multi-modal data representations without using any further supervised information. Supervised cross-modal hashing methods learn the hash functions by mapping pairwise instances into pairwise binary codes and preserve the semantic relevance of the pairwise instances with the guidance of supervised information (such as semantic labels). Representative supervised CMH methods, such as semantic correlation maximization (SCM) [29], semantics preserving hashing (SePH) [30], dictionary learning cross-modal hashing (DLCMH) [31] and semi-relaxation supervised hashing (SRSH) [32], can effectively distill semantic correlations of cross-modal data by exploiting semantic labels and achieve superior performance compared to unsupervised CMH methods. Nevertheless, these methods are based on shallow architectures and cannot describe the complicated nonlinear correlations among different modalities. Moreover, in these methods, the hand-crafted feature extraction and hash function learning are independently performed, which might not be optimally compatible with each other and may result in suboptimal performance.

Recently, deep convolutional neural networks (CNNs) [33], [34], [35], [36] have made significant progress in various computer vision applications [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], and have been adopted for cross-modal hashing retrieval. The deep neural network based cross-modal hashing methods, such as deep cross-modal hashing (DCMH) [47], pairwise relationship guided deep hashing (PRDH) [48], correlation hashing network (CHN) [49], collective deep quantization (CDQ) [50], self-supervised adversarial hashing (SSAH) [51], cross-modal hamming hashing (CMHH) [52], fast discrete cross-modal hashing (FDCH) [53], deep multiscale fusion hashing (DMFH) [54]. and triplet-based deep hashing (TDH) [55], integrate the learning of hash representations and hash functions into an end-to-end trainable architecture. Meanwhile, deep model based CMH methods can effectively capture nonlinear heterogeneous cross-modal correlations and obtain better performance than methods using shallow architectures.

However, most existing deep CMH methods simply leverage the single-class labels to define the semantic affinity of the original pairwise instances, and meanwhile, minimize the difference between the semantic affinity and the similarities of learned hash representations to preserve the semantically correlation of pairwise instances in the Hamming space. Nonetheless, this simple definition of semantic affinity cannot effectively preserve the semantic correlation and may lead to inferior retrieval performance. Actually, in cross-modal retrieval benchmark datasets as well as practical applications, these data from different modalities are usually labeled with multiple categories, e.g., multi-labels. This allows us to refine the definition of semantic affinity and base it on multi-labels information, thereby exploiting more accurate semantic relevance both for inter-modality and intra-modality pairwise instances (as shown in Fig. 1). However, the problem to effectively minimize the gap between the multi-label semantic affinity and the corresponding similarity of learned hash representations still remains. A common solution to minimize this gap is optimizing a MSE (Mean Square Error) based loss function [56]. However, the measure of MSE is based on Euclidean distance and is difficult to optimize. Moreover, MSE based loss functions are not robust to outlier pairs of instances [49]. To address this issue, a straight forward solution is constraining the similarities of learned hash representations (S) to fit their corresponding multi-label semantic affinity of original pairwise instances (P) by using a classical Kullback–Leibler divergence [57] based loss function (KL Loss) together with stochastic gradient descent optimization [58]. This leads to two other problems. Firstly, the ranges of S and P are not always the same, and it has been shown that it is ineffective to try to unify the ranges of S and P with simple linear transformations. Secondly, if most of the initial values of S are bigger than the corresponding values of P, the Kullback–Leibler divergence based loss function may generate negative loss values, which may work against the fitting goal during the optimization procedure (as the shown in Fig. 2).

Furthermore, most existing deep CMH methods simply use all modalities of the data to learn hash functions but neglect the fact that original instances in all data modalities may contain noise. These noises may reduce the performance and robustness of the learned hash representations and hash functions. At the same time, the assigned labels of instances constitute refinements for the original features in each modality and contain much semantic information, which usually shows little noise. Thus, a self-supervised semantic network based on multi-label annotations is usually utilized to improve the performance of deep CMH methods (as shown in Fig. 3). However, they define the semantic affinity matrix of instances based on single-label information, which cannot accurately capture the semantic affinity of original pairwise instances.

Taking the above problems into consideration, in this paper, we propose a novel and efficient multi-label enhancement based self-supervised deep cross-modal hashing (MESDCH) method to improve the robustness of learned hash representations and hash functions. As shown in Fig. 4, two novel modules are introduced in our MESDCH. The first one is the multi-label semantic affinity preserving module. This module mainly consists of three parts: the first part is a new definition of multi-labels semantic affinity, which aims to accurately exploit the semantic affinity of original pairwise instances under the supervision of multi-label information. The second part is a novel space transformation using a ReLU function to effectively unify the ranges of the similarities of learned hash representations and the corresponding multi-label semantic affinity of original pairwise instances in the Kullback–Leibler divergence based loss function. The third part is the proposed positive-constraint Kullback–Leibler divergence loss function which mainly prevents the Kullback–Leibler divergence based loss function from having negative loss values. The second one is self-supervised semantic generation module. This module effectively use multi-label annotations as a modal to supervise the hash representation and hash function learning with the aim to which alleviate the impact of noisy data in all modalities data. The proposed MESDCH merges the multi-label semantic affinity preserving module and self-supervised semantic generation module into deep cross-modal hashing based on three deep neural networks, LabelNet for multi-label modality, ImgNet for the image modality and TxtNet for the text modality. LabelNet plays as a supervisor role to guide the training of TxtNet and ImgNet, and the multi-label semantic affinity of both intra-modalities and inter-modalities are preserved by minimizing the difference between the multi-label semantic affinity matrix and the corresponding hash representation affinity matrix. Superior hash binary codes can be obtained by utilizing a sign function on the learned hash representations.

The main contributions of our work can be summarized as follows.

  • 1.

    A novel multi-label semantic affinity preserving module is proposed. In this module, a multi-label semantic affinity matrix is defined to accurately calculate the semantic relevance of original pairwise instances. A ReLU transformation is put forward to transform the range of similarities of learned hash representations close to the range of semantic affinity of original pairwise instances. A positive-constraint Kullback–Leibler divergence based loss function is defined to ensure the value of the loss function is non-negative during the hash functions learning procedure.

  • 2.

    To effectively lessen the influence of noisy data in original training instances during the hash functions learning procedure, the proposed multi-label enhancement based self-supervised deep cross-modal hashing method incorporate both the multi-label semantic affinity preserving module and the self-supervised semantic generation module into an end-to-end trainable architecture, which can further enhance the robustness of the learned hash representations and hash functions.

  • 3.

    Extensive experiments on four cross-modal retrieval benchmark datasets demonstrate that MESDCH significantly enhances the performance compared to CMH methods without the proposed modules. Furthermore, experimental results also show that our proposed MESDCH outperforms other state-of-the-art CMH methods.

The remainder of this paper is organized as follows. We briefly review the related works on cross-modal hashing retrieval in Section 2. Section 3 elaborates our proposed multi-label enhancement based self-supervised deep cross-modal hashing method. Section 4 presents the detailed optimizations used in our framework. Section 5 provides the experimental results and the corresponding analysis. Section 6 concludes our work.

Section snippets

Related work

According to the style of feature learning, existing CMH methods can be roughly categorized into shallow architecture methods and deep architecture methods. Specifically, shallow architecture CMH methods learn hash representations by using traditional hand crafted feature learning methods. Semantic preserving hashing for cross-view retrieval (SePH) [30] is a two-stage method which firstly learns hash codes by using the similarities of randomly initialized hash codes to fit the corresponding

Proposed method

In this section, we first introduce the formal notations, problem definition, and the details of the proposed multi-label enhancement based self-supervised deep cross-modal hashing method (MESDCH). Without loss of generality, in our method, we assume that each instance has two modalities, i.e., an image modality and a text modality. Nevertheless, our proposed MESDCH can easily be extended to various other multi-modalities (such as audio, video and graphics). Moreover, in MESDCH, the multi-label

Learning Algorithm of MESDCH

In our proposed MESDCH method, an alternating learning strategy is utilized to learn Wv,Wt,Wl and B during the training process. During each epoch, we update one parameter while keeping the others unchanged. We briefly outline the entire optimizing process of MESDCH and the whole alternating learning procedure for solving Eq. 17 in Algorithm 1.

Experiments and setup

To evaluate the effective and efficient of our proposed MESDCH, four popular cross-modal retrieval benchmark datasets are used, and the performance of MESDCH is also compared to state-of-the-art cross-modal hashing methods.

Conclusion

In this paper, we presented a superior cross-modal hashing method named multi-label enhancement based self-supervised deep cross-modal hashing (MESDCH). A multi-label semantic affinity module is defined in MESDCH to preserve the semantic affinity of inter-modalities and intra-modalities. Compared to the single-label affinity preserving cross-modal hashing methods, our MESDCH can significantly improve the search accuracy by using the proposed multi-label semantic affinity module. Furthermore, a

CRediT authorship contribution statement

Xitao Zou: Conceptualization, Methodology, Writing - original draft. Song Wu: Conceptualization, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61806168), Fundamental Research Funds for the Central Universities (SWU117059), and Venture & Innovation Support Program for Chongqing Overseas Returnees (CX2018075).

Xitao Zou received his B.S. degree in computer science from the Guiyang University, Guiyang, China in 2012. He received his M.S. degree in computer science from the Southwest University, Chongqing, China in 2015. After this, he works at the College of Computer Science and Engineering of Chongqing Three Gorges University, Chongqing, China. He is now a Ph.D candidate at the College of Computer and Information Science, Southwest University. His current research interests include deep learning

References (72)

  • Fangming Zhong et al.

    Deep discrete cross-modal hashing for cross-media retrieval

    Pattern Recognition

    (2018)
  • Yu. Liu et al.

    Cyclematch: A cycle-consistent embedding network for image-text matching

    Pattern Recognition

    (2019)
  • Dan Li et al.

    Semi-supervised cross-modal image generation with generative adversarial networks

    Pattern Recognition

    (2020)
  • Yuxin Peng et al.

    An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges

    IEEE Transactions on Circuits and Systems for Video Technology

    (2017)
  • Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey on cross-modal retrieval. arXiv...
  • Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. Deep supervised cross-modal retrieval. In The IEEE Conference on...
  • Abhishek Sharma et al.

    Generalized multiview analysis: A discriminative latent space

  • Xiao-Yuan Jing, Rui-Min Hu, Yang-Ping Zhu, Shan-Shan Wu, Chao Liang, and Jing-Yu Yang. Intra-view and inter-view...
  • Xiangbo Mao et al.

    Parallel field alignment for cross media retrieval

  • Yue Ting Zhuang, Yan Fei Wang, Fei Wu, Yin Zhang, and Wei Ming Lu. Supervised coupled dictionary learning with group...
  • Yunchao Gong et al.

    A multi-view embedding space for modeling internet images, tags, and their semantics

    International journal of computer vision

    (2014)
  • Kaiye Wang et al.

    Joint feature selection and subspace learning for cross-modal retrieval

    IEEE transactions on pattern analysis and machine intelligence

    (2015)
  • Yangqing Jia et al.

    Learning cross-modality similarity for multinomial data

  • Yin Zheng et al.

    Topic modeling of multimodal data: an autoregressive approach

  • Yanfei Wang et al.

    Multi-modal mutual topic reinforce modeling for cross-media retrieval

  • Jian Wang, Yonghao He, Cuicui Kang, Shiming Xiang, and Chunhong Pan. Image-text cross-modal retrieval via...
  • Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A...
  • Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In...
  • Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, and Yueting Zhuang. Deep compositional cross-modal...
  • Yunchao Wei et al.

    Cross-modal retrieval with cnn visual features: A new baseline

    IEEE transactions on cybernetics

    (2016)
  • Yuxin Peng and Jinwei Qi. Cm-gans: cross-modal generative adversarial networks for common representation learning. ACM...
  • Wu. Lin et al.

    Cycle-consistent deep generative hashing for cross-modal retrieval

    IEEE Transactions on Image Processing

    (2018)
  • Devraj Mandal et al.

    Generalized semantic preserving hashing for cross-modal retrieval

    IEEE Transactions on Image Processing

    (2018)
  • Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE transactions on pattern...
  • Wujun Li. Learning to hash for big data: A tutorial. https://cs.nju.edu.cn/lwj/slides/L2H.pdf,...
  • Venice Erin Liong

    Jiwen Lu, and Yap-Peng Tan

    Cross-modal discrete hashing. Pattern Recognition

    (2018)
  • Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, and Heng Tao Shen. Inter-media hashing for large-scale retrieval from...
  • Guiguang Ding et al.

    Collective matrix factorization hashing for multimodal data

  • Jile Zhou et al.

    Latent semantic sparse hashing for cross-modal similarity search

  • Jian Zhang et al.

    Unsupervised generative adversarial cross-modal hashing

  • Dongqing Zhang et al.

    Large-scale supervised multimodal hashing with semantic correlation maximization

  • Zijia Lin et al.

    Semantics-preserving hashing for cross-view retrieval

  • Xin-Shun Xu. Dictionary learning based hashing for cross-modal retrieval. In Proceedings of the 24th ACM international...
  • Peng-Fei Zhang

    Chuan-Xiang Li, Meng-Yuan Liu, Liqiang Nie, and Xin-Shun Xu. Semi-relaxation supervised hashing for cross-modal retrieval

  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural...
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint...
  • Cited by (14)

    • Adaptive weight multi-channel center similar deep hashing

      2022, Journal of Visual Communication and Image Representation
    • Orthogonal multi-view analysis by successive approximations via eigenvectors

      2022, Neurocomputing
      Citation Excerpt :

      It is rather natural for human beings to perceive the world through comprehensive information collected by multiple sensory organs, but it is an open question on how to endow machines with analogous cognitive capabilities to do the same. To take full advantage of multi-view data, multi-view learning has attracted increasing attention due to its wide applications such as dimensionality reduction [1], cross-view recognition [2,3], clustering [4,5], classification [6], and multi-label learning [7,8]. Many learning criteria have been explored to capture the relations among multiple views including subspace learning methods [9,10], tensor approaches [11,12] and the deep learning [13–15].

    • Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal Retrieval

      2024, ACM Transactions on Multimedia Computing, Communications and Applications
    • Deep Semantic-Aware Proxy Hashing for Multi-Label Cross-Modal Retrieval

      2024, IEEE Transactions on Circuits and Systems for Video Technology
    View all citing articles on Scopus

    Xitao Zou received his B.S. degree in computer science from the Guiyang University, Guiyang, China in 2012. He received his M.S. degree in computer science from the Southwest University, Chongqing, China in 2015. After this, he works at the College of Computer Science and Engineering of Chongqing Three Gorges University, Chongqing, China. He is now a Ph.D candidate at the College of Computer and Information Science, Southwest University. His current research interests include deep learning based computer vision and cross-modal retrieval.

    Song Wu received his B.S. degree and M.S. degree in computer science from the Southwest University, Chongqing, China, in 2009 and 2012, respectively. He received his Ph.D from the Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Netherlands. He is a member of the Overseas High-level Talent Program in Chongqing and currently working at the College of Computer and Information Science of Southwest University. His current research interests include large-scale image retrieval and classification, big data technology and deep learning based computer vision (the co-author of the most cited paper of journal Neurocomputing: Deep learning for visual understanding: A review).

    Erwin M. Bakker is co-director of the LIACS Media Lab at Leiden University. He has published widely in the fields of image retrieval, audio analysis and retrieval and bioinformatics. He was closely involved with the start of the International Conference on Image and Video Retrieval (CIVR) serving on the organizing committee in 2003 and 2005. Moreover, he regularly serves as a program committee member or organizing committee member for scientific multimedia and human-computer interaction conferences and workshops.

    Xinzhi Wang is now a bachelor student in computer science at Southwest University. He has three years of engineering experience in intelligent system development, and is highly experienced in algorithm analysis and mathematic modeling. His current research interests include deep learning based computer vision, cross-modal retrieval and hashing, person re-identification and face detection.

    View full text