Elsevier

Pattern Recognition

Volume 136, April 2023, 109211
Pattern Recognition

Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks

https://doi.org/10.1016/j.patcog.2022.109211Get rights and content

Highlights

  • MCGCN for the first time builds cross-modal graph and jointly learns modality-specific and modality-shared features for semi-supervised cross-modal hashing.

  • MCGCN provides a three-channel network architecture, including two modality-specific channels and a cross-modal channel to model cross-modal graph with heterogeneous image and text features.

  • To effectively reduce the modality gap, network training is guided by adversarial scheme.

  • MCGCN obtains state-of-the-art semi-supervised cross-modal hashing performance.

Abstract

Cross-modal hashing maps heterogeneous multimedia data into Hamming space for retrieving relevant samples across modalities, which has received great research interests due to its rapid retrieval and low storage cost. In real-world applications, due to high manual annotation cost of multi-media data, we can only make use of limited number of labeled data with rich unlabeled data. In recent years, several semi-supervised cross-modal hashing (SCH) methods have been presented. However, how to fully explore and jointly utilize the modality-specific (complementarity) and modality-shared (correlation) information for retrieval has not been well studied for existing SCH works. In this paper, we propose a novel SCH approach named Modality-specific and Cross-modal Graph Convolutional Networks (MCGCN). The network architecture contains two modality-specific channels and a cross-modal channel to learn modality-specific and shared representations for each modality, respectively. Graph convolutional network (GCN) is leveraged in these three channels to explore intra-modal and inter-modal similarity, and perform semantic information propagation from labeled data to unlabeled data. Modality-specific and shared representations for each modality are fused with attention scheme. To further reduce the modality gap, a discriminative model is designed, learning to classify the modality of representations, and network training is guided by adversarial scheme. Experiments on two widely used multi-modal datasets demonstrate MCGCN outperforms state-of-the-art semi-supervised/supervised cross-modal hashing methods.

Introduction

With the rapid growth of multi-media data, cross-modal retrieval [1], [2], [3], [4], [5] has received continuous research attention, whose goal is to search semantically relevant instances from one modality with the query instance of another modality [6], [7]. One of the most popular pipeline is cross-modal hashing [8], [9], which learns to convert multi-media data into binary hash codes for retrieval, due to its advantage in retrieval speed and storage for large-scale data [10], [11]. Different modalities usually have inconsistent distributions and representations, which is the main challenge. To deal with this modality gap, several supervised cross-modal hashing methods have been developed [12], e.g., collective matrix factorization hashing (CMFH) [13], deep cross-modal hashing (DCMH) [8], cycle-consistent deep generative hashing (CYC-DGH) [14], etc.

Although supervised cross-modal hashing methods have achieved significant progress, they heavily rely on the semantic label information. However, labeling a large repository of instances containing multiple modalities is time and labor consuming and is infeasible. Some unsupervised cross-modal hashing methods have demonstrated that unlabeled multi-media data is also useful for the retrieval task [15], [16]. For example, cluster-wise unsupervised hashing (CUH) [17] adopts the multi-view clustering manner to project data of different modalities into latent space to seek cluster centroid points for learning compact hash codes and linear hash functions. Focusing on the unsupervised retrieval task, aggregation-based graph convolutional hashing (AGCH) [18] uses multiple metrics to formulate affinity matrix for hash code learning. Deep graph-neighbor coherence preserving network (DGCPN) [19] presents graph-neighbor coherence to explore the relationships between unlabeled data and its neighbors, and adopts a comprehensive similarity preserving loss for preserving similarity.

In real-world application, we usually can obtain a small quantity of labeled multi-media data and access rich unlabeled data with multiple modalities to perform cross-modal hashing in this semi-supervised scenario. In recent years, benefited from the development of deep learning technology [20], a few deep learning based semi-supervised cross-modal hashing (SCH) methods have been presented and demonstrated to bring favorable retrieval performance, e.g., semi-supervised deep quantization (SSDQ) [21], ranking-based deep cross-modal hashing (RDCMH) [22], semi-supervised cross-modal hashing approach by generative adversarial network (SCH-GAN) [23], etc. Recently, the powerful representation learning technology, i.e., graph convolutional network (GCN) [24], has been successfully introduced into SCH [25]. Semi-supervised graph convolutional hashing network (SGCH) [26] preserves high-order intra-modality similarity with GCN and adopts a siamese network to map the learned node representations into hamming space for achieving hash codes.

Although a set of SCH methods have been developed, existing SCH methods mainly focus on intra-modal feature learning and similarity preserving, and then build bridge across modalities in the way of loss function establishment, e.g., [21], [22] and [25], or a certain network module, e.g., [23] and [26], with the learned features of each modality for reducing the modality gap and learning hash codes. How to jointly explore both intra-modal and inter-modal semantic similarity and structure information in labeled and unlabeled data, such that the modality-specific and modality-shared information is fully exploited and used, has not been well studied. In this paper, we propose a novel SCH approach named Modality-specific and Cross-modal Graph Convolutional Networks (MCGCN). The contributions of our work are summarized as following three points:

  • (1)

    MCGCN provides a three-channel network architecture, including two modality-specific channels and a cross-modal channel for image and text modalities. Besides intra-modal graph modeling, cross-modal graph is also modeled with heterogeneous image and text features. Joint intra- and inter-modal semantic similarity preservation and semantic information propagation for unlabeled samples are performed based on GCN. And the modality-specific and shared representations are fused with attention scheme for each modality. To our knowledge, this is the first work to specially build cross-modal graph and jointly learn modality-specific and modality-shared features for SCH.

  • (2)

    The adversarial scheme is employed to guide optimization of network parameters. The generative model learns to predict the semantic labels of feature representations, and makes full use of the label and semantic similarity information to generate discriminant hash codes. And the discriminative model builds modality classifier to model inter-modal invariance with the adversarial loss.

  • (3)

    We evaluate MCGCN on the widely used benchmark datasets Wikipedia [27] and NUS-WIDE-10K [28]. The experimental results demonstrate our approach can achieve state-of-the-art SCH performance.

The rest of this paper is organized as follows. Section 2 briefly introduces the related works on supervised and unsupervised cross-modal hashing methods, semi-supervised cross-modal hashing methods, and graph convolutional networks. In Section 3, we detail the proposed MCGCN approach. Section 4 reports the experimental results on the Wikipedia and NUS-WIDE-10K datasets, and provides a comprehensive discussion about MCGCN. Finally, the conclusions are drawn in Section 5.

Section snippets

Supervised and unsupervised cross-modal hashing methods

Nowadays, several supervised or unsupervised cross-modal hashing methods have been presented and have achieved significant process [29], [30], [31], [32]. With the matrix factorization technology, collective matrix factorization hashing (CMFH) [13] tries to learn unified hash codes in the shared latent semantic space for different modalities of an instance. Deep cross-modal hashing (DCMH) [8] provides an end-to-end deep learning framework to perform cross-modal retrieval. Cycle-consistent deep

Notation

Given multimodal dataset D={I,T}, where I=[i1,,iN]RdI×N and T=[t1,,tN]RdT×N separately denote the feature matrices for the image and text modalities, which can be divided into a retrieval set Dr and a query set Dq. Here, N is the total number of feature vectors of image/text modalities and dIdT. The retrieval set Dr={DrL,DrU}, where DrL is a collection of NL instances of labeled image-text pairs, and DrU is a set of NRU instances of unlabeled image-text pairs. lpL{0,1}C×1 represents the

Datasets and compared methods

In this paper, we use two benchmark datasets Wikipedia [27] and NUS-WIDE-10K [28] to evaluate our approach MCGCN.

-The Wikipedia dataset [27] is collected from Wikipedia articles. It contains 2,866 image-text pairs from 10 categories. Following [23], the dataset is divided into a training set (retrieval set) with 2,173 pairs and a test set (query set) with the remaining 693 pairs.

-The NUS-WIDE-10K dataset [28] is a subset of the NUS-WIDE dataset [44], including the pairs of 10 largest categories

Conclusion

In this paper, we propose a novel semi-supervised cross-modal hashing approach named MCGCN. Modality-specific and modality-shared features are effectively explored through joint intra-modal and cross-modal graph modeling and graph convolutional representation learning. The label and structure information of labeled and unlabeled samples are fully leveraged to perform semantic information propagation and learn discriminative hash codes.

Comprehensive experiments on two widely used datasets

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 62076139, 61702280), Open Research Project of Zhejiang Lab (No. 2021KF0AB05), Future Network Scientific Research Fund Project (No. FNSRFP-2021-YB-15), 1311 Talent Program of Nanjing University of Posts and Telecommunications, the National Postdoctoral Program for Innovative Talents (No. BX20180146), China Postdoctoral Science Foundation (No. 2019M661901), and Jiangsu Planned Projects for Postdoctoral Research Funds

Fei Wu received the PhD degree in information and communication engineering from the Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China, in 2016. He is currently an associate professor with the College of Automation and Artificial Intelligence, NJUPT. He has authored over 40 scientific papers, such as TPAMI, TIP, PR, CPVR, AAAI and IJCAI. His current research interests include pattern recognition and artificial intelligence.

References (45)

  • Q.-Y. Jiang et al.

    Deep cross-modal hashing

    IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • X. Ma et al.

    Multi-level correlation adversarial hashing for cross-modal retrieval

    IEEE Trans. Multimedia

    (2020)
  • S. Jin et al.

    SSAH: semi-supervised adversarial deep hashing with self-paced hard sample generation

    AAAI Conference on Artificial Intelligence

    (2020)
  • Y. Wang et al.

    Deep unified cross-modality hashing by pairwise data alignment

    International Joint Conference on Artificial Intelligence

    (2021)
  • C. Sun et al.

    Supervised hierarchical cross-modal hashing

    International ACM SIGIR Conference on Research and Development in Information Retrieval

    (2019)
  • G. Ding et al.

    Large-scale cross-modality search via collective matrix factorization hashing

    IEEE Trans. Image Process.

    (2016)
  • L. Wu et al.

    Cycle-consistent deep generative hashing for cross-modal retrieval

    IEEE Trans. Image Process.

    (2019)
  • W. Wang et al.

    Set and rebase: determining the semantic graph connectivity for unsupervised cross-modal hashing

    International Joint Conference on Artificial Intelligence

    (2020)
  • H. Hu et al.

    Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing

    IEEE Conference on Computer Vision and Pattern Recognition

    (2020)
  • P.-F. Zhang et al.

    Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval

    IEEE Trans. Multimedia

    (2021)
  • J. Yu et al.

    Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing

    AAAI Conference on Artificial Intelligence

    (2021)
  • Y. Zhang et al.

    Deep relation embedding for cross-modal retrieval

    IEEE Trans. Image Process.

    (2020)
  • Cited by (11)

    • Label embedding asymmetric discrete hashing for efficient cross-modal retrieval

      2023, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    Fei Wu received the PhD degree in information and communication engineering from the Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China, in 2016. He is currently an associate professor with the College of Automation and Artificial Intelligence, NJUPT. He has authored over 40 scientific papers, such as TPAMI, TIP, PR, CPVR, AAAI and IJCAI. His current research interests include pattern recognition and artificial intelligence.

    Shuaishuai Li is currently a Master candidate in control engineering with NJUPT, Nanjing, China. His current research interests include pattern recognition and data mining.

    Guangwei Gao received the PhD degree in pattern recognition and intelligence systems from Nanjing University of Science and Technology, Nanjing, China, in 2014. Now, he is an associate professor with the Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China. His research mainly focuses on pattern recognition and computer vision.

    Yimu Ji received the PhD degree in computer science from NJUPT, Nanjing, China, in 2006. He is a professor with NJUPT. His current research interests include intelligent driving, computer vision and big data processing.

    Xiao-Yuan Jing received the Doctoral degree in pattern recognition and intelligent system from the Nanjing University of Science and Technology, Nanjing, China, in 1998. He is currently a professor with the School of Computer, Wuhan University, Wuhan, China. He has published more than 100 papers, such as TPAMI, TIP, TIFS, TCSVT, TMM, PR, CPVR, AAAI, IJCAI, and ICSE. His current research interests include pattern recognition and artificial intelligence.

    Zhiguo Wan received the PhD degree from the School of Computing, National University of Singapore, Singapore, in 2007. He was an associate professor with the School of Computer Science and Technology, Shandong University. From 2008 to 2014, he was an assistant professor with the School of Software, Tsinghua University. He was a post-doctoral researcher with the Katholieke University of Leuven, Belgium, from 2006 to 2008. He is currently a principal investigator with Zhejiang Laboratory, Hangzhou, Zhejiang, China. His research interest includes intelligent computing.

    View full text