Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks

doi:10.1016/j.patcog.2022.109211

Pattern Recognition

Volume 136, April 2023, 109211

https://doi.org/10.1016/j.patcog.2022.109211 Get rights and content

Highlights

•
MCGCN for the first time builds cross-modal graph and jointly learns modality-specific and modality-shared features for semi-supervised cross-modal hashing.
•
MCGCN provides a three-channel network architecture, including two modality-specific channels and a cross-modal channel to model cross-modal graph with heterogeneous image and text features.
•
To effectively reduce the modality gap, network training is guided by adversarial scheme.
•
MCGCN obtains state-of-the-art semi-supervised cross-modal hashing performance.

Abstract

Cross-modal hashing maps heterogeneous multimedia data into Hamming space for retrieving relevant samples across modalities, which has received great research interests due to its rapid retrieval and low storage cost. In real-world applications, due to high manual annotation cost of multi-media data, we can only make use of limited number of labeled data with rich unlabeled data. In recent years, several semi-supervised cross-modal hashing (SCH) methods have been presented. However, how to fully explore and jointly utilize the modality-specific (complementarity) and modality-shared (correlation) information for retrieval has not been well studied for existing SCH works. In this paper, we propose a novel SCH approach named Modality-specific and Cross-modal Graph Convolutional Networks (MCGCN). The network architecture contains two modality-specific channels and a cross-modal channel to learn modality-specific and shared representations for each modality, respectively. Graph convolutional network (GCN) is leveraged in these three channels to explore intra-modal and inter-modal similarity, and perform semantic information propagation from labeled data to unlabeled data. Modality-specific and shared representations for each modality are fused with attention scheme. To further reduce the modality gap, a discriminative model is designed, learning to classify the modality of representations, and network training is guided by adversarial scheme. Experiments on two widely used multi-modal datasets demonstrate MCGCN outperforms state-of-the-art semi-supervised/supervised cross-modal hashing methods.

Introduction

With the rapid growth of multi-media data, cross-modal retrieval [1], [2], [3], [4], [5] has received continuous research attention, whose goal is to search semantically relevant instances from one modality with the query instance of another modality [6], [7]. One of the most popular pipeline is cross-modal hashing [8], [9], which learns to convert multi-media data into binary hash codes for retrieval, due to its advantage in retrieval speed and storage for large-scale data [10], [11]. Different modalities usually have inconsistent distributions and representations, which is the main challenge. To deal with this modality gap, several supervised cross-modal hashing methods have been developed [12], e.g., collective matrix factorization hashing (CMFH) [13], deep cross-modal hashing (DCMH) [8], cycle-consistent deep generative hashing (CYC-DGH) [14], etc.

Although supervised cross-modal hashing methods have achieved significant progress, they heavily rely on the semantic label information. However, labeling a large repository of instances containing multiple modalities is time and labor consuming and is infeasible. Some unsupervised cross-modal hashing methods have demonstrated that unlabeled multi-media data is also useful for the retrieval task [15], [16]. For example, cluster-wise unsupervised hashing (CUH) [17] adopts the multi-view clustering manner to project data of different modalities into latent space to seek cluster centroid points for learning compact hash codes and linear hash functions. Focusing on the unsupervised retrieval task, aggregation-based graph convolutional hashing (AGCH) [18] uses multiple metrics to formulate affinity matrix for hash code learning. Deep graph-neighbor coherence preserving network (DGCPN) [19] presents graph-neighbor coherence to explore the relationships between unlabeled data and its neighbors, and adopts a comprehensive similarity preserving loss for preserving similarity.

In real-world application, we usually can obtain a small quantity of labeled multi-media data and access rich unlabeled data with multiple modalities to perform cross-modal hashing in this semi-supervised scenario. In recent years, benefited from the development of deep learning technology [20], a few deep learning based semi-supervised cross-modal hashing (SCH) methods have been presented and demonstrated to bring favorable retrieval performance, e.g., semi-supervised deep quantization (SSDQ) [21], ranking-based deep cross-modal hashing (RDCMH) [22], semi-supervised cross-modal hashing approach by generative adversarial network (SCH-GAN) [23], etc. Recently, the powerful representation learning technology, i.e., graph convolutional network (GCN) [24], has been successfully introduced into SCH [25]. Semi-supervised graph convolutional hashing network (SGCH) [26] preserves high-order intra-modality similarity with GCN and adopts a siamese network to map the learned node representations into hamming space for achieving hash codes.

Although a set of SCH methods have been developed, existing SCH methods mainly focus on intra-modal feature learning and similarity preserving, and then build bridge across modalities in the way of loss function establishment, e.g., [21], [22] and [25], or a certain network module, e.g., [23] and [26], with the learned features of each modality for reducing the modality gap and learning hash codes. How to jointly explore both intra-modal and inter-modal semantic similarity and structure information in labeled and unlabeled data, such that the modality-specific and modality-shared information is fully exploited and used, has not been well studied. In this paper, we propose a novel SCH approach named Modality-specific and Cross-modal Graph Convolutional Networks (MCGCN). The contributions of our work are summarized as following three points:

(1)
MCGCN provides a three-channel network architecture, including two modality-specific channels and a cross-modal channel for image and text modalities. Besides intra-modal graph modeling, cross-modal graph is also modeled with heterogeneous image and text features. Joint intra- and inter-modal semantic similarity preservation and semantic information propagation for unlabeled samples are performed based on GCN. And the modality-specific and shared representations are fused with attention scheme for each modality. To our knowledge, this is the first work to specially build cross-modal graph and jointly learn modality-specific and modality-shared features for SCH.
(2)
The adversarial scheme is employed to guide optimization of network parameters. The generative model learns to predict the semantic labels of feature representations, and makes full use of the label and semantic similarity information to generate discriminant hash codes. And the discriminative model builds modality classifier to model inter-modal invariance with the adversarial loss.
(3)
We evaluate MCGCN on the widely used benchmark datasets Wikipedia [27] and NUS-WIDE-10K [28]. The experimental results demonstrate our approach can achieve state-of-the-art SCH performance.

The rest of this paper is organized as follows. Section 2 briefly introduces the related works on supervised and unsupervised cross-modal hashing methods, semi-supervised cross-modal hashing methods, and graph convolutional networks. In Section 3, we detail the proposed MCGCN approach. Section 4 reports the experimental results on the Wikipedia and NUS-WIDE-10K datasets, and provides a comprehensive discussion about MCGCN. Finally, the conclusions are drawn in Section 5.

Section snippets

Supervised and unsupervised cross-modal hashing methods

Nowadays, several supervised or unsupervised cross-modal hashing methods have been presented and have achieved significant process [29], [30], [31], [32]. With the matrix factorization technology, collective matrix factorization hashing (CMFH) [13] tries to learn unified hash codes in the shared latent semantic space for different modalities of an instance. Deep cross-modal hashing (DCMH) [8] provides an end-to-end deep learning framework to perform cross-modal retrieval. Cycle-consistent deep

Notation

Given multimodal dataset $D = {I, T}$ , where $I = [i_{1}, \dots, i_{N}] \in R^{d_{I} \times N}$ and $T = [t_{1}, \dots, t_{N}] \in R^{d_{T} \times N}$ separately denote the feature matrices for the image and text modalities, which can be divided into a retrieval set $D_{r}$ and a query set $D_{q}$ . Here, $N$ is the total number of feature vectors of image/text modalities and $d_{I} \neq d_{T}$ . The retrieval set $D_{r} = {D_{r}^{L}, D_{r}^{U}}$ , where $D_{r}^{L}$ is a collection of $N_{L}$ instances of labeled image-text pairs, and $D_{r}^{U}$ is a set of $N_{R U}$ instances of unlabeled image-text pairs. $l_{p}^{L} \in {0, 1}^{C \times 1}$ represents the

Datasets and compared methods

In this paper, we use two benchmark datasets Wikipedia [27] and NUS-WIDE-10K [28] to evaluate our approach MCGCN.

-The Wikipedia dataset [27] is collected from Wikipedia articles. It contains 2,866 image-text pairs from 10 categories. Following [23], the dataset is divided into a training set (retrieval set) with 2,173 pairs and a test set (query set) with the remaining 693 pairs.

-The NUS-WIDE-10K dataset [28] is a subset of the NUS-WIDE dataset [44], including the pairs of 10 largest categories

Conclusion

In this paper, we propose a novel semi-supervised cross-modal hashing approach named MCGCN. Modality-specific and modality-shared features are effectively explored through joint intra-modal and cross-modal graph modeling and graph convolutional representation learning. The label and structure information of labeled and unlabeled samples are fully leveraged to perform semantic information propagation and learn discriminative hash codes.

Comprehensive experiments on two widely used datasets

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 62076139, 61702280), Open Research Project of Zhejiang Lab (No. 2021KF0AB05), Future Network Scientific Research Fund Project (No. FNSRFP-2021-YB-15), 1311 Talent Program of Nanjing University of Posts and Telecommunications, the National Postdoctoral Program for Innovative Talents (No. BX20180146), China Postdoctoral Science Foundation (No. 2019M661901), and Jiangsu Planned Projects for Postdoctoral Research Funds

Fei Wu received the PhD degree in information and communication engineering from the Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China, in 2016. He is currently an associate professor with the College of Automation and Artificial Intelligence, NJUPT. He has authored over 40 scientific papers, such as TPAMI, TIP, PR, CPVR, AAAI and IJCAI. His current research interests include pattern recognition and artificial intelligence.

References (45)

J. Wu et al.
Reconstruction regularized low-rank subspace learning for cross-modal retrieval
Pattern Recognit.
(2021)
L. Wang et al.
Cluster-wise unsupervised hashing for cross-modal similarity search
Pattern Recognit.
(2021)
D. Zhang et al.
Robust and discrete matrix factorization hashing for cross-modal retrieval
Pattern Recognit.
(2022)
F. Wu et al.
Modality-specific and shared generative adversarial network for cross-modal retrieval
Pattern Recognit.
(2020)
K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on cross-modal retrieval, arXiv preprint...
Y. Aytar et al.
Cross-modal scene networks
IEEE Trans. Pattern Anal. Mach.Intell.
(2018)
D. Mandal et al.
Semi-supervised cross-modal retrieval with label prediction
IEEE Trans. Multimedia
(2020)
P. Hu et al.
Semi-supervised multi-modal learning with balanced spectral decomposition
AAAI Conference on Artificial Intelligence
(2020)
Y. Wu et al.
Augmented adversarial training for cross-modal retrieval
IEEE Trans. Multimedia
(2021)
J.C. Pereira et al.
On the role of correlation and abstraction in cross-modal multimedia retrieval
IEEE Trans. Pattern Anal. Mach.Intell.
(2013)

Q.-Y. Jiang et al.

Deep cross-modal hashing

IEEE Conference on Computer Vision and Pattern Recognition

(2017)

X. Ma et al.

Multi-level correlation adversarial hashing for cross-modal retrieval

IEEE Trans. Multimedia

(2020)

S. Jin et al.

SSAH: semi-supervised adversarial deep hashing with self-paced hard sample generation

AAAI Conference on Artificial Intelligence

(2020)

Y. Wang et al.

Deep unified cross-modality hashing by pairwise data alignment

International Joint Conference on Artificial Intelligence

(2021)

C. Sun et al.

Supervised hierarchical cross-modal hashing

International ACM SIGIR Conference on Research and Development in Information Retrieval

(2019)

G. Ding et al.

Large-scale cross-modality search via collective matrix factorization hashing

IEEE Trans. Image Process.

(2016)

L. Wu et al.

Cycle-consistent deep generative hashing for cross-modal retrieval

IEEE Trans. Image Process.

(2019)

W. Wang et al.

Set and rebase: determining the semantic graph connectivity for unsupervised cross-modal hashing

International Joint Conference on Artificial Intelligence

(2020)

H. Hu et al.

Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing

IEEE Conference on Computer Vision and Pattern Recognition

(2020)

P.-F. Zhang et al.

Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval

IEEE Trans. Multimedia

(2021)

J. Yu et al.

Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing

AAAI Conference on Artificial Intelligence

(2021)

Y. Zhang et al.

Deep relation embedding for cross-modal retrieval

IEEE Trans. Image Process.

(2020)

Cited by (11)

Two-stage fine-grained image classification model based on multi-granularity feature fusion
2024, Pattern Recognition
Fine-grained visual classification (FGVC) is a difficult task due to the challenges of discriminative feature learning. Most existing methods directly use the final output of the network which always contains the global feature with high-level semantic information. However, the differences between fine-grained images are reflected in subtle local regions which often appear in the front of the network. When the texture of the background and object are similar or the proportion of the background is too large, the prediction will be greatly affected. In order to solve the above problems, this paper proposes multi-granularity feature fusion module (MGFF) and two-stage classification based on Vision-Transformer (ViT). The former comprehensively represents images by fusing features of different granularities, thus avoiding the limitations of single-scale features. The latter leverages the ViT model to separate the object from the background at a very small cost, thereby improving the accuracy of the prediction. We conduct comprehensive experiments and achieves the best performance in two fine-grained tasks on CUB-200-2011 and NA-Birds.
An enhanced noise-tolerant hashing for drone object detection
2023, Pattern Recognition
Drone, a.k.a. Unmanned aerial vehicle (UAV), has been pervasively applied in geological hazard monitoring, smart agriculture, and urban planning in the past decade. In this work, we fuse multiple attributes into a noise-tolerant hashing framework that can detect objects from drone pictures extremely fast. Our method can intrinsically and flexibly encode various topological structures from each target object, based on which multi-scale objects can be discovered in a view- and altitude-invariant way. Moreover, by leveraging $l_{F}$ and $l_{1}$ norms collaboratively, the calculated hash codes are robust to low quality drone pictures and noisy semantic labels. More specifically, for each drone-borne picture, we extract visually/semantically salient object parts inside it. To characterize their topological structure, we construct a graphlet by linking the spatially adjacent object patches into a small graph. Subsequently, a binary matrix factorization (MF) is designed to hierarchically exploit the semantics of these graphlets, wherein three attributes: i) deep binary hash codes learning, ii) contaminated pictures/labels denoising, and iii) adaptive data graph updating are seamlessly incorporated. Such multi-attribute binary MF can be solved iteratively, and in turn each graphlet is transformed into the binary hash codes. Finally, the hash codes corresponding to graphlets within each drone photo are utilized for ranking-based object discovery. Comprehensive experiments on the DAC-SDC, MOHR, and our self-compiled data set have demonstrated the competitively speed and accuracy of our method. As a byproduct, we employ an elaborately-designed FPGA architecture to calculate our hash codes. On average, a 57 frames per second (fps) object detection speed is achieved on 4K drone videos (without temporal modeling).
Efficient discrete cross-modal hashing with semantic correlations and similarity preserving
2023, Information Sciences
With its merits in query speed and memory footprint, hashing has elicited considerable monument in cross-media similarity retrieval applications. Many label-dependent supervised hashing methods have been proposed for similarity searching across modality boundaries. However, the current cross-modal hashing (CMH) works are subjected to severe information loss, and their performances may dramatically degrade because of the expensive costs of constructing affinity graphs, inadequate mining of label information, and disregarding label correlations. To facilitate these problems, we propose an Efficient Discrete Cross-Modal Hashing (EDCH) in which an asymmetric model is introduced, which not only conveys external semantic information via embedding high-order labels but also preserves internal modality attributes by introducing binary representations and common subspace. To fully use label semantic information, we integrate the semantic supervised intersection scheme and the category correlations embedding in a shallow framework. Moreover, we elaborately develop an efficient and effective discrete optimization strategy to learn binary representations and a novel mutual linear projection to strengthen the capability and effectiveness of hash functions. Comprehensive experiments are conducted on three representative datasets to evaluate our method. The results validated that our method achieved promising and competitive retrieval performance and surpasses several typical and cutting-edge approaches.
Label embedding asymmetric discrete hashing for efficient cross-modal retrieval
2023, Engineering Applications of Artificial Intelligence
Given the exponential growth of multimedia data, how to swiftly and accurately retrieve information has grown in popularity. Among retrieval techniques, supervised hashing stands out due to its low memory footprint and relatively precise accuracy. Prior theoretical studies often inserted high-order tags into binary code learning, treating them as independent entities. Nevertheless, such approaches frequently neglect the latent category correlations revealed by the label information. Additionally, in terms of optimization, some algorithms employ a bit-by-bit scheme, leading to time-consuming, while others adopt a relaxation-based strategy, producing quantization inaccuracy. To address these issues, we formulate a novel, two-step hashing strategy, termed Label Embedding Asymmetric Discrete Hashing (LEADH). In this study, we provide an asymmetric technique to protect the discrete binary code constraints. Compared with the symmetric model, this method significantly reduces time consumption. In particular, a label-binary mutual mapping architecture is specifically recommended. This model can fully explore and utilize multi-label semantic information to provide better discriminative learned binary codes. Furthermore, to minimize quantization errors, an efficient and effective discrete optimization module based on augmented Lagrangian multipliers is elaborately designed. Extensive experimentation and theoretical study support our model’s superiority. Compared to the sub-optimal method, our LEADH achieves an improvement of 2.6%, 1.8%, and 1.1% on Wiki, MIRFlickr, and NUS-WIDE, respectively.
Unsupervised graph reasoning distillation hashing for multimodal hamming space search with vision-language model
2024, International Journal of Multimedia Information Retrieval
Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval
2024, IEEE Access

View all citing articles on Scopus

Shuaishuai Li is currently a Master candidate in control engineering with NJUPT, Nanjing, China. His current research interests include pattern recognition and data mining.

Guangwei Gao received the PhD degree in pattern recognition and intelligence systems from Nanjing University of Science and Technology, Nanjing, China, in 2014. Now, he is an associate professor with the Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China. His research mainly focuses on pattern recognition and computer vision.

Yimu Ji received the PhD degree in computer science from NJUPT, Nanjing, China, in 2006. He is a professor with NJUPT. His current research interests include intelligent driving, computer vision and big data processing.

Xiao-Yuan Jing received the Doctoral degree in pattern recognition and intelligent system from the Nanjing University of Science and Technology, Nanjing, China, in 1998. He is currently a professor with the School of Computer, Wuhan University, Wuhan, China. He has published more than 100 papers, such as TPAMI, TIP, TIFS, TCSVT, TMM, PR, CPVR, AAAI, IJCAI, and ICSE. His current research interests include pattern recognition and artificial intelligence.

Zhiguo Wan received the PhD degree from the School of Computing, National University of Singapore, Singapore, in 2007. He was an associate professor with the School of Computer Science and Technology, Shandong University. From 2008 to 2014, he was an assistant professor with the School of Software, Tsinghua University. He was a post-doctoral researcher with the Katholieke University of Leuven, Belgium, from 2006 to 2008. He is currently a principal investigator with Zhejiang Laboratory, Hangzhou, Zhejiang, China. His research interest includes intelligent computing.

View full text

Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks

Highlights

Abstract

Introduction

Section snippets

Supervised and unsupervised cross-modal hashing methods

Notation

Datasets and compared methods

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Cross-modal scene networks

IEEE Trans. Pattern Anal. Mach.Intell.

Semi-supervised cross-modal retrieval with label prediction

IEEE Trans. Multimedia

Semi-supervised multi-modal learning with balanced spectral decomposition

AAAI Conference on Artificial Intelligence

Augmented adversarial training for cross-modal retrieval

IEEE Trans. Multimedia

On the role of correlation and abstraction in cross-modal multimedia retrieval

IEEE Trans. Pattern Anal. Mach.Intell.

Deep cross-modal hashing

IEEE Conference on Computer Vision and Pattern Recognition

Multi-level correlation adversarial hashing for cross-modal retrieval

IEEE Trans. Multimedia

SSAH: semi-supervised adversarial deep hashing with self-paced hard sample generation

AAAI Conference on Artificial Intelligence

Deep unified cross-modality hashing by pairwise data alignment

International Joint Conference on Artificial Intelligence

Supervised hierarchical cross-modal hashing

International ACM SIGIR Conference on Research and Development in Information Retrieval

Large-scale cross-modality search via collective matrix factorization hashing

IEEE Trans. Image Process.

Cycle-consistent deep generative hashing for cross-modal retrieval

IEEE Trans. Image Process.

Set and rebase: determining the semantic graph connectivity for unsupervised cross-modal hashing

International Joint Conference on Artificial Intelligence

Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing

IEEE Conference on Computer Vision and Pattern Recognition

Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval

IEEE Trans. Multimedia

Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing

AAAI Conference on Artificial Intelligence

Deep relation embedding for cross-modal retrieval

IEEE Trans. Image Process.