Discrete asymmetric zero-shot hashing with application to cross-modal retrieval

doi:10.1016/j.neucom.2022.09.037

Neurocomputing

Volume 511, 28 October 2022, Pages 366-379

https://doi.org/10.1016/j.neucom.2022.09.037 Get rights and content

Abstract

In recent years, cross-modal retrieval technology has attracted extensive attention with the massive growth of multimedia data. However, most cross-modal hashing methods mainly focus on exploring the retrieval of seen classes, while ignoring the retrieval of unseen classes. Therefore, traditional cross-modal hashing methods cannot achieve satisfactory performances in zero-shot retrieval. To mitigate this challenge, in this paper, we propose a novel zero-shot cross-modal retrieval method called discrete asymmetric zero-shot hashing(DAZSH), which fully exploits the supervised knowledge of multimodal data. Specifically, it integrates pairwise similarity, class attributes and semantic labels to guide zero-shot hashing learning. Moreover, our proposed DAZSH method combines the data features with the class attributes to obtain a semantic category representation for each category. Therefore, the relationships between seen and unseen classes can be effectively captured by learning a category representation vector for each instance. Therefore, the supervised knowledge can be transferred from the seen classes to the unseen classes. In addition, we develop an efficient discrete optimization strategy to solve the proposed model. Massive experiments on three benchmark datasets show that our proposed approach has achieved promising results in cross-modal retrieval tasks. The source code of this paper can be obtained from https://github.com/szq0816/DAZSH.

Introduction

Over the past decade, cross-modal retrieval tasks have become a great challenge owing to the exponential growth of multimodal data. In general, multimodal data are usually dependent and have essential connections in most cases. Therefore, it is fundamental to learn the correlation information between multimodalities in pattern recognition and machine learning, which is referred to as the heterogeneity gap. To resolve this discrepancy, traditional approaches try to project multimedia data into a common semantic space and then perform the retrieval task. However, real-value projections require more storage space and expensive computational costs due to the increase in multimedia data. This has become a significant obstacle in cross-modal retrieval applications. Therefore, hashing is an effective way to perform the retrieval task in large-scale datasets due to its low storage and high computational efficiency. It aims to project original samples into compact binary codes, which preserves their similarity in Hamming space. Recently, researchers have made many efforts to bridge the heterogeneity gap between multiple modalities and have achieved promising performances in many real applications [1], [2], [3], [4].

To the best of our knowledge, most existing cross-modal hashing retrieval methods are studied in the seen classes dataset [5], [6], [7]. However, with the explosive growth of multimedia data, some new concepts, such as unseen classes, have emerged in the past few years. Retraining the existing cross-modal hashing model after collecting new concept data is high cost and requires much storage space. Therefore, it is necessary to adopt a new cross-modal hashing model to deal with training data containing new concepts. Therefore, zero-shot learning aims to identify previously unseen data categories. Specifically, the trained classifier not only identifies the existing data categories in the training set, but also distinguishes the data from the unseen categories [8].

In the past several years, many zero-shot learning approaches have been applied for cross-modal retrieval [9], [10], [11]. Yang et al. [9] realized the potential semantic transfer by projecting the labels into the word embedding space. Shi et al. [10] proposed a zero-shot hashing method based on the asymmetric ratio similarity matrix, which can significantly improve the ability of knowledge transfer from seen classes to unseen classes. Transductive zero-shot hashing (T-MLZSH) [11] was proposed as a multilabel image retrieval model based on zero-shot learning. In this model, the labels of unseen classes are predicted by the instance-concept coherence ranking. Nevertheless, all the abovementioned works were applied to single-modality retrieval tasks, and there are still few efforts on unseen cross-modality retrieval tasks. With the continuous emergence of new concepts, existing cross-modal retrieval methods have the following limitations. (1) They only consider data from the seen categories and ignore the unseen cases. Therefore, these models are unsuitable for cross-modal retrieval with mixed unseen data. (2) Most of them neglect the class attribute information in hash code learning and thus are inconducive to knowledge transfer from seen classes to unseen classes. (3) Existing zero-shot hashing approaches fail to consider the pairwise similarity, class labels and class attributes to train models at the same time.

To address the above challenges, in this work, a novel zero-shot hashing method, called discrete asymmetric zero-shot hashing(DAZSH), is proposed for cross-modal retrieval. It integrates pairwise similarity, semantic labels and category attributes into a framework to fully explore semantic information. Specifically, we combine the data features with category attributes to obtain the category representation vector of each instance. In addition, the relationship between seen classes and unseen classes can be better captured, and the supervision knowledge can be transferred from the seen classes to the unseen classes. Fig. 1 shows the framework of our proposed DAZSH method in zero-shot cross-modal retrieval. The experimental results on three datasets show that our proposed DAZSH method can achieve better retrieval performances in dealing with unseen data.

The contributions of this work are as follows:

(1)
We propose a unified discrete asymmetric zero-shot framework for learning hash codes. It combines data features with class attributes to learn an attribute space for each modality. Therefore, we explore the relationship between the seen classes and the unseen classes, which can transfer the supervision information from the seen classes to the unseen classes. In addition, our proposed approach aims at embedding the labels into the attribute space to improve retrieval accuracy. Therefore, our proposed model can generate more discriminative hash codes than traditional hashing methods.
(2)
To maintain the characteristics of each modality, we generate different hash codes for different modalities using the asymmetric similarity strategy. Furthermore, we employ the maximum likelihood estimation algorithm to explore the pairwise similarity of multimodality data. To the best of our knowledge, this study is the first to utilize the pairwise similarity of different modalities using the maximum likelihood estimation algorithm in the cross-modal zero-shot learning field. Compared with traditional cross-modal zero-shot learning methods, our proposed method can effectively alleviate the heterogeneity gap between different modalities and more closely connect the seen and unseen classes, simultaneously.
(3)
We develop a discrete optimization scheme to solve our proposed model, and then give its complexity analysis. Comprehensive experimental results on three benchmark datasets have shown the superiority of our proposed DAZSH method in different retrieval tasks.

The remainder of this paper is organized as follows: Section 2 introduces the previous work on cross-modal retrieval. Section 3 details our approach and its optimization scheme. Section 4 gives the experimental results and their analysis. Section 5 draws the conclusion of this work.

Section snippets

Related works

In this section, we give a preliminary introduction to works on traditional cross-modal retrieval and zero-shot cross-modal retrieval.

Discrete asymmetric zero-shot hashing (DAZSH)

This section introduces the proposed DAZSH model in detail.

Experiments

In this section, we carry out experiments to verify the effectiveness of DAZSH in cross-modal retrieval. Specifically, we set up two common query tasks: image query text and text query image. In addition, the experiments are conducted on query sets in two different scenarios (seen and unseen).

Conclusions

In this paper, we propose a novel zero-shot hashing method, called DAZSH, for cross-modal retrieval. The method adopts an asymmetric discrete coding structure and pairwise similarity to guide hash code learning, which can significantly improve the discriminative ability of hash codes. Meanwhile, the DAZSH method constructs an attribute space for each modality by combining the feature matrix and the class attribute matrix, and thus achieves knowledge transfer from seen classes to unseen classes.

CRediT authorship contribution statement

Zhenqiu Shu: Conceptualization, Writing – review & editing, Supervision. Kailing Yong: Data curation, Software, Validation, Visualization, Writing – original draft. Jun Yu: Writing – review & editing. Shengxiang Gao: Writing – review & editing. Cunli Mao: Writing – review & editing. Zhengtao Yu: Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China [Grant No. 61603159, 62162033, 62020106012, U21B2027], Yunnan Provincial Major Science and Technology Special Plan Projects [Grant No. 202002AD080001, 202103AA080015], Yunnan Foundation Research Projects [Grant No. 202201AT070154, 202101BE070001-056].

Zhenqiu Shu received the Ph.D. degree in computer applications at Nanjing University of Science and Technology. In February 2021, he joined the Faculty of Information Engineering and Automation, Kunming University of Science and Technology, where he is currently an associate professor. Before joining in Kunming University of Science and Technology University, he had been a postdoctoral in Jiangnan University for four years. His research interests include image processing, computer vision and

References (45)

Donglin Zhang et al.
Moon: Multi-hash codes joint learning for cross-media retrieval
Pattern Recognition Letters
(2021)
Zhenqiu Shu et al.
Specific class center guided deep hashing for cross-modal retrieval
Information Sciences
(2022)
Feng Xue et al.
Cross-modal retrieval via label category supervised matrix factorization hashing
Pattern Recognition Letters
(2020)
Xu Yuan et al.
CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval
Pattern Recognition Letters
(2021)
Chuan-Xiang Li et al.
SCRATCH A scalable discrete matrix factorization hashing for cross-modal retrieval
Xin Liu et al.
MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2019)
Jun Yu et al.
Adaptive multi-modal fusion hashing via Hadamard matrix
Applied Intelligence
(2022)
Lu Wang et al.
Asymmetric correlation quantization hashing for cross-modal retrieval
IEEE Transactions on Multimedia
(2021)
Yongxin Wang et al.
BATCH: A scalable asymmetric discrete cross-modal hashing
IEEE Transactions on Knowledge and Data Engineering
(2020)
Christoph H Lampert et al.
Learning to detect unseen object classes by between-class attribute transfer

Yang Yang et al.

Zero-shot hashing via transferring supervised knowledge

Yang Shi et al.

Zero-shot hashing via asymmetric ratio similarity matrix

IEEE Transactions on Knowledge and Data Engineering

(2022)

Qin Zou et al.

Transductive zero-shot hashing for multilabel image retrieval

IEEE Transactions on Neural Networks and Learning Systems

(2020)

Guiguang Ding et al.

Collective matrix factorization hashing for multimodal data

Di Wang et al.

Semantic topic multimodal hashing for cross-media retrieval

Jile Zhou et al.

Latent semantic sparse hashing for cross-modal similarity search

Di Wang et al.

Joint and individual matrix factorization hashing for large-scale cross-modal retrieval

Pattern Recognition

(2020)

Tao Yao et al.

Discrete robust matrix factorization hashing for large-scale cross-media retrieval

IEEE Transactions on Knowledge and Data Engineering

(2021)

Zijia Lin et al.

Semantics-preserving hashing for cross-view retrieval

Xingbo Liu et al.

Fast discrete cross-modal hashing with regressing from semantic labels

Di Wang et al.

Label consistent matrix factorization hashing for large-scale cross-modal similarity search

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2018)

Donglin Zhang et al.

Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications

(2021)

Cited by (11)

Unpaired robust hashing with noisy labels for zero-shot cross-modal retrieval
2024, Engineering Applications of Artificial Intelligence
With new social media concepts emerging, zero-shot cross-modal retrieval methods have gained significant attention. Most of the existing methods assume that the labels of training data are correct and the different modalities are perfectly matched, which is unrealistic in real-life retrieval scenarios. This paper presents a novel approach, termed unpaired robust hashing with noisy labels (URHNL), for zero-shot cross-modal retrieval. Specifically, we developed a zero-shot cross-modal hash retrieval framework that can learn distinct hash codes for different modalities, which is suitable for unpaired cross-modal retrieval scenarios. In addition, it incorporates the sparse constraint on the noise matrix and the low-rank constraint on the recovered label matrix, respectively. These constraints are applied to mitigate the negative effects of noisy labels. Furthermore, we introduce the concept of drag $ɛ$ into the learning process of label semantic embedding, which aims to generate more discriminative hash codes. To improve the similarity of semantic information within hash codes, we consider both intra-modal and inter-modal similarity. A large number of experiments on cross-modal datasets show the effectiveness of the URHNL approach in real and complex zero-shot cross-modal retrieval scenarios. The source code of this word can be found at https://github.com/szq0816/URHNL.
Supervised adaptive similarity consistent latent representation hashing
2024, Neurocomputing
Cross-modal hashing has attracted significant attention in multimedia data similarity given its appealing computational cost and retrieval performance. Supervised hashing benefits from the auxiliary learning of a similarity matrix, which is usually predefined by inner product features or category labels. However, a predefined similarity matrix fails to reflect the real similarity relationship between image-text pairs. In addition, existing methods fix the weights to a value or update them by introducing sensitive dataset-related hyper-parameters. To overcome these problems, we propose a method to perform supervised adaptive similarity consistent latent representation hashing (SCLRH) that adaptively learns the similarity matrix during hashing learning. In SCLRH, we assume that multimodal data are observed and reconstructed from different perspectives of a common consistent latent representation. Instead of using a predefined similarity matrix, SCLRH adaptively learns this matrix to reflect the underlying manifold structure and describes the fine-grained similarity between consistent latent representations. In addition, SCLRH introduces a self-weighted learning strategy to update the weights based on the contributions of different modalities without involving additional hyper-parameters. Experimental results on three benchmark datasets demonstrate the superiority of the proposed SCLRH for cross-modal retrieval.
RICH: A rapid method for image-text cross-modal hash retrieval
2023, Displays
Deep cross-modal hash retrieval (DCMHR) methods can effectively analyze the correlation of multimodal data while maintaining efficiency. However, to pursue better accuracy, most existing hash methods forget that the original purpose of introducing hash technology is to reduce training consumption, and overtraining also leads to overfitting. This paper proposes a rapid method for image-text cross-modal hash retrieval (RICH) based on DenseNet and multi-head attention (MHA) BOW (Bag of Words), which makes full use of unlabeled samples and uses $E a r l y S t o p$ in training. To fully extract image case features, we propose multiple dense feature sampling for cross-modal retrieval. It is worth noting that this method applies DenseNet and $E a r l y S t o p$ to unsupervised cross-modal retrieval for the first time and greatly reduces training costs while keeping good results. Then it is discussed that the MHA be carried in the TxtNet, which can extract neglected features. Furthermore, to alleviate the heterogeneous gap between different modalities, we also use the auxiliary similarity metrics. Experiments on three datasets show that the average performance of this method is higher than most of the DCMHR methods. In addition, compared to most of the SoTA unsupervised DCMHR methods, the training cost and stability of RICH are more excellent, which proves the effectiveness and superiority of this method.
Uchrmi: An Unsupervised Cross-Modal Hashing Retrieval with Multi-Level Interaction
2024, SSRN
Enhancing Multicriteria-Based Recommendations by Alleviating Scalability and Sparsity Issues Using Collaborative Denoising Autoencoder
2024, Computers, Materials and Continua
Online supervised collective matrix factorization hashing for cross-modal retrieval
2023, Applied Intelligence

View all citing articles on Scopus

Kailing Yong is currently pursuing toward Master degree at the Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. Her current research interests include multimedia information retrieval and machine learning.

Jun Yu received his Ph.D. degree at the school of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China. He joined the College of Computer and Communication Engineering, Zhengzhou University of Light Industry in 2021. His research interests include multimedia information retrieval, computer vision and deep learning.

Shengxiang Gao received the Ph.D. degree in computer application technology from the Kunming University of Science and Technology, Kunming, China, in 2016. She is currently an associate professor with the School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. Her current research interests include natural language processing and information retrieval.

Cunli Mao received the Ph.D. degree in computer application technology from the Kunming University of Science and Technology, Kunming, China, in 2014. He is currently a Professor with the School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. His current research interests include natural language processing and machine learning.

Zhengtao Yu received the Ph.D. degree in computer application technology from the Beijing Institute of Technology, Beijing, China, in 2005. He is currently a Professor with the School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. His current research interests include natural language processing, information retrieval, and machine learning.

View full text

Discrete asymmetric zero-shot hashing with application to cross-modal retrieval

Abstract

Introduction

Section snippets

Related works

Discrete asymmetric zero-shot hashing (DAZSH)

Experiments

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Pattern Recognition Letters

Information Sciences

Pattern Recognition Letters

Pattern Recognition Letters

MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence

Adaptive multi-modal fusion hashing via Hadamard matrix

Applied Intelligence

Asymmetric correlation quantization hashing for cross-modal retrieval

IEEE Transactions on Multimedia

BATCH: A scalable asymmetric discrete cross-modal hashing

IEEE Transactions on Knowledge and Data Engineering

Learning to detect unseen object classes by between-class attribute transfer

Zero-shot hashing via transferring supervised knowledge

Zero-shot hashing via asymmetric ratio similarity matrix

IEEE Transactions on Knowledge and Data Engineering

Transductive zero-shot hashing for multilabel image retrieval

IEEE Transactions on Neural Networks and Learning Systems

Collective matrix factorization hashing for multimodal data

Semantic topic multimodal hashing for cross-media retrieval

Latent semantic sparse hashing for cross-modal similarity search

Joint and individual matrix factorization hashing for large-scale cross-modal retrieval

Pattern Recognition

Discrete robust matrix factorization hashing for large-scale cross-media retrieval

IEEE Transactions on Knowledge and Data Engineering

Semantics-preserving hashing for cross-view retrieval

Fast discrete cross-modal hashing with regressing from semantic labels

Label consistent matrix factorization hashing for large-scale cross-modal similarity search

IEEE Transactions on Pattern Analysis and Machine Intelligence

Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications