Elsevier

Neurocomputing

Volume 511, 28 October 2022, Pages 366-379
Neurocomputing

Discrete asymmetric zero-shot hashing with application to cross-modal retrieval

https://doi.org/10.1016/j.neucom.2022.09.037Get rights and content

Abstract

In recent years, cross-modal retrieval technology has attracted extensive attention with the massive growth of multimedia data. However, most cross-modal hashing methods mainly focus on exploring the retrieval of seen classes, while ignoring the retrieval of unseen classes. Therefore, traditional cross-modal hashing methods cannot achieve satisfactory performances in zero-shot retrieval. To mitigate this challenge, in this paper, we propose a novel zero-shot cross-modal retrieval method called discrete asymmetric zero-shot hashing(DAZSH), which fully exploits the supervised knowledge of multimodal data. Specifically, it integrates pairwise similarity, class attributes and semantic labels to guide zero-shot hashing learning. Moreover, our proposed DAZSH method combines the data features with the class attributes to obtain a semantic category representation for each category. Therefore, the relationships between seen and unseen classes can be effectively captured by learning a category representation vector for each instance. Therefore, the supervised knowledge can be transferred from the seen classes to the unseen classes. In addition, we develop an efficient discrete optimization strategy to solve the proposed model. Massive experiments on three benchmark datasets show that our proposed approach has achieved promising results in cross-modal retrieval tasks. The source code of this paper can be obtained from https://github.com/szq0816/DAZSH.

Introduction

Over the past decade, cross-modal retrieval tasks have become a great challenge owing to the exponential growth of multimodal data. In general, multimodal data are usually dependent and have essential connections in most cases. Therefore, it is fundamental to learn the correlation information between multimodalities in pattern recognition and machine learning, which is referred to as the heterogeneity gap. To resolve this discrepancy, traditional approaches try to project multimedia data into a common semantic space and then perform the retrieval task. However, real-value projections require more storage space and expensive computational costs due to the increase in multimedia data. This has become a significant obstacle in cross-modal retrieval applications. Therefore, hashing is an effective way to perform the retrieval task in large-scale datasets due to its low storage and high computational efficiency. It aims to project original samples into compact binary codes, which preserves their similarity in Hamming space. Recently, researchers have made many efforts to bridge the heterogeneity gap between multiple modalities and have achieved promising performances in many real applications [1], [2], [3], [4].

To the best of our knowledge, most existing cross-modal hashing retrieval methods are studied in the seen classes dataset [5], [6], [7]. However, with the explosive growth of multimedia data, some new concepts, such as unseen classes, have emerged in the past few years. Retraining the existing cross-modal hashing model after collecting new concept data is high cost and requires much storage space. Therefore, it is necessary to adopt a new cross-modal hashing model to deal with training data containing new concepts. Therefore, zero-shot learning aims to identify previously unseen data categories. Specifically, the trained classifier not only identifies the existing data categories in the training set, but also distinguishes the data from the unseen categories [8].

In the past several years, many zero-shot learning approaches have been applied for cross-modal retrieval [9], [10], [11]. Yang et al. [9] realized the potential semantic transfer by projecting the labels into the word embedding space. Shi et al. [10] proposed a zero-shot hashing method based on the asymmetric ratio similarity matrix, which can significantly improve the ability of knowledge transfer from seen classes to unseen classes. Transductive zero-shot hashing (T-MLZSH) [11] was proposed as a multilabel image retrieval model based on zero-shot learning. In this model, the labels of unseen classes are predicted by the instance-concept coherence ranking. Nevertheless, all the abovementioned works were applied to single-modality retrieval tasks, and there are still few efforts on unseen cross-modality retrieval tasks. With the continuous emergence of new concepts, existing cross-modal retrieval methods have the following limitations. (1) They only consider data from the seen categories and ignore the unseen cases. Therefore, these models are unsuitable for cross-modal retrieval with mixed unseen data. (2) Most of them neglect the class attribute information in hash code learning and thus are inconducive to knowledge transfer from seen classes to unseen classes. (3) Existing zero-shot hashing approaches fail to consider the pairwise similarity, class labels and class attributes to train models at the same time.

To address the above challenges, in this work, a novel zero-shot hashing method, called discrete asymmetric zero-shot hashing(DAZSH), is proposed for cross-modal retrieval. It integrates pairwise similarity, semantic labels and category attributes into a framework to fully explore semantic information. Specifically, we combine the data features with category attributes to obtain the category representation vector of each instance. In addition, the relationship between seen classes and unseen classes can be better captured, and the supervision knowledge can be transferred from the seen classes to the unseen classes. Fig. 1 shows the framework of our proposed DAZSH method in zero-shot cross-modal retrieval. The experimental results on three datasets show that our proposed DAZSH method can achieve better retrieval performances in dealing with unseen data.

The contributions of this work are as follows:

  • (1)

    We propose a unified discrete asymmetric zero-shot framework for learning hash codes. It combines data features with class attributes to learn an attribute space for each modality. Therefore, we explore the relationship between the seen classes and the unseen classes, which can transfer the supervision information from the seen classes to the unseen classes. In addition, our proposed approach aims at embedding the labels into the attribute space to improve retrieval accuracy. Therefore, our proposed model can generate more discriminative hash codes than traditional hashing methods.

  • (2)

    To maintain the characteristics of each modality, we generate different hash codes for different modalities using the asymmetric similarity strategy. Furthermore, we employ the maximum likelihood estimation algorithm to explore the pairwise similarity of multimodality data. To the best of our knowledge, this study is the first to utilize the pairwise similarity of different modalities using the maximum likelihood estimation algorithm in the cross-modal zero-shot learning field. Compared with traditional cross-modal zero-shot learning methods, our proposed method can effectively alleviate the heterogeneity gap between different modalities and more closely connect the seen and unseen classes, simultaneously.

  • (3)

    We develop a discrete optimization scheme to solve our proposed model, and then give its complexity analysis. Comprehensive experimental results on three benchmark datasets have shown the superiority of our proposed DAZSH method in different retrieval tasks.

The remainder of this paper is organized as follows: Section 2 introduces the previous work on cross-modal retrieval. Section 3 details our approach and its optimization scheme. Section 4 gives the experimental results and their analysis. Section 5 draws the conclusion of this work.

Section snippets

Related works

In this section, we give a preliminary introduction to works on traditional cross-modal retrieval and zero-shot cross-modal retrieval.

Discrete asymmetric zero-shot hashing (DAZSH)

This section introduces the proposed DAZSH model in detail.

Experiments

In this section, we carry out experiments to verify the effectiveness of DAZSH in cross-modal retrieval. Specifically, we set up two common query tasks: image query text and text query image. In addition, the experiments are conducted on query sets in two different scenarios (seen and unseen).

Conclusions

In this paper, we propose a novel zero-shot hashing method, called DAZSH, for cross-modal retrieval. The method adopts an asymmetric discrete coding structure and pairwise similarity to guide hash code learning, which can significantly improve the discriminative ability of hash codes. Meanwhile, the DAZSH method constructs an attribute space for each modality by combining the feature matrix and the class attribute matrix, and thus achieves knowledge transfer from seen classes to unseen classes.

CRediT authorship contribution statement

Zhenqiu Shu: Conceptualization, Writing – review & editing, Supervision. Kailing Yong: Data curation, Software, Validation, Visualization, Writing – original draft. Jun Yu: Writing – review & editing. Shengxiang Gao: Writing – review & editing. Cunli Mao: Writing – review & editing. Zhengtao Yu: Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China [Grant No. 61603159, 62162033, 62020106012, U21B2027], Yunnan Provincial Major Science and Technology Special Plan Projects [Grant No. 202002AD080001, 202103AA080015], Yunnan Foundation Research Projects [Grant No. 202201AT070154, 202101BE070001-056].

Zhenqiu Shu received the Ph.D. degree in computer applications at Nanjing University of Science and Technology. In February 2021, he joined the Faculty of Information Engineering and Automation, Kunming University of Science and Technology, where he is currently an associate professor. Before joining in Kunming University of Science and Technology University, he had been a postdoctoral in Jiangnan University for four years. His research interests include image processing, computer vision and

References (45)

  • Yang Yang et al.

    Zero-shot hashing via transferring supervised knowledge

  • Yang Shi et al.

    Zero-shot hashing via asymmetric ratio similarity matrix

    IEEE Transactions on Knowledge and Data Engineering

    (2022)
  • Qin Zou et al.

    Transductive zero-shot hashing for multilabel image retrieval

    IEEE Transactions on Neural Networks and Learning Systems

    (2020)
  • Guiguang Ding et al.

    Collective matrix factorization hashing for multimodal data

  • Di Wang et al.

    Semantic topic multimodal hashing for cross-media retrieval

  • Jile Zhou et al.

    Latent semantic sparse hashing for cross-modal similarity search

  • Di Wang et al.

    Joint and individual matrix factorization hashing for large-scale cross-modal retrieval

    Pattern Recognition

    (2020)
  • Tao Yao et al.

    Discrete robust matrix factorization hashing for large-scale cross-media retrieval

    IEEE Transactions on Knowledge and Data Engineering

    (2021)
  • Zijia Lin et al.

    Semantics-preserving hashing for cross-view retrieval

  • Xingbo Liu et al.

    Fast discrete cross-modal hashing with regressing from semantic labels

  • Di Wang et al.

    Label consistent matrix factorization hashing for large-scale cross-modal similarity search

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2018)
  • Donglin Zhang et al.

    Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval

    ACM Transactions on Multimedia Computing, Communications, and Applications

    (2021)
  • Cited by (11)

    • Unpaired robust hashing with noisy labels for zero-shot cross-modal retrieval

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    Zhenqiu Shu received the Ph.D. degree in computer applications at Nanjing University of Science and Technology. In February 2021, he joined the Faculty of Information Engineering and Automation, Kunming University of Science and Technology, where he is currently an associate professor. Before joining in Kunming University of Science and Technology University, he had been a postdoctoral in Jiangnan University for four years. His research interests include image processing, computer vision and machine learning.

    Kailing Yong is currently pursuing toward Master degree at the Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. Her current research interests include multimedia information retrieval and machine learning.

    Jun Yu received his Ph.D. degree at the school of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China. He joined the College of Computer and Communication Engineering, Zhengzhou University of Light Industry in 2021. His research interests include multimedia information retrieval, computer vision and deep learning.

    Shengxiang Gao received the Ph.D. degree in computer application technology from the Kunming University of Science and Technology, Kunming, China, in 2016. She is currently an associate professor with the School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. Her current research interests include natural language processing and information retrieval.

    Cunli Mao received the Ph.D. degree in computer application technology from the Kunming University of Science and Technology, Kunming, China, in 2014. He is currently a Professor with the School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. His current research interests include natural language processing and machine learning.

    Zhengtao Yu received the Ph.D. degree in computer application technology from the Beijing Institute of Technology, Beijing, China, in 2005. He is currently a Professor with the School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. His current research interests include natural language processing, information retrieval, and machine learning.

    View full text