Elsevier

Neurocomputing

Volume 483, 28 April 2022, Pages 148-159
Neurocomputing

Image-text bidirectional learning network based cross-modal retrieval

https://doi.org/10.1016/j.neucom.2022.02.007Get rights and content

Abstract

The problem of cross-modal retrieval has attracted significant attention in the cross-media retrieval community. One key challenge of cross-modal retrieval is to eliminate the heterogeneous gap between different patterns. The existing numerous cross-modal retrieval approaches tend to jointly construct a common subspace, while these methods fail to consider mutual influence between modalities sufficiently during the whole training process. In this paper, we propose a novel image-text Bidirectional Learning Network (BLN) based cross-modal retrieval method. The method constructs a common representation space and directly measures the similarity of heterogeneous data. More specifically, a multi-layer supervision network is proposed to learn the cross-modal relevance of the generated representations. Moreover, a bidirectional crisscross loss function is proposed to preserve the modal invariance with the bidirectional learning strategy in the common representation space. The loss functions of discriminant consistency and the bidirectional crisscross loss are integrated into an objective function which aims to minimize the intra-class distance and maximize the inter-class distance. Comprehensive experimental results on four widely-used databases show that the proposed method is effective and superior to the existing cross-modal retrieval methods.

Introduction

Information usually contains multi-media data such as text, voice and images [1], [2]. With the growth of multi-media data, there are a large number of application requirements of cross-modal retrieval, especially using a text query to search for images and vice versa. Cross-modal retrieval is an emerging technology of data retrieval whose purpose is to determine whether data from different modalities point to the same content [3], [4]. Objectively, due to the distribution gap and heterogeneity, it is difficult to directly measure the correlation between cross-modal data. Therefore, the matching of image and text data is a challenging task.

To address the aforementioned cross-modal retrieval problem, numerous approaches are proposed to eliminate the cross-modal gap in recent years. A common approach is representation learning, which aims to transform the samples from different modalities into a common representation space [5]. There are mainly two types of existing representation learning methods, traditional approaches and deep learning approaches. Traditional cross-modal retrieval methods learn the common subspace by modeling the correlation between image and text. Due to the constraints of feature embedding and label information, these methods are unable to get satisfactory results for nonlinear features. Deep Neural Network (DNN)[6] achieves competitive performance with the state-of-the-art in representation learning, thus several methods based on deep learning have been proposed to solve the above problems in recent years. The deep learning-based approaches provide scalable nonlinear transformations for feature representations. Particularly, DNN-based cross-modal retrieval approaches exploit nonlinear correlations to learn a common subspace. For instance, Andrew et al.[7] proposed Deep Canonical Correlation Analysis (DCCA), which learned complex nonlinear projections through a deep network.

The pioneering work of convolutional neural network (CNN)[8] has a great influence on the research of computer vision. Undeniably, compared with traditional methods, CNN obtains better image feature representations in multiple fields including cross-modal retrieval. Accordingly, Wang et al.[9] proposed a Multimodal Deep Neural Network (MDNN) based on deep CNN and Neural Language Model (NLM) to learn the multimodal mapping function. Li et al.[10] learned a deep network for each modality and projected the cross-modal feature into a common semantic space. To narrow the heterogeneity gap between image and text, the method used a Deep Convolutional Activation Feature (DeCAF) to extract visual features.

Inspired by the Generative Adversarial Networks (GAN) [11], [12], [13], [14], Wang et al.[15] proposed an Adversarial Cross-Modal Retrieval (ACMR) method for cross-modal retrieval tasks. Based on the adversarial learning mechanism, the method obtained an effectively shared subspace between different modalities. For the purpose of preserving more discrimination in common space, Hu et al.[16] proposed a Multimodal Adversarial Network (MAN), which learned a common representation space with an eigenvalue strategy. Also on the basis of the GAN model, JFSE[17] proposed three advanced distribution alignment schemes with advanced cycle-consistency constraints to preserve semantic compatibility.

In addition, several methods attempted to achieve semantic alignment of different modalities to enhance the retrieval performance. Qi et al.[18] proposed a Cross-modal Bidirectional Translation (CBT) approach which adopted a translation mechanism and reinforcement learning strategy to effectively explore image-text correlation.The CBT designed two loss functions for inter-modality and intra-modality to mutually boost for the cross-modal correlation learning. Ji et al.[19] proposed Saliency-guided Attention Network (SAN) to solve the problem of asymmetry. It employed visual and textual attention modules to learn the fine-grained correlation of cross-modal data. Xu et al.[20] proposed a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. Aiming to align local semantics, it encouraged exploiting the global semantic consistence between image regions and sentence words as complementary. To address the challenging issue of incomplete cross-modal retrieval, Jing et al.[21] proposed a Dual-Aligned Variational Autoencoders (DAVAE) to simultaneously handle both the heterogeneity problem and the incompleteness problem in a unified deep model. Nevertheless, these methods are not comprehensive enough to utilize predicted cross-modal information in both two feature spaces of image and text. Meanwhile, these methods learn the shared representation through a shallow network structure, which cannot fully capture the complex cross-media correlation.

In this paper, we present a novel image-text bidirectional learning network (BLN) method for cross-modal retrieval. It aims to eliminate the cross-modal differences between the same semantic samples while preserving the discriminant between different semantic samples. Different from existing two-modality methods, BLN optimizes the multi-layer network which enhances the nonlinearity of feature representation by constructing two types of network units with different activation functions. Simultaneously, the method minimizes the bidirectional crisscross loss across the mapped features. Besides, four losses with specific tasks are integrated into an objective function to learn the common space. Following this learning strategy, the multimodal features extracted by different encoders are mapped to a deep common space, which ensures that the proposed cross-modal retrieval framework is discriminative and modal invariant. The main contributions of this paper are as follows.

  • (1)

    This paper designs a novel cross-modal retrieval framework based on bidirectional learning to eliminate the semantic gap between multimodal data. The deep framework effectively learns common representation space under the premise of ensuring semantic discrimination and modal invariance.

  • (2)

    This paper proposes two novel network units with different activation functions to bridge the dimension gap between the cross-modal data. The discrepancy between related heterogeneous data is minimized and the nonlinear capability is preserved to learn the common representation space by employing the network units.

  • (3)

    A bidirectional crisscross loss function is proposed to learn the cross-modal similarity through the category label information. Different from the previous loss function, the bidirectional crisscross learning loss not only minimizes the discriminative consistency loss between the mapped feature and the real label but also minimizes the correlation loss with the predicted label of the opposite modality.

In this paper, generous experiments are verified to separately evaluate the independent performance and overall performance with different components of the framework on four widely-used benchmark databases. Experimental results achieve state-of-the-art performance on the databases, which demonstrates the effectiveness of the bidirectional learning network based cross-modal retrieval method.

Section snippets

Related work

Common cross-modal retrieval methods are commonly divided into two categories: hash-based approaches [22], [23], [24], [25], [26] and real-value approaches [27], [28], [29]. Hash-based approaches attempt to learn hash functions for different modalities to optimize calculation efficiency. Specifically, the hash-based approaches map cross-modal data to a common hamming space to measure similarity. Furthermore, the related cross-modal data in the common hamming space are as close as possible.

Bidirectional Learning Network

Due to the gap of cross-modal data, the original representations of image and text cannot calculate the similarity directly. To measure the similarity between image and text, this paper constructs a bidirectional learning network to learn modality-specific features for each modality. The proposed cross-modal retrieval method based on BLN is shown in Fig. 1. The overall framework consists of a subnetwork for image modality and a subnetwork for text modality. In Fig. 1, the embedded module is the

Databases

Four widely-used cross-modal databases are utilized to evaluate the performance of the proposed method, including Pascal Sentence database [35], Wikipedia database [36], PKU XMedia database [37], [38] and PKU XMediaNet database [39], [40]. Pascal Sentence database has 20 categories, which contains 1000 image-text instance pairs in total. The data partition strategy in this paper follows the setting in [14], [34], which is divided into three subsets: 800 pairs for training, 100 pairs for

Discussions

The main purpose of this paper is to construct a bidirectional learning network for learning a common representation space. BLN method integrates cross-modal retrieval and common space learning into the same neural network framework. The superiority of the BLN method is mainly reflected in two aspects: (1) Our work employs different activation functions to construct DL-units and DM-units in multi-layer networks. Therefore, the learned common representation space enhances the non-linear

Conclusions

The cross-modal sample features cannot be directly compared against each other for cross-modal retrieval due to the different statistical properties. To eliminate the gap between heterogeneous related features, this paper proposes a cross-modal retrieval method based on a bidirectional learning network. First, a multi-layer network is proposed to effectively bridge the heterogeneity gap in both feature space of image and text and enhance the nonlinear expression ability of the cross-media data

CRediT authorship contribution statement

Zhuoyi Li: Investigation, Writing - original draft. Huibin Lu: Resources, Conceptualization, Supervision. Hao Fu: Validation, Visualization, Software. Guanghua Gu: Methodology, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank the anonymous reviewers for valuable comments. This work was partly supported by National Natural Science Foundation of China (No.62072394), Natural Science Foundation of Hebei province (F2021203019), Postgraduate Innovation Fund Project of Hebei Province (CXZZBS2022132, CXZZSS2022117).

Zhuoyi Li was born in TangShan, Hebei Province, China, in 1995. He received the B.Sc. degree from Handan University, Handan, Hebei Province, China in 2017. He is currently pursuing the Ph.D. degree in Electronic Science and Technology with Yanshan University. His current research interest is Cross-Modal Retrieval and Image Retrieval.

References (50)

  • A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, Advances in...
  • W. Wang et al.

    Effective deep learning-based multi-modal retrieval

    The VLDB Journal

    (2016)
  • Z. Li, W. Lu, E. Bao, W. Xing, Learning a semantic space by deep network for cross-media retrieval., in: DMS, Citeseer,...
  • I.J. Goodfellow et al.

    Generative adversarial networks

    Advances in Neural Information Processing Systems

    (2014)
  • Y. Peng, J. Qi, Y. Yuan, Cm-gans: Cross-modal generative adversarial networks for common representation learning, Acm...
  • H.A. Xia et al.

    Collaborative generative adversarial network with visual perception and memory reasoning

    Neurocomputing

    (2020)
  • R. Zhou, C. Jiang, Q. Xu, A survey on generative adversarial network-based text-to-image synthesis, Neurocomputing 451...
  • B. Wang et al.

    Adversarial cross-modal retrieval

  • X. Xu et al.

    Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited

  • Y. Peng et al.

    Reinforced cross-media correlation learning by context-aware bidirectional translation

    IEEE Transactions on Circuits and Systems for Video Technology

    (2019)
  • Z. Ji et al.

    Saliency-guided attention network for image-sentence matching

  • X. Xu et al.

    Cross-modal attention with semantic consistence for image-text matching

  • M. Jing et al.

    Incomplete cross-modal retrieval with dual-aligned variational autoencoders

  • G. Ding et al.

    Collective matrix factorization hashing for multimodal data

  • D. Wang et al.

    Learning compact hash codes for multimodal representations using orthogonal deep structure

    IEEE Transactions on Multimedia

    (2015)
  • Cited by (12)

    • Multi-label adversarial fine-grained cross-modal retrieval

      2023, Signal Processing: Image Communication
    • Cross-modal information balance-aware reasoning network for image-text retrieval

      2023, Engineering Applications of Artificial Intelligence
    • Unpaired referring expression grounding via bidirectional cross-modal matching

      2023, Neurocomputing
      Citation Excerpt :

      Referring expression grounding, also called referring expression comprehension or natural language object localization, aims to localize objects from an image based on a language query. It serves as a fundamental step for many higher-level multi-modal tasks, such as image captioning [1–3], cross-modal retrieval [4–6] and cross-modal segmentation [7–9]. Fully-supervised referring grounding methods [10–16] have been well developed in recent years and achieve outstanding performance.

    • Survey on Cross-modal Data Entity Resolution

      2023, Ruan Jian Xue Bao/Journal of Software
    View all citing articles on Scopus

    Zhuoyi Li was born in TangShan, Hebei Province, China, in 1995. He received the B.Sc. degree from Handan University, Handan, Hebei Province, China in 2017. He is currently pursuing the Ph.D. degree in Electronic Science and Technology with Yanshan University. His current research interest is Cross-Modal Retrieval and Image Retrieval.

    Huibin Lu was born in JiaoHe, Jilin Province, China in 1965. He received the B.Sc. degree from Northeast Heavy Machinery Institute, Qiqihaer, Heilongjiang Province, China, in 1986, the M.sc. degree in circuits and systems from Yanshan University, Hebei Province, China, in 1994 and the Ph.D. degree in circuits and systems from Yanshan University, Hebei Province, China, in 2004. He is a Professor with School of Information Science Engineering, Yanshan University, Qinhuangdao, Hebei Province, China. His research interests include signal processing and information transmission.

    Hao Fu was born in HaiKou, Hainan Province, China, in 1996. He received the B.Eng. degree from Hefei University of Technology, Hefei, Anhui Province, China, in 2019. He is currently pursuing the M.Sc. degree in Information and Communication Engineering with Yanshan University. His current research interest is Cross Modal Retrieval.

    Guanghua Gu was born in PuYang, Henan Province, China in 1979. He received the B.Sc. degree from Yanshan University, Qinhuangdao, Hebei Province, China, in 2001, the M.Sc. degree from Yanshan University, Qinhuangdao, Hebei Province, China, in 2004, and the Ph.D. degree in signal and information processing from Beijing Jiaotong University, Beijing, China in 2013. He was a visiting scholar of University of South Carolina, Columbia, South Carolina, USA from 2015 to 2016. He is currently a professor with Yanshan University. His current research interests include image classification, image recognition and image retrieval.

    View full text