Image-text bidirectional learning network based cross-modal retrieval
Introduction
Information usually contains multi-media data such as text, voice and images [1], [2]. With the growth of multi-media data, there are a large number of application requirements of cross-modal retrieval, especially using a text query to search for images and vice versa. Cross-modal retrieval is an emerging technology of data retrieval whose purpose is to determine whether data from different modalities point to the same content [3], [4]. Objectively, due to the distribution gap and heterogeneity, it is difficult to directly measure the correlation between cross-modal data. Therefore, the matching of image and text data is a challenging task.
To address the aforementioned cross-modal retrieval problem, numerous approaches are proposed to eliminate the cross-modal gap in recent years. A common approach is representation learning, which aims to transform the samples from different modalities into a common representation space [5]. There are mainly two types of existing representation learning methods, traditional approaches and deep learning approaches. Traditional cross-modal retrieval methods learn the common subspace by modeling the correlation between image and text. Due to the constraints of feature embedding and label information, these methods are unable to get satisfactory results for nonlinear features. Deep Neural Network (DNN)[6] achieves competitive performance with the state-of-the-art in representation learning, thus several methods based on deep learning have been proposed to solve the above problems in recent years. The deep learning-based approaches provide scalable nonlinear transformations for feature representations. Particularly, DNN-based cross-modal retrieval approaches exploit nonlinear correlations to learn a common subspace. For instance, Andrew et al.[7] proposed Deep Canonical Correlation Analysis (DCCA), which learned complex nonlinear projections through a deep network.
The pioneering work of convolutional neural network (CNN)[8] has a great influence on the research of computer vision. Undeniably, compared with traditional methods, CNN obtains better image feature representations in multiple fields including cross-modal retrieval. Accordingly, Wang et al.[9] proposed a Multimodal Deep Neural Network (MDNN) based on deep CNN and Neural Language Model (NLM) to learn the multimodal mapping function. Li et al.[10] learned a deep network for each modality and projected the cross-modal feature into a common semantic space. To narrow the heterogeneity gap between image and text, the method used a Deep Convolutional Activation Feature (DeCAF) to extract visual features.
Inspired by the Generative Adversarial Networks (GAN) [11], [12], [13], [14], Wang et al.[15] proposed an Adversarial Cross-Modal Retrieval (ACMR) method for cross-modal retrieval tasks. Based on the adversarial learning mechanism, the method obtained an effectively shared subspace between different modalities. For the purpose of preserving more discrimination in common space, Hu et al.[16] proposed a Multimodal Adversarial Network (MAN), which learned a common representation space with an eigenvalue strategy. Also on the basis of the GAN model, JFSE[17] proposed three advanced distribution alignment schemes with advanced cycle-consistency constraints to preserve semantic compatibility.
In addition, several methods attempted to achieve semantic alignment of different modalities to enhance the retrieval performance. Qi et al.[18] proposed a Cross-modal Bidirectional Translation (CBT) approach which adopted a translation mechanism and reinforcement learning strategy to effectively explore image-text correlation.The CBT designed two loss functions for inter-modality and intra-modality to mutually boost for the cross-modal correlation learning. Ji et al.[19] proposed Saliency-guided Attention Network (SAN) to solve the problem of asymmetry. It employed visual and textual attention modules to learn the fine-grained correlation of cross-modal data. Xu et al.[20] proposed a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. Aiming to align local semantics, it encouraged exploiting the global semantic consistence between image regions and sentence words as complementary. To address the challenging issue of incomplete cross-modal retrieval, Jing et al.[21] proposed a Dual-Aligned Variational Autoencoders (DAVAE) to simultaneously handle both the heterogeneity problem and the incompleteness problem in a unified deep model. Nevertheless, these methods are not comprehensive enough to utilize predicted cross-modal information in both two feature spaces of image and text. Meanwhile, these methods learn the shared representation through a shallow network structure, which cannot fully capture the complex cross-media correlation.
In this paper, we present a novel image-text bidirectional learning network (BLN) method for cross-modal retrieval. It aims to eliminate the cross-modal differences between the same semantic samples while preserving the discriminant between different semantic samples. Different from existing two-modality methods, BLN optimizes the multi-layer network which enhances the nonlinearity of feature representation by constructing two types of network units with different activation functions. Simultaneously, the method minimizes the bidirectional crisscross loss across the mapped features. Besides, four losses with specific tasks are integrated into an objective function to learn the common space. Following this learning strategy, the multimodal features extracted by different encoders are mapped to a deep common space, which ensures that the proposed cross-modal retrieval framework is discriminative and modal invariant. The main contributions of this paper are as follows.
- (1)
This paper designs a novel cross-modal retrieval framework based on bidirectional learning to eliminate the semantic gap between multimodal data. The deep framework effectively learns common representation space under the premise of ensuring semantic discrimination and modal invariance.
- (2)
This paper proposes two novel network units with different activation functions to bridge the dimension gap between the cross-modal data. The discrepancy between related heterogeneous data is minimized and the nonlinear capability is preserved to learn the common representation space by employing the network units.
- (3)
A bidirectional crisscross loss function is proposed to learn the cross-modal similarity through the category label information. Different from the previous loss function, the bidirectional crisscross learning loss not only minimizes the discriminative consistency loss between the mapped feature and the real label but also minimizes the correlation loss with the predicted label of the opposite modality.
In this paper, generous experiments are verified to separately evaluate the independent performance and overall performance with different components of the framework on four widely-used benchmark databases. Experimental results achieve state-of-the-art performance on the databases, which demonstrates the effectiveness of the bidirectional learning network based cross-modal retrieval method.
Section snippets
Related work
Common cross-modal retrieval methods are commonly divided into two categories: hash-based approaches [22], [23], [24], [25], [26] and real-value approaches [27], [28], [29]. Hash-based approaches attempt to learn hash functions for different modalities to optimize calculation efficiency. Specifically, the hash-based approaches map cross-modal data to a common hamming space to measure similarity. Furthermore, the related cross-modal data in the common hamming space are as close as possible.
Bidirectional Learning Network
Due to the gap of cross-modal data, the original representations of image and text cannot calculate the similarity directly. To measure the similarity between image and text, this paper constructs a bidirectional learning network to learn modality-specific features for each modality. The proposed cross-modal retrieval method based on BLN is shown in Fig. 1. The overall framework consists of a subnetwork for image modality and a subnetwork for text modality. In Fig. 1, the embedded module is the
Databases
Four widely-used cross-modal databases are utilized to evaluate the performance of the proposed method, including Pascal Sentence database [35], Wikipedia database [36], PKU XMedia database [37], [38] and PKU XMediaNet database [39], [40]. Pascal Sentence database has 20 categories, which contains 1000 image-text instance pairs in total. The data partition strategy in this paper follows the setting in [14], [34], which is divided into three subsets: 800 pairs for training, 100 pairs for
Discussions
The main purpose of this paper is to construct a bidirectional learning network for learning a common representation space. BLN method integrates cross-modal retrieval and common space learning into the same neural network framework. The superiority of the BLN method is mainly reflected in two aspects: (1) Our work employs different activation functions to construct DL-units and DM-units in multi-layer networks. Therefore, the learned common representation space enhances the non-linear
Conclusions
The cross-modal sample features cannot be directly compared against each other for cross-modal retrieval due to the different statistical properties. To eliminate the gap between heterogeneous related features, this paper proposes a cross-modal retrieval method based on a bidirectional learning network. First, a multi-layer network is proposed to effectively bridge the heterogeneity gap in both feature space of image and text and enhance the nonlinear expression ability of the cross-media data
CRediT authorship contribution statement
Zhuoyi Li: Investigation, Writing - original draft. Huibin Lu: Resources, Conceptualization, Supervision. Hao Fu: Validation, Visualization, Software. Guanghua Gu: Methodology, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to thank the anonymous reviewers for valuable comments. This work was partly supported by National Natural Science Foundation of China (No.62072394), Natural Science Foundation of Hebei province (F2021203019), Postgraduate Innovation Fund Project of Hebei Province (CXZZBS2022132, CXZZSS2022117).
Zhuoyi Li was born in TangShan, Hebei Province, China, in 1995. He received the B.Sc. degree from Handan University, Handan, Hebei Province, China in 2017. He is currently pursuing the Ph.D. degree in Electronic Science and Technology with Yanshan University. His current research interest is Cross-Modal Retrieval and Image Retrieval.
References (50)
- et al.
Semantic consistency hashing for cross-modal retrieval
Neurocomputing
(2016) - et al.
Multimodal adversarial network for cross-modal retrieval
Knowledge-Based Systems
(2019) - et al.
Cmir-net: A deep learning based model for cross-modal retrieval in remote sensing
Pattern Recognition Letters
(2020) - et al.
Polygonal coordinate system: Visualizing high-dimensional data using geometric dr, and a deterministic version of t-sne
Expert Systems with Applications
(2021) - C. Wang, H. Yang, C. Meinel, Deep semantic mapping for cross-modal retrieval, in: 2015 IEEE 27th International...
- et al.
Learning consistent feature representation for cross-modal multimedia retrieval
IEEE Transactions on Multimedia
(2015) - et al.
Learning the relative importance of objects from tagged images for retrieval and cross-modal search
International Journal of Computer Vision
(2012) - J. Gao, W. Zhang, F. Zhong, Z. Chen, Ucmh: Unpaired cross-modal hashing with matrix factorization, Neurocomputing 418...
- et al.
Structured autoencoders for subspace clustering
IEEE Transactions on Image Processing
(2018) - G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, in: International conference on...
Effective deep learning-based multi-modal retrieval
The VLDB Journal
Generative adversarial networks
Advances in Neural Information Processing Systems
Collaborative generative adversarial network with visual perception and memory reasoning
Neurocomputing
Adversarial cross-modal retrieval
Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited
Reinforced cross-media correlation learning by context-aware bidirectional translation
IEEE Transactions on Circuits and Systems for Video Technology
Saliency-guided attention network for image-sentence matching
Cross-modal attention with semantic consistence for image-text matching
Incomplete cross-modal retrieval with dual-aligned variational autoencoders
Collective matrix factorization hashing for multimodal data
Learning compact hash codes for multimodal representations using orthogonal deep structure
IEEE Transactions on Multimedia
Cited by (12)
Continual learning for cross-modal image-text retrieval based on domain-selective attention
2024, Pattern RecognitionAdversarial pre-optimized graph representation learning with double-order sampling for cross-modal retrieval
2023, Expert Systems with ApplicationsMulti-label adversarial fine-grained cross-modal retrieval
2023, Signal Processing: Image CommunicationCross-modal information balance-aware reasoning network for image-text retrieval
2023, Engineering Applications of Artificial IntelligenceUnpaired referring expression grounding via bidirectional cross-modal matching
2023, NeurocomputingCitation Excerpt :Referring expression grounding, also called referring expression comprehension or natural language object localization, aims to localize objects from an image based on a language query. It serves as a fundamental step for many higher-level multi-modal tasks, such as image captioning [1–3], cross-modal retrieval [4–6] and cross-modal segmentation [7–9]. Fully-supervised referring grounding methods [10–16] have been well developed in recent years and achieve outstanding performance.
Survey on Cross-modal Data Entity Resolution
2023, Ruan Jian Xue Bao/Journal of Software
Zhuoyi Li was born in TangShan, Hebei Province, China, in 1995. He received the B.Sc. degree from Handan University, Handan, Hebei Province, China in 2017. He is currently pursuing the Ph.D. degree in Electronic Science and Technology with Yanshan University. His current research interest is Cross-Modal Retrieval and Image Retrieval.
Huibin Lu was born in JiaoHe, Jilin Province, China in 1965. He received the B.Sc. degree from Northeast Heavy Machinery Institute, Qiqihaer, Heilongjiang Province, China, in 1986, the M.sc. degree in circuits and systems from Yanshan University, Hebei Province, China, in 1994 and the Ph.D. degree in circuits and systems from Yanshan University, Hebei Province, China, in 2004. He is a Professor with School of Information Science Engineering, Yanshan University, Qinhuangdao, Hebei Province, China. His research interests include signal processing and information transmission.
Hao Fu was born in HaiKou, Hainan Province, China, in 1996. He received the B.Eng. degree from Hefei University of Technology, Hefei, Anhui Province, China, in 2019. He is currently pursuing the M.Sc. degree in Information and Communication Engineering with Yanshan University. His current research interest is Cross Modal Retrieval.
Guanghua Gu was born in PuYang, Henan Province, China in 1979. He received the B.Sc. degree from Yanshan University, Qinhuangdao, Hebei Province, China, in 2001, the M.Sc. degree from Yanshan University, Qinhuangdao, Hebei Province, China, in 2004, and the Ph.D. degree in signal and information processing from Beijing Jiaotong University, Beijing, China in 2013. He was a visiting scholar of University of South Carolina, Columbia, South Carolina, USA from 2015 to 2016. He is currently a professor with Yanshan University. His current research interests include image classification, image recognition and image retrieval.