Deep semantic hashing with dual attention for cross-modal retrieval

Wu, Jiagao; Weng, Weiwei; Fu, Junxia; Liu, Linfeng; Hu, Bin

doi:10.1007/s00521-021-06696-y

Deep semantic hashing with dual attention for cross-modal retrieval

Original Article
Published: 12 November 2021

Volume 34, pages 5397–5416, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Jiagao Wu ORCID: orcid.org/0000-0003-3109-2553^1,2,
Weiwei Weng^1,2,
Junxia Fu^1,2,
Linfeng Liu^1,2 &
…
Bin Hu³

614 Accesses
6 Citations
Explore all metrics

Abstract

With the explosive growth of multimodal data, cross-modal retrieval has drawn increasing research interests. Hashing-based methods have made great advancements in cross-modal retrieval due to the benefits of low storage cost and fast query speed. However, there still exists a crucial challenge to improve the accuracy of cross-modal retrieval due to the heterogeneity gap between modalities. To further tackle this problem, in this paper, we propose a new two-staged cross-modal retrieval method, called Deep Semantic Hashing with Dual Attention (DSHDA). In the first stage of DSHDA, a Semantic Label Network (SeLabNet) is designed to extract label semantic features and hash codes by training the multi-label annotations, which can make the learning of different modalities in a common semantic space and bridge the modality gap effectively. In the second stage of DSHDA, we propose a deep neural network to simultaneously integrate feature and hash code learning for each modality into the same framework, the training of the framework is guided by the label semantic features and hash codes generated from SeLabNet to maximize the cross-modal semantic relevance. Moreover, dual attention mechanisms are used in our neural networks: (1) Lo-attention is used to extract the local key information of each modality and improve the quality of modality features. (2) Co-attention is used to strengthen the relationship between different modalities to produce more consistent and accurate hash codes. Extensive experiments on two real datasets with image-text modalities demonstrate the superiority of the proposed method in cross-modal retrieval tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Peng Gao, Shijie Geng, … Yu Qiao

References

Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27(8):3893
Article MathSciNet Google Scholar
Cao Y, Long M, Wang J, Liu S (2017) Collective deep quantization for efficient cross-modal retrieval. In: 31st AAAI conference on artificial intelligence, pp 3974–3980
Wang B, Yang Y, Xu X, Hanjalic A, Shen H (2017) Adversarial cross-modal retrieval. In: Proceedings of the 2017 ACM multimedia conference, pp 154–162
Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4269–4278
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17(3):370
Article Google Scholar
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp 686–701
Gu J, Cai J, Joty S.R., Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189
Hu D, Nie F, Li X (2018) Deep binary reconstruction for cross-modal hashing. IEEE Trans Multimedia 21(4):973
Article Google Scholar
Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph pca hashing for similarity search. IEEE Trans Multimedia 19(9):2033
Article Google Scholar
Shi Y, You X, Zheng F, Wang S, Peng Q (2019) Equally-guided discriminative hashing for cross-modal retrieval. In: Twenty-eighth international joint conference on artificial intelligence, pp 4767–4773
Zhang J, Peng Y (2020) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans. Multimedia 22(1):174
Article MathSciNet Google Scholar
Wang D, Cui P, Ou M, Zhu W (2015) Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multimedia 17(9):1404
Article Google Scholar
Liu W, Mu C, Kumar S, Chang S (2014) Discrete graph hashing. In: Advances in Neural Information Processing Systems, pp 3419–3427
Ding K, Fan B, Huo C, Xiang S, Pan C (2017) Cross-modal hashing via rank-order preserving. IEEE Trans Multimedia 19(3):571
Article Google Scholar
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10386–10395
Cao Y, Long M, Wang J, Zhu H (2016) Correlation autoencoder hashing for supervised cross-modal search. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 197–204
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: 2015 IEEE conference on computer vision and pattern recognition, pp 3864–3872
Mandal D, Chaudhury K, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2633–2641
Jiang Q, Li W (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Thirty-First AAAI conference on artificial intelligence, pp 1618–1625
Weng W, Wu J, Yang L, Liu L, Hu B (2019) Label-based deep semantic hashing for cross-modal retrieval. In: Neural Information Processing, pp 24–36
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: 2018 IEEE conference on computer vision and pattern recognition, pp 4242–4251
Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision, pp 591–606
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 415–424
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082
Zhang D, Li W (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Twenty-Eighth AAAI Conference on Artificial Intelligence, pp 2177–2183
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872
Xu X, Shen F, Yang Y, Shen HT, Li X (2017) Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans Image Process 26(5):2494
Article MathSciNet Google Scholar
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: 32nd AAAI conference on artificial intelligence, pp 539–546
Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2019) Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 21(4):1047
Article Google Scholar
Wu Q, Shen C, Wang P, Dick A, Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367
Article Google Scholar
Chen K, Zhao T, Yang M, Liu L, Tamura A, Wang R, Utiyama M, Sumita E (2018) A neural approach to source dependence based context model for statistical machine translation. IEEE/ACM Trans Audio Speech Language Process 26(2):266
Article Google Scholar
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585
Article MathSciNet Google Scholar
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014). Return of the devil in the details: Delving deep into convolutional nets, arXiv:1405.3531
He K, Jian S (2015) Convolutional neural networks at constrained time cost. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 5353–5360
Huiskes M, Lew M (2008) The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39–43
Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, p 48
Liu W, Mu C, Kumar S, Chang SF (2014) Discrete graph hashing. In: Proceedings of the 27th international conference on neural information processing systems, pp 3419–3427
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence, pp 1360–1365
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence, pp 3890–3896

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grants Nos. 61872191 and 41571389.

Author information

Authors and Affiliations

School of Computer, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, 210023, China
Jiagao Wu, Weiwei Weng, Junxia Fu & Linfeng Liu
Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing, Jiangsu, 210023, China
Jiagao Wu, Weiwei Weng, Junxia Fu & Linfeng Liu
Key Laboratory of Virtual Geographic Environment, Ministry of Education, Nanjing Normal University, Nanjing, 210046, Jiangsu, China
Bin Hu

Authors

Jiagao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Weng
View author publications
You can also search for this author in PubMed Google Scholar
Junxia Fu
View author publications
You can also search for this author in PubMed Google Scholar
Linfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiagao Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Optimization algorithm

It is intractable to optimize Eq. (5) directly since it is non-convex with variables $\omega ^p$, $\omega ^t$, $\mathbf{B }$. However, it is convex when taking one variable with the other two variables fixed. Therefore, we use an alternating learning strategy that fixing two parameters and updating the left one at a time until convergence. The whole alternating learning procedure is shown in Algorithm 1 and the detailed optimization steps are listed as follows:

Step 1: Optimize $\omega ^p$ with $\omega ^t$ and $\mathbf{B }$ fixed. We use SGD with a BP algorithm to optimize the CNN parameter $\omega ^p$ of the image modality. For each sampled point $\mathbf{p }_j$, we can compute the gradient for $\hat{\mathbf{F }}_{*i}^p$ as follows:

$$\begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^p}}=\frac{1}{512}{\sum _{j=1}^n{\left( {\left( \frac{1}{2\cdot 512}(\hat{\mathbf{F }}^p_{*i})^\top (\hat{\mathbf{F }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{F }}^l_{*j}}}} \end{aligned}$$

(12)

Compute the gradient for $\hat{\mathbf{H }}_{*i}^p$ as follows:

$$\begin{aligned} \begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^p}}&=\frac{\gamma }{c}{\sum _{j=1}^n{\left( {\left( \frac{1}{2c}(\hat{\mathbf{H }}^p_{*i})^\top (\hat{\mathbf{H }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{H }}^l_{*j}}}} \\&\quad +2\mu {(\hat{\mathbf{H }}_{*i}^p-\mathbf{B }_{*i})}+2\tau \hat{\mathbf{H }}^p\cdot \mathbf{1 } \end{aligned} \end{aligned}$$

(13)

Then $\frac{\partial {J}}{\partial {\omega ^p}}$ can be computed with $\frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^p}}$ and $\frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^p}}$ by using the chain rule, based on which BP can be used to update the parameter $\omega ^p$.

Step 2: Optimize $\omega ^t$ with $\omega ^p$ and $\mathbf{B }$ fixed. We use SGD with a BP algorithm to optimize deep neural network parameter $\omega ^t$ of the text modality. For each sampled point $\mathbf{t }_j$, we can compute the gradient for $\hat{\mathbf{F }}_{*i}^t$ as follows:

$$\begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^t}}=\frac{1}{512}{\sum _{j=1}^n{\left( {\left( \frac{1}{2\cdot 512}(\hat{\mathbf{F }}^t_{*i})^\top (\hat{\mathbf{F }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{F }}^l_{*j}}}} \end{aligned}$$

(14)

Compute the gradient for $\hat{\mathbf{H }}_{*i}^t$ as follows:

$$\begin{aligned} \begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^t}}&=\frac{\gamma }{c}{\sum _{j=1}^n{\left( {\left( \frac{1}{2c}(\hat{\mathbf{H }}^t_{*i})^\top (\hat{\mathbf{H }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{H }}^l_{*j}}}} \\&\quad +2\mu {(\hat{\mathbf{H }}_{*i}^t-\mathbf{B }_{*i})}+2\tau \hat{\mathbf{H }}^t\cdot \mathbf{1 } \end{aligned} \end{aligned}$$

(15)

Then $\frac{\partial {J}}{\partial {\omega ^t}}$ can be computed with $\frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^t}}$ and $\frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^t}}$ by using the chain rule, based on which BP can be used to update the parameter $\omega ^t$.

Step 3: Optimize $\mathbf{B }$ with $\omega ^p$ and $\omega ^t$ fixed. The objective function shown in Eq. (5) can be reformulated as follows:

$$\begin{aligned} \begin{aligned} \min \limits _{\mathbf{B }}J&=\mu {({\Vert \mathbf{B }-\hat{\mathbf{H }}^p\Vert }_F^2+{\Vert \mathbf{B }-\hat{\mathbf{H }}^t\Vert }_F^2)}\\ s.t. \quad&\mathbf{B }\in {\{-1,1\}}^{c \times n} \end{aligned} \end{aligned}$$

(16)

which is rewritten as follows:

$$\begin{aligned} \begin{aligned} \max \limits _{\mathbf{B }}J&=Tr{(\mathbf{B }^\top (\mu (\hat{\mathbf{H }}^p+\hat{\mathbf{H }}^t)))} \\ s.t. \quad&\mathbf{B }\in {\{-1,1\}}^{c \times n} \end{aligned} \end{aligned}$$

(17)

To maximize the above formulation, we need to keep the two values of the product the same sign. Therefore, the following formulation can be obtained:

$$\begin{aligned} \mathbf{B }=sign(\mu (\hat{\mathbf{H }}^p+\hat{\mathbf{H }}^t)) \end{aligned}$$

(18)

Time complexity of DSHDA

As Fig. 1, DSHDA consists of two stages, each of which consists of different modules further. We will analyze the time complexity of these modules one by one as follows.

The first stage of DSHDA is the SeLabNet, which is a network model with multiple full connected layers. The time complexity of the network model can be expressed as follows:

$$\begin{aligned} O\left( \sum _{l=1}^{d_F} {w_{l-1}\cdot w_l}\right) \end{aligned}$$

(19)

where $w_l$ is the number of neurons in the l-th layer and $d_F$ is the total number of full connected layers.

Let $w_0=k$ be the dimension of the input multi-label annotation. Then, according to Eq. (19) and Table 1, we can obtain the time complexity of the semantic feature generation module $G^l$ and semantic hash code generation mobule $D^l$ as following Eqs. (20) and (21), respectively.

$$\begin{aligned}&O(k \cdot k + 2048 \cdot k + 2048 \cdot 512 ) \nonumber \\&\quad \sim O(k^2+2048 \cdot k + 1.05 \times 10^6) \end{aligned}$$

(20)

$$\begin{aligned}&O(512 \cdot c ) \end{aligned}$$

(21)

The second stage of DSHDA is the image and text network, which consists of five modules, including feature learning module ($E^p$ and $E^t$), Lo-attention ($A^p$ and $A^t$), semantic feature generation ($G^p$ and $G^t$), B-structure ($V^p$ and $V^t$) and semantic hash code generation ($D^p$ and $D^t$).

For the image network, the image feature learning module $E^p$ is a five-layer convolutional network. The time complexity of the network model with multiple convolutional layers is given by [37]:

$$\begin{aligned} O\left( \sum _{l=1}^{d_L} {n_{l-1} \cdot s_l^2 \cdot n_l \cdot m_l^2}\right), \end{aligned}$$

(22)

where $d_L$ is the number of convolutional layers, $s_l$ is the spatial size (length) of the filter in the l-th layer, $n_l$ is the number of filters in the l-th layer, $m_l$ is the spatial size of the output feature map in the l-th layer, and $n_{l-1}$ is also known as the number of input channels of the l-th layer.

Let $t_l$ be the stride of the filter in the l-th layer, we have

$$\begin{aligned} m_l \approx \frac{m_{l-1}}{t_l} \end{aligned}$$

(23)

Besides, the length of the original image feature $d_p$ can be expressed as:

$$\begin{aligned} d_p = m_0^2 \cdot n_0, \end{aligned}$$

(24)

where $m_0$ and $n_0$ can be considered as the spatial size and the number of channels of the input feature map for the first convolutional layer, respectively.

Then, Eq. (22) can be transformed as:

$$\begin{aligned} O\left( \sum _{l=1}^{d_L} \frac{n_{l-1} \cdot s_l^2 \cdot n_l}{n_0 \cdot \prod _{j=1}^{l}{t_j}} \cdot d_p\right) \end{aligned}$$

(25)

Now, Let $n_0=3$ represent the three color channels of the input image. According to Eq. (25) and Table 2, we can obtain the time complexity of module $E^p$ as follows:

$$\begin{aligned} \begin{aligned}&O\left( \frac{11\cdot 11\cdot 64}{4^2}\cdot d_p +\frac{64\cdot 5\cdot 5\cdot 265}{3\cdot 4^2}\cdot d_p \right. \\&\left. \quad +3\cdot \frac{64\cdot 3\cdot 3\cdot 265}{3\cdot 4^2}\cdot d_p \right) \\&\quad \sim O(4.88 \times 10^4 \cdot d_p) \end{aligned} \end{aligned}$$

(26)

Since the dimension of the output image feature maps by module $E^p$ is $d_C \cdot d_H \cdot d_W$, we have $d_C=265$, $d_H \cdot d_W=m_5^2=d_p/48$. These feature maps will be the input of the Lo-attention module $A^p$, which contains is a convolutional layer with a $1 \times 1$ kernel size. Then, the time complexity of module $A^p$ is:

$$\begin{aligned} O(33 \cdot d_p ) \end{aligned}$$

(27)

With the same approach, we can obtain the time complexity of other modules in the image and text network of DSHDA. Due to space limitations, we omit the process of the analysis and just list the time complexity of DSHDA modules in Table 11.

Table 11 Time complexity of DSHDA modules

Full size table

Thus, the total time complexity of DSHDA is the sum of all modules in Table 11. Besides, noticing that the values of k and c are $O(10^2)$, $d_p$ and $d_t$ are $O(10^3)$ in practice, we can obtain the time complexity of DSHDA is:

$$\begin{aligned} O(\epsilon \times 10^8), \end{aligned}$$

(28)

where $\epsilon >1$ is a scale factor of the time complexity.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Weng, W., Fu, J. et al. Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput & Applic 34, 5397–5416 (2022). https://doi.org/10.1007/s00521-021-06696-y

Download citation

Received: 30 March 2021
Accepted: 27 October 2021
Published: 12 November 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s00521-021-06696-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Deep semantic hashing with dual attention for cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Optimization algorithm

Time complexity of DSHDA

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep semantic hashing with dual attention for cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Optimization algorithm

Time complexity of DSHDA

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation