Skip to main content
Log in

Deep semantic hashing with dual attention for cross-modal retrieval

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

With the explosive growth of multimodal data, cross-modal retrieval has drawn increasing research interests. Hashing-based methods have made great advancements in cross-modal retrieval due to the benefits of low storage cost and fast query speed. However, there still exists a crucial challenge to improve the accuracy of cross-modal retrieval due to the heterogeneity gap between modalities. To further tackle this problem, in this paper, we propose a new two-staged cross-modal retrieval method, called Deep Semantic Hashing with Dual Attention (DSHDA). In the first stage of DSHDA, a Semantic Label Network (SeLabNet) is designed to extract label semantic features and hash codes by training the multi-label annotations, which can make the learning of different modalities in a common semantic space and bridge the modality gap effectively. In the second stage of DSHDA, we propose a deep neural network to simultaneously integrate feature and hash code learning for each modality into the same framework, the training of the framework is guided by the label semantic features and hash codes generated from SeLabNet to maximize the cross-modal semantic relevance. Moreover, dual attention mechanisms are used in our neural networks: (1) Lo-attention is used to extract the local key information of each modality and improve the quality of modality features. (2) Co-attention is used to strengthen the relationship between different modalities to produce more consistent and accurate hash codes. Extensive experiments on two real datasets with image-text modalities demonstrate the superiority of the proposed method in cross-modal retrieval tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27(8):3893

    Article  MathSciNet  Google Scholar 

  2. Cao Y, Long M, Wang J, Liu S (2017) Collective deep quantization for efficient cross-modal retrieval. In: 31st AAAI conference on artificial intelligence, pp 3974–3980

  3. Wang B, Yang Y, Xu X, Hanjalic A, Shen H (2017) Adversarial cross-modal retrieval. In: Proceedings of the 2017 ACM multimedia conference, pp 154–162

  4. Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4269–4278

  5. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17(3):370

    Article  Google Scholar 

  6. Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp 686–701

  7. Gu J, Cai J, Joty S.R., Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189

  8. Hu D, Nie F, Li X (2018) Deep binary reconstruction for cross-modal hashing. IEEE Trans Multimedia 21(4):973

    Article  Google Scholar 

  9. Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph pca hashing for similarity search. IEEE Trans Multimedia 19(9):2033

    Article  Google Scholar 

  10. Shi Y, You X, Zheng F, Wang S, Peng Q (2019) Equally-guided discriminative hashing for cross-modal retrieval. In: Twenty-eighth international joint conference on artificial intelligence, pp 4767–4773

  11. Zhang J, Peng Y (2020) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans. Multimedia 22(1):174

    Article  MathSciNet  Google Scholar 

  12. Wang D, Cui P, Ou M, Zhu W (2015) Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multimedia 17(9):1404

    Article  Google Scholar 

  13. Liu W, Mu C, Kumar S, Chang S (2014) Discrete graph hashing. In: Advances in Neural Information Processing Systems, pp 3419–3427

  14. Ding K, Fan B, Huo C, Xiang S, Pan C (2017) Cross-modal hashing via rank-order preserving. IEEE Trans Multimedia 19(3):571

    Article  Google Scholar 

  15. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10386–10395

  16. Cao Y, Long M, Wang J, Zhu H (2016) Correlation autoencoder hashing for supervised cross-modal search. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 197–204

  17. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: 2015 IEEE conference on computer vision and pattern recognition, pp 3864–3872

  18. Mandal D, Chaudhury K, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2633–2641

  19. Jiang Q, Li W (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240

  20. Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Thirty-First AAAI conference on artificial intelligence, pp 1618–1625

  21. Weng W, Wu J, Yang L, Liu L, Hu B (2019) Label-based deep semantic hashing for cross-modal retrieval. In: Neural Information Processing, pp 24–36

  22. Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: 2018 IEEE conference on computer vision and pattern recognition, pp 4242–4251

  23. Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision, pp 591–606

  24. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 415–424

  25. Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796

  26. Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082

  27. Zhang D, Li W (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Twenty-Eighth AAAI Conference on Artificial Intelligence, pp 2177–2183

  28. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872

  29. Xu X, Shen F, Yang Y, Shen HT, Li X (2017) Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans Image Process 26(5):2494

    Article  MathSciNet  Google Scholar 

  30. Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: 32nd AAAI conference on artificial intelligence, pp 539–546

  31. Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2019) Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 21(4):1047

    Article  Google Scholar 

  32. Wu Q, Shen C, Wang P, Dick A, Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367

    Article  Google Scholar 

  33. Chen K, Zhao T, Yang M, Liu L, Tamura A, Wang R, Utiyama M, Sumita E (2018) A neural approach to source dependence based context model for statistical machine translation. IEEE/ACM Trans Audio Speech Language Process 26(2):266

    Article  Google Scholar 

  34. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  35. Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585

    Article  MathSciNet  Google Scholar 

  36. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014). Return of the devil in the details: Delving deep into convolutional nets, arXiv:1405.3531

  37. He K, Jian S (2015) Convolutional neural networks at constrained time cost. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 5353–5360

  38. Huiskes M, Lew M (2008) The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39–43

  39. Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, p 48

  40. Liu W, Mu C, Kumar S, Chang SF (2014) Discrete graph hashing. In: Proceedings of the 27th international conference on neural information processing systems, pp 3419–3427

  41. Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence, pp 1360–1365

  42. Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence, pp 3890–3896

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grants Nos. 61872191 and 41571389.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiagao Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Optimization algorithm

It is intractable to optimize Eq. (5) directly since it is non-convex with variables \(\omega ^p\), \(\omega ^t\), \(\mathbf{B }\). However, it is convex when taking one variable with the other two variables fixed. Therefore, we use an alternating learning strategy that fixing two parameters and updating the left one at a time until convergence. The whole alternating learning procedure is shown in Algorithm 1 and the detailed optimization steps are listed as follows:

Step 1: Optimize \(\omega ^p\) with \(\omega ^t\) and \(\mathbf{B }\) fixed. We use SGD with a BP algorithm to optimize the CNN parameter \(\omega ^p\) of the image modality. For each sampled point \(\mathbf{p }_j\), we can compute the gradient for \(\hat{\mathbf{F }}_{*i}^p\) as follows:

$$\begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^p}}=\frac{1}{512}{\sum _{j=1}^n{\left( {\left( \frac{1}{2\cdot 512}(\hat{\mathbf{F }}^p_{*i})^\top (\hat{\mathbf{F }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{F }}^l_{*j}}}} \end{aligned}$$
(12)

Compute the gradient for \(\hat{\mathbf{H }}_{*i}^p\) as follows:

$$\begin{aligned} \begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^p}}&=\frac{\gamma }{c}{\sum _{j=1}^n{\left( {\left( \frac{1}{2c}(\hat{\mathbf{H }}^p_{*i})^\top (\hat{\mathbf{H }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{H }}^l_{*j}}}} \\&\quad +2\mu {(\hat{\mathbf{H }}_{*i}^p-\mathbf{B }_{*i})}+2\tau \hat{\mathbf{H }}^p\cdot \mathbf{1 } \end{aligned} \end{aligned}$$
(13)

Then \(\frac{\partial {J}}{\partial {\omega ^p}}\) can be computed with \(\frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^p}}\) and \(\frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^p}}\) by using the chain rule, based on which BP can be used to update the parameter \(\omega ^p\).

Step 2: Optimize \(\omega ^t\) with \(\omega ^p\) and \(\mathbf{B }\) fixed. We use SGD with a BP algorithm to optimize deep neural network parameter \(\omega ^t\) of the text modality. For each sampled point \(\mathbf{t }_j\), we can compute the gradient for \(\hat{\mathbf{F }}_{*i}^t\) as follows:

$$\begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^t}}=\frac{1}{512}{\sum _{j=1}^n{\left( {\left( \frac{1}{2\cdot 512}(\hat{\mathbf{F }}^t_{*i})^\top (\hat{\mathbf{F }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{F }}^l_{*j}}}} \end{aligned}$$
(14)

Compute the gradient for \(\hat{\mathbf{H }}_{*i}^t\) as follows:

$$\begin{aligned} \begin{aligned} \frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^t}}&=\frac{\gamma }{c}{\sum _{j=1}^n{\left( {\left( \frac{1}{2c}(\hat{\mathbf{H }}^t_{*i})^\top (\hat{\mathbf{H }}^l_{*j})+\frac{1}{2}\right) -{\hat{S}}_{ij}}\right) \cdot {\hat{\mathbf{H }}^l_{*j}}}} \\&\quad +2\mu {(\hat{\mathbf{H }}_{*i}^t-\mathbf{B }_{*i})}+2\tau \hat{\mathbf{H }}^t\cdot \mathbf{1 } \end{aligned} \end{aligned}$$
(15)

Then \(\frac{\partial {J}}{\partial {\omega ^t}}\) can be computed with \(\frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^t}}\) and \(\frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^t}}\) by using the chain rule, based on which BP can be used to update the parameter \(\omega ^t\).

Step 3: Optimize \(\mathbf{B }\) with \(\omega ^p\) and \(\omega ^t\) fixed. The objective function shown in Eq. (5) can be reformulated as follows:

$$\begin{aligned} \begin{aligned} \min \limits _{\mathbf{B }}J&=\mu {({\Vert \mathbf{B }-\hat{\mathbf{H }}^p\Vert }_F^2+{\Vert \mathbf{B }-\hat{\mathbf{H }}^t\Vert }_F^2)}\\ s.t. \quad&\mathbf{B }\in {\{-1,1\}}^{c \times n} \end{aligned} \end{aligned}$$
(16)

which is rewritten as follows:

$$\begin{aligned} \begin{aligned} \max \limits _{\mathbf{B }}J&=Tr{(\mathbf{B }^\top (\mu (\hat{\mathbf{H }}^p+\hat{\mathbf{H }}^t)))} \\ s.t. \quad&\mathbf{B }\in {\{-1,1\}}^{c \times n} \end{aligned} \end{aligned}$$
(17)

To maximize the above formulation, we need to keep the two values of the product the same sign. Therefore, the following formulation can be obtained:

$$\begin{aligned} \mathbf{B }=sign(\mu (\hat{\mathbf{H }}^p+\hat{\mathbf{H }}^t)) \end{aligned}$$
(18)
figure a

Time complexity of DSHDA

As Fig. 1, DSHDA consists of two stages, each of which consists of different modules further. We will analyze the time complexity of these modules one by one as follows.

The first stage of DSHDA is the SeLabNet, which is a network model with multiple full connected layers. The time complexity of the network model can be expressed as follows:

$$\begin{aligned} O\left( \sum _{l=1}^{d_F} {w_{l-1}\cdot w_l}\right) \end{aligned}$$
(19)

where \(w_l\) is the number of neurons in the l-th layer and \(d_F\) is the total number of full connected layers.

Let \(w_0=k\) be the dimension of the input multi-label annotation. Then, according to Eq. (19) and Table 1, we can obtain the time complexity of the semantic feature generation module \(G^l\) and semantic hash code generation mobule \(D^l\) as following Eqs. (20) and (21), respectively.

$$\begin{aligned}&O(k \cdot k + 2048 \cdot k + 2048 \cdot 512 ) \nonumber \\&\quad \sim O(k^2+2048 \cdot k + 1.05 \times 10^6) \end{aligned}$$
(20)
$$\begin{aligned}&O(512 \cdot c ) \end{aligned}$$
(21)

The second stage of DSHDA is the image and text network, which consists of five modules, including feature learning module (\(E^p\) and \(E^t\)), Lo-attention (\(A^p\) and \(A^t\)), semantic feature generation (\(G^p\) and \(G^t\)), B-structure (\(V^p\) and \(V^t\)) and semantic hash code generation (\(D^p\) and \(D^t\)).

For the image network, the image feature learning module \(E^p\) is a five-layer convolutional network. The time complexity of the network model with multiple convolutional layers is given by [37]:

$$\begin{aligned} O\left( \sum _{l=1}^{d_L} {n_{l-1} \cdot s_l^2 \cdot n_l \cdot m_l^2}\right), \end{aligned}$$
(22)

where \(d_L\) is the number of convolutional layers, \(s_l\) is the spatial size (length) of the filter in the l-th layer, \(n_l\) is the number of filters in the l-th layer, \(m_l\) is the spatial size of the output feature map in the l-th layer, and \(n_{l-1}\) is also known as the number of input channels of the l-th layer.

Let \(t_l\) be the stride of the filter in the l-th layer, we have

$$\begin{aligned} m_l \approx \frac{m_{l-1}}{t_l} \end{aligned}$$
(23)

Besides, the length of the original image feature \(d_p\) can be expressed as:

$$\begin{aligned} d_p = m_0^2 \cdot n_0, \end{aligned}$$
(24)

where \(m_0\) and \(n_0\) can be considered as the spatial size and the number of channels of the input feature map for the first convolutional layer, respectively.

Then, Eq. (22) can be transformed as:

$$\begin{aligned} O\left( \sum _{l=1}^{d_L} \frac{n_{l-1} \cdot s_l^2 \cdot n_l}{n_0 \cdot \prod _{j=1}^{l}{t_j}} \cdot d_p\right) \end{aligned}$$
(25)

Now, Let \(n_0=3\) represent the three color channels of the input image. According to Eq. (25) and Table 2, we can obtain the time complexity of module \(E^p\) as follows:

$$\begin{aligned} \begin{aligned}&O\left( \frac{11\cdot 11\cdot 64}{4^2}\cdot d_p +\frac{64\cdot 5\cdot 5\cdot 265}{3\cdot 4^2}\cdot d_p \right. \\&\left. \quad +3\cdot \frac{64\cdot 3\cdot 3\cdot 265}{3\cdot 4^2}\cdot d_p \right) \\&\quad \sim O(4.88 \times 10^4 \cdot d_p) \end{aligned} \end{aligned}$$
(26)

Since the dimension of the output image feature maps by module \(E^p\) is \(d_C \cdot d_H \cdot d_W\), we have \(d_C=265\), \(d_H \cdot d_W=m_5^2=d_p/48\). These feature maps will be the input of the Lo-attention module \(A^p\), which contains is a convolutional layer with a \(1 \times 1\) kernel size. Then, the time complexity of module \(A^p\) is:

$$\begin{aligned} O(33 \cdot d_p ) \end{aligned}$$
(27)

With the same approach, we can obtain the time complexity of other modules in the image and text network of DSHDA. Due to space limitations, we omit the process of the analysis and just list the time complexity of DSHDA modules in Table 11.

Table 11 Time complexity of DSHDA modules

Thus, the total time complexity of DSHDA is the sum of all modules in Table 11. Besides, noticing that the values of k and c are \(O(10^2)\), \(d_p\) and \(d_t\) are \(O(10^3)\) in practice, we can obtain the time complexity of DSHDA is:

$$\begin{aligned} O(\epsilon \times 10^8), \end{aligned}$$
(28)

where \(\epsilon >1\) is a scale factor of the time complexity.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Weng, W., Fu, J. et al. Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput & Applic 34, 5397–5416 (2022). https://doi.org/10.1007/s00521-021-06696-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06696-y

Keywords

Navigation