Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

Lu, Yuhang; Yu, Jing; Liu, Yanbing; Tan, Jianlong; Guo, Li; Zhang, Weifeng

doi:10.1007/978-3-319-99365-2_19

Yuhang Lu^16,17,
Jing Yu¹⁶,
Yanbing Liu¹⁶,
Jianlong Tan¹⁶,
Li Guo¹⁶ &
…
Weifeng Zhang^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11061))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1863 Accesses
2 Citations

Abstract

Cross-modal retrieval provides a flexible way to find semantically relevant information across different modalities given a query of one modality. The main challenge is to measure the similarity between different modalities of data. Generally, different modalities contain unequal amount of information when describing the same semantics. For example, textual descriptions often contain more background information that cannot be conveyed by images and vice versa. Existing works mostly map the global data features from different modalities to a common semantic space to measure their similarity, which ignore their imbalanced and complementary relationships. In this paper, we propose stacked co-attention networks (SCANet) to progressively learn the mutually attended features of different modalities and leverage these fine-grained correlations to enhance cross-modal retrieval performance. SCANet adopts a dual-path end-to-end framework to jointly learn the multimodal representations, stacked co-attention, and similarity metric. Experiment results on three widely-used benchmark datasets verify that SCANet outperforms state-of-the-art methods, with 19% improvements on MAP in average for the best case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134. ACM (2003)
Google Scholar
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS, pp. 3837–3845 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Kan, M., Shan, S., Zhang, H., Lao, S., Chen, X.: Multi-view discriminant analysis. TPAMI 38(1), 188–194 (2016)
Article Google Scholar
Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. TMM 17(3), 276–288 (2017)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Kumar, B.G.V., Carneiro, G., Reid, I.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions. In: CVPR, pp. 5385–5394 (2016)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471 (2016)
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp. 3846–3853 (2016)
Google Scholar
Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network. arXiv preprint arXiv:1708.04776 (2017)
Qin, Z., Yu, J., Cong, Y., Wan, T.: Topic correlation model for cross-modal multimedia information retrieval. PAA 19(4), 1007–1022 (2016)
MathSciNet Google Scholar
Ranjan, V., Rasiwasia, N., Jawahar, C.: Multi-label cross-modal retrieval. In: ICCV, pp. 4094–4102 (2015)
Google Scholar
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM-MM, pp. 251–260 (2010)
Google Scholar
Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent space. In: CVPR, pp. 2160–2167 (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014)
Google Scholar
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. TPAMI 38(10), 2010–2023 (2016)
Article Google Scholar
Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: ICCV, pp. 2088–2095 (2013)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Google Scholar
Yu, J., et al.: Modeling text with graph convolutional network for cross-modal information retrieval. arXiv preprint arXiv:1802.00985 (2018)
Zhang, L., Ma, B., He, J., Li, G., Huang, Q., Tian, Q.: Adaptively unified semi-supervised learning for cross-modal retrieval. In: IJCAI, pp. 3406–3412 (2017)
Google Scholar
Zhang, X., et al.: HashGAN: attention-aware deep adversarial hashing for cross modal retrieval. arXiv preprint arXiv:1711.09347 (2017)
Zhen, Y., Yeung, D.Y.: Co-regularized hashing for multimodal data. In: NIPS, pp. 1376–1384 (2012)
Google Scholar
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding. arXiv preprint arXiv:1711.05535 (2017)

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program (Grant No. 2017YFC0820700) and the Fundamental Theory and Cutting Edge Technology Research Program of Institute of Information Engineering, CAS (Grant No. Y7Z0351101)

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Yuhang Lu, Jing Yu, Yanbing Liu, Jianlong Tan & Li Guo
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yuhang Lu
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Weifeng Zhang
Zhejiang Future Technology Institute, Jiaxing, China
Weifeng Zhang

Authors

Yuhang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yanbing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Tan
View author publications
You can also search for this author in PubMed Google Scholar
Li Guo
View author publications
You can also search for this author in PubMed Google Scholar
Weifeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Yu .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, Y., Yu, J., Liu, Y., Tan, J., Guo, L., Zhang, W. (2018). Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-99365-2_19
Published: 12 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics