Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning

Yang, Yang; Guo, Jinyi; Li, Guangyu; Li, Lanyu; Li, Wenjie; Yang, Jian

doi:10.1007/s11704-023-3186-6

Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning

Research Article
Published: 02 December 2023

Volume 18, article number 181335, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yang Yang^1,2,3,
Jinyi Guo¹,
Guangyu Li¹,
Lanyu Li⁴,
Wenjie Li² &
…
Jian Yang¹

95 Accesses
8 Citations
43 Altmetric
6 Mentions
Explore all metrics

Abstract

Traditional image-sentence cross-modal retrieval methods usually aim to learn consistent representations of heterogeneous modalities, thereby to search similar instances in one modality according to the query from another modality in result. The basic assumption behind these methods is that parallel multi-modal data (i.e., different modalities of the same example are aligned) can be obtained in prior. In other words, the image-sentence cross-modal retrieval task is a supervised task with the alignments as ground-truths. However, in many real-world applications, it is difficult to realign a large amount of parallel data for new scenarios due to the substantial labor costs, leading the non-parallel multi-modal data and existing methods cannot be used directly. On the other hand, there actually exists auxiliary parallel multi-modal data with similar semantics, which can assist the non-parallel data to learn the consistent representations. Therefore, in this paper, we aim at “Alignment Efficient Image-Sentence Retrieval” (AEIR), which recurs to the auxiliary parallel image-sentence data as the source domain data, and takes the non-parallel data as the target domain data. Unlike single-modal transfer learning, AEIR learns consistent image-sentence cross-modal representations of target domain by transferring the alignments of existing parallel data. Specifically, AEIR learns the image-sentence consistent representations in source domain with parallel data, while transferring the alignment knowledge across domains by jointly optimizing a novel designed cross-domain cross-modal metric learning based constraint with intra-modal domain adversarial loss. Consequently, we can effectively learn the consistent representations for target domain considering both the structure and semantic transfer. Furthermore, extensive experiments on different transfer scenarios validate that AEIR can achieve better retrieval results comparing with the baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

Article 20 August 2020

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Text-image matching for multi-model machine translation

Article 09 May 2023

References

Wang Z, Liu X, Lin J, Yang C, Li H. Multi-attention based cross-domain beauty product image retrieval. Science China Information Sciences, 2020, 63(2): 120112
Google Scholar
Wang K, Yin Q, Wang W, Wu S, Wang L. A comprehensive survey on cross-modal retrieval. 2016, arXiv preprint arXiv: 1607.06215
Peng Y, Qi J, Ye Z, Zhuo Y. Hierarchical visual-textual knowledge distillation for life-long correlation learning. International Journal of Computer Vision, 2021, 129(4): 921–941
MATH Google Scholar
Liu Y, Guo Y Y, Fang J, Fan J L, Hao Y, Liu J M. Survey of research on deep learning image-text cross-modal retrieval. Journal of Frontiers of Computer Science & Technology, 2022, 16(3): 489–511
MATH Google Scholar
Chi J, Peng Y. Dual adversarial networks for zero-shot cross-media retrieval. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 663–669
Zhen L, Hu P, Wang X, Peng D. Deep supervised cross-modal retrieval. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10386–10395
Wang D, Gao X, Wang X, He L, Yuan B. Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, 2016, 25(10): 4540–4554
MathSciNet MATH Google Scholar
Qu W, Wang D, Feng S, Zhang Y, Yu G. A novel cross-modal hashing algorithm based on multimodal deep learning. Science China Information Sciences, 2017, 60(9): 092104
MATH Google Scholar
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J. CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5763–5772
Lee K H, Chen X, Hua G, Hu H, He X. Stacked cross attention for image-text matching. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 212–228
Zhang Y, Lu H. Deep cross-modal projection learning for image-text matching. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 707–723
Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In: Proceedings of AAAI Conference on Artificial Intelligence. 2021, 3208–3216
Peng Y, Qi J, Zhuo Y. MAVA: multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism. IEEE Transactions on Image Processing, 2020, 29: 2728–2741
MATH Google Scholar
Ji Z, Wang H, Han J, Pang Y. SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Transactions on Cybernetics, 2022, 52(2): 1086–1097
Google Scholar
Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T. DeViSE: a deep visual-semantic embedding model. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2121–2129
Song G, Tan X. Sequential learning for cross-modal retrieval. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. 2019, 4531–4539
Feng Y, Ma L, Liu W, Luo J. Unsupervised image captioning. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4120–4129
Gu J, Joty S R, Cai J, Zhao H, Yang X, Wang G. Unpaired image captioning via scene graph alignments. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10322–10331
Huang P Y, Kang G, Liu W, Chang X, Hauptmann A G. Annotation efficient cross-modal retrieval with adversarial attentive alignment. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1758–1767
Zhen L, Hu P, Peng X, Goh R S M, Zhou J T. Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems, 2020, 33(2): 798–810
MATH Google Scholar
Chen Q, Liu Y, Albanie S. Mind-the-gap! Unsupervised domain adaptation for text-video retrieval. In: Proceedings of AAAI Conference on Artificial Intelligence. 2021, 1072–1080
Zhao W, Wu X, Luo J. Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Transactions on Image Processing, 2021, 30: 1180–1192
MathSciNet MATH Google Scholar
Geigle G, Pfeiffer J, Reimers N, Vulić I, Gurevych I. Retrieve fast, Rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Transactions of the Association for Computational Linguistics, 2022, 10: 503–521
Google Scholar
Yang Y, Zhang C, Xu Y C, Yu D, Zhan D C, Yang J. Rethinking label-wise cross-modal retrieval from A semantic sharing perspective. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 3300–3306
Pan S J, Tsang I W, Kwok J T, Yang Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199–210
MATH Google Scholar
Scott T R, Ridgeway K, Mozer M C. Adapted deep embeddings: A synthesis of methods for k-shot inductive transfer learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 76–85
Wang Y, Wang C, Xue H, Chen S. Self-corrected unsupervised domain adaptation. Frontiers of Computer Science, 2022, 16(5): 165323
MATH Google Scholar
Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 3320–3328
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 2016, 17(1): 2096–2030
MathSciNet MATH Google Scholar
Long M, Cao Z, Wang J, Jordan M I. Conditional adversarial domain adaptation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 1647–1657
Yao Z, Wang Y, Long M, Wang J. Unsupervised transfer learning for spatiotemporal predictive networks. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 999
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248–255
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664–676
MATH Google Scholar
Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. 2014, arXiv preprint arXiv: 1411.2539
Socher R, Karpathy A, Le Q V, Manning C D, Ng A Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2014, 2: 207–218
Google Scholar
Faghri F, Fleet D J, Kiros J R, Fidler S. VSE++: improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference 2018. 2018, 12
Diao H, Zhang Y, Ma L, Lu H. Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 1218–1226
Tzeng E, Hoffman J, Saenko K, Darrell T. Adversarial discriminative domain adaptation. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2962–2971
Luo Z, Zou Y, Hoffman J, Fei-Fei L. Label efficient learning of transferable representations across domains and tasks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 165–177
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A C, Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2672–2680
Hoffman J, Tzeng E, Darrell T, Saenko K. Simultaneous deep transfer across domains and tasks. In: Csurka G, ed. Domain Adaptation in Computer Vision Applications. Cham: Springer, 2017, 173–187
MATH Google Scholar
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A comprehensive survey on transfer learning. Proceedings of the IEEE, 2021, 109(1): 43–76
MATH Google Scholar
Huiskes M J, Lew M S. The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. 2008, 39–43
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 740–755
Hotelling H. Relations between two sets of variates. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics: Methodology and Distribution. New York: Springer, 1992, 162–190
MATH Google Scholar
Andrew G, Arora R, Bilmes J A, Livescu K. Deep canonical correlation analysis. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1247–1255
Zhang J, Peng Y, Yuan M. Unsupervised generative adversarial cross-modal hashing. In: Proceedings of AAAI Conference on Artificial Intelligence. 2018, 539–546
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12652–12660
Peng S J, He Y, Liu X, Cheung Y M, Xu X, Cui Z. Relation-aggregated cross-graph correlation learning for fine-grained image–text retrieval. IEEE Transactions on Neural Networks and Learning Systems, 2022, doi: https://doi.org/10.1109/TNNLS.2022.3188569
Peng Y, Ye Z, Qi J, Zhuo Y. Unsupervised visual-textual correlation learning with fine-grained semantic alignment. IEEE Transactions on Cybernetics, 2022, 52(5): 3669–3683
MATH Google Scholar
Saito K, Kim D, Sclaroff S, Darrell T, Saenko K. Semi-supervised domain adaptation via minimax entropy. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 8049–8057
Kingma D P, Ba J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
Lin J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 1991, 37(1): 145–151
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research was supported by the National Key R&D Program of China (2022YFF0712100), the National Natural Science Foundation of China (Grant Nos. 62006118, 62276131, 62006119), Natural Science Foundation of Jiangsu Province of China (BK20200460), Jiangsu Shuangchuang (Mass Innovation and Entrepreneurship) Talent Program, Young Elite Scientists Sponsorship Program by CAST, the Fundamental Research Funds for the Central Universities (Nos. NJ2022028, 30922010317).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
Yang Yang, Jinyi Guo, Guangyu Li & Jian Yang
Department of Computing, Hong Kong Polytechnic University, Hong Kong, 100872, China
Yang Yang & Wenjie Li
State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, 210094, China
Yang Yang
14th Research Institute of China Electronics Technology Group Corporation, Nanjing, 210094, China
Lanyu Li

Authors

Yang Yang
View author publications
You can also search for this author inPubMed Google Scholar
Jinyi Guo
View author publications
You can also search for this author inPubMed Google Scholar
Guangyu Li
View author publications
You can also search for this author inPubMed Google Scholar
Lanyu Li
View author publications
You can also search for this author inPubMed Google Scholar
Wenjie Li
View author publications
You can also search for this author inPubMed Google Scholar
Jian Yang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Guangyu Li or Lanyu Li.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Yang Yang received the PhD degree in computer science from Nanjing University, China in 2019. He is currently a professor with the School of Computer Science and Engineering, Nanjing University of Science and Technology, China. His research interests lie primarily in machine learning and data mining, including heterogeneous learning, model reuse, and incremental mining. He has published over 30 papers in leading international journal/conferences. He serves as PC in leading conferences such as IJCAI, AAAI, ICML, NIPS.

Jinyi Guo received the MSc degree with the School of Computer Science and Engineering, in Nanjing University of Science and Technology, China. His research interests lie primarily in cross-modal learning.

Guangyu Li received the BS degree from China University of Mining and Technology and MS degree from Tongji University, China in 2008 and 2011, respectively, and the PhD degree in computer science from University of Paris-Sud, France in 2015. He is currently working as an assistant professor with the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology, China. His current research interests include machine learning, computer vision, wireless networks, and so on.

Lanyu Li received the PhD degree in computer science from Nanjing University, China in 2019. He is currently a senior engineer in 14th Research Institute of China Electronics Technology Group Corporation, China. His research direction includes multimodal information interpretation and analysis based on remote sensing data, presided over two national allocation projects.

Wenjie Li received the PhD degree in systems engineering and engineering management from The Chinese University of Hong Kong, China in 1997. She is currently a Professor with the Department of Computing, The Hong Kong Polytechnic University, China. Her main research interests include natural language understanding and generation, machine conversation, and summarization and question answering.

Jian Yang received the PhD degree in pattern recognition and intelligence systems from the Nanjing University of Science and Technology (NUST), China in 2002. He is currently a professor with the School of Computer Science and Engineering, NUST, China. He has authored more than 200 scientific papers in pattern recognition and computer vision. His papers have been cited more than 6000 times in the Web of Science and 15,000 times in the Scholar Google. His current research interests include pattern recognition, computer vision, and machine learning. Dr. Yang is a Fellow of IAPR. He is currently an Associate Editor of Pattern Recognition, Pattern Recognition Letters, the IEEE Transactions on Neural Networks and Learning Systems, and Neurocomputing.

Electronic supplementary material