Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

Qin, Xue-Yang; Li, Li-Shuang; Tang, Jing-Yao; Hao, Fei; Ge, Mei-Ling; Pang, Guang-Yao

doi:10.1007/s11390-024-4125-1

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

Regular Paper
Special Section of CVM 2024
Published: 20 September 2024

Volume 39, pages 811–826, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xue-Yang Qin (秦雪洋)¹,
Li-Shuang Li (李丽双)¹,
Jing-Yao Tang (唐婧尧)¹,
Fei Hao (郝飞)²,
Mei-Ling Ge (盖枚岭)³ &
…
Guang-Yao Pang (庞光垚)⁴

243 Accesses
1 Altmetric
Explore all metrics

Abstract

Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a multi-task visual semantic embedding network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCO-CO, with rSum improvements of 8.2% and 3.0%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Zhao G S, Zhang C F, Shang H, Wang Y X, Zhu L, Qian X M. Generative label fused network for image-text matching. Knowledge-Based Systems, 2023, 263: 110280. DOI: https://doi.org/10.1016/j.knosys.2023.110280.
Article Google Scholar
Qin X Y, Li L S, Hao F, Pang G Y, Wang Z H. Cross-modal information balance-aware reasoning network for image-text retrieval. Engineering Applications of Artificial Intelligence, 2023, 120: 105923. DOI: https://doi.org/10.1016/j.engappai.2023.105923.
Article Google Scholar
Liu K, Xue F, Guo D, Sun P J, Qian S S, Hong R C. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Trans. Multimedia, 2023, 25: 9343–9355. DOI: https://doi.org/10.1109/TMM.2023.3251108.
Article Google Scholar
Wu Y X, Liao L Z, Zhang G Y, Lei W Q, Zhao G S, Qian X M, Chua T S. State graph reasoning for multimodal conversational recommendation. IEEE Trans. Multimedia, 2023, 25: 3113–3124. DOI: https://doi.org/10.1109/TMM.2022.3155900.
Article Google Scholar
Wen Z, Peng Y X. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits and Systems for Video Technology, 2021, 31(3): 1042–1054. DOI: https://doi.org/10.1109/TCSVT.2020.2991866.
Article Google Scholar
Li Z Y, Guo Y Y, Wang K J, Wei Y W, Nie L Q, Kankanhalli M. Joint answering and explanation for visual commonsense reasoning. IEEE Trans. Image Processing, 2023, 32: 3836–3846. DOI: https://doi.org/10.1109/TIP.2023.3286259.
Article Google Scholar
Wang L W, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2019, 41(2): 394–407. DOI: https://doi.org/10.1109/TPAMI.2018.2797921.
Article Google Scholar
Liu Y, Guo Y M, Liu L, Bakker E M, Lew M S. CycleMatch: A cycle-consistent embedding network for image-text matching. Pattern Recognition, 2019, 93: 365–379. DOI: https://doi.org/10.1016/j.patcog.2019.05.008.
Article Google Scholar
Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Visual semantic reasoning for image-text matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.4654–4662. DOI: https://doi.org/10.1109/iccv.2019.00475.
Sarafianos N, Xu X, Kakadiaris I A. Adversarial representation learning for text-to-image matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.5814–5824. DOI: https://doi.org/10.1109/iccv.2019.00591.
Peng Y X, Qi J W. CM- GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Computing, Communications, and Applications, 2019, 15(1): Article No. 22. DOI: https://doi.org/10.1145/3284750.
Chi J Z, Peng Y X. Zero- shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circuits and Systems for Video Technology, 2020, 30(4): 1173–1187. DOI: https://doi.org/10.1109/TCSVT.2019.2900171.
Article Google Scholar
Xie Y C, Zeng X H, Wang T H, Xu L M, Wang D J. Multiple deep neural networks with multiple labels for cross-modal hashing retrieval. Engineering Applications of Artificial Intelligence, 2022, 114: 105090. DOI: https://doi.org/10.1016/j.engappai.2022.105090.
Article Google Scholar
Lee K H, Chen X, Hua G, Hu H D, He X D. Stacked cross attention for image-text matching. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.201–216. DOI: https://doi.org/10.1007/978-3-030-01225-0_13.
Google Scholar
Liu C X, Mao Z D, Liu A A, Zhang T Z, Wang B, Zhang Y D. Focus your attention: A bidirectional focal attention network for image-text matching. In Proc. the 27th ACM International Conference on Multimedia, Oct. 2019, pp.3–11. DOI: https://doi.org/10.1145/3343031.3350869.
Chapter Google Scholar
Wei X, Zhang T Z, Li Y, Zhang Y D, Wu F. Multi-modality cross attention network for image and sentence matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10941–10950. DOI: https://doi.org/10.1109/cvpr42600.2020.01095.
Google Scholar
He Y, Liu X, Cheung Y M, Peng S J, Yi J H, Fan W T. Cross-graph attention enhanced multi-modal correlation learning for fine-grained image-text retrieval. In Proc. the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2021, pp.1865–1869. DOI: https://doi.org/10.1145/3404835.3463031.
Google Scholar
Zhang K, Mao Z D, Liu A A, Zhang Y D. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia, 2023, 25: 1320–1332. DOI: https://doi.org/10.1109/TMM.2022.3141603.
Article Google Scholar
Zhang K, Mao Z D, Wang Q, Zhang Y D. Negative-aware attention framework for image-text matching. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.15661–15670. DOI: https://doi.org/10.1109/cvpr52688.2022.01521.
Google Scholar
Wu J, Wu C L, Lu J, Wang L Q, Cui X R. Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits and Systems for Video Technology, 2022, 32(1): 388–397. DOI: https://doi.org/10.1109/TCSVT.2021.3060713.
Article Google Scholar
Wang Y, Su Y T, Li W H, Sun Z Y, Wei Z Q, Nie J, Li X Y, Liu A A. Rare-aware attention network for image-text matching. Information Processing & Management, 2023, 60(3): 103280. DOI: https://doi.org/10.1016/j.ipm.2023.103280.
Article Google Scholar
Chen J C, Hu H X, Wu H, Jiang Y N, Wang C H. Learning the best pooling strategy for visual semantic embedding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.15789–15798. DOI: https://doi.org/10.1109/cvpr46437.2021.01553.
Google Scholar
Liu C X, Mao Z D, Zhang T Z, Xie H T, Wang B, Zhang Y D. Graph structured network for image-text matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10921–10930. DOI: https://doi.org/10.1109/cvpr42600.2020.01093.
Google Scholar
Cheng Y H, Zhu X G, Qian J C, Wen F, Liu P L. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(4): 95. DOI: https://doi.org/10.1145/3499027.
Article Google Scholar
Diao H W, Zhang Y, Ma L, Lu H C. Similarity reasoning and filtration for image-text matching. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.1218–1226. DOI: https://doi.org/10.1609/aaai.v35i2.16209.
Google Scholar
Wang X H, Zhu L C, Yang Y. T2VLAD: Global-local sequence alignment for text-video retrieval. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.5079–5088. DOI: https://doi.org/10.1109/cvpr46437.2021.00504.
Google Scholar
Ji Z, Chen K X, Wang H R. Step-wise hierarchical alignment network for image-text matching. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.765–771. DOI: https://doi.org/10.24963/ijcai.2021/106.
Google Scholar
Li W H, Yang S, Wang Y, Song D, Li X Y. Multi-level similarity learning for image-text retrieval. Information Processing & Management, 2021, 58(1): 102432. DOI: https://doi.org/10.1016/j.ipm.2020.102432.
Article Google Scholar
Li J T, Liu L, Niu L, Zhang L Q. Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans. Image Processing, 2021, 30: 9193–9207. DOI: https://doi.org/10.1109//TIP.2021.3123553.
Article Google Scholar
Xu Y Y, Li X T, Yuan H B, Yang Y B, Zhang L F. Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits and Systems for Video Technology, 2024, 34(2): 1228–1240. DOI: https://doi.org/10.1109/tcsvt.2023.3292995.
Article Google Scholar
Foggia P, Greco A, Saggese A, Vento M. Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence, 2023, 118: 105651. DOI: https://doi.org/10.1016/j.engappai.2022.105651.
Article Google Scholar
Moscato V, Napolano G, Postiglione M, Sperlï G. Multi-task learning for few-shot biomedical relation extraction. Artificial Intelligence Review, 2023, 56(11): 13743–13763. DOI: https://doi.org/10.1007/s10462-023-10484-6.
Article Google Scholar
Vandenhende S, Georgoulis S, Van Gool L. MTI-Net: Multi-scale task interaction networks for multi-task learning. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.527–543. DOI: https://doi.org/10.1007/978-3-030-58548-8_31.
Google Scholar
Luo J Y, Shen Y, Ao X, Zhao Z, Yang M. Cross-modal image-text retrieval with multitask learning. In Proc. the 28th ACM International Conference on Information and Knowledge Management, Nov. 2019, pp.2309–2312. DOI: https://doi.org/10.1145/3357384.3358104.
Google Scholar
Yuan H, Huang Y, Zhang D B, Chen Z R, Cheng W L, Wang L. VSR++: Improving visual semantic reasoning for fine-grained image-text matching. In Proc. the 25th International Conference on Pattern Recognition, Jan. 2021, pp.3728–3735. DOI: https://doi.org/10.1109/icpr48806.2021.9413223.
Google Scholar
Xu X, Wang T, Yang Y, Zuo L, Shen F M, Shen H T. Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Networks and Learning Systems, 2020, 31(12): 5412–5425. DOI: https://doi.org/10.1109/TNNLS.2020.2967597.
Article Google Scholar
Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(1): 641–656. DOI: https://doi.org/10.1109/TPAMI.2022.3148470.
Article Google Scholar
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: https://doi.org/10.1109/cvpr.2018.00636.
Chapter Google Scholar
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Xue-Yang Qin (秦雪洋), Li-Shuang Li (李丽双) & Jing-Yao Tang (唐婧尧)
School of Computer Science, Shaanxi Normal University, Xi’an, 710119, China
Fei Hao (郝飞)
School of Computer Engineering, Weifang University, Weifang, 261061, China
Mei-Ling Ge (盖枚岭)
Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software, Wuzhou University, Wuzhou, 543002, China
Guang-Yao Pang (庞光垚)

Authors

Xue-Yang Qin (秦雪洋)
View author publications
You can also search for this author in PubMed Google Scholar
Li-Shuang Li (李丽双)
View author publications
You can also search for this author in PubMed Google Scholar
Jing-Yao Tang (唐婧尧)
View author publications
You can also search for this author in PubMed Google Scholar
Fei Hao (郝飞)
View author publications
You can also search for this author in PubMed Google Scholar
Mei-Ling Ge (盖枚岭)
View author publications
You can also search for this author in PubMed Google Scholar
Guang-Yao Pang (庞光垚)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li-Shuang Li (李丽双).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant No. 62076048.

Xue-Yang Qin received his Ph.D. degree in computer software and theory from Dalian University of Technology, Dalian, in 2024, and received his M.S. degree in computer systems organization from Shaanxi Normal University, Xi’an, in 2019. His main research interests include cross-modal retrieval, information extraction, and multimodal data processing.

Li-Shuang Li received her Ph.D. degree in knowledge management from Dalian University of Technology, Dalian, in 2013. She is currently a professor with the School of Computer Science and Technology, Dalian University of Technology, Dalian. Her current research interests include data mining, natural language processing, and information extraction.

Jing-Yao Tang received her M.Sc. degree in computer software and theory from South China Normal University, Guangzhou, in 2021. She is working toward her Ph.D. degree in the School of Computer Science and Technology, Dalian University of Technology, Dalian. Her main research interests include natural language processing, information extraction, causal inference, and low-resource learning.

Fei Hao received his Ph.D. degree in computer science and engineering from Soonchunhyang University, Asan, in 2016. Since 2016, he has been with Shaanxi Normal University, Xi’an, where he is an associate professor. His research interests include social computing, soft computing, big data analytics, pervasive computing, and data mining.

Mei-Ling Ge received her Master’s degree in computer applied technology from the College of Computer Science, Shaanxi Normal University, Xi’an, in 2021. She is currently a teaching assistant with the School of Computer Engineering, Weifang University, Weifang. Her current research interests include recommendation, multimedia content analysis and retrieval, machine learning, and data mining.

Guang-Yao Pang received his Ph.D. degree in computer software and theory from Shaanxi Normal University, Xi’an, in 2021, and his M.S. degree in software engineering from University of Electronic Science and Technology of China, Chengdu, in 2013. He is currently an associate professor with the Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software, Wuzhou University, Wuzhou. His main research interests include deep learning, recommendation system, and multimodal data processing.

Electronic supplementary material