Skip to main content

Advertisement

Log in

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

  • Regular Paper
  • Special Section of CVM 2024
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a multi-task visual semantic embedding network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCO-CO, with rSum improvements of 8.2% and 3.0%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Zhao G S, Zhang C F, Shang H, Wang Y X, Zhu L, Qian X M. Generative label fused network for image-text matching. Knowledge-Based Systems, 2023, 263: 110280. DOI: https://doi.org/10.1016/j.knosys.2023.110280.

    Article  Google Scholar 

  2. Qin X Y, Li L S, Hao F, Pang G Y, Wang Z H. Cross-modal information balance-aware reasoning network for image-text retrieval. Engineering Applications of Artificial Intelligence, 2023, 120: 105923. DOI: https://doi.org/10.1016/j.engappai.2023.105923.

    Article  Google Scholar 

  3. Liu K, Xue F, Guo D, Sun P J, Qian S S, Hong R C. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Trans. Multimedia, 2023, 25: 9343–9355. DOI: https://doi.org/10.1109/TMM.2023.3251108.

    Article  Google Scholar 

  4. Wu Y X, Liao L Z, Zhang G Y, Lei W Q, Zhao G S, Qian X M, Chua T S. State graph reasoning for multimodal conversational recommendation. IEEE Trans. Multimedia, 2023, 25: 3113–3124. DOI: https://doi.org/10.1109/TMM.2022.3155900.

    Article  Google Scholar 

  5. Wen Z, Peng Y X. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits and Systems for Video Technology, 2021, 31(3): 1042–1054. DOI: https://doi.org/10.1109/TCSVT.2020.2991866.

    Article  Google Scholar 

  6. Li Z Y, Guo Y Y, Wang K J, Wei Y W, Nie L Q, Kankanhalli M. Joint answering and explanation for visual commonsense reasoning. IEEE Trans. Image Processing, 2023, 32: 3836–3846. DOI: https://doi.org/10.1109/TIP.2023.3286259.

    Article  Google Scholar 

  7. Wang L W, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2019, 41(2): 394–407. DOI: https://doi.org/10.1109/TPAMI.2018.2797921.

    Article  Google Scholar 

  8. Liu Y, Guo Y M, Liu L, Bakker E M, Lew M S. CycleMatch: A cycle-consistent embedding network for image-text matching. Pattern Recognition, 2019, 93: 365–379. DOI: https://doi.org/10.1016/j.patcog.2019.05.008.

    Article  Google Scholar 

  9. Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Visual semantic reasoning for image-text matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.4654–4662. DOI: https://doi.org/10.1109/iccv.2019.00475.

  10. Sarafianos N, Xu X, Kakadiaris I A. Adversarial representation learning for text-to-image matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.5814–5824. DOI: https://doi.org/10.1109/iccv.2019.00591.

  11. Peng Y X, Qi J W. CM- GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Computing, Communications, and Applications, 2019, 15(1): Article No. 22. DOI: https://doi.org/10.1145/3284750.

  12. Chi J Z, Peng Y X. Zero- shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circuits and Systems for Video Technology, 2020, 30(4): 1173–1187. DOI: https://doi.org/10.1109/TCSVT.2019.2900171.

    Article  Google Scholar 

  13. Xie Y C, Zeng X H, Wang T H, Xu L M, Wang D J. Multiple deep neural networks with multiple labels for cross-modal hashing retrieval. Engineering Applications of Artificial Intelligence, 2022, 114: 105090. DOI: https://doi.org/10.1016/j.engappai.2022.105090.

    Article  Google Scholar 

  14. Lee K H, Chen X, Hua G, Hu H D, He X D. Stacked cross attention for image-text matching. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.201–216. DOI: https://doi.org/10.1007/978-3-030-01225-0_13.

    Google Scholar 

  15. Liu C X, Mao Z D, Liu A A, Zhang T Z, Wang B, Zhang Y D. Focus your attention: A bidirectional focal attention network for image-text matching. In Proc. the 27th ACM International Conference on Multimedia, Oct. 2019, pp.3–11. DOI: https://doi.org/10.1145/3343031.3350869.

    Chapter  Google Scholar 

  16. Wei X, Zhang T Z, Li Y, Zhang Y D, Wu F. Multi-modality cross attention network for image and sentence matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10941–10950. DOI: https://doi.org/10.1109/cvpr42600.2020.01095.

    Google Scholar 

  17. He Y, Liu X, Cheung Y M, Peng S J, Yi J H, Fan W T. Cross-graph attention enhanced multi-modal correlation learning for fine-grained image-text retrieval. In Proc. the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2021, pp.1865–1869. DOI: https://doi.org/10.1145/3404835.3463031.

    Google Scholar 

  18. Zhang K, Mao Z D, Liu A A, Zhang Y D. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia, 2023, 25: 1320–1332. DOI: https://doi.org/10.1109/TMM.2022.3141603.

    Article  Google Scholar 

  19. Zhang K, Mao Z D, Wang Q, Zhang Y D. Negative-aware attention framework for image-text matching. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.15661–15670. DOI: https://doi.org/10.1109/cvpr52688.2022.01521.

    Google Scholar 

  20. Wu J, Wu C L, Lu J, Wang L Q, Cui X R. Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits and Systems for Video Technology, 2022, 32(1): 388–397. DOI: https://doi.org/10.1109/TCSVT.2021.3060713.

    Article  Google Scholar 

  21. Wang Y, Su Y T, Li W H, Sun Z Y, Wei Z Q, Nie J, Li X Y, Liu A A. Rare-aware attention network for image-text matching. Information Processing & Management, 2023, 60(3): 103280. DOI: https://doi.org/10.1016/j.ipm.2023.103280.

    Article  Google Scholar 

  22. Chen J C, Hu H X, Wu H, Jiang Y N, Wang C H. Learning the best pooling strategy for visual semantic embedding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.15789–15798. DOI: https://doi.org/10.1109/cvpr46437.2021.01553.

    Google Scholar 

  23. Liu C X, Mao Z D, Zhang T Z, Xie H T, Wang B, Zhang Y D. Graph structured network for image-text matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10921–10930. DOI: https://doi.org/10.1109/cvpr42600.2020.01093.

    Google Scholar 

  24. Cheng Y H, Zhu X G, Qian J C, Wen F, Liu P L. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(4): 95. DOI: https://doi.org/10.1145/3499027.

    Article  Google Scholar 

  25. Diao H W, Zhang Y, Ma L, Lu H C. Similarity reasoning and filtration for image-text matching. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.1218–1226. DOI: https://doi.org/10.1609/aaai.v35i2.16209.

    Google Scholar 

  26. Wang X H, Zhu L C, Yang Y. T2VLAD: Global-local sequence alignment for text-video retrieval. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.5079–5088. DOI: https://doi.org/10.1109/cvpr46437.2021.00504.

    Google Scholar 

  27. Ji Z, Chen K X, Wang H R. Step-wise hierarchical alignment network for image-text matching. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.765–771. DOI: https://doi.org/10.24963/ijcai.2021/106.

    Google Scholar 

  28. Li W H, Yang S, Wang Y, Song D, Li X Y. Multi-level similarity learning for image-text retrieval. Information Processing & Management, 2021, 58(1): 102432. DOI: https://doi.org/10.1016/j.ipm.2020.102432.

    Article  Google Scholar 

  29. Li J T, Liu L, Niu L, Zhang L Q. Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans. Image Processing, 2021, 30: 9193–9207. DOI: https://doi.org/10.1109//TIP.2021.3123553.

    Article  Google Scholar 

  30. Xu Y Y, Li X T, Yuan H B, Yang Y B, Zhang L F. Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits and Systems for Video Technology, 2024, 34(2): 1228–1240. DOI: https://doi.org/10.1109/tcsvt.2023.3292995.

    Article  Google Scholar 

  31. Foggia P, Greco A, Saggese A, Vento M. Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence, 2023, 118: 105651. DOI: https://doi.org/10.1016/j.engappai.2022.105651.

    Article  Google Scholar 

  32. Moscato V, Napolano G, Postiglione M, Sperlï G. Multi-task learning for few-shot biomedical relation extraction. Artificial Intelligence Review, 2023, 56(11): 13743–13763. DOI: https://doi.org/10.1007/s10462-023-10484-6.

    Article  Google Scholar 

  33. Vandenhende S, Georgoulis S, Van Gool L. MTI-Net: Multi-scale task interaction networks for multi-task learning. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.527–543. DOI: https://doi.org/10.1007/978-3-030-58548-8_31.

    Google Scholar 

  34. Luo J Y, Shen Y, Ao X, Zhao Z, Yang M. Cross-modal image-text retrieval with multitask learning. In Proc. the 28th ACM International Conference on Information and Knowledge Management, Nov. 2019, pp.2309–2312. DOI: https://doi.org/10.1145/3357384.3358104.

    Google Scholar 

  35. Yuan H, Huang Y, Zhang D B, Chen Z R, Cheng W L, Wang L. VSR++: Improving visual semantic reasoning for fine-grained image-text matching. In Proc. the 25th International Conference on Pattern Recognition, Jan. 2021, pp.3728–3735. DOI: https://doi.org/10.1109/icpr48806.2021.9413223.

    Google Scholar 

  36. Xu X, Wang T, Yang Y, Zuo L, Shen F M, Shen H T. Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Networks and Learning Systems, 2020, 31(12): 5412–5425. DOI: https://doi.org/10.1109/TNNLS.2020.2967597.

    Article  Google Scholar 

  37. Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(1): 641–656. DOI: https://doi.org/10.1109/TPAMI.2022.3148470.

    Article  Google Scholar 

  38. Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: https://doi.org/10.1109/cvpr.2018.00636.

    Chapter  Google Scholar 

  39. He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li-Shuang Li  (李丽双).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant No. 62076048.

Xue-Yang Qin received his Ph.D. degree in computer software and theory from Dalian University of Technology, Dalian, in 2024, and received his M.S. degree in computer systems organization from Shaanxi Normal University, Xi’an, in 2019. His main research interests include cross-modal retrieval, information extraction, and multimodal data processing.

Li-Shuang Li received her Ph.D. degree in knowledge management from Dalian University of Technology, Dalian, in 2013. She is currently a professor with the School of Computer Science and Technology, Dalian University of Technology, Dalian. Her current research interests include data mining, natural language processing, and information extraction.

Jing-Yao Tang received her M.Sc. degree in computer software and theory from South China Normal University, Guangzhou, in 2021. She is working toward her Ph.D. degree in the School of Computer Science and Technology, Dalian University of Technology, Dalian. Her main research interests include natural language processing, information extraction, causal inference, and low-resource learning.

Fei Hao received his Ph.D. degree in computer science and engineering from Soonchunhyang University, Asan, in 2016. Since 2016, he has been with Shaanxi Normal University, Xi’an, where he is an associate professor. His research interests include social computing, soft computing, big data analytics, pervasive computing, and data mining.

Mei-Ling Ge received her Master’s degree in computer applied technology from the College of Computer Science, Shaanxi Normal University, Xi’an, in 2021. She is currently a teaching assistant with the School of Computer Engineering, Weifang University, Weifang. Her current research interests include recommendation, multimedia content analysis and retrieval, machine learning, and data mining.

Guang-Yao Pang received his Ph.D. degree in computer software and theory from Shaanxi Normal University, Xi’an, in 2021, and his M.S. degree in software engineering from University of Electronic Science and Technology of China, Chengdu, in 2013. He is currently an associate professor with the Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software, Wuzhou University, Wuzhou. His main research interests include deep learning, recommendation system, and multimodal data processing.

Electronic supplementary material

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, XY., Li, LS., Tang, JY. et al. Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval. J. Comput. Sci. Technol. 39, 811–826 (2024). https://doi.org/10.1007/s11390-024-4125-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-024-4125-1

Keywords

Navigation