Abstract
Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a multi-task visual semantic embedding network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCO-CO, with rSum improvements of 8.2% and 3.0%, respectively.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhao G S, Zhang C F, Shang H, Wang Y X, Zhu L, Qian X M. Generative label fused network for image-text matching. Knowledge-Based Systems, 2023, 263: 110280. DOI: https://doi.org/10.1016/j.knosys.2023.110280.
Qin X Y, Li L S, Hao F, Pang G Y, Wang Z H. Cross-modal information balance-aware reasoning network for image-text retrieval. Engineering Applications of Artificial Intelligence, 2023, 120: 105923. DOI: https://doi.org/10.1016/j.engappai.2023.105923.
Liu K, Xue F, Guo D, Sun P J, Qian S S, Hong R C. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Trans. Multimedia, 2023, 25: 9343–9355. DOI: https://doi.org/10.1109/TMM.2023.3251108.
Wu Y X, Liao L Z, Zhang G Y, Lei W Q, Zhao G S, Qian X M, Chua T S. State graph reasoning for multimodal conversational recommendation. IEEE Trans. Multimedia, 2023, 25: 3113–3124. DOI: https://doi.org/10.1109/TMM.2022.3155900.
Wen Z, Peng Y X. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits and Systems for Video Technology, 2021, 31(3): 1042–1054. DOI: https://doi.org/10.1109/TCSVT.2020.2991866.
Li Z Y, Guo Y Y, Wang K J, Wei Y W, Nie L Q, Kankanhalli M. Joint answering and explanation for visual commonsense reasoning. IEEE Trans. Image Processing, 2023, 32: 3836–3846. DOI: https://doi.org/10.1109/TIP.2023.3286259.
Wang L W, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2019, 41(2): 394–407. DOI: https://doi.org/10.1109/TPAMI.2018.2797921.
Liu Y, Guo Y M, Liu L, Bakker E M, Lew M S. CycleMatch: A cycle-consistent embedding network for image-text matching. Pattern Recognition, 2019, 93: 365–379. DOI: https://doi.org/10.1016/j.patcog.2019.05.008.
Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Visual semantic reasoning for image-text matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.4654–4662. DOI: https://doi.org/10.1109/iccv.2019.00475.
Sarafianos N, Xu X, Kakadiaris I A. Adversarial representation learning for text-to-image matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.5814–5824. DOI: https://doi.org/10.1109/iccv.2019.00591.
Peng Y X, Qi J W. CM- GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Computing, Communications, and Applications, 2019, 15(1): Article No. 22. DOI: https://doi.org/10.1145/3284750.
Chi J Z, Peng Y X. Zero- shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circuits and Systems for Video Technology, 2020, 30(4): 1173–1187. DOI: https://doi.org/10.1109/TCSVT.2019.2900171.
Xie Y C, Zeng X H, Wang T H, Xu L M, Wang D J. Multiple deep neural networks with multiple labels for cross-modal hashing retrieval. Engineering Applications of Artificial Intelligence, 2022, 114: 105090. DOI: https://doi.org/10.1016/j.engappai.2022.105090.
Lee K H, Chen X, Hua G, Hu H D, He X D. Stacked cross attention for image-text matching. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.201–216. DOI: https://doi.org/10.1007/978-3-030-01225-0_13.
Liu C X, Mao Z D, Liu A A, Zhang T Z, Wang B, Zhang Y D. Focus your attention: A bidirectional focal attention network for image-text matching. In Proc. the 27th ACM International Conference on Multimedia, Oct. 2019, pp.3–11. DOI: https://doi.org/10.1145/3343031.3350869.
Wei X, Zhang T Z, Li Y, Zhang Y D, Wu F. Multi-modality cross attention network for image and sentence matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10941–10950. DOI: https://doi.org/10.1109/cvpr42600.2020.01095.
He Y, Liu X, Cheung Y M, Peng S J, Yi J H, Fan W T. Cross-graph attention enhanced multi-modal correlation learning for fine-grained image-text retrieval. In Proc. the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2021, pp.1865–1869. DOI: https://doi.org/10.1145/3404835.3463031.
Zhang K, Mao Z D, Liu A A, Zhang Y D. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia, 2023, 25: 1320–1332. DOI: https://doi.org/10.1109/TMM.2022.3141603.
Zhang K, Mao Z D, Wang Q, Zhang Y D. Negative-aware attention framework for image-text matching. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.15661–15670. DOI: https://doi.org/10.1109/cvpr52688.2022.01521.
Wu J, Wu C L, Lu J, Wang L Q, Cui X R. Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits and Systems for Video Technology, 2022, 32(1): 388–397. DOI: https://doi.org/10.1109/TCSVT.2021.3060713.
Wang Y, Su Y T, Li W H, Sun Z Y, Wei Z Q, Nie J, Li X Y, Liu A A. Rare-aware attention network for image-text matching. Information Processing & Management, 2023, 60(3): 103280. DOI: https://doi.org/10.1016/j.ipm.2023.103280.
Chen J C, Hu H X, Wu H, Jiang Y N, Wang C H. Learning the best pooling strategy for visual semantic embedding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.15789–15798. DOI: https://doi.org/10.1109/cvpr46437.2021.01553.
Liu C X, Mao Z D, Zhang T Z, Xie H T, Wang B, Zhang Y D. Graph structured network for image-text matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10921–10930. DOI: https://doi.org/10.1109/cvpr42600.2020.01093.
Cheng Y H, Zhu X G, Qian J C, Wen F, Liu P L. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(4): 95. DOI: https://doi.org/10.1145/3499027.
Diao H W, Zhang Y, Ma L, Lu H C. Similarity reasoning and filtration for image-text matching. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.1218–1226. DOI: https://doi.org/10.1609/aaai.v35i2.16209.
Wang X H, Zhu L C, Yang Y. T2VLAD: Global-local sequence alignment for text-video retrieval. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.5079–5088. DOI: https://doi.org/10.1109/cvpr46437.2021.00504.
Ji Z, Chen K X, Wang H R. Step-wise hierarchical alignment network for image-text matching. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.765–771. DOI: https://doi.org/10.24963/ijcai.2021/106.
Li W H, Yang S, Wang Y, Song D, Li X Y. Multi-level similarity learning for image-text retrieval. Information Processing & Management, 2021, 58(1): 102432. DOI: https://doi.org/10.1016/j.ipm.2020.102432.
Li J T, Liu L, Niu L, Zhang L Q. Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans. Image Processing, 2021, 30: 9193–9207. DOI: https://doi.org/10.1109//TIP.2021.3123553.
Xu Y Y, Li X T, Yuan H B, Yang Y B, Zhang L F. Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits and Systems for Video Technology, 2024, 34(2): 1228–1240. DOI: https://doi.org/10.1109/tcsvt.2023.3292995.
Foggia P, Greco A, Saggese A, Vento M. Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence, 2023, 118: 105651. DOI: https://doi.org/10.1016/j.engappai.2022.105651.
Moscato V, Napolano G, Postiglione M, Sperlï G. Multi-task learning for few-shot biomedical relation extraction. Artificial Intelligence Review, 2023, 56(11): 13743–13763. DOI: https://doi.org/10.1007/s10462-023-10484-6.
Vandenhende S, Georgoulis S, Van Gool L. MTI-Net: Multi-scale task interaction networks for multi-task learning. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.527–543. DOI: https://doi.org/10.1007/978-3-030-58548-8_31.
Luo J Y, Shen Y, Ao X, Zhao Z, Yang M. Cross-modal image-text retrieval with multitask learning. In Proc. the 28th ACM International Conference on Information and Knowledge Management, Nov. 2019, pp.2309–2312. DOI: https://doi.org/10.1145/3357384.3358104.
Yuan H, Huang Y, Zhang D B, Chen Z R, Cheng W L, Wang L. VSR++: Improving visual semantic reasoning for fine-grained image-text matching. In Proc. the 25th International Conference on Pattern Recognition, Jan. 2021, pp.3728–3735. DOI: https://doi.org/10.1109/icpr48806.2021.9413223.
Xu X, Wang T, Yang Y, Zuo L, Shen F M, Shen H T. Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Networks and Learning Systems, 2020, 31(12): 5412–5425. DOI: https://doi.org/10.1109/TNNLS.2020.2967597.
Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(1): 641–656. DOI: https://doi.org/10.1109/TPAMI.2022.3148470.
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: https://doi.org/10.1109/cvpr.2018.00636.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
This work was supported by the National Natural Science Foundation of China under Grant No. 62076048.
Xue-Yang Qin received his Ph.D. degree in computer software and theory from Dalian University of Technology, Dalian, in 2024, and received his M.S. degree in computer systems organization from Shaanxi Normal University, Xi’an, in 2019. His main research interests include cross-modal retrieval, information extraction, and multimodal data processing.
Li-Shuang Li received her Ph.D. degree in knowledge management from Dalian University of Technology, Dalian, in 2013. She is currently a professor with the School of Computer Science and Technology, Dalian University of Technology, Dalian. Her current research interests include data mining, natural language processing, and information extraction.
Jing-Yao Tang received her M.Sc. degree in computer software and theory from South China Normal University, Guangzhou, in 2021. She is working toward her Ph.D. degree in the School of Computer Science and Technology, Dalian University of Technology, Dalian. Her main research interests include natural language processing, information extraction, causal inference, and low-resource learning.
Fei Hao received his Ph.D. degree in computer science and engineering from Soonchunhyang University, Asan, in 2016. Since 2016, he has been with Shaanxi Normal University, Xi’an, where he is an associate professor. His research interests include social computing, soft computing, big data analytics, pervasive computing, and data mining.
Mei-Ling Ge received her Master’s degree in computer applied technology from the College of Computer Science, Shaanxi Normal University, Xi’an, in 2021. She is currently a teaching assistant with the School of Computer Engineering, Weifang University, Weifang. Her current research interests include recommendation, multimedia content analysis and retrieval, machine learning, and data mining.
Guang-Yao Pang received his Ph.D. degree in computer software and theory from Shaanxi Normal University, Xi’an, in 2021, and his M.S. degree in software engineering from University of Electronic Science and Technology of China, Chengdu, in 2013. He is currently an associate professor with the Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software, Wuzhou University, Wuzhou. His main research interests include deep learning, recommendation system, and multimodal data processing.
Electronic supplementary material
About this article
Cite this article
Qin, XY., Li, LS., Tang, JY. et al. Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval. J. Comput. Sci. Technol. 39, 811–826 (2024). https://doi.org/10.1007/s11390-024-4125-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-024-4125-1