Skip to main content
Log in

A unified cycle-consistent neural model for text and image retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text-image retrieval has been recently becoming a hot-spot research field, thanks to the development of deeply-learnable architectures which can retrieve visual items given textual queries and vice-versa. The key idea of many state-of-the-art approaches has been that of learning a joint multi-modal embedding space in which text and images could be projected and compared. Here we take a different approach and reformulate the problem of text-image retrieval as that of learning a translation between the textual and visual domain. Our proposal leverages an end-to-end trainable architecture that can translate text into image features and vice versa and regularizes this mapping with a cycle-consistency criterion. Experimental evaluations for text-to-image and image-to-text retrieval, conducted on small, medium and large-scale datasets show consistent improvements over the baselines, thus confirming the appropriateness of using a cycle-consistent constrain for the text-image matching task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Each element of the diagonal would contain αξ(x,txt2img(s)) + ξ(x,txt2img(s)) = α, thus potentially invalidating the result of the maximum.

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE conference on computer vision and pattern recognition

  2. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization

  3. Baraldi L, Cornia M, Grana C, Cucchiara R (2018) Aligning text and document illustrations: towards visually explainable digital humanities. In: International conference on pattern recognition

  4. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  5. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on empirical methods in natural language processing

  6. Cornia M, Baraldi L, Serra G, Cucchiara R (2017) Visual saliency for image captioning in new multimedia services. In: IEEE international conference on multimedia & expo workshops

  7. Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl 14(2):48

    Article  Google Scholar 

  8. Cornia M, Baraldi L, Tavakoli HR, Cucchiara R (2018) Towards cycle-consistent models for text and image retrieval. In: European conference on computer vision workshops

  9. Cornia M, Stefanini M, Baraldi L, Corsini M, Cucchiara R (2020) Explaining digital humanities by aligning images and textual descriptions. Pattern Recognit Lett 129:166–172

    Article  Google Scholar 

  10. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: IEEE conference on computer vision and pattern recognition

  11. Dong J, Li X, Snoek CG (2016) Word2VisualVec: image and video to sentence matching by visual feature prediction. arXiv:1604.06838

  12. Dong J, Li X, Snoek CG (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimed 20(12):3377–3388

    Article  Google Scholar 

  13. Eisenschtat A, Wolf L (2017) Linking image and text with 2-way nets. In: IEEE conference on computer vision and pattern recognition

  14. Engilberge M, Chevallier L, Pérez P, Cord M (2018) Finding beans in burgers: deep semantic-visual embedding with localization. In: IEEE conference on computer vision and pattern recognition

  15. Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: British machine vision conference

  16. Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: IEEE conference on computer vision and pattern recognition

  17. He D, Xia Y, Qin T, Wang L, Yu N, Liu TY, Ma WY (2016) Dual learning for machine translation. In: Advances in neural information processing systems

  18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition

  19. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  20. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  21. Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition

  22. Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: IEEE conference on computer vision and pattern recognition

  23. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition

  24. Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations

  25. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. In: Advances in neural information processing systems workshops

  26. Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition

  27. Lin CY (2004) ROUGE: a package for automatic evaluation of summaries. In: ACL workshop on text summarization branches out

  28. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision

  29. Liu R, Zhao Y, Wei S, Zheng L, Yang Y (2019) Modality-invariant image-text embedding for image-sentence matching. ACM Trans Multimed Comput Commun Appl 15(1):27

    Article  Google Scholar 

  30. Luo P, Wang G, Lin L, Wang X (2017) Deep dual learning for semantic image segmentation. In: IEEE conference on computer vision and pattern recognition

  31. Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: IEEE international conference on computer vision

  32. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems

  33. Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition

  34. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual meeting on association for computational linguistics

  35. Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858

    Article  Google Scholar 

  36. Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Conference on empirical methods in natural language processing

  37. Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2019) M-VAD names: a dataset for video captioning with naming. Multimed Tools Appl 78(10):14007–14027

    Article  Google Scholar 

  38. Qiao T, Zhang J, Xu D, Tao D (2019) MirrorGAN: learning text-to-image generation by redescription. In: IEEE conference on computer vision and pattern recognition

  39. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. In: International Conference on Machine Learning

  40. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems

  41. Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  42. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: IEEE conference on computer vision and pattern recognition

  43. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  44. Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: IEEE conference on computer vision and pattern recognition

  45. Shao J, Zhao Z, Su F (2019) Two-stage deep learning for supervised cross-modal retrieval. Multimed Tools Appl 78(12):16615–16631

    Article  Google Scholar 

  46. Shetty R, Tavakoli HR, Laaksonen J (2018) Image and video captioning with augmented neural architectures. IEEE MultiMedia 25(2):34–46

    Article  Google Scholar 

  47. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

  48. Sundaram N, Brox T, Keutzer K (2010) Dense point trajectories by GPU-accelerated large displacement optical flow. In: European conference on computer vision

  49. Tang D, Duan N, Yan Z, Zhang Z, Sun Y, Liu S, Lv Y, Zhou M (2018) Learning to collaborate for question answering and asking. In: Conference of the North American chapter of the association for computational linguistics: human language technologies

  50. Tavakoli HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: IEEE international conference on computer vision

  51. Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimed Tools Appl 78(3):2921–2935

    Article  Google Scholar 

  52. Tu Z, Liu Y, Shang L, Liu X, Li H (2017) Neural machine translation with reconstruction. In: AAAI conference on artificial intelligence

  53. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition

  54. Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: IEEE conference on computer vision and pattern recognition

  55. Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) FVQA: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  Google Scholar 

  56. Wang L, Li Y, Lazebnik S (2019) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41 (2):394–407

    Article  Google Scholar 

  57. Wu CY, Manmatha R, Smola AJ, Krähenbühl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision

  58. Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204

    Article  Google Scholar 

  59. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning

  60. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE conference on computer vision and pattern recognition

  61. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  62. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE international conference on computer vision

  63. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962

    Article  Google Scholar 

  64. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE international conference on computer vision

Download references

Acknowledgements

We gratefully acknowledge Facebook AI Research, Panasonic Corporation, and NVIDIA Corporation for the donation of the GPUs used in this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcella Cornia.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cornia, M., Baraldi, L., Tavakoli, H.R. et al. A unified cycle-consistent neural model for text and image retrieval. Multimed Tools Appl 79, 25697–25721 (2020). https://doi.org/10.1007/s11042-020-09251-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09251-4

Keywords

Navigation