A unified cycle-consistent neural model for text and image retrieval

Cornia, Marcella; Baraldi, Lorenzo; Tavakoli, Hamed R.; Cucchiara, Rita

doi:10.1007/s11042-020-09251-4

A unified cycle-consistent neural model for text and image retrieval

Published: 06 July 2020

Volume 79, pages 25697–25721, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Marcella Cornia ORCID: orcid.org/0000-0001-9640-9385¹,
Lorenzo Baraldi¹,
Hamed R. Tavakoli² &
…
Rita Cucchiara¹

636 Accesses
12 Citations
Explore all metrics

Abstract

Text-image retrieval has been recently becoming a hot-spot research field, thanks to the development of deeply-learnable architectures which can retrieve visual items given textual queries and vice-versa. The key idea of many state-of-the-art approaches has been that of learning a joint multi-modal embedding space in which text and images could be projected and compared. Here we take a different approach and reformulate the problem of text-image retrieval as that of learning a translation between the textual and visual domain. Our proposal leverages an end-to-end trainable architecture that can translate text into image features and vice versa and regularizes this mapping with a cycle-consistency criterion. Experimental evaluations for text-to-image and image-to-text retrieval, conducted on small, medium and large-scale datasets show consistent improvements over the baselines, thus confirming the appropriateness of using a cycle-consistent constrain for the text-image matching task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Cycle-Consistent Models for Text and Image Retrieval

Visual and semantic guided scene text retrieval

Article 09 June 2024

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Article Open access 12 May 2023

Notes

Each element of the diagonal would contain α − ξ(x,txt2img(s)) + ξ(x,txt2img(s)) = α, thus potentially invalidating the result of the maximum.

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE conference on computer vision and pattern recognition
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization
Baraldi L, Cornia M, Grana C, Cucchiara R (2018) Aligning text and document illustrations: towards visually explainable digital humanities. In: International conference on pattern recognition
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on empirical methods in natural language processing
Cornia M, Baraldi L, Serra G, Cucchiara R (2017) Visual saliency for image captioning in new multimedia services. In: IEEE international conference on multimedia & expo workshops
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl 14(2):48
Article Google Scholar
Cornia M, Baraldi L, Tavakoli HR, Cucchiara R (2018) Towards cycle-consistent models for text and image retrieval. In: European conference on computer vision workshops
Cornia M, Stefanini M, Baraldi L, Corsini M, Cucchiara R (2020) Explaining digital humanities by aligning images and textual descriptions. Pattern Recognit Lett 129:166–172
Article Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: IEEE conference on computer vision and pattern recognition
Dong J, Li X, Snoek CG (2016) Word2VisualVec: image and video to sentence matching by visual feature prediction. arXiv:1604.06838
Dong J, Li X, Snoek CG (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimed 20(12):3377–3388
Article Google Scholar
Eisenschtat A, Wolf L (2017) Linking image and text with 2-way nets. In: IEEE conference on computer vision and pattern recognition
Engilberge M, Chevallier L, Pérez P, Cord M (2018) Finding beans in burgers: deep semantic-visual embedding with localization. In: IEEE conference on computer vision and pattern recognition
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: British machine vision conference
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: IEEE conference on computer vision and pattern recognition
He D, Xia Y, Qin T, Wang L, Yu N, Liu TY, Ma WY (2016) Dual learning for machine translation. In: Advances in neural information processing systems
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: IEEE conference on computer vision and pattern recognition
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. In: Advances in neural information processing systems workshops
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition
Lin CY (2004) ROUGE: a package for automatic evaluation of summaries. In: ACL workshop on text summarization branches out
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision
Liu R, Zhao Y, Wei S, Zheng L, Yang Y (2019) Modality-invariant image-text embedding for image-sentence matching. ACM Trans Multimed Comput Commun Appl 15(1):27
Article Google Scholar
Luo P, Wang G, Lin L, Wang X (2017) Deep dual learning for semantic image segmentation. In: IEEE conference on computer vision and pattern recognition
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: IEEE international conference on computer vision
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems
Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual meeting on association for computational linguistics
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Conference on empirical methods in natural language processing
Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2019) M-VAD names: a dataset for video captioning with naming. Multimed Tools Appl 78(10):14007–14027
Article Google Scholar
Qiao T, Zhang J, Xu D, Tao D (2019) MirrorGAN: learning text-to-image generation by redescription. In: IEEE conference on computer vision and pattern recognition
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. In: International Conference on Machine Learning
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: IEEE conference on computer vision and pattern recognition
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: IEEE conference on computer vision and pattern recognition
Shao J, Zhao Z, Su F (2019) Two-stage deep learning for supervised cross-modal retrieval. Multimed Tools Appl 78(12):16615–16631
Article Google Scholar
Shetty R, Tavakoli HR, Laaksonen J (2018) Image and video captioning with augmented neural architectures. IEEE MultiMedia 25(2):34–46
Article Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Sundaram N, Brox T, Keutzer K (2010) Dense point trajectories by GPU-accelerated large displacement optical flow. In: European conference on computer vision
Tang D, Duan N, Yan Z, Zhang Z, Sun Y, Liu S, Lv Y, Zhou M (2018) Learning to collaborate for question answering and asking. In: Conference of the North American chapter of the association for computational linguistics: human language technologies
Tavakoli HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: IEEE international conference on computer vision
Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimed Tools Appl 78(3):2921–2935
Article Google Scholar
Tu Z, Liu Y, Shang L, Liu X, Li H (2017) Neural machine translation with reconstruction. In: AAAI conference on artificial intelligence
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition
Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: IEEE conference on computer vision and pattern recognition
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) FVQA: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Article Google Scholar
Wang L, Li Y, Lazebnik S (2019) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41 (2):394–407
Article Google Scholar
Wu CY, Manmatha R, Smola AJ, Krähenbühl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision
Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE conference on computer vision and pattern recognition
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE international conference on computer vision
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Article Google Scholar
Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE international conference on computer vision

Download references

Acknowledgements

We gratefully acknowledge Facebook AI Research, Panasonic Corporation, and NVIDIA Corporation for the donation of the GPUs used in this work.

Author information

Authors and Affiliations

Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy
Marcella Cornia, Lorenzo Baraldi & Rita Cucchiara
Nokia Technologies, Espoo, Finland
Hamed R. Tavakoli

Authors

Marcella Cornia
View author publications
You can also search for this author inPubMed Google Scholar
Lorenzo Baraldi
View author publications
You can also search for this author inPubMed Google Scholar
Hamed R. Tavakoli
View author publications
You can also search for this author inPubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Marcella Cornia.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cornia, M., Baraldi, L., Tavakoli, H.R. et al. A unified cycle-consistent neural model for text and image retrieval. Multimed Tools Appl 79, 25697–25721 (2020). https://doi.org/10.1007/s11042-020-09251-4

Download citation

Received: 03 May 2019
Revised: 30 April 2020
Accepted: 24 June 2020
Published: 06 July 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11042-020-09251-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A unified cycle-consistent neural model for text and image retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Cycle-Consistent Models for Text and Image Retrieval

Visual and semantic guided scene text retrieval

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now