research-article

Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding

Authors:

Zhizhong Zhang,

Yuan XieAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 529 - 537

https://doi.org/10.1145/3581783.3612193

Published: 27 October 2023 Publication History

Abstract

Cross-modal recipe retrieval is an emerging visual-textual retrieval task, which aims at matching food images with the corresponding recipes. Although large-scale Vision-Language Pre-training (VLP) models have achieved impressive performance on a wide range of downstream tasks, they still perform unsatisfactorily on this cross-modal retrieval task due to the following two problems: (1) Features from food images and recipes need to be aligned, simply fine-tuning the pre-trained VLP model's image encoder does not explicitly help with this goal. (2) The text content in the recipe is more structured than the text caption in the VLP model's pre-training corpus, which prevents the VLP model from adapting to the recipe retrieval task. In this paper, we propose a Component-aware Instance-specific Prompt learning (CIP) model that fully exploits the ability of large-scale VLP models. CIP enables us to learn the structured recipe information and therefore allows for aligning visual-textual representations without fine-tuning. Furthermore, we construct a recipe encoder termed Adaptive Recipe Merger (ARM) based on hierarchical Transformers, encouraging the model to learn more effective recipe representations. Extensive experiments on the public Recipe1M dataset demonstrate the superiority of our proposed method by outperforming the state-of-the-art methods on cross-modal recipe retrieval task.

References

[1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, Vol. 33 (2020), 1877--1901.

[2]

Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 35--44.

Digital Library

[3]

Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15789--15798.

[4]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104--120.

Digital Library

[5]

Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, et al. 2022. ViSTA: vision and scene text aggregation for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5184--5193.

[6]

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning. PMLR, 1931--1942.

[7]

Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, and Jun Yu. 2021. Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia. 797--806.

Digital Library

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.

[10]

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. 2022. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18166--18176.

[11]

Han Fu, Rui Wu, Chenghao Liu, and Jianling Sun. 2020. Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14570--14580.

[12]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM, Vol. 63, 11 (2020), 139--144.

Digital Library

[13]

Ricardo Guerrero, Hai X Pham, and Vladimir Pavlovic. 2021. Cross-modal retrieval and synthesis (x-mrs): Closing the modality gap in shared subspace learning. In Proceedings of the 29th ACM International Conference on Multimedia. 3192--3201.

Digital Library

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[16]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916.

[17]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021a. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 9694--9705.

[18]

Jiao Li, Jialiang Sun, Xing Xu, Wei Yu, and Fumin Shen. 2021b. Cross-modal image-recipe retrieval via intra-and inter-modality hybrid fusion. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 173--182.

Digital Library

[19]

Jiao Li, Xing Xu, Wei Yu, Fumin Shen, Zuo Cao, Kai Zuo, and Heng Tao Shen. 2021c. Hybrid Fusion with Intra-and Cross-Modality Attention for Image-Recipe Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 244--254.

Digital Library

[20]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121--137.

Digital Library

[21]

Dim P Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba. 2022. Learning program representations for food images and cooking recipes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16559--16569.

[22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.

[23]

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. 2022. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18082--18091.

[24]

Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. 2021. Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15475--15484.

[25]

Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3020--3028.

[26]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 815--823.

[27]

Xiaohui Shen, Zhe Lin, Jonathan Brandt, Shai Avidan, and Ying Wu. 2012. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 3013--3020.

[28]

Yu Sugiyama and Keiji Yanai. 2021. Cross-modal recipe embeddings by disentangling recipe contents and dish styles. In Proceedings of the 29th ACM International Conference on Multimedia. 2501--2509.

Digital Library

[29]

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200--212.

[30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).

[31]

Hao Wang, Guosheng Lin, Steven Hoi, and Chunyan Miao. 2022a. Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 5517--5526.

Digital Library

[32]

Hao Wang, Guosheng Lin, Steven CH Hoi, and Chunyan Miao. 2022b. Learning structural representations for recipe generation and food retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

[33]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven CH Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11572--11581.

[34]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, and Steven CH Hoi. 2022c. Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes With Semantic Consistency and Attention Mechanism. IEEE Transactions on Multimedia, Vol. 24 (2022), 2515--2525.

Digital Library

[35]

Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, and Luo Zhong. 2021a. Learning tfidf enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Transactions on Services Computing (2021).

[36]

Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, and Lin Li. 2021b. Learning text-image joint embedding for efficient cross-modal retrieval with deep feature engineering. ACM Transactions on Information Systems (TOIS), Vol. 40, 4 (2021), 1--27.

Digital Library

[37]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.

[38]

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1318--1327.

[39]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.

[40]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.

Digital Library

[41]

Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5519--5527.

[42]

Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2gan: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11477--11486.

Cited By

Zhao FLu YYao ZQu F(2025)Cross modal recipe retrieval with fine grained modal interactionScientific Reports10.1038/s41598-025-89461-815:1Online publication date: 9-Feb-2025
https://doi.org/10.1038/s41598-025-89461-8
Du YLiu YJin QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680731
Zou ZZhu XZhu QLiu YZhu L(2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3370158
Show More Cited By

Index Terms

Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Fine-Grained Modalities Interaction for Cross-Modal Recipe Retrieval
PRICAI 2024: Trends in Artificial Intelligence
Abstract
Cross-modal recipe retrieval, which is a system for finding cooking recipes based on a user-submitted food image or finding food pictures based on recipes, has garnered significant attention in recent years. Many advanced techniques aim to enhance ...
Improving Cross-Modal Recipe Embeddings with Cross Decoder
ICDAR '24: Proceedings of the 5th ACM Workshop on Intelligent Cross-Data Analysis and Retrieval

In this paper, we propose an effective cross-modal embedding fusing decoder (Cross Decoder) for cross-modal recipe retrieval tasks. We introduce our Cross Decoder into a recent GAN and transformer-based method to improve the representation capability of ...
Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training
MultiMedia Modeling
Abstract
Cross-modal recipe retrieval aims to exploit the relationships and accomplish mutual retrieval between recipe images and texts, which is clear for human but arduous to formulate. Although many previous works endeavored to solve this problem, most ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Key Research and Development Program of China
the National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
470
Total Downloads

Downloads (Last 12 months)251
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao FLu YYao ZQu F(2025)Cross modal recipe retrieval with fine grained modal interactionScientific Reports10.1038/s41598-025-89461-815:1Online publication date: 9-Feb-2025
https://doi.org/10.1038/s41598-025-89461-8
Du YLiu YJin QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680731
Zou ZZhu XZhu QLiu YZhu L(2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3370158
Shukor MThome NCord M(2024)Vision and Structured-Language Pretraining for Cross-Modal Food RetrievalComputer Vision and Image Understanding10.1016/j.cviu.2024.104071247:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.cviu.2024.104071
Qu FLu YYao ZZhao F(2024)Fine-Grained Modalities Interaction for Cross-Modal Recipe RetrievalPRICAI 2024: Trends in Artificial Intelligence10.1007/978-981-96-0128-8_7(79-90)Online publication date: 12-Nov-2024
https://doi.org/10.1007/978-981-96-0128-8_7
Song FZhu BHao YWang S(2024)Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-72983-6_7(111-127)Online publication date: 29-Oct-2024
https://doi.org/10.1007/978-3-031-72983-6_7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten