skip to main content
10.1145/3581783.3612193acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding

Published: 27 October 2023 Publication History

Abstract

Cross-modal recipe retrieval is an emerging visual-textual retrieval task, which aims at matching food images with the corresponding recipes. Although large-scale Vision-Language Pre-training (VLP) models have achieved impressive performance on a wide range of downstream tasks, they still perform unsatisfactorily on this cross-modal retrieval task due to the following two problems: (1) Features from food images and recipes need to be aligned, simply fine-tuning the pre-trained VLP model's image encoder does not explicitly help with this goal. (2) The text content in the recipe is more structured than the text caption in the VLP model's pre-training corpus, which prevents the VLP model from adapting to the recipe retrieval task. In this paper, we propose a Component-aware Instance-specific Prompt learning (CIP) model that fully exploits the ability of large-scale VLP models. CIP enables us to learn the structured recipe information and therefore allows for aligning visual-textual representations without fine-tuning. Furthermore, we construct a recipe encoder termed Adaptive Recipe Merger (ARM) based on hierarchical Transformers, encouraging the model to learn more effective recipe representations. Extensive experiments on the public Recipe1M dataset demonstrate the superiority of our proposed method by outperforming the state-of-the-art methods on cross-modal recipe retrieval task.

References

[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, Vol. 33 (2020), 1877--1901.
[2]
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 35--44.
[3]
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15789--15798.
[4]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104--120.
[5]
Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, et al. 2022. ViSTA: vision and scene text aggregation for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5184--5193.
[6]
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning. PMLR, 1931--1942.
[7]
Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, and Jun Yu. 2021. Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia. 797--806.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
[10]
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. 2022. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18166--18176.
[11]
Han Fu, Rui Wu, Chenghao Liu, and Jianling Sun. 2020. Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14570--14580.
[12]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM, Vol. 63, 11 (2020), 139--144.
[13]
Ricardo Guerrero, Hai X Pham, and Vladimir Pavlovic. 2021. Cross-modal retrieval and synthesis (x-mrs): Closing the modality gap in shared subspace learning. In Proceedings of the 29th ACM International Conference on Multimedia. 3192--3201.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[16]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916.
[17]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021a. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 9694--9705.
[18]
Jiao Li, Jialiang Sun, Xing Xu, Wei Yu, and Fumin Shen. 2021b. Cross-modal image-recipe retrieval via intra-and inter-modality hybrid fusion. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 173--182.
[19]
Jiao Li, Xing Xu, Wei Yu, Fumin Shen, Zuo Cao, Kai Zuo, and Heng Tao Shen. 2021c. Hybrid Fusion with Intra-and Cross-Modality Attention for Image-Recipe Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 244--254.
[20]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121--137.
[21]
Dim P Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba. 2022. Learning program representations for food images and cooking recipes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16559--16569.
[22]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.
[23]
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. 2022. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18082--18091.
[24]
Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. 2021. Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15475--15484.
[25]
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3020--3028.
[26]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 815--823.
[27]
Xiaohui Shen, Zhe Lin, Jonathan Brandt, Shai Avidan, and Ying Wu. 2012. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 3013--3020.
[28]
Yu Sugiyama and Keiji Yanai. 2021. Cross-modal recipe embeddings by disentangling recipe contents and dish styles. In Proceedings of the 29th ACM International Conference on Multimedia. 2501--2509.
[29]
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200--212.
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[31]
Hao Wang, Guosheng Lin, Steven Hoi, and Chunyan Miao. 2022a. Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 5517--5526.
[32]
Hao Wang, Guosheng Lin, Steven CH Hoi, and Chunyan Miao. 2022b. Learning structural representations for recipe generation and food retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[33]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven CH Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11572--11581.
[34]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, and Steven CH Hoi. 2022c. Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes With Semantic Consistency and Attention Mechanism. IEEE Transactions on Multimedia, Vol. 24 (2022), 2515--2525.
[35]
Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, and Luo Zhong. 2021a. Learning tfidf enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Transactions on Services Computing (2021).
[36]
Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, and Lin Li. 2021b. Learning text-image joint embedding for efficient cross-modal retrieval with deep feature engineering. ACM Transactions on Information Systems (TOIS), Vol. 40, 4 (2021), 1--27.
[37]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.
[38]
Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1318--1327.
[39]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.
[40]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.
[41]
Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5519--5527.
[42]
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2gan: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11477--11486.

Cited By

View all
  • (2025)Cross modal recipe retrieval with fine grained modal interactionScientific Reports10.1038/s41598-025-89461-815:1Online publication date: 9-Feb-2025
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal recipe retrieval
  2. prompt learning
  3. vision-language model

Qualifiers

  • Research-article

Funding Sources

  • the National Key Research and Development Program of China
  • the National Natural Science Foundation of China

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)251
  • Downloads (Last 6 weeks)9
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Cross modal recipe retrieval with fine grained modal interactionScientific Reports10.1038/s41598-025-89461-815:1Online publication date: 9-Feb-2025
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
  • (2024)Vision and Structured-Language Pretraining for Cross-Modal Food RetrievalComputer Vision and Image Understanding10.1016/j.cviu.2024.104071247:COnline publication date: 1-Oct-2024
  • (2024)Fine-Grained Modalities Interaction for Cross-Modal Recipe RetrievalPRICAI 2024: Trends in Artificial Intelligence10.1007/978-981-96-0128-8_7(79-90)Online publication date: 12-Nov-2024
  • (2024)Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-72983-6_7(111-127)Online publication date: 29-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media