research-article

Cross-modal Image-Recipe Retrieval via Multimodal Fusion

Authors:

Akshita Maradapu Vera Venkata saiAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 19, Pages 1 - 7

https://doi.org/10.1145/3595916.3626389

Published: 01 January 2024 Publication History

Abstract

Cross-modal image-recipe retrieval aims to capture the correlation between food images and recipes. While existing methods have demonstrated good performance on retrieval tasks, they often overlook two crucial aspects: (1) the capture of fine-grained recipe information and (2) the consideration of correlations between embeddings from different modalities. We introduce the Multimodal Fusion Retrieval Framework (MFRF) to address these issues. The proposed framework utilizes a deep learning-based encoder to process recipe and image data effectively, incorporates a fusion network to learn cross-modal semantic alignment, and ultimately achieves image-recipe retrieval. MFRF comprises three integral modules. The recipe preprocessing module utilizes various levels of Transformer to extract essential features such as the title and ingredients from the recipe. Additionally, it employs LSTM based on BERT to establish contextual relationships and dependencies among sentences in the recipe instructions. The multimodal fusion module incorporates visual-linguistic contrastive losses to align the representations of both images and recipes. Moreover, it leverages cross-modal attention mechanisms to facilitate effective interaction between the two modalities. Lastly, the cross-modal retrieval module employs a triple loss function to enable cross-modal retrieval of image-recipe pairs. Experimental evaluations conducted on the widely-used Recipe1M benchmark dataset demonstrate the effectiveness of the proposed MFRF, achieving substantial performance improvements on both the 1k and 10k test sets. Specifically, the results indicate an increase of +9.9% (64.8 R@1) and +8.4% (33.7 R@1) respectively.

References

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.

Digital Library

[2]

Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 35–44.

Digital Library

[3]

Jingjing Chen, Lei Pang, and Chong-Wah Ngo. 2017. Cross-modal recipe retrieval: How to cook this dish?. In MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23. Springer, 588–600.

[4]

Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the 26th ACM international conference on Multimedia. 1020–1028.

Digital Library

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[7]

Matthias Fontanellaz, Stergios Christodoulidis, and Stavroula Mougiakakou. 2019. Self-attention and ingredient-attention based model for recipe retrieval from image queries. In Proceedings of the 5th international workshop on multimedia assisted dietary management. 25–31.

Digital Library

[8]

Han Fu, Rui Wu, Chenghao Liu, and Jianling Sun. 2020. Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14570–14580.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[11]

Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics: methodology and distribution. Springer, 162–190.

[12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.

[13]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 201–216.

Digital Library

[15]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.

[16]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694–9705.

[17]

Jiao Li, Jialiang Sun, Xing Xu, Wei Yu, and Fumin Shen. 2021. Cross-modal image-recipe retrieval via intra-and inter-modality hybrid fusion. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 173–182.

Digital Library

[18]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. 4654–4662.

[19]

Lijie Li, Shuangyang Hu, Junhao Chen, Ye Wang, and Zuobin Xiong. 2022. Exp-SoftLexicon Lattice Model Integrating Radical-Level Features for Chinese NER. In SEKE. 329–334.

[20]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).

[21]

Zhixin Li, Feng Ling, Canlong Zhang, and Huifang Ma. 2020. Combining global and local similarity for cross-media retrieval. IEEE Access 8 (2020), 21847–21856.

[22]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[23]

Hai X Pham, Ricardo Guerrero, Vladimir Pavlovic, and Jiatong Li. 2021. CHEF: cross-modal hierarchical embeddings for food domain retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2423–2430.

[24]

Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020).

[25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[26]

Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. 2021. Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15475–15484.

[27]

Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3020–3028.

[28]

Mustafa Shukor, Guillaume Couairon, Asya Grechka, and Matthieu Cord. 2022. Transformer decoders with multimodal regularization for cross-modal food retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4567–4578.

[29]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[30]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).

[31]

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491 (2018).

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[33]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven CH Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11572–11581.

[34]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, and Steven CH Hoi. 2021. Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Transactions on Multimedia 24 (2021), 2515–2525.

Digital Library

[35]

Zhongwei Xie, Ling Liu, Lin Li, and Luo Zhong. 2021. Learning joint embedding with modality alignments for cross-modal retrieval of recipes and food images. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2221–2230.

Digital Library

[36]

Zuobin Xiong, Zhipeng Cai, Qilong Han, Arwa Alrawais, and Wei Li. 2020. ADGAN: Protect your location privacy in camera data of auto-driving vehicles. IEEE Transactions on Industrial Informatics 17, 9 (2020), 6200–6210.

[37]

Zuobin Xiong, Wei Li, Qilong Han, and Zhipeng Cai. 2019. Privacy-preserving auto-driving: a GAN-based approach to protect vehicular camera data. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 668–677.

[38]

Zuobin Xiong, Honghui Xu, Wei Li, and Zhipeng Cai. 2021. Multi-source adversarial sample attack on autonomous vehicles. IEEE Transactions on Vehicular Technology 70, 3 (2021), 2822–2835.

[39]

Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-modal attention with semantic consistence for image–text matching. IEEE transactions on neural networks and learning systems 31, 12 (2020), 5412–5425.

Cited By

Zou ZZhu XZhu QZhang HZhu L(2024)Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe RetrievalFoods10.3390/foods1311162813:11(1628)Online publication date: 23-May-2024
https://doi.org/10.3390/foods13111628
Zou ZZhu XZhu QLiu YZhu L(2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3370158
Zhang BKyutoku HDoman KKomamizu TIde IQian J(2024)Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learningKnowledge-Based Systems10.1016/j.knosys.2024.112641305(112641)Online publication date: Dec-2024
https://doi.org/10.1016/j.knosys.2024.112641
Show More Cited By

Index Terms

Cross-modal Image-Recipe Retrieval via Multimodal Fusion
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

In recent years, the Internet has stimulated the explosion of multimedia data. Food-related cooking videos, images, and recipes promote the rapid development of food computing. Image-recipe retrieval is an important sub-task in the field of cross-modal ...
Hybrid Fusion with Intra- and Cross-Modality Attention for Image-Recipe Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Image-recipe retrieval, which aims at retrieving the relevant recipe from a food image and vice versa, is now attracting widespread attention, since sharing food-related images and recipes on the Internet has become a popular trend. Existing methods have ...
Cross-modal contrastive learning for multimodal sentiment recognition
Abstract
Multimodal sentiment recognition has obtained increasing attention in recent years due to its potential to improve sentiment recognition accuracy by integrating information from multiple modalities. However, the heterogeneity issue caused by the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the National Natural Science Foundation of China
the National Key R&D Program of China

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
120
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)6

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zou ZZhu XZhu QZhang HZhu L(2024)Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe RetrievalFoods10.3390/foods1311162813:11(1628)Online publication date: 23-May-2024
https://doi.org/10.3390/foods13111628
Zou ZZhu XZhu QLiu YZhu L(2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3370158
Zhang BKyutoku HDoman KKomamizu TIde IQian J(2024)Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learningKnowledge-Based Systems10.1016/j.knosys.2024.112641305(112641)Online publication date: Dec-2024
https://doi.org/10.1016/j.knosys.2024.112641
Shukor MThome NCord M(2024)Vision and Structured-Language Pretraining for Cross-Modal Food RetrievalComputer Vision and Image Understanding10.1016/j.cviu.2024.104071247:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.cviu.2024.104071

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten