Skip to main content
Log in

Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image-Recipe retrieval is the task of retrieving closely related recipes from a collection given a food image and vice versa. The modality gap between images and recipes makes it a challenging task. Recent studies usually focus on learning consistent image and recipe representations to bridge the semantic gap. Though the existing methods have significantly improved image-recipe retrieval, several challenges still remain: 1) Previous studies usually directly concatenate the textual embeddings of different recipe components to generate recipe presentations. Only simple interactions rather than complex interactions are considered. 2) They commonly focus on textual feature extraction from recipes. The methods to extract image features are relatively simple, and most studies utilize the ResNet-50 model. 3) Apart from the retrieval learning loss (triplet loss, for example), several auxiliary loss functions (such as adversarial loss and reconstruction loss) are commonly used to match the recipe and image representations. To deal with these issues, we introduce a novel Low-rank Multi-component Fusion method with Component-Specific Factors (LMF-CSF) to model the different textual components in a recipe for producing superior textual representations. Furthermore, try to pay some attention to image feature extraction. A visual transformer is used to learn better image representations. Then the enhanced representations from two modalities are directly fed into a triplet loss function for image-recipe retrieval learning. Experimental results conducted on the Recipe1M dataset indicate that our LMF-CSF method can outperform the current state-of-the-art image-recipe retrieval baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Achlioptas D, McSherry F (2007) Fast computation of low-rank matrix approximations. J ACM 54:1–19

    Article  MathSciNet  Google Scholar 

  2. Cai M, Shen X, Abhadiomhen SE, Cai Y, Tian S (2023) Robust dimensionality reduction via low-rank Laplacian graph learning. ACM Trans Intell Syst Technol 14:1–24

    Google Scholar 

  3. Carvalho M, Cadène R, Picard D, Soulier L, Thome N, Cord M (2018) Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In: Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval, Ann Arbor, MI, USA, pp 35–44

  4. Chen J, Ngo C-W (2016) Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM international conference on multimedia, Amsterdam, The Netherlands, pp 32–41

  5. Chen J-j, Ngo C-W, Chua T-S (2017) Cross-modal recipe retrieval with rich food attributes. In: Proceedings of the 25th ACM international conference on multimedia, Mountain View, CA, USA, pp 1771–1779

  6. Chen Y, Zhou D, Li L, Han J-m (2021) Multimodal encoders for food-oriented cross-modal retrieval. In: Proceedings of the 5th APWeb-WAIM joint conference on web and big data (APWeb-WAIM), Guangzhou, China, pp 253–266

  7. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th international conference on learning representation (ICLR). Virtual Event, Austria, pp 1–21

  8. Elsweiler D, Trattner C, Harvey M (2017) Exploiting food choice biases for healthier recipe recommendation. In: Proceedings of the 40th international acm sigir conference on research and development in information retrieval, Shinjuku, Tokyo, Japan, pp 575–584

  9. Freyne J, Berkovsky S (2010) Intelligent food planning: personalized recipe recommendation. In: Proceedings of the 15th international conference on intelligent user interfaces, Hong Kong, China, pp 321–324

  10. Fu H, Wu R, Liu C, Sun J (2020) Mcen: bridging cross-modal gap between cooking recipes and dish images with latent variable model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, pp 14570–14580

  11. Guerrero R, Pham HX, Pavlovic V (2021) Cross-modal retrieval and synthesis (X-MRS): closing the modality gap in shared subspace learning. In: Proceedings of the 29th ACM international conference on multimedia. Virtual Event, China, pp 3192–3201

  12. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, pp 770–778

  13. Helmy T, Al-Nazer A, Al-Bukhitan S, Iqbal A (2015) Health, food and user's profile ontologies for personalized information retrieval. Procedia Comput Sci 52:1071–1076

    Article  Google Scholar 

  14. Hui K-f, Shen X-j, Abhadiomhen SE, Zhan Y-z (2022) Robust low-rank representation via residual projection for image classification. Knowl-Based Syst 241:108230

    Article  Google Scholar 

  15. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of advances in neural information processing systems, Lake Tahoe, Nevada, USA, pp 1106–1114

  16. Li J, Xu X, Yu W, Shen F, Cao Z, Zuo K, Shen HT (2021) Hybrid fusion with intra-and cross-modality attention for image-recipe retrieval. In: Proceedings of the 44th international ACM SIGIR conference on Research and Development in information retrieval. Virtual Event, Canada, pp 244–254

  17. Li J, Sun J, Xu X, Yu W, Shen F (2021) Cross-modal image-recipe retrieval via intra-and inter-modality hybrid fusion. In: Proceedings of the 2021 international conference on multimedia retrieval, Taipei, Taiwan, pp 173–182

  18. Li L, Li M, Zan Z, Xie Q, Liu J (2021) Multi-subspace implicit alignment for cross-modal retrieval on cooking recipes and food images. In: Proceedings of the 30th ACM international conference on information & knowledge management, Queensland, Australia, pp 3211–3215

  19. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics, Melbourne, Australia, pp 2247–2256

  20. Martinel N, Piciarelli C, Micheloni C, Luca Foresti G (2015) A structured committee for food recognition. In: Proceedings of the 2015 IEEE international conference on computer vision workshops, Santiago, Chile, pp 92–100

  21. Martinel N, Foresti GL, Micheloni C (2018) Wide-slice residual networks for food recognition. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision (WACV), Lake Tahoe, NV, USA, pp 567–576

  22. Min W, Jiang S, Sang J, Wang H, Liu X, Herranz L (2016) Being a supercook: joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Trans Multimed 19:1100–1113

    Article  Google Scholar 

  23. Min W, Bao B-K, Mei S, Zhu Y, Rui Y, Jiang S (2017) You are what you eat: exploring rich recipe information for cross-region food analysis. IEEE Trans Multimed 20:950–964

    Article  Google Scholar 

  24. Pham HX, Guerrero R, Li J, Pavlovic V (2021) CHEF: cross-modal hierarchical embeddings for food domain retrieval. In: Proceedings of the 35th AAAI conference on artificial intelligence. Virtual Event, pp 2423–2430

  25. Salvador A, Hynes N, Aytar Y, Marin J, Ofli F, Weber I, Torralba A (2017) Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, pp 3020–3028

  26. Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Virtual Event, pp 15475–15484

  27. Teng C-Y, Lin Y-R, Adamic LA (2012) Recipe recommendation using ingredient networks. In: Proceedings of the 4th annual ACM web science conference, Evanston, IL, USA, pp 298–307

  28. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 2017 advances in neural information processing systems, Long Beach, CA, USA, pp 5998–6008

  29. Wang H, Sahoo D, Liu C, Lim E-p, Hoi SC (2019) Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, pp 11572–11581

  30. Wang H, Sahoo D, Liu C, Shu K, Achananuparp P, Lim E-p, Hoi CS (2021) Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Trans Multimed:1–1

  31. Wang Zy, Abhadiomhen SE, Liu Zf, Shen Xj, Gao Wy, Li Sy (2021) Multi-view intrinsic low-rank representation for robust face recognition and clustering. IET Image Process 15:3573–3584

    Article  Google Scholar 

  32. Xie Z, Liu L, Li L, Zhong L (2021) Learning joint embedding with modality alignments for cross-modal retrieval of recipes and food images. In: Proceedings of the 30th ACM international conference on information & knowledge management, Queensland, Australia, pp 2221–2230

  33. Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, pp 1103–1114

  34. Zan Z, Li L, Liu J, Zhou D (2020) Sentence-based and noise-robust cross-modal retrieval on cooking recipes and food images. In: Proceedings of the 2020 international conference on multimedia retrieval, Dublin, Ireland, pp 117–125

  35. Zhang F, Yuan NJ, Zheng K, Lian D, Xie X, Rui Y (2016) Exploiting dining preference for restaurant recommendation. In: Proceedings of the 25th international conference on world wide web, Montreal, Canada, pp 725–735

  36. Zhao W, Zhou D, Cao B, Zhang K, Chen J (2023) Adversarial modality alignment network for cross-modal molecule retrieval. IEEE Trans Artif Intell:1–12. https://doi.org/10.1109/TAI.2023.3254518

  37. Zhu B, Ngo C-W, Chen J, Hao Y (2019) R2gan: cross-modal recipe retrieval with generative adversarial network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, pp 11477–11486

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 61876062 & 61873316), the Guangdong Basic and Applied Basic Research Foundation of China (No. 2023A1515012718), the Scientific Research Fund of Hunan Provincial Education Department (No. 21A0319), the Hunan Provincial Natural Science Foundation of China (No. 2022JJ30020 & 2021JJ30274), and Hunan Provincial Innovation Foundation for Postgraduate (No. CX20210986).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Zhou.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, W., Zhou, D., Cao, B. et al. Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval. Multimed Tools Appl 83, 3601–3619 (2024). https://doi.org/10.1007/s11042-023-15819-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15819-7

Keywords

Navigation