Skip to main content

Application of Multimodal Machine Learning for Image Recommendation Systems

  • Conference paper
  • First Online:
Recent Trends in Analysis of Images, Social Networks and Texts (AIST 2023)

Abstract

In the era of information overload, making decisions about various aspects of our lives has become increasingly challenging. The rise of the internet and online platforms has provided access to vast amounts of information, including recommendations based on user preferences. This paper focuses on the development of a unique multimodal recommender system for images, leveraging machine learning and deep learning techniques.

The results demonstrate the effectiveness of a complex recommender system, with an average accuracy of 0.814. The system outperforms algorithms based solely on CLIP embeddings or BERT vectors, showcasing the advantages of incorporating multiple modalities. The strengths of the model lie in its robust recognition of CLIP vectors and efficient processing of image features. However, there is room for improvement in text features and embeddings, suggesting the need for more detailed textual information and additional data sources.

The innovation of this system is to utilize both image data and text descriptions to provide personalized recommendations. Images are vectorized and combined with relevant metrics, while text descriptions are obtained through object recognition or dataset text. The model incorporates various data, such as location, creation date, and event information, to enhance the recommendation process.

Overall, this study highlights the potential of multimodal recommender systems for enhancing recommendation quality and providing users with a personalized experience. Ongoing efforts to refine the model with diverse data and parameter optimization can further improve its performance.

The work on Section 2 was supported by RSF under grant 22-11-00323 and performed at HSE University, Moscow, Russia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. beautifulsoup4 4.12.2 (2023). https://pypi.org/project/beautifulsoup4. Accessed 07 May 2023

  2. clip-vit-base-patch (2023). https://huggingface.co/openai/clip-vit-base-patch32. Accessed 11 May 2023

  3. Flickr Image dataset \(|\) Kaggle (2023). https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset. Accessed 07 Sept 2023

  4. Overview - CatBoostClassifier \(|\) Catboost (2023). https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier. Accessed 07 Sept 2023

  5. sklearn.cluster.KMeans - scikit-learn 1.3.0 documentation (2023). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. Accessed 07 Sept 2023

  6. sklearn.ensemble.RandomForestClassifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Accessed 07 Sept 2023

  7. sklearn.tree.DecisionTreeClassifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. Accessed 07 Sept 2023

  8. SMOTE - Version 0.11.0 (2023). https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html. Accessed 07 Sept 2023

  9. XGBoost Documentation - xgboost 1.7.6 documentation (2023). https://xgboost.readthedocs.io/en/stable/. Accessed 07 Sept 2023

  10. Yandex Pictures (2023). https://yandex.ru/images. Accessed 06 May 2023

  11. Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)

    Article  Google Scholar 

  12. Andreeva, E., Ignatov, D.I., Grachev, A., Savchenko, A.V.: Extraction of visual features for recommendation of products via deep learning. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 201–210. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11027-7_20

    Chapter  Google Scholar 

  13. Anneroth, G., Batsakis, J., Luna, M.: Review of the literature and a recommended system of malignancy grading in oral squamous cell carcinomas. Eur. J. Oral Sci. 95(3), 229–249 (1987)

    Article  Google Scholar 

  14. Bisong, E., Bisong, E.: Introduction to scikit-learn. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 215–229 (2019)

    Google Scholar 

  15. Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)

    Article  Google Scholar 

  16. Dorogush, A.V., Ershov, V., Gulin, A.: CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018)

  17. Garson, J., Aggarwal, A., Sarkar, S.: ResNet manual. Ver 1, 30 (2002)

    Google Scholar 

  18. Ge, T., et al.: Image matters: visually modeling user behaviors using advanced model server. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2087–2095 (2018)

    Google Scholar 

  19. Grechikhin, I., Savchenko, A.V.: User modeling on mobile device based on facial clustering and object detection in photos and videos. In: Morales, A., Fierrez, J., Sánchez, J.S., Ribeiro, B. (eds.) IbPRIA 2019, Part II. LNCS, vol. 11868, pp. 429–440. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31321-0_37

    Chapter  Google Scholar 

  20. Kharchevnikova, A., Savchenko, A.: Neural networks in video-based age and gender recognition on mobile platforms. Opt. Memory Neural Netw. 27, 246–259 (2018)

    Article  Google Scholar 

  21. Kiela, D., Bottou, L.: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 36–45 (2014)

    Google Scholar 

  22. Kim, P., Kim, P.: Convolutional neural network. MATLAB deep learning: with machine learning, neural networks and artificial intelligence, pp. 121–147 (2017)

    Google Scholar 

  23. Lazaridou, A., Pham, N.T., Baroni, M.: Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598 (2015)

  24. Li, X., et al.: Adversarial multimodal representation learning for click-through rate prediction. In: Proceedings of The Web Conference 2020, pp. 827–836 (2020)

    Google Scholar 

  25. Lin, K.Y., Lu, H.P.: Why people use social networking sites: an empirical study integrating network externalities and motivation theory. Comput. Hum. Behav. 27(3), 1152–1161 (2011)

    Article  Google Scholar 

  26. Makarov, I., Bakhanova, M., Nikolenko, S., Gerasimova, O.: Self-supervised recurrent depth estimation with attention mechanisms. PeerJ Comput. Sci. 8, e865 (2022)

    Article  Google Scholar 

  27. Makarov, I., et al.: On reproducing semi-dense depth map reconstruction using deep convolutional neural networks with perceptual loss. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1080–1084 (2019)

    Google Scholar 

  28. Makarov, I.: Temporal network embedding framework with causal anonymous walks representations. PeerJ Comput. Sci. 8, e858 (2022)

    Article  Google Scholar 

  29. Makarov, I., Veldyaykin, N., Chertkov, M., Pokoev, A.: Russian sign language dactyl recognition. In: 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), pp. 726–729. IEEE (2019)

    Google Scholar 

  30. Malkiel, I., Ginzburg, D., Barkan, O., Caciularu, A., Weill, J., Koenigstein, N.: Interpreting BERT-based text similarity via activation and saliency maps. In: Proceedings of the ACM Web Conference 2022, pp. 3259–3268 (2022)

    Google Scholar 

  31. Monastyrev, V.V., Drobintsev, P.D.: Recommendation system based on user actions in the social network.  PAH 32(3), 101–108 (2020)

    Google Scholar 

  32. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)

    Google Scholar 

  33. Resnick, P., Varian, H.R.: Recommender systems. Commun. ACM 40(3), 56–58 (1997)

    Article  Google Scholar 

  34. Salah, A., Truong, Q.T., Lauw, H.W.: Cornac: a comparative framework for multimodal recommender systems. J. Mach. Learn. Res. 21(1), 3803–3807 (2020)

    Google Scholar 

  35. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  36. Savchenko, A., Alekseev, A., Kwon, S., Tutubalina, E., Myasnikov, E., Nikolenko, S.: Ad lingua: text classification improves symbolism prediction in image advertisements. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1886–1892 (2020)

    Google Scholar 

  37. Savchenko, A.V.: User preference prediction in visual data on mobile devices. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021)

    Google Scholar 

  38. Savchenko, A.V.: Recommending restaurants based on classification of photos from the gallery of mobile device. In: Proceedings of the 20th Jubilee International Symposium on Intelligent Systems and Informatics (SISY), pp. 431–436. IEEE (2022)

    Google Scholar 

  39. Savchenko, A.V., Demochkin, K.V., Grechikhin, I.S.: Preference prediction based on a photo gallery analysis with scene recognition and object detection. Pattern Recogn. 121, 108248 (2022)

    Article  Google Scholar 

  40. Savchenko, A.V., Savchenko, L.V., Makarov, I.: Fast search of face recognition model for a mobile device based on neural architecture comparator. IEEE Access 11, 65977–65990 (2023)

    Article  Google Scholar 

  41. Savchenko, A.: Deep neural networks and maximum likelihood search for approximate nearest neighbor in video-based image recognition. Opt. Memory Neural Netw. 26, 129–136 (2017)

    Article  Google Scholar 

  42. Savchenko, A., Khokhlova, Y.I.: About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Memory Neural Netw. 23, 34–42 (2014)

    Article  Google Scholar 

  43. Savchenko, A., Savchenko, L.: Three-way classification for sequences of observations. Inf. Sci., 119540 (2023)

    Google Scholar 

  44. Savchenko, V.V., Savchenko, A.V.: Criterion of significance level for selection of order of spectral estimation of entropy maximum. Radioelectron. Commun. Syst. 62(5), 223–231 (2019)

    Article  Google Scholar 

  45. Schafer, J.B., Konstan, J., Riedl, J.: Recommender systems in e-commerce. In: Proceedings of the 1st ACM Conference on Electronic Commerce, pp. 158–166 (1999)

    Google Scholar 

  46. Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Edu. Technol. 51(5), 1450–1484 (2020)

    Article  Google Scholar 

  47. Sharma, M., Mann, S.: A survey of recommender systems: approaches and limitations. Int. J. Innov. Eng. Technol. 2(2), 8–14 (2013)

    Google Scholar 

  48. Smith, B., Linden, G.: Two decades of recommender systems at amazon. com. IEEE Internet Comput. 21(3), 12–18 (2017)

    Google Scholar 

  49. Tikhomirova, K., Makarov, I.: Community detection based on the nodes role in a network: the telegram platform case. In: van der Aalst, W.M.P. (ed.) AIST 2020. LNCS, vol. 12602, pp. 294–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72610-2_22

    Chapter  Google Scholar 

  50. Truong, Q.T., Salah, A., Lauw, H.: Multi-modal recommender systems: hands-on exploration. In: Proceedings of the 15th ACM Conference on Recommender Systems, pp. 834–837 (2021)

    Google Scholar 

  51. Walinder, L., Price, M., Lim, B., Smith, B.: Multimodal personalized recommender algorithm based on knowledge graph (2022)

    Google Scholar 

  52. Wirojwatanakul, P., Wangperawong, A.: Multi-label product categorization using multi-modal fusion models. arXiv preprint arXiv:1907.00420 (2019)

  53. Yakovlev, K., et al.: Sinkhorn transformations for single-query postprocessing in text-video retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, pp. 2394–2398. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539618.3592064

  54. Zhang, W., Qin, J., Guo, W., Tang, R., He, X.: Deep learning for click-through rate estimation. arXiv preprint arXiv:2104.10584 (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilya Makarov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Foniakov, M., Bardukov, A., Makarov, I. (2024). Application of Multimodal Machine Learning for Image Recommendation Systems. In: Ignatov, D.I., et al. Recent Trends in Analysis of Images, Social Networks and Texts. AIST 2023. Communications in Computer and Information Science, vol 1905. Springer, Cham. https://doi.org/10.1007/978-3-031-67008-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-67008-4_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-67007-7

  • Online ISBN: 978-3-031-67008-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics