Application of Multimodal Machine Learning for Image Recommendation Systems

Foniakov, Mikhail; Bardukov, Anatoly; Makarov, Ilya

doi:10.1007/978-3-031-67008-4_18

Mikhail Foniakov¹⁷,
Anatoly Bardukov¹⁷ &
Ilya Makarov^17,18

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1905))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

140 Accesses

Abstract

In the era of information overload, making decisions about various aspects of our lives has become increasingly challenging. The rise of the internet and online platforms has provided access to vast amounts of information, including recommendations based on user preferences. This paper focuses on the development of a unique multimodal recommender system for images, leveraging machine learning and deep learning techniques.

The results demonstrate the effectiveness of a complex recommender system, with an average accuracy of 0.814. The system outperforms algorithms based solely on CLIP embeddings or BERT vectors, showcasing the advantages of incorporating multiple modalities. The strengths of the model lie in its robust recognition of CLIP vectors and efficient processing of image features. However, there is room for improvement in text features and embeddings, suggesting the need for more detailed textual information and additional data sources.

The innovation of this system is to utilize both image data and text descriptions to provide personalized recommendations. Images are vectorized and combined with relevant metrics, while text descriptions are obtained through object recognition or dataset text. The model incorporates various data, such as location, creation date, and event information, to enhance the recommendation process.

Overall, this study highlights the potential of multimodal recommender systems for enhancing recommendation quality and providing users with a personalized experience. Ongoing efforts to refine the model with diverse data and parameter optimization can further improve its performance.

The work on Section 2 was supported by RSF under grant 22-11-00323 and performed at HSE University, Moscow, Russia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Personalized Deep Learning for Tag Recommendation

A Deep Learning Based Multi-modal Approach for Images and Texts Recommendation

Image Tag Recommendation via Deep Cross-Modal Correlation Mining

References

beautifulsoup4 4.12.2 (2023). https://pypi.org/project/beautifulsoup4. Accessed 07 May 2023
clip-vit-base-patch (2023). https://huggingface.co/openai/clip-vit-base-patch32. Accessed 11 May 2023
Flickr Image dataset $|$ Kaggle (2023). https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset. Accessed 07 Sept 2023
Overview - CatBoostClassifier $|$ Catboost (2023). https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier. Accessed 07 Sept 2023
sklearn.cluster.KMeans - scikit-learn 1.3.0 documentation (2023). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. Accessed 07 Sept 2023
sklearn.ensemble.RandomForestClassifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Accessed 07 Sept 2023
sklearn.tree.DecisionTreeClassifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. Accessed 07 Sept 2023
SMOTE - Version 0.11.0 (2023). https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html. Accessed 07 Sept 2023
XGBoost Documentation - xgboost 1.7.6 documentation (2023). https://xgboost.readthedocs.io/en/stable/. Accessed 07 Sept 2023
Yandex Pictures (2023). https://yandex.ru/images. Accessed 06 May 2023
Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)
Article Google Scholar
Andreeva, E., Ignatov, D.I., Grachev, A., Savchenko, A.V.: Extraction of visual features for recommendation of products via deep learning. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 201–210. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11027-7_20
Chapter Google Scholar
Anneroth, G., Batsakis, J., Luna, M.: Review of the literature and a recommended system of malignancy grading in oral squamous cell carcinomas. Eur. J. Oral Sci. 95(3), 229–249 (1987)
Article Google Scholar
Bisong, E., Bisong, E.: Introduction to scikit-learn. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 215–229 (2019)
Google Scholar
Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)
Article Google Scholar
Dorogush, A.V., Ershov, V., Gulin, A.: CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018)
Garson, J., Aggarwal, A., Sarkar, S.: ResNet manual. Ver 1, 30 (2002)
Google Scholar
Ge, T., et al.: Image matters: visually modeling user behaviors using advanced model server. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2087–2095 (2018)
Google Scholar
Grechikhin, I., Savchenko, A.V.: User modeling on mobile device based on facial clustering and object detection in photos and videos. In: Morales, A., Fierrez, J., Sánchez, J.S., Ribeiro, B. (eds.) IbPRIA 2019, Part II. LNCS, vol. 11868, pp. 429–440. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31321-0_37
Chapter Google Scholar
Kharchevnikova, A., Savchenko, A.: Neural networks in video-based age and gender recognition on mobile platforms. Opt. Memory Neural Netw. 27, 246–259 (2018)
Article Google Scholar
Kiela, D., Bottou, L.: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 36–45 (2014)
Google Scholar
Kim, P., Kim, P.: Convolutional neural network. MATLAB deep learning: with machine learning, neural networks and artificial intelligence, pp. 121–147 (2017)
Google Scholar
Lazaridou, A., Pham, N.T., Baroni, M.: Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598 (2015)
Li, X., et al.: Adversarial multimodal representation learning for click-through rate prediction. In: Proceedings of The Web Conference 2020, pp. 827–836 (2020)
Google Scholar
Lin, K.Y., Lu, H.P.: Why people use social networking sites: an empirical study integrating network externalities and motivation theory. Comput. Hum. Behav. 27(3), 1152–1161 (2011)
Article Google Scholar
Makarov, I., Bakhanova, M., Nikolenko, S., Gerasimova, O.: Self-supervised recurrent depth estimation with attention mechanisms. PeerJ Comput. Sci. 8, e865 (2022)
Article Google Scholar
Makarov, I., et al.: On reproducing semi-dense depth map reconstruction using deep convolutional neural networks with perceptual loss. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1080–1084 (2019)
Google Scholar
Makarov, I.: Temporal network embedding framework with causal anonymous walks representations. PeerJ Comput. Sci. 8, e858 (2022)
Article Google Scholar
Makarov, I., Veldyaykin, N., Chertkov, M., Pokoev, A.: Russian sign language dactyl recognition. In: 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), pp. 726–729. IEEE (2019)
Google Scholar
Malkiel, I., Ginzburg, D., Barkan, O., Caciularu, A., Weill, J., Koenigstein, N.: Interpreting BERT-based text similarity via activation and saliency maps. In: Proceedings of the ACM Web Conference 2022, pp. 3259–3268 (2022)
Google Scholar
Monastyrev, V.V., Drobintsev, P.D.: Recommendation system based on user actions in the social network. PAH 32(3), 101–108 (2020)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)
Google Scholar
Resnick, P., Varian, H.R.: Recommender systems. Commun. ACM 40(3), 56–58 (1997)
Article Google Scholar
Salah, A., Truong, Q.T., Lauw, H.W.: Cornac: a comparative framework for multimodal recommender systems. J. Mach. Learn. Res. 21(1), 3803–3807 (2020)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Savchenko, A., Alekseev, A., Kwon, S., Tutubalina, E., Myasnikov, E., Nikolenko, S.: Ad lingua: text classification improves symbolism prediction in image advertisements. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1886–1892 (2020)
Google Scholar
Savchenko, A.V.: User preference prediction in visual data on mobile devices. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021)
Google Scholar
Savchenko, A.V.: Recommending restaurants based on classification of photos from the gallery of mobile device. In: Proceedings of the 20th Jubilee International Symposium on Intelligent Systems and Informatics (SISY), pp. 431–436. IEEE (2022)
Google Scholar
Savchenko, A.V., Demochkin, K.V., Grechikhin, I.S.: Preference prediction based on a photo gallery analysis with scene recognition and object detection. Pattern Recogn. 121, 108248 (2022)
Article Google Scholar
Savchenko, A.V., Savchenko, L.V., Makarov, I.: Fast search of face recognition model for a mobile device based on neural architecture comparator. IEEE Access 11, 65977–65990 (2023)
Article Google Scholar
Savchenko, A.: Deep neural networks and maximum likelihood search for approximate nearest neighbor in video-based image recognition. Opt. Memory Neural Netw. 26, 129–136 (2017)
Article Google Scholar
Savchenko, A., Khokhlova, Y.I.: About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Memory Neural Netw. 23, 34–42 (2014)
Article Google Scholar
Savchenko, A., Savchenko, L.: Three-way classification for sequences of observations. Inf. Sci., 119540 (2023)
Google Scholar
Savchenko, V.V., Savchenko, A.V.: Criterion of significance level for selection of order of spectral estimation of entropy maximum. Radioelectron. Commun. Syst. 62(5), 223–231 (2019)
Article Google Scholar
Schafer, J.B., Konstan, J., Riedl, J.: Recommender systems in e-commerce. In: Proceedings of the 1st ACM Conference on Electronic Commerce, pp. 158–166 (1999)
Google Scholar
Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Edu. Technol. 51(5), 1450–1484 (2020)
Article Google Scholar
Sharma, M., Mann, S.: A survey of recommender systems: approaches and limitations. Int. J. Innov. Eng. Technol. 2(2), 8–14 (2013)
Google Scholar
Smith, B., Linden, G.: Two decades of recommender systems at amazon. com. IEEE Internet Comput. 21(3), 12–18 (2017)
Google Scholar
Tikhomirova, K., Makarov, I.: Community detection based on the nodes role in a network: the telegram platform case. In: van der Aalst, W.M.P. (ed.) AIST 2020. LNCS, vol. 12602, pp. 294–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72610-2_22
Chapter Google Scholar
Truong, Q.T., Salah, A., Lauw, H.: Multi-modal recommender systems: hands-on exploration. In: Proceedings of the 15th ACM Conference on Recommender Systems, pp. 834–837 (2021)
Google Scholar
Walinder, L., Price, M., Lim, B., Smith, B.: Multimodal personalized recommender algorithm based on knowledge graph (2022)
Google Scholar
Wirojwatanakul, P., Wangperawong, A.: Multi-label product categorization using multi-modal fusion models. arXiv preprint arXiv:1907.00420 (2019)
Yakovlev, K., et al.: Sinkhorn transformations for single-query postprocessing in text-video retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, pp. 2394–2398. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539618.3592064
Zhang, W., Qin, J., Guo, W., Tang, R., He, X.: Deep learning for click-through rate estimation. arXiv preprint arXiv:2104.10584 (2021)

Download references

Author information

Authors and Affiliations

HSE University, Moscow, Russia
Mikhail Foniakov, Anatoly Bardukov & Ilya Makarov
AIRI, Moscow, Russia
Ilya Makarov

Authors

Mikhail Foniakov
View author publications
You can also search for this author in PubMed Google Scholar
Anatoly Bardukov
View author publications
You can also search for this author in PubMed Google Scholar
Ilya Makarov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilya Makarov .

Editor information

Editors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovskii Institute of Mathematics and Mechanics of Russian Academy of Sciences, Yekaterinburg, Russia
Michael Khachay
University of Oslo, Oslo, Norway
Andrey Kutuzov
American University of Armenia, Yerevan, Armenia
Habet Madoyan
Artificial Intelligence Research Institute, Moscow, Russia
Ilya Makarov
Universität Hamburg, Hamburg, Germany
Irina Nikishina
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
Mohamed bin Zayed University of Artificial Intelligence and Technology Innovation Institute, Abu Dhabi, United Arab Emirates
Maxim Panov
Industrial and Systems Engineering, University of Florida, Gainesville, FL, USA
Panos M. Pardalos
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Apptek, Aachen, Nordrhein-Westfalen, Germany
Evgenii Tsymbalov
Kazan Federal University and HSE University, Moscow, Russia
Elena Tutubalina
MTS AI, Moscow, Russia
Sergey Zagoruyko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Foniakov, M., Bardukov, A., Makarov, I. (2024). Application of Multimodal Machine Learning for Image Recommendation Systems. In: Ignatov, D.I., et al. Recent Trends in Analysis of Images, Social Networks and Texts. AIST 2023. Communications in Computer and Information Science, vol 1905. Springer, Cham. https://doi.org/10.1007/978-3-031-67008-4_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-67008-4_18
Published: 30 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-67007-7
Online ISBN: 978-3-031-67008-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Application of Multimodal Machine Learning for Image Recommendation Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Personalized Deep Learning for Tag Recommendation

A Deep Learning Based Multi-modal Approach for Images and Texts Recommendation

Image Tag Recommendation via Deep Cross-Modal Correlation Mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Application of Multimodal Machine Learning for Image Recommendation Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Personalized Deep Learning for Tag Recommendation

A Deep Learning Based Multi-modal Approach for Images and Texts Recommendation

Image Tag Recommendation via Deep Cross-Modal Correlation Mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation