Abstract
In the era of information overload, making decisions about various aspects of our lives has become increasingly challenging. The rise of the internet and online platforms has provided access to vast amounts of information, including recommendations based on user preferences. This paper focuses on the development of a unique multimodal recommender system for images, leveraging machine learning and deep learning techniques.
The results demonstrate the effectiveness of a complex recommender system, with an average accuracy of 0.814. The system outperforms algorithms based solely on CLIP embeddings or BERT vectors, showcasing the advantages of incorporating multiple modalities. The strengths of the model lie in its robust recognition of CLIP vectors and efficient processing of image features. However, there is room for improvement in text features and embeddings, suggesting the need for more detailed textual information and additional data sources.
The innovation of this system is to utilize both image data and text descriptions to provide personalized recommendations. Images are vectorized and combined with relevant metrics, while text descriptions are obtained through object recognition or dataset text. The model incorporates various data, such as location, creation date, and event information, to enhance the recommendation process.
Overall, this study highlights the potential of multimodal recommender systems for enhancing recommendation quality and providing users with a personalized experience. Ongoing efforts to refine the model with diverse data and parameter optimization can further improve its performance.
The work on Section 2 was supported by RSF under grant 22-11-00323 and performed at HSE University, Moscow, Russia.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
beautifulsoup4 4.12.2 (2023). https://pypi.org/project/beautifulsoup4. Accessed 07 May 2023
clip-vit-base-patch (2023). https://huggingface.co/openai/clip-vit-base-patch32. Accessed 11 May 2023
Flickr Image dataset \(|\) Kaggle (2023). https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset. Accessed 07 Sept 2023
Overview - CatBoostClassifier \(|\) Catboost (2023). https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier. Accessed 07 Sept 2023
sklearn.cluster.KMeans - scikit-learn 1.3.0 documentation (2023). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. Accessed 07 Sept 2023
sklearn.ensemble.RandomForestClassifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Accessed 07 Sept 2023
sklearn.tree.DecisionTreeClassifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. Accessed 07 Sept 2023
SMOTE - Version 0.11.0 (2023). https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html. Accessed 07 Sept 2023
XGBoost Documentation - xgboost 1.7.6 documentation (2023). https://xgboost.readthedocs.io/en/stable/. Accessed 07 Sept 2023
Yandex Pictures (2023). https://yandex.ru/images. Accessed 06 May 2023
Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)
Andreeva, E., Ignatov, D.I., Grachev, A., Savchenko, A.V.: Extraction of visual features for recommendation of products via deep learning. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 201–210. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11027-7_20
Anneroth, G., Batsakis, J., Luna, M.: Review of the literature and a recommended system of malignancy grading in oral squamous cell carcinomas. Eur. J. Oral Sci. 95(3), 229–249 (1987)
Bisong, E., Bisong, E.: Introduction to scikit-learn. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 215–229 (2019)
Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)
Dorogush, A.V., Ershov, V., Gulin, A.: CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018)
Garson, J., Aggarwal, A., Sarkar, S.: ResNet manual. Ver 1, 30 (2002)
Ge, T., et al.: Image matters: visually modeling user behaviors using advanced model server. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2087–2095 (2018)
Grechikhin, I., Savchenko, A.V.: User modeling on mobile device based on facial clustering and object detection in photos and videos. In: Morales, A., Fierrez, J., Sánchez, J.S., Ribeiro, B. (eds.) IbPRIA 2019, Part II. LNCS, vol. 11868, pp. 429–440. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31321-0_37
Kharchevnikova, A., Savchenko, A.: Neural networks in video-based age and gender recognition on mobile platforms. Opt. Memory Neural Netw. 27, 246–259 (2018)
Kiela, D., Bottou, L.: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 36–45 (2014)
Kim, P., Kim, P.: Convolutional neural network. MATLAB deep learning: with machine learning, neural networks and artificial intelligence, pp. 121–147 (2017)
Lazaridou, A., Pham, N.T., Baroni, M.: Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598 (2015)
Li, X., et al.: Adversarial multimodal representation learning for click-through rate prediction. In: Proceedings of The Web Conference 2020, pp. 827–836 (2020)
Lin, K.Y., Lu, H.P.: Why people use social networking sites: an empirical study integrating network externalities and motivation theory. Comput. Hum. Behav. 27(3), 1152–1161 (2011)
Makarov, I., Bakhanova, M., Nikolenko, S., Gerasimova, O.: Self-supervised recurrent depth estimation with attention mechanisms. PeerJ Comput. Sci. 8, e865 (2022)
Makarov, I., et al.: On reproducing semi-dense depth map reconstruction using deep convolutional neural networks with perceptual loss. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1080–1084 (2019)
Makarov, I.: Temporal network embedding framework with causal anonymous walks representations. PeerJ Comput. Sci. 8, e858 (2022)
Makarov, I., Veldyaykin, N., Chertkov, M., Pokoev, A.: Russian sign language dactyl recognition. In: 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), pp. 726–729. IEEE (2019)
Malkiel, I., Ginzburg, D., Barkan, O., Caciularu, A., Weill, J., Koenigstein, N.: Interpreting BERT-based text similarity via activation and saliency maps. In: Proceedings of the ACM Web Conference 2022, pp. 3259–3268 (2022)
Monastyrev, V.V., Drobintsev, P.D.: Recommendation system based on user actions in the social network.
PAH 32(3), 101–108 (2020)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)
Resnick, P., Varian, H.R.: Recommender systems. Commun. ACM 40(3), 56–58 (1997)
Salah, A., Truong, Q.T., Lauw, H.W.: Cornac: a comparative framework for multimodal recommender systems. J. Mach. Learn. Res. 21(1), 3803–3807 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Savchenko, A., Alekseev, A., Kwon, S., Tutubalina, E., Myasnikov, E., Nikolenko, S.: Ad lingua: text classification improves symbolism prediction in image advertisements. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1886–1892 (2020)
Savchenko, A.V.: User preference prediction in visual data on mobile devices. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021)
Savchenko, A.V.: Recommending restaurants based on classification of photos from the gallery of mobile device. In: Proceedings of the 20th Jubilee International Symposium on Intelligent Systems and Informatics (SISY), pp. 431–436. IEEE (2022)
Savchenko, A.V., Demochkin, K.V., Grechikhin, I.S.: Preference prediction based on a photo gallery analysis with scene recognition and object detection. Pattern Recogn. 121, 108248 (2022)
Savchenko, A.V., Savchenko, L.V., Makarov, I.: Fast search of face recognition model for a mobile device based on neural architecture comparator. IEEE Access 11, 65977–65990 (2023)
Savchenko, A.: Deep neural networks and maximum likelihood search for approximate nearest neighbor in video-based image recognition. Opt. Memory Neural Netw. 26, 129–136 (2017)
Savchenko, A., Khokhlova, Y.I.: About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Memory Neural Netw. 23, 34–42 (2014)
Savchenko, A., Savchenko, L.: Three-way classification for sequences of observations. Inf. Sci., 119540 (2023)
Savchenko, V.V., Savchenko, A.V.: Criterion of significance level for selection of order of spectral estimation of entropy maximum. Radioelectron. Commun. Syst. 62(5), 223–231 (2019)
Schafer, J.B., Konstan, J., Riedl, J.: Recommender systems in e-commerce. In: Proceedings of the 1st ACM Conference on Electronic Commerce, pp. 158–166 (1999)
Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Edu. Technol. 51(5), 1450–1484 (2020)
Sharma, M., Mann, S.: A survey of recommender systems: approaches and limitations. Int. J. Innov. Eng. Technol. 2(2), 8–14 (2013)
Smith, B., Linden, G.: Two decades of recommender systems at amazon. com. IEEE Internet Comput. 21(3), 12–18 (2017)
Tikhomirova, K., Makarov, I.: Community detection based on the nodes role in a network: the telegram platform case. In: van der Aalst, W.M.P. (ed.) AIST 2020. LNCS, vol. 12602, pp. 294–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72610-2_22
Truong, Q.T., Salah, A., Lauw, H.: Multi-modal recommender systems: hands-on exploration. In: Proceedings of the 15th ACM Conference on Recommender Systems, pp. 834–837 (2021)
Walinder, L., Price, M., Lim, B., Smith, B.: Multimodal personalized recommender algorithm based on knowledge graph (2022)
Wirojwatanakul, P., Wangperawong, A.: Multi-label product categorization using multi-modal fusion models. arXiv preprint arXiv:1907.00420 (2019)
Yakovlev, K., et al.: Sinkhorn transformations for single-query postprocessing in text-video retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, pp. 2394–2398. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539618.3592064
Zhang, W., Qin, J., Guo, W., Tang, R., He, X.: Deep learning for click-through rate estimation. arXiv preprint arXiv:2104.10584 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Foniakov, M., Bardukov, A., Makarov, I. (2024). Application of Multimodal Machine Learning for Image Recommendation Systems. In: Ignatov, D.I., et al. Recent Trends in Analysis of Images, Social Networks and Texts. AIST 2023. Communications in Computer and Information Science, vol 1905. Springer, Cham. https://doi.org/10.1007/978-3-031-67008-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-67008-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-67007-7
Online ISBN: 978-3-031-67008-4
eBook Packages: Computer ScienceComputer Science (R0)