Abstract
In this paper, we present a novel multimodal framework for video recommendation based on deep learning. Unlike most common solutions, we formulate video recommendations by exploiting simultaneously two data modalities, particularly: (i) the visual (i.e., image sequence) and (ii) the textual modalities, which in conjunction with the audio stream constitute the elementary data of a video document. More specifically, our framework firstly describe textual data by using the bag-of-words and TF-IDF models, fusing those features with deep convolutional descriptors extracted from the visual data. As result, we obtain a multimodal descriptor for each video document, from which we construct a low-dimensional sparse representation by using autoencoders. To qualify the recommendation task, we extend a sparse linear method with side information (SSLIM), by taking into account the sparse representations of video descriptors previously computed. By doing this, we are able to produce a ranking of the top-N most relevant videos to the user. Note that our framework is flexible, i.e., one may use other types of modalities, autoencoders, and fusion architectures. Experimental results obtained on three real datasets (MovieLens-1M, MovieLens-10M and Vine), containing 3,320, 8,400 and 18,576 videos, respectively, show that our framework can improve up to 60.6% the recommendation results, when compared to a single modality recommendation model and up to 31%, when compared to state-of-the art methods used as baselines in our study, demonstrating the effectiveness of our framework and highlighting the usefulness of multimodal information in recommender system.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-019-01430-7/MediaObjects/10489_2019_1430_Fig7_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Here, we used the cosine function to assess how similar are the items. Other functions may be tested in the future.
References
Ahmed M, Imtiaz MT, Khan R (2018) Movie recommendation system using clustering and pattern recognition network. In: 2018 IEEE 8th annual computing and communication workshop and conference (CCWC). IEEE, pp 143–147
Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval, vol 463. ACM Press, New York
Beel J, Gipp B, Langer S, Breitinger C (2015) Research-paper recommender systems: a literature survey. International Journal on Digital Libraries, pp 1–34. https://doi.org/10.1007/s00799-015-0156-0
Beutel A, Covington P, Jain S, Xu C, Li J, Gatto V, Chi EH (2018) Latent cross: Making use of context in recurrent recommender systems. In: Proceedings of the eleventh ACM international conference on Web search and data mining. ACM, pp 46–54
Bobadilla J, Hernando A, Ortega F, Gutiérrez A (2012) Collaborative filtering based on significances. Inf Sci 185(1):1–17 . https://doi.org/10.1016/j.ins.2011.09.014, http://www.scopus.com/inward/record.url?eid=2-s2.0-80755139565&partnerID=40&md5=ff16abb2e6d3731d4f4683d0f56018ae
Bobadilla J, Ortega F, Hernando A, Gutiérrez A (2013) Recommender systems survey. Knowl-Based Syst 46:109–132
Cheng HT, Koc L, Harmsen J, Shaked T, Chandra T, Aradhye H, Anderson G, Corrado G, Chai W, Ispir M et al (2016) Wide & deep learning for recommender systems. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, pp 7–10
Christakou C, Vrettos S, Stafylopatis A (2007) A Hybrid Movie Recommender System Based on Neural Networks. Int J Artif Intell Tools 16(05):771–792. https://doi.org/10.1142/S0218213007003540
da Conceiċão F L A, Pádua F L C, Machado AC, Lacerda AM, Dalip DH (2016) Metodologia para recomendaċão de vídeos baseada em descritores de conteúdo visuais e textuais. Tendências da Pesquisa Brasileira em Ciência da Informaċão 9(1):208–225
Covington P, Adams J, Sargin E (2016) Deep neural networks for youtube recommendations. In: Proceedings of the 10th ACM conference on recommender systems. ACM, pp 191–198
Cremonesi P, Koren Y, Turrin R (2010) Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the fourth ACM conference on recommender systems, RecSys’10. ACM, New York, pp 39–46
Cunningham JP, Byron MY (2014) Dimensionality reduction for large-scale neural recordings. Nat Neurosci 17(11):1500–1509
Davidson J, Livingston B, Sampath D, Liebald B, Liu J, Nandy P, Van Vleet T, Gargi U, Gupta S, He Y, Lambert M (2010) The YouTube video recommendation system. Proceedings of the fourth ACM conference on Recommender systems - RecSys ’10, p 293, https://doi.org/10.1145/1864708.1864770, http://portal.acm.org/citation.cfm?doid=1864708.1864770
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, pp 3844–3852
Deldjoo Y, Quadrana M, Elahi M, Cremonesi P (2017) Using mise-en-scène visual features based on mpeg-7 and deep learning for movie recommendation. arXiv:170406109
Deng Z, Yan M, Sang J, Xu C (2015) Twitter is faster: personalized time-aware video recommendation from Twitter to YouTube. ACM Trans Multimed Comput Commun Appl (TOMM) 11(2):31
Deshpande M, Karypis G (2004) Item-based top-n recommendation algorithms. ACM Trans Inf Syst (TOIS) 22(1):143–177
Fan Y, Wang Y, Yu H, Liu B (2017) Movie recommendation based on visual features of trailers. In: International conference on innovative mobile and internet services in ubiquitous computing, Springer, pp 242–253
Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning, vol 1. MIT Press, Cambridge
He R, McAuley J (2016) Vbpr: Visual bayesian personalized ranking from implicit feedback. In: AAAI, pp 144–150
Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In: Proceedings of the 2008 Eighth IEEE international conference on data mining, ICDM ’08. IEEE Computer Society, Washington, pp 263–272, https://doi.org/10.1109/ICDM.2008.22
Järvelin K, Kekäläinen J (2000) IR Evaluation Methods for Retrieving Highly Relevant Documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, pp 41-48. https://doi.org/10.1145/345508.345545
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 675–678
Kabbur S, Ning X, Karypis G (2013) Fism: factored item similarity models for top-n recommender systems. In: ACM SIGKDD, pp 659–667
Kataria S, Mitra P, Bhatia S (1999) Utilizing Context in Generative Bayesian Models for Linked Corpus. Aaai 10(Hofmann):1
Kim HN, Ji AT, Ha I, Jo GS (2010) Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation. Electron Commer Res Appl 9(1):73–83. https://doi.org/10.1016/j.elerap.2009.08.004
Lao N, Cohen WW (2010) Relational retrieval using a combination of path-constrained random walks. Mach Learn 81(1):53–67. https://doi.org/10.1007/s10994-010-5205-8
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Li X, She J (2017) Collaborative variational autoencoder for recommender systems. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 305–314
Li Z, Peng JY, Geng GH, Chen XJ, Zheng PP (2014) Video recommendation based on multi-modal information and multiple kernel. Multimed Tools Appl 74(13):4599–4616. https://doi.org/10.1007/s11042-013-1825-x
Lin J, Wilbur WJ (2007) PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinform 8(1):423. https://doi.org/10.1186/1471-2105-8-423
Linden G, Smith B, York J (2003) Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput 7(1):76–80. https://doi.org/10.1109/MIC.2003.1167344
McAuley J, Leskovec J (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM conference on recommender systems. ACM, pp 165–172
Mei T, Yang B, Hua XS, Yang L, Yang SQ, Li S (2007) VideoReach. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07. ACM Press, New York, pp 767. https://doi.org/10.1145/1277741.1277899, http://portal.acm.org/citation.cfm?doid=1277741.1277899
Nascimento C, Laender AH, Da Silva AS, Gonçalves MA (2011) A Source Independent Framework for Research Paper Recommendation. In: Proceedings of the 11th ACM/IEEE-CS joint conference on digital libraries. ACM Press, New York, pp 297–306. https://doi.org/10.1145/1998076.1998132, http://portal.acm.org/citation.cfm?doid=1998076.1998132
Nascimento G, Laranjeira C, Braz V, Lacerda A, Nascimento ER (2018) A robust indoor scene recognition method based on sparse representation. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 22nd Iberoamerican Congress, CIARP 2017, Valparaíso, 2017, proceedings. Springer, vol 10657, pp 408
Ning X, Karypis G (2011) Slim: Sparse linear methods for top-n recommender systems. In: ICDM’11, pp 497–506
Ning X, Karypis G (2012) Sparse linear methods with side information for top-n recommendations. In: ACM RecSys, pp 155–162
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, ACM, pp 251–260
Rassweiler Filho RJ, Wehrmann J, Barros RC (2017) Leveraging deep visual features for content-based movie recommender systems. In: 2017 international joint conference on neural networks (IJCNN). IEEE, pp 604–611
Redi M, O’Hare N, Schifanella R, Trevisiol M, Jaimes A (2014) 6 Seconds of sound and vision: Creativity in micro-videos. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2009a) Bpr: Bayesian personalized ranking from implicit feedback. In: Proceedings of the twenty-fifth conference‘ on uncertainty in artificial intelligence, UAI’09. AUAI Press, Arlington, pp 452–461. http://dl.acm.org/citation.cfm?id=1795114.1795167
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2009b) Bpr: Bayesian personalized ranking from implicit feedback. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, pp 452–461
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088): 533
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Vapnik VN (1998) The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., http://portal.acm.org/citation.cfm?id=211359
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408
Wang C, Blei DM (2011) Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 448–456
Yang C, Chen X, Liu L, Liu T, Geng S (2018) A hybrid movie recommendation method based on social similarity and item attributes. In: International conference on sensing and imaging. Springer, pp 275–285
Yang J, Nguyen MN, San PP, Li X, Krishnaswamy S (2015) Deep convolutional neural networks on multichannel time series for human activity recognition. In: Ijcai, vol 15, pp 3995–4001
Zhang F, Yuan NJ, Lian D, Xie X, Ma WY (2016) Collaborative knowledge base embedding for recommender systems. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 353–362
Zhang Y, Ai Q, Chen X, Croft WB (2017) Joint representation learning for top-n recommendation with heterogeneous information sources. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 1449– 1458
Zheng L, Noroozi V, Yu PS (2017) Joint deep modeling of users and items using reviews for recommendation. In: Proceedings of the Tenth ACM international conference on Web search and data mining. ACM, pp 425–434
Zheng Y, Mobasher B, Burke R (2014) Cslim. In: Proceedings of the 8th ACM conference on recommender systems - RecSys’14, vol 0, pp 301–304. https://doi.org/10.1145/2645710.2645756, http://dl.acm.org/citation.cfm?doid=2645710.2645756
Zhou R, Khemmarat S, Gao L (2010) The Impact of YouTube Recommendation System on Video Views. In: Proceedings of the 10th ACM SIGCOMM conference on internet measurement. ACM, pp 404–410. https://doi.org/10.1145/1879141.1879193
Acknowledgements
The authors would like to thank the support of CNPq under Procs. 307510/2017-4, 313163/2014-6, 431458/2016-2 and 309291/2017-8, FAPEMIG under Procs. PPM-00542-15, APQ-03445-16 and FAPEMIG-PRONEX-MASWeb, Models, Algorithms and Systems for the Web under Proc. APQ-01400-14, CEFET-MG and CAPES.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Conceiç ao, F.L.A., Pádua, F.L.C., Lacerda, A. et al. Multimodal data fusion framework based on autoencoders for top-N recommender systems. Appl Intell 49, 3267–3282 (2019). https://doi.org/10.1007/s10489-019-01430-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01430-7