skip to main content
10.1145/3109859.3109876acmconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
research-article

Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks

Published:27 August 2017Publication History

ABSTRACT

Recommendation algorithms that incorporate techniques from deep learning are becoming increasingly popular. Due to the structure of the data coming from recommendation domains (i.e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size. This makes them difficult to train, due to the limited memory of graphical processing units, and difficult to deploy on mobile devices with limited hardware. To address these difficulties, we propose Bloom embeddings, a compression technique that can be applied to the input and output of neural network models dealing with sparse high-dimensional binary-coded instances. Bloom embeddings are computationally efficient, and do not seriously compromise the accuracy of the model up to 1/5 compression ratios. In some cases, they even improve over the original accuracy, with relative increases up to 12%. We evaluate Bloom embeddings on 7 data sets and compare it against 4 alternative methods, obtaining favorable results. We also discuss a number of further advantages of Bloom embeddings, such as 'on-the-fly' constant-time operation, zero or marginal space requirements, training time speedups, or the fact that they do not require any change to the core model architecture or training configuration.

References

  1. Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. 2015. Label-embedding for image classification. IEEE Trans. on Pattern Analysis and Machine Intelligence 38, 7 (2015), 1425--1438.Google ScholarGoogle ScholarCross RefCross Ref
  2. Y. Amit, M. Fink, N. Srebro, and S. Ullman. 2007. Uncovering shared structures in multiclass classification. In Proc. of the Int. Conf. on Machine Learning (ICML). 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Armano, C. Chira, and N. Hatami. 2012. Error-correcting output codes for multi-label text categorization. In Proc. of the Italian Information Retrieval Conf. (IIR). 26--37.Google ScholarGoogle Scholar
  4. S. Bengio, J. Weston, and D. Grangier. 2010. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems (NIPS). Vol. 23. 163--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Bengio, R. Ducharme, and P. Vincent. 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems (NIPS), T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.). Vol. 13. 932--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and P. Lamere. 2011. The million song dataset. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR). 591--596.Google ScholarGoogle Scholar
  7. B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Blustein and A. El-Maazawi. 2002. Bloom filters - a tutorial, analysis, and survey. Technical Report. Faculty of Computer Science, Dalhousie University, Halifax, Canada. https://www.cs.dal.ca/research/techreports/cs-2002-10Google ScholarGoogle Scholar
  9. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. 2006. An improved construction for counting Bloom filters. In European Symposium on Algorithms (ESA), Y. Azar and T. Erlebach (Eds.). Lecture Notes in Computer Science, Vol. 4168. Springer-Verlag, Berlin, Germany, 684--695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Cardoso-Cachopo. 2007. Improving methods for single-label text categorization. Ph.D. Dissertation. Instituto Superior Tecnico, Universidade Tecnica de Lisboa.Google ScholarGoogle Scholar
  11. W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. 2015. Compressing neural networks with the hashing trick. In Proc. of the Int. Conf. on Machine Learning (ICML). 2285--2294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah. 2016. Wide & deep learning for recommender systems. In Proc. of the Workshop on Deep Learning for Recommender Systems (DLRS). 7--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: encoder-decoder approaches. In Proc. of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST). 103--111.Google ScholarGoogle Scholar
  14. F. Chollet. 2016. Information-theoretic label embeddings for large-scale image classification. ArXiv: 1607.05691. (2016).Google ScholarGoogle Scholar
  15. M. Cissé, N. Usunier, T. Artières, and P. Gallinari. 2013. Robust Bloom filters for large multilabel classification tasks. In Advances in Neural Information Processing Systems (NIPS). 1851--1859. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). 3123--3131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. G. Dietterich and G. Bakiri. 1995. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2 (1995), 263--286. Google ScholarGoogle ScholarCross RefCross Ref
  18. P. C. Dillinger and P. Manolios. 2004. Bloom filters in probabilistic verification. In Proc. of the Int. Conf. on Formal Methods in Computer-Aided Design (FMCAD). 367--381.Google ScholarGoogle Scholar
  19. J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (2011), 2121--2159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Ganchev and M. Dredze. 2008. Small statistical models by random feature mixing. In ACL Workshop on Mobile Language Processing (MLP). 19--20.Google ScholarGoogle Scholar
  21. X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep sparse rectifier neural networks. In Proc. of the Int. Conf. on Artificial Intelligence and Statistics (AISTATS). 315--323.Google ScholarGoogle Scholar
  22. E. Grave, A. Joulin, M. Cissé, D. Grangier, and H. Jégou. 2016. Efficient softmax approximation for GPUs. ArXiv: 1609.04309. (2016).Google ScholarGoogle Scholar
  23. A. Graves. 2013. Generating sequences with recurrent neural networks. ArXiv: 1308.0850. (2013).Google ScholarGoogle Scholar
  24. S. Han, H. Mao, and W. J. Dally. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1510.00149Google ScholarGoogle Scholar
  25. F. M. Harper and J. K. Konstan. 2015. The MovieLens datasets: history and context. ACM Trans. on Interactive Intelligent Systems 5, 4 (2015), 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1511.06939Google ScholarGoogle Scholar
  27. S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory networks. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3-4 (1936), 321--377.Google ScholarGoogle ScholarCross RefCross Ref
  29. D. Hsu, S. M. Kakade, and T. Zhang. 2012. A spectral algorithm for learning hidden Markov models. J. Comput. System Sci. 78, 5 (2012), 1460--1470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang. 2009. Multi-label prediction via compressed sensing. In Advances in Neural Information Processing Systems (NIPS), Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.). Vol. 22. 772--780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1511.06530Google ScholarGoogle Scholar
  32. D. P. Kingma and J. L. Ba. 2015. Adam: a method for stochastic optimization. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1412.6980 https://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  33. J. Langford, L. Li, and A. Strehl. 2007. Vowpal wabbit online learning project. Technical Report. http://hunch.net/?p=309Google ScholarGoogle Scholar
  34. C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to information retrieval. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. McAuley, R. Pandey, and J. Leskovec. 2015. Inferring networks of substitutable and complementary products. In Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD). 785--794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Mikolov. 2012. Statistical language models based on neural networks. Ph.D. Dissertation. Brno University of Technology.Google ScholarGoogle Scholar
  37. T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv: 1301.3781. (2013).Google ScholarGoogle Scholar
  38. M. Mitzenmacher and E. Upfal. 2005. Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Morin and Y. Bengio. 2005. Hierarchical probabilistic neural network language model. In Proc. of the Int. Workshop on Artificial Intelligence and Statistics (AISTATS). 246--252.Google ScholarGoogle Scholar
  40. S. Rendle. 2010. Factorization machines. In Proc. of the IEEE Int. Conf. on Data Mining (ICDM). 995--1000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. V. N. Vishwanathan. 2009. Hash kernels for structured data. Journal of Machine Learning Research 10 (2009), 2615--2637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. F. Strub, R. Gaudel, and J. Mary. 2016. Hybrid recommender system based on autoencoders. In Proc. of theWorkshop on Deep Learning for Recommender Systems (DLRS). 11--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Tieleman and G. Hinton. 2012. Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2. (2012).Google ScholarGoogle Scholar
  44. J. Turian, L. Ratinov, and Y. Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL). 384--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. P. Vincent, A. Brébisson, and X. Bouthilier. 2015. Efficient exact gradient update for training deep networks with very large sparse targets. In Advances in Neural Information Processing Systems (NIPS). 1108--1116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K.Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. 2009. Feature hashing for large scale multitask learning. In Proc. of the Int. Conf. on Machine Learning (ICML). 1113--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J.Weston, S. Bengio, and N. Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning 81, 1 (2010), 21--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Weston, O. Chapelle, A. Elisseeff, B. Schölkopf, and V. Vapnik. 2002. Kernel dependency estimation. In Advances in Neural Information Processing Systems (NIPS), S. Becker, S. Thrun, and K. Obermayer (Eds.). Vol. 15. 873--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Y. Wu, C. DuBois, A. X. Zheng, and M. Ester. 2016. Collaborative denoising auto-Encoders for top-N recommender systems. In Proc. of the ACM Int. Conf. on Web Search and Data Mining (WSDM). 153--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. 2005. Improving recommendation lists through topic diversification. In Proc. of the Int.World Wide Web Conf. (WWW). 22--32. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          RecSys '17: Proceedings of the Eleventh ACM Conference on Recommender Systems
          August 2017
          466 pages
          ISBN:9781450346528
          DOI:10.1145/3109859

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 August 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          RecSys '17 Paper Acceptance Rate26of125submissions,21%Overall Acceptance Rate254of1,295submissions,20%

          Upcoming Conference

          RecSys '24
          18th ACM Conference on Recommender Systems
          October 14 - 18, 2024
          Bari , Italy

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader