ABSTRACT
Recommendation algorithms that incorporate techniques from deep learning are becoming increasingly popular. Due to the structure of the data coming from recommendation domains (i.e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size. This makes them difficult to train, due to the limited memory of graphical processing units, and difficult to deploy on mobile devices with limited hardware. To address these difficulties, we propose Bloom embeddings, a compression technique that can be applied to the input and output of neural network models dealing with sparse high-dimensional binary-coded instances. Bloom embeddings are computationally efficient, and do not seriously compromise the accuracy of the model up to 1/5 compression ratios. In some cases, they even improve over the original accuracy, with relative increases up to 12%. We evaluate Bloom embeddings on 7 data sets and compare it against 4 alternative methods, obtaining favorable results. We also discuss a number of further advantages of Bloom embeddings, such as 'on-the-fly' constant-time operation, zero or marginal space requirements, training time speedups, or the fact that they do not require any change to the core model architecture or training configuration.
- Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. 2015. Label-embedding for image classification. IEEE Trans. on Pattern Analysis and Machine Intelligence 38, 7 (2015), 1425--1438.Google ScholarCross Ref
- Y. Amit, M. Fink, N. Srebro, and S. Ullman. 2007. Uncovering shared structures in multiclass classification. In Proc. of the Int. Conf. on Machine Learning (ICML). 17--24. Google ScholarDigital Library
- G. Armano, C. Chira, and N. Hatami. 2012. Error-correcting output codes for multi-label text categorization. In Proc. of the Italian Information Retrieval Conf. (IIR). 26--37.Google Scholar
- S. Bengio, J. Weston, and D. Grangier. 2010. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems (NIPS). Vol. 23. 163--171. Google ScholarDigital Library
- Y. Bengio, R. Ducharme, and P. Vincent. 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems (NIPS), T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.). Vol. 13. 932--938. Google ScholarDigital Library
- T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and P. Lamere. 2011. The million song dataset. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR). 591--596.Google Scholar
- B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- J. Blustein and A. El-Maazawi. 2002. Bloom filters - a tutorial, analysis, and survey. Technical Report. Faculty of Computer Science, Dalhousie University, Halifax, Canada. https://www.cs.dal.ca/research/techreports/cs-2002-10Google Scholar
- F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. 2006. An improved construction for counting Bloom filters. In European Symposium on Algorithms (ESA), Y. Azar and T. Erlebach (Eds.). Lecture Notes in Computer Science, Vol. 4168. Springer-Verlag, Berlin, Germany, 684--695. Google ScholarDigital Library
- A. Cardoso-Cachopo. 2007. Improving methods for single-label text categorization. Ph.D. Dissertation. Instituto Superior Tecnico, Universidade Tecnica de Lisboa.Google Scholar
- W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. 2015. Compressing neural networks with the hashing trick. In Proc. of the Int. Conf. on Machine Learning (ICML). 2285--2294. Google ScholarDigital Library
- H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah. 2016. Wide & deep learning for recommender systems. In Proc. of the Workshop on Deep Learning for Recommender Systems (DLRS). 7--10. Google ScholarDigital Library
- K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: encoder-decoder approaches. In Proc. of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST). 103--111.Google Scholar
- F. Chollet. 2016. Information-theoretic label embeddings for large-scale image classification. ArXiv: 1607.05691. (2016).Google Scholar
- M. Cissé, N. Usunier, T. Artières, and P. Gallinari. 2013. Robust Bloom filters for large multilabel classification tasks. In Advances in Neural Information Processing Systems (NIPS). 1851--1859. Google ScholarDigital Library
- M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). 3123--3131. Google ScholarDigital Library
- T. G. Dietterich and G. Bakiri. 1995. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2 (1995), 263--286. Google ScholarCross Ref
- P. C. Dillinger and P. Manolios. 2004. Bloom filters in probabilistic verification. In Proc. of the Int. Conf. on Formal Methods in Computer-Aided Design (FMCAD). 367--381.Google Scholar
- J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (2011), 2121--2159. Google ScholarDigital Library
- K. Ganchev and M. Dredze. 2008. Small statistical models by random feature mixing. In ACL Workshop on Mobile Language Processing (MLP). 19--20.Google Scholar
- X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep sparse rectifier neural networks. In Proc. of the Int. Conf. on Artificial Intelligence and Statistics (AISTATS). 315--323.Google Scholar
- E. Grave, A. Joulin, M. Cissé, D. Grangier, and H. Jégou. 2016. Efficient softmax approximation for GPUs. ArXiv: 1609.04309. (2016).Google Scholar
- A. Graves. 2013. Generating sequences with recurrent neural networks. ArXiv: 1308.0850. (2013).Google Scholar
- S. Han, H. Mao, and W. J. Dally. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1510.00149Google Scholar
- F. M. Harper and J. K. Konstan. 2015. The MovieLens datasets: history and context. ACM Trans. on Interactive Intelligent Systems 5, 4 (2015), 19. Google ScholarDigital Library
- B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1511.06939Google Scholar
- S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory networks. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- H. Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3-4 (1936), 321--377.Google ScholarCross Ref
- D. Hsu, S. M. Kakade, and T. Zhang. 2012. A spectral algorithm for learning hidden Markov models. J. Comput. System Sci. 78, 5 (2012), 1460--1470. Google ScholarDigital Library
- D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang. 2009. Multi-label prediction via compressed sensing. In Advances in Neural Information Processing Systems (NIPS), Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.). Vol. 22. 772--780. Google ScholarDigital Library
- Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1511.06530Google Scholar
- D. P. Kingma and J. L. Ba. 2015. Adam: a method for stochastic optimization. In Proc. of the Int. Conf. on Learning Representations (ICLR). arXiv:1412.6980 https://arxiv.org/abs/1412.6980Google Scholar
- J. Langford, L. Li, and A. Strehl. 2007. Vowpal wabbit online learning project. Technical Report. http://hunch.net/?p=309Google Scholar
- C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to information retrieval. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- J. McAuley, R. Pandey, and J. Leskovec. 2015. Inferring networks of substitutable and complementary products. In Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD). 785--794. Google ScholarDigital Library
- T. Mikolov. 2012. Statistical language models based on neural networks. Ph.D. Dissertation. Brno University of Technology.Google Scholar
- T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv: 1301.3781. (2013).Google Scholar
- M. Mitzenmacher and E. Upfal. 2005. Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- F. Morin and Y. Bengio. 2005. Hierarchical probabilistic neural network language model. In Proc. of the Int. Workshop on Artificial Intelligence and Statistics (AISTATS). 246--252.Google Scholar
- S. Rendle. 2010. Factorization machines. In Proc. of the IEEE Int. Conf. on Data Mining (ICDM). 995--1000. Google ScholarDigital Library
- Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. V. N. Vishwanathan. 2009. Hash kernels for structured data. Journal of Machine Learning Research 10 (2009), 2615--2637. Google ScholarDigital Library
- F. Strub, R. Gaudel, and J. Mary. 2016. Hybrid recommender system based on autoencoders. In Proc. of theWorkshop on Deep Learning for Recommender Systems (DLRS). 11--16. Google ScholarDigital Library
- T. Tieleman and G. Hinton. 2012. Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2. (2012).Google Scholar
- J. Turian, L. Ratinov, and Y. Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL). 384--394. Google ScholarDigital Library
- P. Vincent, A. Brébisson, and X. Bouthilier. 2015. Efficient exact gradient update for training deep networks with very large sparse targets. In Advances in Neural Information Processing Systems (NIPS). 1108--1116. Google ScholarDigital Library
- K.Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. 2009. Feature hashing for large scale multitask learning. In Proc. of the Int. Conf. on Machine Learning (ICML). 1113--1120. Google ScholarDigital Library
- J.Weston, S. Bengio, and N. Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning 81, 1 (2010), 21--35. Google ScholarDigital Library
- J. Weston, O. Chapelle, A. Elisseeff, B. Schölkopf, and V. Vapnik. 2002. Kernel dependency estimation. In Advances in Neural Information Processing Systems (NIPS), S. Becker, S. Thrun, and K. Obermayer (Eds.). Vol. 15. 873--880. Google ScholarDigital Library
- Y. Wu, C. DuBois, A. X. Zheng, and M. Ester. 2016. Collaborative denoising auto-Encoders for top-N recommender systems. In Proc. of the ACM Int. Conf. on Web Search and Data Mining (WSDM). 153--162. Google ScholarDigital Library
- C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. 2005. Improving recommendation lists through topic diversification. In Proc. of the Int.World Wide Web Conf. (WWW). 22--32. Google ScholarDigital Library
Index Terms
- Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks
Recommendations
Learning to Embed Categorical Features without Embedding Tables for Recommendation
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningEmbedding learning of categorical features (e.g. user/item IDs) is at the core of various recommendation models. The standard approach creates an embedding table where each row represents a dedicated embedding vector for every unique feature value. ...
To Copy, or not to Copy; That is a Critical Issue of the Output Softmax Layer in Neural Sequential Recommenders
WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data MiningRecent studies suggest that the existing neural models have difficulty handling repeated items in sequential recommendation tasks. However, our understanding of this difficulty is still limited. In this study, we substantially advance this field by ...
Deep Embeddings for Brand Detection in Product Titles
Analysis of Images, Social Networks and TextsAbstractIn this paper, we compare various techniques to learn expressive product title embeddings starting from TF-IDF and ending with deep neural architectures. The problem is to recognize brands from noisy retail product names coming from different ...
Comments