ABSTRACT
In this paper we present the novel aspects of our 15th place solution to the RecSys Challenge 2019 which are focused on the acceleration of feature generation and model training time. In our final solution we sped up training of our model by a factor of 15.6x, from a workflow of 891.8s (14m52s) to 57.2s, through a combination of the RAPIDS.AI cuDF library for preprocessing, a custom batch dataloader, LAMB and extreme batch sizes, and an update to the kernel responsible for calculating the embedding gradient in PyTorch. Using cuDF we also accelerated our feature generation by a factor of 9.7x by performing the computations on the GPU, reducing the time taken to generate the features used in our model from 51 minutes to 5. We demonstrate these optimizations on the fastai tabular model which we relied on extensively in our final ensemble. With training time so drastically reduced the iteration involved in generating new features and training new models is much more fluid, allowing for the rapid prototyping of deep learning based recommender systems in hours as opposed to days.
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785--794. Google ScholarDigital Library
- Jeremy Howard, Rachel Thomas, and Sylvain Gugger. 2018. Fast.ai Library. http://docs.fast.aiGoogle Scholar
- Bytedance Inc. 2019. BytePS. https://github.com/bytedance/bytepsGoogle Scholar
- Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 3146--3154. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdfGoogle ScholarDigital Library
- Peter Knees, Yashar Deldjoo, Farshad Bakhshandegan Moghaddam, Jens Adamczak, Gerard-Paul Leyson, and Philipp Monreal. 2019. RecSys Challenge 2019: Session-based Hotel Recommendations. In Proceedings of the Thirteenth ACM Conference on Recommender Systems (RecSys '19). ACM, New York, NY, USA, 2. Google ScholarDigital Library
- Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman (Eds.). 51 -- 56.Google ScholarCross Ref
- NVidia. 2017. APEX. https://github.com/NVIDIA/apexGoogle Scholar
- NVidia. 2019. CUDA Profiler Users Guide. https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdfGoogle Scholar
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS Autodiff Workshop.Google Scholar
- RAPIDS.AI. 2019. RAPIDS.AI cuDF repository. https://github.com/rapidsai/cuDFGoogle Scholar
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 693--701. http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdfGoogle ScholarDigital Library
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799Google Scholar
- Mark J van der Laan, Eric C Polley, and Alan E Hubbard. 2007. Super Learner. In Journal of the American Statistical Applications in Genetics and Molecular Biology, Vol. 6. Issue 1.Google ScholarCross Ref
- StÃl'fan van der Walt, S. Chris Colbert, and GaÃńl Varoquaux. 2011. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering 13, 2 (2011), 22--30. arXiv:https://aip.scitation.org/doi/pdf/10.1109/MCSE.2011.37 Google ScholarDigital Library
- Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. CoRR abs/1904.00962 (2019). arXiv:1904.00962 http://arxiv.org/abs/1904.00962Google Scholar
Index Terms
- Accelerating recommender system training 15x with RAPIDS
Recommendations
GPU Accelerated Feature Engineering and Training for Recommender Systems
RecSysChallenge '20: Proceedings of the Recommender Systems Challenge 2020In this paper we present our 1st place solution of the RecSys Challenge 2020 which focused on the prediction of user behavior, specifically the interaction with content, on this year’s dataset from competition host Twitter. Our approach achieved the ...
A Scalable, Accurate Hybrid Recommender System
WKDD '10: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data MiningRecommender systems apply machine learning techniques for filtering unseen information and can predict whether a user would like a given resource. There are three main types of recommender systems: collaborative filtering, content-based filtering, and ...
Hybrid Recommender System Based on Multi-Hierarchical Ontologies
WebMedia '18: Proceedings of the 24th Brazilian Symposium on Multimedia and the WebRecommender Systems (RSs) are usually based in User Profiles (UP) to identify items of interest to a user, among the items of a usually vast collection. Traditional RSs are mostly based on ratings of items made by users and do not attempt to estimate ...
Comments