ABSTRACT
Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.
This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme.
With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.
- S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013.Google ScholarDigital Library
- L. Bottou and O. Bousquet. The tradeo-offs of large-scale learning. optimization for machine learning. JMLR, pages 351--368, 2011.Google Scholar
- X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin. Gpu-accelerated restricted boltzmann machine for collaborative filtering. In ICA3PP, 2012. Google ScholarDigital Library
- J. Canny, D. L. W. Hall, and D. Klein. A multi-teraflop constituency parser using gpus. In EMNLP, 2013.Google Scholar
- W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology, 6(1), 2014. Google ScholarDigital Library
- A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with cots hpc systems. In ICML, pages 1337--1345, 2013.Google ScholarDigital Library
- H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, 2014. Google ScholarDigital Library
- J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.Google ScholarDigital Library
- G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The yahoo! music dataset and kdd-cup '11. In KDD Cup 2011 competition, 2012.Google Scholar
- R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011. Google ScholarDigital Library
- M. Kabiljo and A. Ilic. Recommending items to more than a billion people. https://code.facebook.com/posts/861999383875667/recommending-items-to-more-than-a-billion-people/, 2015. {Online; accessed 17-Aug-2015}.Google Scholar
- Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
- B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting large-scale matrix factorization. In EDBT, pages 625--636, 2013. Google ScholarDigital Library
- M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014. Google ScholarDigital Library
- X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google Scholar
- F. Niu, B. Recht, C. Re, and S. J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.Google ScholarDigital Library
- NVidia CUDA 7.0. cuBLAS. http://docs.nvidia.com/cuda/cublas/, 2015. {Online; accessed 17-Aug-2015}.Google Scholar
- NVidia CUDA 7.0. cuSPARSE. http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrmm2, 2015. {Online; accessed 4-Aug-2015}.Google Scholar
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarCross Ref
- I. Pillaszy, D. Zibriczky, and D. Tikk. Fast als-based matrix factorization for explicit and implicit feedback datasets. In ACM Conference on Recommender Systems (RecSys), pages 71--78, 2010. Google ScholarDigital Library
- P. Resnick and H. R. Varian. Recommender systems. Communications of the ACM, 40(3):56--58, 1997. Google ScholarDigital Library
- T. Rohrmann. How to factorize a 700 GB matrix with Apache Flink. http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/, 2015. {Online; accessed 15-Aug-2015}.Google Scholar
- S. Schelter, V. Satuluri, and R. B. Zadeh. Factorbird-a parameter server approach to distributed matrix factorization. In NIPS Workshop on Distributed Matrix Computations, 2014.Google Scholar
- Stanford SNAP Lab. Web data: Amazon reviews. https://snap.stanford.edu/data/web-Amazon.html, 2015. {Online; accessed 18-Aug-2015}.Google Scholar
- C. Teflioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In ICDM, pages 655--664, 2012. Google ScholarDigital Library
- E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. In KDD, pages 1335--1344, 2015. Google ScholarDigital Library
- B. Yavuz, X. Meng, and R. Xin. Scalable Collaborative Filtering with Spark MLlib. https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html, 2014. {Online; accessed 15-Aug-2015}.Google Scholar
- H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, 2012. Google ScholarDigital Library
- H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. In VLDB, 2014. Google ScholarDigital Library
- Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008. Google ScholarDigital Library
- Y. Zhuang, W. Chin, Y. Juan, and C. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In ACM RecSys, 2013. Google ScholarDigital Library
Index Terms
- Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs
Recommendations
A Cross-Platform SpMV Framework on Many-Core Architectures
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
yaSpMV: yet another SpMV framework on GPUs
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingSpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
Comments