skip to main content
10.1145/2907294.2907297acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

Authors Info & Claims
Published:31 May 2016Publication History

ABSTRACT

Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.

This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme.

With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.

References

  1. S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Bottou and O. Bousquet. The tradeo-offs of large-scale learning. optimization for machine learning. JMLR, pages 351--368, 2011.Google ScholarGoogle Scholar
  3. X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin. Gpu-accelerated restricted boltzmann machine for collaborative filtering. In ICA3PP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Canny, D. L. W. Hall, and D. Klein. A multi-teraflop constituency parser using gpus. In EMNLP, 2013.Google ScholarGoogle Scholar
  5. W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology, 6(1), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with cots hpc systems. In ICML, pages 1337--1345, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The yahoo! music dataset and kdd-cup '11. In KDD Cup 2011 competition, 2012.Google ScholarGoogle Scholar
  10. R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Kabiljo and A. Ilic. Recommending items to more than a billion people. https://code.facebook.com/posts/861999383875667/recommending-items-to-more-than-a-billion-people/, 2015. {Online; accessed 17-Aug-2015}.Google ScholarGoogle Scholar
  12. Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting large-scale matrix factorization. In EDBT, pages 625--636, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google ScholarGoogle Scholar
  17. F. Niu, B. Recht, C. Re, and S. J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. NVidia CUDA 7.0. cuBLAS. http://docs.nvidia.com/cuda/cublas/, 2015. {Online; accessed 17-Aug-2015}.Google ScholarGoogle Scholar
  19. NVidia CUDA 7.0. cuSPARSE. http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrmm2, 2015. {Online; accessed 4-Aug-2015}.Google ScholarGoogle Scholar
  20. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  21. I. Pillaszy, D. Zibriczky, and D. Tikk. Fast als-based matrix factorization for explicit and implicit feedback datasets. In ACM Conference on Recommender Systems (RecSys), pages 71--78, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Resnick and H. R. Varian. Recommender systems. Communications of the ACM, 40(3):56--58, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Rohrmann. How to factorize a 700 GB matrix with Apache Flink. http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/, 2015. {Online; accessed 15-Aug-2015}.Google ScholarGoogle Scholar
  24. S. Schelter, V. Satuluri, and R. B. Zadeh. Factorbird-a parameter server approach to distributed matrix factorization. In NIPS Workshop on Distributed Matrix Computations, 2014.Google ScholarGoogle Scholar
  25. Stanford SNAP Lab. Web data: Amazon reviews. https://snap.stanford.edu/data/web-Amazon.html, 2015. {Online; accessed 18-Aug-2015}.Google ScholarGoogle Scholar
  26. C. Teflioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In ICDM, pages 655--664, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. In KDD, pages 1335--1344, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Yavuz, X. Meng, and R. Xin. Scalable Collaborative Filtering with Spark MLlib. https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html, 2014. {Online; accessed 15-Aug-2015}.Google ScholarGoogle Scholar
  29. H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. In VLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Zhuang, W. Chin, Y. Juan, and C. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In ACM RecSys, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
          May 2016
          302 pages
          ISBN:9781450343145
          DOI:10.1145/2907294

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 May 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          HPDC '16 Paper Acceptance Rate20of129submissions,16%Overall Acceptance Rate166of966submissions,17%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader