research-article

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

Authors:
Wei Tan

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Liangliang Cao

Yahoo! Labs, New York City, NY, USA

Yahoo! Labs, New York City, NY, USA
View Profile

,
Liana Fong

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingMay 2016Pages 219–230https://doi.org/10.1145/2907294.2907297

Published:31 May 2016Publication History

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pages 219–230

ABSTRACT

Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.

This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme.

With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.

References

S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013.Google ScholarDigital Library
L. Bottou and O. Bousquet. The tradeo-offs of large-scale learning. optimization for machine learning. JMLR, pages 351--368, 2011.Google Scholar
X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin. Gpu-accelerated restricted boltzmann machine for collaborative filtering. In ICA3PP, 2012. Google ScholarDigital Library
J. Canny, D. L. W. Hall, and D. Klein. A multi-teraflop constituency parser using gpus. In EMNLP, 2013.Google Scholar
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology, 6(1), 2014. Google ScholarDigital Library
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with cots hpc systems. In ICML, pages 1337--1345, 2013.Google ScholarDigital Library
H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, 2014. Google ScholarDigital Library
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.Google ScholarDigital Library
G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The yahoo! music dataset and kdd-cup '11. In KDD Cup 2011 competition, 2012.Google Scholar
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011. Google ScholarDigital Library
M. Kabiljo and A. Ilic. Recommending items to more than a billion people. https://code.facebook.com/posts/861999383875667/recommending-items-to-more-than-a-billion-people/, 2015. {Online; accessed 17-Aug-2015}.Google Scholar
Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting large-scale matrix factorization. In EDBT, pages 625--636, 2013. Google ScholarDigital Library
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014. Google ScholarDigital Library
X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google Scholar
F. Niu, B. Recht, C. Re, and S. J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.Google ScholarDigital Library
NVidia CUDA 7.0. cuBLAS. http://docs.nvidia.com/cuda/cublas/, 2015. {Online; accessed 17-Aug-2015}.Google Scholar
NVidia CUDA 7.0. cuSPARSE. http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrmm2, 2015. {Online; accessed 4-Aug-2015}.Google Scholar
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarCross Ref
I. Pillaszy, D. Zibriczky, and D. Tikk. Fast als-based matrix factorization for explicit and implicit feedback datasets. In ACM Conference on Recommender Systems (RecSys), pages 71--78, 2010. Google ScholarDigital Library
P. Resnick and H. R. Varian. Recommender systems. Communications of the ACM, 40(3):56--58, 1997. Google ScholarDigital Library
T. Rohrmann. How to factorize a 700 GB matrix with Apache Flink. http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/, 2015. {Online; accessed 15-Aug-2015}.Google Scholar
S. Schelter, V. Satuluri, and R. B. Zadeh. Factorbird-a parameter server approach to distributed matrix factorization. In NIPS Workshop on Distributed Matrix Computations, 2014.Google Scholar
Stanford SNAP Lab. Web data: Amazon reviews. https://snap.stanford.edu/data/web-Amazon.html, 2015. {Online; accessed 18-Aug-2015}.Google Scholar
C. Teflioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In ICDM, pages 655--664, 2012. Google ScholarDigital Library
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. In KDD, pages 1335--1344, 2015. Google ScholarDigital Library
B. Yavuz, X. Meng, and R. Xin. Scalable Collaborative Filtering with Spark MLlib. https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html, 2014. {Online; accessed 15-Aug-2015}.Google Scholar
H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, 2012. Google ScholarDigital Library
H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. In VLDB, 2014. Google ScholarDigital Library
Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008. Google ScholarDigital Library
Y. Zhuang, W. Chin, Y. Juan, and C. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In ACM RecSys, 2013. Google ScholarDigital Library

Index Terms

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Factorization methods
        Factor analysis
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

A Cross-Platform SpMV Framework on Many-Core Architectures

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...
Read More
Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Read More
yaSpMV: yet another SpMV framework on GPUs
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
May 2016
302 pages
ISBN:9781450343145
DOI:10.1145/2907294
General Chair:
Hiroshi Nakashima
Kyoto University, Japan
,
Program Chairs:
Kenjiro Taura
The University of Tokyo, Japan
,
Jack Lange
University of Pittsburgh, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
alternating least square (als)
cuda
gpu
matrix factorization
parallel algorithms
performance optimization
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '16 Paper Acceptance Rate20of129submissions,16%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 447
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Cross-Platform SpMV Framework on Many-Core Architectures

Out-of-core implementation for accelerator kernels on heterogeneous clouds

yaSpMV: yet another SpMV framework on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Cross-Platform SpMV Framework on Many-Core Architectures

Out-of-core implementation for accelerator kernels on heterogeneous clouds

yaSpMV: yet another SpMV framework on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media