skip to main content
10.1145/3038228.3038240acmconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Parallel CCD++ on GPU for Matrix Factorization

Published: 04 February 2017 Publication History

Abstract

Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).

References

[1]
2016. Movielens dataset. https://movielens.org/. (2016). Accessed: 2016-08-30.
[2]
Alekh Agarwal and John C Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873--881.
[3]
Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer, 177--186.
[4]
Andrzej Cichocki and PHAN Anh-Huy. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.
[5]
Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. 2012. The Yahoo! Music Dataset and KDD-Cup'11. In KDD Cup. 8--18.
[6]
Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 69--77.
[7]
Ngoc-Diep Ho, Paul Van Dooren, and Vincent D. Blondel. 2011. Descent Methods for Nonnegative Matrix Factorization. Springer Netherlands, Dordrecht, 251--293.
[8]
Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. 2008. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on Machine learning. ACM, 408--415.
[9]
Cho-Jui Hsieh and Inderjit S Dhillon. 2011. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1064--1072.
[10]
Hyunsoo Kim and Haesun Park. 2008. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications 30, 2 (2008), 713--730.
[11]
Yehuda Koren, Robert Bell, Chris Volinsky, and others. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37.
[12]
John Langford, Alexander Smola, and Martin Zinkevich. 2009. Slow learners are fast. arXiv preprint arXiv:0911.0491 (2009).
[13]
Neil D. Lawrence and Raquel Urtasun. 2009. Non-linear Matrix Factorization with Gaussian Processes. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML '09). ACM, New York, NY, USA, 601--608.
[14]
Xin Liu and Karl Aberer. 2013. SoCo: a social network aided context-aware recommender system. In Proceedings of the 22nd international conference on World Wide Web. ACM, 781--802.
[15]
Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 502--511.
[16]
István Pilászy, Dávid Zibriczky, and Domonkos Tikk. 2010. Fast als-based matrix factorization for explicit and implicit feedback datasets. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 71--78.
[17]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.
[18]
André Valente Rodrigues, Alípio Jorge, and Inês Dutra. 2015. Accelerating Recommender Systems Using GPUs. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (SAC '15). ACM, New York, NY, USA, 879--884.
[19]
Gábor Takács and Domonkos Tikk. 2012. Alternating Least Squares for Personalized Ranking. In Proceedings of the Sixth ACM Conference on Recommender Systems (RecSys '12). ACM, New York, NY, USA, 83--90.
[20]
Wei Tan, Liangliang Cao, and Liana Fong. 2016. Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 219--230.
[21]
Christina Teflioudi, Faraz Makari, and Rainer Gemulla. 2012. Distributed matrix completion. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 655--664.
[22]
Paul Tseng. 2001. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications 109, 3 (2001), 475--494.
[23]
Hongzhi Yin, Yizhou Sun, Bin Cui, Zhiting Hu, and Ling Chen. 2013. LCARS: a location-content-aware recommender system. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 221--229.
[24]
Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. 2012. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 765--774.
[25]
Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. 2014. Parallel matrix factorization for recommender systems. Knowledge and Information Systems 41, 3 (2014), 793--819.
[26]
Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. 2011. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning 85, 1-2 (2011), 41--75.
[27]
Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, SVN Vishwanathan, and Inderjit Dhillon. 2014. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. Proceedings of the VLDB Endowment 7, 11 (2014), 975--986.
[28]
Tong Zhang. 2004. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML '04). ACM, New York, NY, USA, 116.
[29]
Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. 2008. Large-scale parallel collaborative filtering for the netflix prize. In International Conference on Algorithmic Applications in Management. Springer, 337--348.
[30]
Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. 2013. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 249--256.

Cited By

View all
  • (2024)cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU PlatformACM Transactions on Parallel Computing10.1145/364809411:2(1-33)Online publication date: 8-Jun-2024
  • (2024)Parallel Fractional Stochastic Gradient Descent With Adaptive Learning for Recommender SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318521235:3(470-483)Online publication date: Mar-2024
  • (2022)Locality Sensitive Hash Aggregated Nonlinear Neighborhood Matrix Factorization for Online Sparse Big Data AnalysisACM/IMS Transactions on Data Science10.1145/34977492:4(1-27)Online publication date: 25-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
GPGPU-10: Proceedings of the General Purpose GPUs
February 2017
84 pages
ISBN:9781450349154
DOI:10.1145/3038228
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Matrix Factorization
  3. Recommender System

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • US National Science Foundation

Conference

PPoPP '17
Sponsor:

Acceptance Rates

GPGPU-10 Paper Acceptance Rate 8 of 15 submissions, 53%;
Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU PlatformACM Transactions on Parallel Computing10.1145/364809411:2(1-33)Online publication date: 8-Jun-2024
  • (2024)Parallel Fractional Stochastic Gradient Descent With Adaptive Learning for Recommender SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318521235:3(470-483)Online publication date: Mar-2024
  • (2022)Locality Sensitive Hash Aggregated Nonlinear Neighborhood Matrix Factorization for Online Sparse Big Data AnalysisACM/IMS Transactions on Data Science10.1145/34977492:4(1-27)Online publication date: 25-Mar-2022
  • (2022)DS-ADMM++: A Novel Distributed Quantized ADMM to Speed up Differentially Private Matrix FactorizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311010433:6(1289-1302)Online publication date: 1-Jun-2022
  • (2022)An Online and Scalable Model for Generalized Sparse Nonnegative Matrix Factorization in Industrial Applications on Multi-GPUIEEE Transactions on Industrial Informatics10.1109/TII.2019.289663418:1(437-447)Online publication date: Jan-2022
  • (2021)BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306494232:9(2291-2302)Online publication date: 1-Sep-2021
  • (2021)BaPa: A Novel Approach of Improving Load Balance in Parallel Matrix Factorization for Recommender SystemsIEEE Transactions on Computers10.1109/TC.2020.299705170:5(789-802)Online publication date: 1-May-2021
  • (2021)HEALS: A Parallel eALS Recommendation System on CPU/GPU Heterogeneous Platforms2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00039(252-261)Online publication date: Dec-2021
  • (2019)An efficient mixed-mode representation of sparse tensorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356216(1-25)Online publication date: 17-Nov-2019
  • (2019)Load-Balanced Sparse MTTKRP on GPUs2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00023(123-133)Online publication date: May-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media