research-article

Parallel CCD++ on GPU for Matrix Factorization

Authors:

Aravind Sukumaran-Rajam,

Rakshith Kunchum,

P. SadayappanAuthors Info & Claims

GPGPU-10: Proceedings of the General Purpose GPUs

Pages 73 - 83

https://doi.org/10.1145/3038228.3038240

Published: 04 February 2017 Publication History

Abstract

Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).

References

[1]

2016. Movielens dataset. https://movielens.org/. (2016). Accessed: 2016-08-30.

[2]

Alekh Agarwal and John C Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873--881.

Digital Library

[3]

Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer, 177--186.

[4]

Andrzej Cichocki and PHAN Anh-Huy. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.

[5]

Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. 2012. The Yahoo! Music Dataset and KDD-Cup'11. In KDD Cup. 8--18.

Digital Library

[6]

Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 69--77.

Digital Library

[7]

Ngoc-Diep Ho, Paul Van Dooren, and Vincent D. Blondel. 2011. Descent Methods for Nonnegative Matrix Factorization. Springer Netherlands, Dordrecht, 251--293.

[8]

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. 2008. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on Machine learning. ACM, 408--415.

Digital Library

[9]

Cho-Jui Hsieh and Inderjit S Dhillon. 2011. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1064--1072.

Digital Library

[10]

Hyunsoo Kim and Haesun Park. 2008. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications 30, 2 (2008), 713--730.

Digital Library

[11]

Yehuda Koren, Robert Bell, Chris Volinsky, and others. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37.

Digital Library

[12]

John Langford, Alexander Smola, and Martin Zinkevich. 2009. Slow learners are fast. arXiv preprint arXiv:0911.0491 (2009).

Digital Library

[13]

Neil D. Lawrence and Raquel Urtasun. 2009. Non-linear Matrix Factorization with Gaussian Processes. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML '09). ACM, New York, NY, USA, 601--608.

Digital Library

[14]

Xin Liu and Karl Aberer. 2013. SoCo: a social network aided context-aware recommender system. In Proceedings of the 22nd international conference on World Wide Web. ACM, 781--802.

Digital Library

[15]

Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 502--511.

Digital Library

[16]

István Pilászy, Dávid Zibriczky, and Domonkos Tikk. 2010. Fast als-based matrix factorization for explicit and implicit feedback datasets. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 71--78.

Digital Library

[17]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.

Digital Library

[18]

André Valente Rodrigues, Alípio Jorge, and Inês Dutra. 2015. Accelerating Recommender Systems Using GPUs. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (SAC '15). ACM, New York, NY, USA, 879--884.

Digital Library

[19]

Gábor Takács and Domonkos Tikk. 2012. Alternating Least Squares for Personalized Ranking. In Proceedings of the Sixth ACM Conference on Recommender Systems (RecSys '12). ACM, New York, NY, USA, 83--90.

Digital Library

[20]

Wei Tan, Liangliang Cao, and Liana Fong. 2016. Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 219--230.

Digital Library

[21]

Christina Teflioudi, Faraz Makari, and Rainer Gemulla. 2012. Distributed matrix completion. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 655--664.

Digital Library

[22]

Paul Tseng. 2001. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications 109, 3 (2001), 475--494.

Digital Library

[23]

Hongzhi Yin, Yizhou Sun, Bin Cui, Zhiting Hu, and Ling Chen. 2013. LCARS: a location-content-aware recommender system. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 221--229.

Digital Library

[24]

Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. 2012. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 765--774.

Digital Library

[25]

Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. 2014. Parallel matrix factorization for recommender systems. Knowledge and Information Systems 41, 3 (2014), 793--819.

Digital Library

[26]

Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. 2011. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning 85, 1-2 (2011), 41--75.

Digital Library

[27]

Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, SVN Vishwanathan, and Inderjit Dhillon. 2014. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. Proceedings of the VLDB Endowment 7, 11 (2014), 975--986.

Digital Library

[28]

Tong Zhang. 2004. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML '04). ACM, New York, NY, USA, 116.

Digital Library

[29]

Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. 2008. Large-scale parallel collaborative filtering for the netflix prize. In International Conference on Algorithmic Applications in Management. Springer, 337--348.

Digital Library

[30]

Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. 2013. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 249--256.

Digital Library

Cited By

Li ZQin YXiao QYang WLi K(2024)cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU PlatformACM Transactions on Parallel Computing10.1145/364809411:2(1-33)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3648094
Elahi FFazlali MMalazi HElahi M(2024)Parallel Fractional Stochastic Gradient Descent With Adaptive Learning for Recommender SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318521235:3(470-483)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2022.3185212
Li ZLi HLi KWu FChen LLi K(2022)Locality Sensitive Hash Aggregated Nonlinear Neighborhood Matrix Factorization for Online Sparse Big Data AnalysisACM/IMS Transactions on Data Science10.1145/34977492:4(1-27)Online publication date: 25-Mar-2022
https://dl.acm.org/doi/10.1145/3497749
Show More Cited By

Recommendations

High Performance Coordinate Descent Matrix Factorization for Recommender Systems
CF'17: Proceedings of the Computing Frontiers Conference

Coordinate descent (CD) has been proved to be an effective technique for matrix factorization (MF) in recommender systems. To speed up factorizing performance, various methods of implementing parallel CDMF have been proposed to leverage modern multi-...
Sparse LU factorization for parallel circuit simulation on GPU
DAC '12: Proceedings of the 49th Annual Design Automation Conference

Sparse solver has become the bottleneck of SPICE simulators. There has been few work on GPU-based sparse solver because of the high data-dependency. The strong data-dependency determines that parallel sparse LU factorization runs efficiently on shared-...
Parallel ILU preconditioners in GPU computation

Accelerating large-scale linear solvers is always crucial for scientific research and industrial applications. In this regard, preconditioners play a key role in improving the performance of iterative linear solvers. This paper presents a summary and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GPGPU-10: Proceedings of the General Purpose GPUs

February 2017

84 pages

ISBN:9781450349154

DOI:10.1145/3038228

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

US National Science Foundation

Conference

PPoPP '17

Sponsor:

SIGPLAN

PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 4 - 8, 2017

TX, Austin, USA

Acceptance Rates

GPGPU-10 Paper Acceptance Rate 8 of 15 submissions, 53%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
404
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ZQin YXiao QYang WLi K(2024)cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU PlatformACM Transactions on Parallel Computing10.1145/364809411:2(1-33)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3648094
Elahi FFazlali MMalazi HElahi M(2024)Parallel Fractional Stochastic Gradient Descent With Adaptive Learning for Recommender SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318521235:3(470-483)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2022.3185212
Li ZLi HLi KWu FChen LLi K(2022)Locality Sensitive Hash Aggregated Nonlinear Neighborhood Matrix Factorization for Online Sparse Big Data AnalysisACM/IMS Transactions on Data Science10.1145/34977492:4(1-27)Online publication date: 25-Mar-2022
https://dl.acm.org/doi/10.1145/3497749
Zhang FXue EGuo RQu GZhao GZomaya A(2022)DS-ADMM++: A Novel Distributed Quantized ADMM to Speed up Differentially Private Matrix FactorizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311010433:6(1289-1302)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3110104
Li HLi KAn JLi K(2022)An Online and Scalable Model for Generalized Sparse Nonnegative Matrix Factorization in Industrial Applications on Multi-GPUIEEE Transactions on Industrial Informatics10.1109/TII.2019.289663418:1(437-447)Online publication date: Jan-2022
https://doi.org/10.1109/TII.2019.2896634
Chen JFang JLiu WYang C(2021)BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306494232:9(2291-2302)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TPDS.2021.3064942
Guo RZhang FWang LZhang WLei XRanjan RZomaya A(2021)BaPa: A Novel Approach of Improving Load Balance in Parallel Matrix Factorization for Recommender SystemsIEEE Transactions on Computers10.1109/TC.2020.299705170:5(789-802)Online publication date: 1-May-2021
https://doi.org/10.1109/TC.2020.2997051
Wang QNiu WChen LJin RRen B(2021)HEALS: A Parallel eALS Recommendation System on CPU/GPU Heterogeneous Platforms2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00039(252-261)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00039
Nisa ILi JSukumaran-Rajam ARawat PKrishnamoorthy SSadayappan PTaufer MBalaji PPeña A(2019)An efficient mixed-mode representation of sparse tensorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356216(1-25)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356216
Nisa ILi JSukumaran-Rajam AVuduc RSadayappan P(2019)Load-Balanced Sparse MTTKRP on GPUs2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00023(123-133)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00023
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten