Abstract
We provide parallel algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run either in a shared-memory environment on a powerful compute node or in a shared-nothing environment on a small cluster of commodity nodes; even very large problems can be handled effectively in these settings. Our ASGD, DSGD-MR, DSGD++, and CSGD algorithms are novel variants of the popular stochastic gradient descent (SGD) algorithm, with the latter three algorithms based on a new “stratified SGD” approach. All of the algorithms are cache-friendly and exploit thread-level parallelism, in-memory processing, and asynchronous communication. We investigate the performance of both new and existing algorithms via a theoretical complexity analysis and a set of large-scale experiments. The results show that CSGD is more scalable, and up to 60 % faster, than the best-performing alternative method in the shared-memory setting. DSGD++ is superior in terms of overall runtime, memory consumption, and scalability in the shared-nothing setting. For example, DSGD++ can solve a difficult matrix completion problem on a high-variance matrix with 10M rows, 1M columns, and 10B revealed entries in around 40 min on 16 compute nodes. In general, algorithms based on SGD appear to perform better than algorithms based on alternating minimizations, such as the PALS and DALS alternating least-squares algorithms.








Similar content being viewed by others
Notes
“Convergence” refers to running an algorithm until some convergence criterion is met; “asymptotic convergence” means that the algorithm converges to the true solution as the runtime increases to \(+\infty \).
It is sometimes desirable to choose a value of \(b\) exceeding the number of processing units \(p\). In this case, each stratum consists of \(p\) (\({<}b\)) interchangeable blocks, so that \(p\) blocks are processed per subepoch and an epoch comprises \(b^2/p\) subepochs.
To facilitate comparison with other algorithms, we assume in Fig. 4b that input data and factor matrices are stored in main memory instead of in a distributed file system. Such an approach is also followed in the DSGD-MR implementation used in our experimental study. See [12] for details on a disk-based Hadoop implementation.
Note that our results w.r.t. the relative performance of PSGD and PCCD++ differ from the ones in [26], presumably due to our use of the bold driver heuristic.
References
Amatriain X, Basilico J (2012) Netflix recommendations: beyond the 5 stars (part 1). http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3:331–342
Bennett J, Lanning S (2007) The Netflix prize. In: Proceedings of the KDD cup workshop, pp 3–6
Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208
Candes EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772
Chen P-L, Tsai C-T, Chen Y-N, Chou K-C, Li C-L, Tsai C-H, Wu K-W, Chou Y-C, Li C-Y, Lin W-S, Yu S-H, Chiu R-B, Lin C-Y, Wang C-C, Wang P-W, Su W-L, Wu C-H, Kuo T-T, McKenzie T, Chang Y-H, Ferng C-S, Ni C-M, Lin H-T, Lin C-J, Lin S-D (2012) A linear ensemble of individual and blended models for music rating prediction. J Mach Learn Res Proc Track 18:21–60
Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Advances in neural information processing systems (NIPS), pp 281–288
Cichocki A, Phan AH (2009) Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans 92–A(3):708–721
Das AS, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Proceedings of the international conference on World Wide Web (WWW), pp 271–280
Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J (2010) Ricardo: integrating R and Hadoop. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 987–998
Dror G, Koenigstein N, Koren Y, Weimer M (2012) The Yahoo! Music dataset and KDD-Cup’11. J Mach Learn Res Proc Track 18:8–18
Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA. http://researcher.watson.ibm.com/researcher/files/us-phaas/rj10482Updated.pdf
Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 69–77
Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 263–272
Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. IEEE Comput 42(8):30–37
Li B, Tata S, Sismanis Y (2013) Sparkler: supporting large-scale matrix factorization. Proceedings of the international conference on extending database technology (EDBT), pp 625–636
Liu C, Yang H-C, Fan J, He L-W, Wang Y-M (2010) Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. Proceedings of the international conference on World Wide Web (WWW), pp 681–690
Mackey L, Talwalkar A, Jordan M (2011) Divide-and-conquer matrix factorization. In: Advances in neural information processing systems (NIPS), pp 1134–1142
McDonald R, Hall K, Mann G (2010) Distributed training strategies for the structured perceptron. In: Human language technologies, pp 456–464
MPI (2013) Message passing interface forum. http://www.mpi-forum.org
Niu F, Recht B, Ré C, Wright SJ (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems (NIPS), pp 693–701
Recht B, Ré C (2013) Parallel stochastic gradient algorithms for large-scale matrix completion. Math Progr Comput 5:201–226
Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc. VLDB Endow 3(1–2):703–710
Teflioudi, C., Makari, F. and Gemulla, R. (2012). Distributed matrix completion, Proc. of the IEEE Intl. Conf. on Data Mining (ICDM), pp. 655–664
Tsitsiklis J, Bertsekas D, Athans M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans Autom Control 31(9):803–812
Yu H-F, Hsieh C-J, Si S, Dhillon IS (2012) Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 765–774
Zhou Y, Wilkinson D, Schreiber R, Pan R (2008) Large-scale parallel collaborative filtering for the Netflix prize. In: Proceedings of the international conference on algorithmic aspects in information and management (AAIM), vol 5034, pp 337–348
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was done while Yannis Sismanis was at the IBM Almaden Research Center.
Rights and permissions
About this article
Cite this article
Makari, F., Teflioudi, C., Gemulla, R. et al. Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion. Knowl Inf Syst 42, 493–523 (2015). https://doi.org/10.1007/s10115-013-0718-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0718-7