Skip to main content
Log in

Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We provide parallel algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run either in a shared-memory environment on a powerful compute node or in a shared-nothing environment on a small cluster of commodity nodes; even very large problems can be handled effectively in these settings. Our ASGD, DSGD-MR, DSGD++, and CSGD algorithms are novel variants of the popular stochastic gradient descent (SGD) algorithm, with the latter three algorithms based on a new “stratified SGD” approach. All of the algorithms are cache-friendly and exploit thread-level parallelism, in-memory processing, and asynchronous communication. We investigate the performance of both new and existing algorithms via a theoretical complexity analysis and a set of large-scale experiments. The results show that CSGD is more scalable, and up to 60 % faster, than the best-performing alternative method in the shared-memory setting. DSGD++ is superior in terms of overall runtime, memory consumption, and scalability in the shared-nothing setting. For example, DSGD++ can solve a difficult matrix completion problem on a high-variance matrix with 10M rows, 1M columns, and 10B revealed entries in around 40 min on 16 compute nodes. In general, algorithms based on SGD appear to perform better than algorithms based on alternating minimizations, such as the PALS and DALS alternating least-squares algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Some of the material in this paper originally appeared in [13, 24].

  2. “Convergence” refers to running an algorithm until some convergence criterion is met; “asymptotic convergence” means that the algorithm converges to the true solution as the runtime increases to \(+\infty \).

  3. It is sometimes desirable to choose a value of \(b\) exceeding the number of processing units \(p\). In this case, each stratum consists of \(p\) (\({<}b\)) interchangeable blocks, so that \(p\) blocks are processed per subepoch and an epoch comprises \(b^2/p\) subepochs.

  4. To facilitate comparison with other algorithms, we assume in Fig. 4b that input data and factor matrices are stored in main memory instead of in a distributed file system. Such an approach is also followed in the DSGD-MR implementation used in our experimental study. See [12] for details on a disk-based Hadoop implementation.

  5. Note that our results w.r.t. the relative performance of PSGD and PCCD++ differ from the ones in [26], presumably due to our use of the bold driver heuristic.

References

  1. Amatriain X, Basilico J (2012) Netflix recommendations: beyond the 5 stars (part 1). http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

  2. Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3:331–342

    MATH  Google Scholar 

  3. Bennett J, Lanning S (2007) The Netflix prize. In: Proceedings of the KDD cup workshop, pp 3–6

  4. Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208

    Article  MATH  MathSciNet  Google Scholar 

  5. Candes EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772

    Article  MATH  MathSciNet  Google Scholar 

  6. Chen P-L, Tsai C-T, Chen Y-N, Chou K-C, Li C-L, Tsai C-H, Wu K-W, Chou Y-C, Li C-Y, Lin W-S, Yu S-H, Chiu R-B, Lin C-Y, Wang C-C, Wang P-W, Su W-L, Wu C-H, Kuo T-T, McKenzie T, Chang Y-H, Ferng C-S, Ni C-M, Lin H-T, Lin C-J, Lin S-D (2012) A linear ensemble of individual and blended models for music rating prediction. J Mach Learn Res Proc Track 18:21–60

    Google Scholar 

  7. Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Advances in neural information processing systems (NIPS), pp 281–288

  8. Cichocki A, Phan AH (2009) Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans 92–A(3):708–721

    Article  Google Scholar 

  9. Das AS, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Proceedings of the international conference on World Wide Web (WWW), pp 271–280

  10. Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J (2010) Ricardo: integrating R and Hadoop. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 987–998

  11. Dror G, Koenigstein N, Koren Y, Weimer M (2012) The Yahoo! Music dataset and KDD-Cup’11. J Mach Learn Res Proc Track 18:8–18

    Google Scholar 

  12. Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA. http://researcher.watson.ibm.com/researcher/files/us-phaas/rj10482Updated.pdf

  13. Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 69–77

  14. Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 263–272

  15. Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. IEEE Comput 42(8):30–37

    Article  Google Scholar 

  16. Li B, Tata S, Sismanis Y (2013) Sparkler: supporting large-scale matrix factorization. Proceedings of the international conference on extending database technology (EDBT), pp 625–636

  17. Liu C, Yang H-C, Fan J, He L-W, Wang Y-M (2010) Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. Proceedings of the international conference on World Wide Web (WWW), pp 681–690

  18. Mackey L, Talwalkar A, Jordan M (2011) Divide-and-conquer matrix factorization. In: Advances in neural information processing systems (NIPS), pp 1134–1142

  19. McDonald R, Hall K, Mann G (2010) Distributed training strategies for the structured perceptron. In: Human language technologies, pp 456–464

  20. MPI (2013) Message passing interface forum. http://www.mpi-forum.org

  21. Niu F, Recht B, Ré C, Wright SJ (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems (NIPS), pp 693–701

  22. Recht B, Ré C (2013) Parallel stochastic gradient algorithms for large-scale matrix completion. Math Progr Comput 5:201–226

    Article  MATH  Google Scholar 

  23. Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc. VLDB Endow 3(1–2):703–710

    Article  Google Scholar 

  24. Teflioudi, C., Makari, F. and Gemulla, R. (2012). Distributed matrix completion, Proc. of the IEEE Intl. Conf. on Data Mining (ICDM), pp. 655–664

  25. Tsitsiklis J, Bertsekas D, Athans M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans Autom Control 31(9):803–812

    Article  MATH  MathSciNet  Google Scholar 

  26. Yu H-F, Hsieh C-J, Si S, Dhillon IS (2012) Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 765–774

  27. Zhou Y, Wilkinson D, Schreiber R, Pan R (2008) Large-scale parallel collaborative filtering for the Netflix prize. In: Proceedings of the international conference on algorithmic aspects in information and management (AAIM), vol 5034, pp 337–348

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faraz Makari.

Additional information

This work was done while Yannis Sismanis was at the IBM Almaden Research Center.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Makari, F., Teflioudi, C., Gemulla, R. et al. Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion. Knowl Inf Syst 42, 493–523 (2015). https://doi.org/10.1007/s10115-013-0718-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0718-7

Keywords

Navigation