Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion

Makari, Faraz; Teflioudi, Christina; Gemulla, Rainer; Haas, Peter; Sismanis, Yannis

doi:10.1007/s10115-013-0718-7

Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion

Regular Paper
Published: 15 February 2014

Volume 42, pages 493–523, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Faraz Makari¹,
Christina Teflioudi¹,
Rainer Gemulla¹,
Peter Haas² &
…
Yannis Sismanis³

751 Accesses
Explore all metrics

Abstract

We provide parallel algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run either in a shared-memory environment on a powerful compute node or in a shared-nothing environment on a small cluster of commodity nodes; even very large problems can be handled effectively in these settings. Our ASGD, DSGD-MR, DSGD++, and CSGD algorithms are novel variants of the popular stochastic gradient descent (SGD) algorithm, with the latter three algorithms based on a new “stratified SGD” approach. All of the algorithms are cache-friendly and exploit thread-level parallelism, in-memory processing, and asynchronous communication. We investigate the performance of both new and existing algorithms via a theoretical complexity analysis and a set of large-scale experiments. The results show that CSGD is more scalable, and up to 60 % faster, than the best-performing alternative method in the shared-memory setting. DSGD++ is superior in terms of overall runtime, memory consumption, and scalability in the shared-nothing setting. For example, DSGD++ can solve a difficult matrix completion problem on a high-variance matrix with 10M rows, 1M columns, and 10B revealed entries in around 40 min on 16 compute nodes. In general, algorithms based on SGD appear to perform better than algorithms based on alternating minimizations, such as the PALS and DALS alternating least-squares algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Article 01 October 2016

Limited-memory common-directions method for large-scale optimization: convergence, parallelization, and distributed optimization

Article Open access 29 March 2022

A distributed Frank–Wolfe framework for learning low-rank matrices with the trace norm

Article 10 May 2018

Notes

Some of the material in this paper originally appeared in [13, 24].
“Convergence” refers to running an algorithm until some convergence criterion is met; “asymptotic convergence” means that the algorithm converges to the true solution as the runtime increases to $+\infty $.
It is sometimes desirable to choose a value of $b$ exceeding the number of processing units $p$. In this case, each stratum consists of $p$ (${<}b$) interchangeable blocks, so that $p$ blocks are processed per subepoch and an epoch comprises $b^2/p$ subepochs.
To facilitate comparison with other algorithms, we assume in Fig. 4b that input data and factor matrices are stored in main memory instead of in a distributed file system. Such an approach is also followed in the DSGD-MR implementation used in our experimental study. See [12] for details on a disk-based Hadoop implementation.
Note that our results w.r.t. the relative performance of PSGD and PCCD++ differ from the ones in [26], presumably due to our use of the bold driver heuristic.

References

Amatriain X, Basilico J (2012) Netflix recommendations: beyond the 5 stars (part 1). http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3:331–342
MATH Google Scholar
Bennett J, Lanning S (2007) The Netflix prize. In: Proceedings of the KDD cup workshop, pp 3–6
Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208
Article MATH MathSciNet Google Scholar
Candes EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772
Article MATH MathSciNet Google Scholar
Chen P-L, Tsai C-T, Chen Y-N, Chou K-C, Li C-L, Tsai C-H, Wu K-W, Chou Y-C, Li C-Y, Lin W-S, Yu S-H, Chiu R-B, Lin C-Y, Wang C-C, Wang P-W, Su W-L, Wu C-H, Kuo T-T, McKenzie T, Chang Y-H, Ferng C-S, Ni C-M, Lin H-T, Lin C-J, Lin S-D (2012) A linear ensemble of individual and blended models for music rating prediction. J Mach Learn Res Proc Track 18:21–60
Google Scholar
Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Advances in neural information processing systems (NIPS), pp 281–288
Cichocki A, Phan AH (2009) Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans 92–A(3):708–721
Article Google Scholar
Das AS, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Proceedings of the international conference on World Wide Web (WWW), pp 271–280
Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J (2010) Ricardo: integrating R and Hadoop. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 987–998
Dror G, Koenigstein N, Koren Y, Weimer M (2012) The Yahoo! Music dataset and KDD-Cup’11. J Mach Learn Res Proc Track 18:8–18
Google Scholar
Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA. http://researcher.watson.ibm.com/researcher/files/us-phaas/rj10482Updated.pdf
Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 69–77
Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 263–272
Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. IEEE Comput 42(8):30–37
Article Google Scholar
Li B, Tata S, Sismanis Y (2013) Sparkler: supporting large-scale matrix factorization. Proceedings of the international conference on extending database technology (EDBT), pp 625–636
Liu C, Yang H-C, Fan J, He L-W, Wang Y-M (2010) Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. Proceedings of the international conference on World Wide Web (WWW), pp 681–690
Mackey L, Talwalkar A, Jordan M (2011) Divide-and-conquer matrix factorization. In: Advances in neural information processing systems (NIPS), pp 1134–1142
McDonald R, Hall K, Mann G (2010) Distributed training strategies for the structured perceptron. In: Human language technologies, pp 456–464
MPI (2013) Message passing interface forum. http://www.mpi-forum.org
Niu F, Recht B, Ré C, Wright SJ (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems (NIPS), pp 693–701
Recht B, Ré C (2013) Parallel stochastic gradient algorithms for large-scale matrix completion. Math Progr Comput 5:201–226
Article MATH Google Scholar
Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc. VLDB Endow 3(1–2):703–710
Article Google Scholar
Teflioudi, C., Makari, F. and Gemulla, R. (2012). Distributed matrix completion, Proc. of the IEEE Intl. Conf. on Data Mining (ICDM), pp. 655–664
Tsitsiklis J, Bertsekas D, Athans M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans Autom Control 31(9):803–812
Article MATH MathSciNet Google Scholar
Yu H-F, Hsieh C-J, Si S, Dhillon IS (2012) Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 765–774
Zhou Y, Wilkinson D, Schreiber R, Pan R (2008) Large-scale parallel collaborative filtering for the Netflix prize. In: Proceedings of the international conference on algorithmic aspects in information and management (AAIM), vol 5034, pp 337–348

Download references

Author information

Authors and Affiliations

Max Planck Institute for Computer Science, Saarbrücken, Germany
Faraz Makari, Christina Teflioudi & Rainer Gemulla
IBM Almaden Research Center, San Jose, CA, USA
Peter Haas
Google, Mountain View, CA, USA
Yannis Sismanis

Authors

Faraz Makari
View author publications
You can also search for this author inPubMed Google Scholar
Christina Teflioudi
View author publications
You can also search for this author inPubMed Google Scholar
Rainer Gemulla
View author publications
You can also search for this author inPubMed Google Scholar
Peter Haas
View author publications
You can also search for this author inPubMed Google Scholar
Yannis Sismanis
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Faraz Makari.

Additional information

This work was done while Yannis Sismanis was at the IBM Almaden Research Center.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Makari, F., Teflioudi, C., Gemulla, R. et al. Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion. Knowl Inf Syst 42, 493–523 (2015). https://doi.org/10.1007/s10115-013-0718-7

Download citation

Received: 15 February 2013
Revised: 18 June 2013
Accepted: 17 July 2013
Published: 15 February 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10115-013-0718-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Limited-memory common-directions method for large-scale optimization: convergence, parallelization, and distributed optimization

A distributed Frank–Wolfe framework for learning low-rank matrices with the trace norm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now