Skip to main content
Log in

Communication-avoiding kernel ridge regression on parallel and distributed systems

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Kernel ridge regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires \(\Theta (n^2)\) memory to form an n-by-n kernel matrix and \(\Theta (n^3)\) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is the number of machines), the memory needed grows as \(\Theta (p)\) per processor and the flops as \(\Theta (p^2)\) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as \(\Theta (1)\) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32 to 38% and achieves a 591\(\times \) speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505\(\times \) speedup (theoretical speedup: 4096\(\times \)). The source code of this paper is available at https://people.eecs.berkeley.edu/~youyang/cakrr.zip.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Anderson, E., Bai, Z., Bischof, C., Blackford, S., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ guide, vol. 9. Siam (1999)

  • Barndorff-Nielsen, O.E., Shephard, N.: Econometric analysis of realized covariation: High frequency based covariance, regression, and correlation in financial economics. Econometrica 72(3), 885–925 (2004)

    Article  MathSciNet  Google Scholar 

  • Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P.: The million song dataset. In: ISMIR, 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24–28, 2011, Miami, Florida. University of Miami, pp. 591–596 (2011)

  • Blanchard, G., Krämer, N.: Optimal learning rates for kernel conjugate gradient regression. In: Advances in Neural Information Processing Systems, pp. 226–234 (2010)

  • Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: Scalapack: a portable linear algebra library for distributed memory computersâdesign issues and performance. In: Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science, pp. 95–106. Springer (1995)

  • Fine, S., Scheinberg, K.: Efficient svm training using low-rank kernel representations. J. Mach. Learn. Res. 2, 243–264 (2002)

    MATH  Google Scholar 

  • Forgy, E.W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–769 (1965)

    Google Scholar 

  • Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Ann. Stat. 36, 1171–1220 (2008)

    MathSciNet  MATH  Google Scholar 

  • Liao, W.-K.: Parallel k-means. [Online]. (2013). http://users.eecs.northwestern.edu/~wkliao/Kmeans/

  • Lin, C.-J.: (2017) Libsvm machine learning regression repository. [Online]. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

  • Lu, Y., Dhillon, P., Foster, D.P., Ungar, L.: Faster ridge regression via the subsampled randomized Hadamard transform. In: Advances in Neural Information Processing Systems, pp. 369–377 (2013)

  • March, W.B., Xiao, B., Biros, G.: Askit: approximate skeletonization kernel-independent treecode in high dimensions. SIAM J. Sci. Comput. 37(2), A1089–A1110 (2015)

    Article  MathSciNet  Google Scholar 

  • NERSC.: NERSC Computational Systems. [Online]. (2016). https://www.nersc.gov/users/computational-systems/

  • SchedMD.: Slurm workload manager. [Online]. (2017). https://slurm.schedmd.com

  • Schmidt, M.: Least squares optimization with l1-norm regularization (2005)

  • Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)

    Article  Google Scholar 

  • Schreiber, R.: Solving eigenvalue and singular value problems on an undersized systolic array. SIAM J. Sci. Stat. Comput. 7, 441–451 (1986)

    Article  Google Scholar 

  • Si, S., Hsieh, C.-J., Dhillon, I.: Memory efficient kernel approximation. In: Proceedings of The 31st International Conference on Machine Learning, pp. 701–709 (2014)

  • Wehbe, L.: Kernel properties—convexity (2013)

  • Williams, C., Seeger, M.: Using the nystrom method to speed up kernel machines. In: Proceedings of the 14th Annual Conference on Neural Information Processing Systems, no. EPFL-CONF-161322, pp. 682–688 (2001)

  • Wright, N.J., Dosanjh, S.S., Andrews, A.K., Antypas, K.B., Draney, B., Canon, R.S., Cholia, S., Daley, C.S., Fagnan, K.M., Gerber, R.A., et al.: Cori: a pre-exascale supercomputer for big data and hpc applications. Big Data High Perform. Comput. 26, 82 (2015)

    Google Scholar 

  • Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)

    Article  MathSciNet  Google Scholar 

  • You, Y., Demmel, J., Czechowski, K., Song, L., Vuduc, R.: Ca-svm: communication-avoiding support vector machines on distributed systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 847–859. IEEE (2015)

  • You, Y., Demmel, J., Hsieh, C.-J., Vuduc, R.: Accurate, fast and scalable kernel ridge regression on parallel and distributed systems (2018). arXiv:1805.00569

  • You, Y., Demmel, J., Vuduc, R., Song, L., Czechowski, K.: Design and implementation of a communication-optimal classifier for distributed kernel support vector machines. In: IEEE Transactions on Parallel and Distributed Systems (2016)

  • You, Y., Gitman, I., Ginsburg, B.: Scaling sgd batch size to 32k for imagenet training (2017). arXiv:1708.03888

  • You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., Hsieh, C.-J.: Large batch optimization for deep learning: Training bert in 76 minutes (2019). arXiv:1904.00962

  • You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)

  • Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression. In: Conference on Learning Theory, pp. 592–617 (2013)

Download references

Acknowledgements

In this project, Yang You was supported by the U.S. DOE Office of Science, Office of Advanced Scientific Computing Research under Award Numbers DE-SC0008700 and AC02-05CH11231. Cho-Jui Hsieh acknowledges the support of NSF via IIS-1719097 Computing Research, Applied Mathematics program under Award Number DE-SC0010200. We’d like to thank Prof. Inderjit Dhillon at UT Austin, Prof. Le Song at Georgia Tech, Prof. Martin Wainwright at UC Berkeley, and Dr. Yuchen Zhang at Stanford University for their discussions with us. The conference version of this paper (You et al. 2018b) titled as Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems was published at the 32nd ACM International Conference on Supercomputing (ICS 2018) during June 12–15, 2018. To make this paper suitable for a journal paper, we made some new contributions and added new stuff to the paper. Specially, we did the following major updates: We added Sects. 5.4 and 5.5. We compare the computation and communication time of different methods to support our communication-avoiding methods in Sect. 5.4 and show the poor accuracy of random partition method in Sect. 5.5. We added Sect. 6. Because many machine learning applications are using GPU-based systems to accelerate the computation. So we built our system on the GPU clusters. Our experiments show that our algorithms are able to achieve good performance on GPU clusters. We added Sect. 7. We give a theoretical analysis on the parallel efficiency of our proposed algorithms. We think the analysis will help the users to have a better understanding of the proposed algorithms. We updated Sect. 8 by highlighting the fundamental difference between our proposed algorithm and the existing approach.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang You.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

You, Y., Huang, J., Hsieh, CJ. et al. Communication-avoiding kernel ridge regression on parallel and distributed systems. CCF Trans. HPC 3, 252–270 (2021). https://doi.org/10.1007/s42514-021-00078-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-021-00078-5

Keywords

Navigation