skip to main content
10.1145/2783258.2783412acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Scaling Up Stochastic Dual Coordinate Ascent

Published:10 August 2015Publication History

ABSTRACT

Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random-coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is typically implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.

Skip Supplemental Material Section

Supplemental Material

p1185.mp4

mp4

167.9 MB

References

  1. A. Agarwal, A. Beygelzimer, D. J. Hsu, J. Langford, and M. J. Telgarsky. Scalable non-linear learning with adaptive polynomial expansions. In NIPS, 2014.Google ScholarGoogle Scholar
  2. A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111--1133, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281--305, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421--436. Springer, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  5. O. Chapelle et al. http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset.Google ScholarGoogle Scholar
  6. P. Deutsch and J.-L. Gailly. Zlib compressed data format specification version 3.3. Technical report, RFC 1950, May, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Duchi, M. Jordan, and B. McMahan. Estimation, optimization, and parallelism when data is sparse. NIPS, pages 1--9, 2013.Google ScholarGoogle Scholar
  8. J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, Berlin, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  9. C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th ICML, pages 408--415, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communication-efficient distributed dual coordinate ascent. In NIPS, pages 3068--3076, 2014.Google ScholarGoogle Scholar
  11. D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the 31st ICML, 2014.Google ScholarGoogle Scholar
  13. F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In In NIPS, 2011.Google ScholarGoogle Scholar
  14. Y. Niu, Y. Wang, G. Sun, A. Yue, B. Dalessandro, C. Perlich, and B. Hamner. The tencent dataset and kdd-cup'12. In KDD-Cup Workshop, 2012.Google ScholarGoogle Scholar
  15. P. Richtárik and M. Takáč. Distributed coordinate descent method for learning with big data. arXiv preprint arXiv:1310.2059, 2013.Google ScholarGoogle Scholar
  16. N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS. 2012.Google ScholarGoogle Scholar
  17. S. Shalev-Shwartz and T. Zhang. Proximal stochatic dual coordinate ascent. arXiv:1211.2772, November 2012.Google ScholarGoogle Scholar
  18. S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS, pages 378--385, 2013.Google ScholarGoogle Scholar
  19. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567--599, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31st ICML, pages 1--41, 2014.Google ScholarGoogle Scholar
  21. Stamper, Niculescu-Mizil, Ritter, Gordon, and Koedinger. Bridge to algebra 2006-2007 - challenge data set from kdd cup 2010 educational data mining challenge, 2010.Google ScholarGoogle Scholar
  22. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31st ICML, pages 736--744, 2014.Google ScholarGoogle Scholar
  23. T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In NIPS, pages 629--637, 2013.Google ScholarGoogle Scholar
  24. H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when data cannot fit in memory. ACM Transactions on Knowledge Discovery From Data, pages 1--23, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41--75, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In NIPS, pages 1502--1510, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scaling Up Stochastic Dual Coordinate Ascent

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2015
      2378 pages
      ISBN:9781450336642
      DOI:10.1145/2783258

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 August 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader