research-article

Scaling Up Stochastic Dual Coordinate Ascent

Authors:
Kenneth Tran

Microsoft, San Francisco, CA, USA

Microsoft, San Francisco, CA, USA
View Profile

,
Saghar Hosseini

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Lin Xiao

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Thomas Finley

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Mikhail Bilenko

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015Pages 1185–1194https://doi.org/10.1145/2783258.2783412

Published:10 August 2015Publication History

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1185–1194

ABSTRACT

Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random-coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is typically implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.

Supplemental Material

p1185.mp4

mp4

167.9 MB

Download

References

A. Agarwal, A. Beygelzimer, D. J. Hsu, J. Langford, and M. J. Telgarsky. Scalable non-linear learning with adaptive polynomial expansions. In NIPS, 2014.Google Scholar
A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111--1133, 2014. Google ScholarDigital Library
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281--305, 2012. Google ScholarDigital Library
L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421--436. Springer, 2012.Google ScholarCross Ref
O. Chapelle et al. http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset.Google Scholar
P. Deutsch and J.-L. Gailly. Zlib compressed data format specification version 3.3. Technical report, RFC 1950, May, 1996. Google ScholarDigital Library
J. Duchi, M. Jordan, and B. McMahan. Estimation, optimization, and parallelism when data is sparse. NIPS, pages 1--9, 2013.Google Scholar
J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, Berlin, 2001.Google ScholarCross Ref
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th ICML, pages 408--415, 2008. Google ScholarDigital Library
M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communication-efficient distributed dual coordinate ascent. In NIPS, pages 3068--3076, 2014.Google Scholar
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503--528, 1989. Google ScholarDigital Library
J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the 31st ICML, 2014.Google Scholar
F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In In NIPS, 2011.Google Scholar
Y. Niu, Y. Wang, G. Sun, A. Yue, B. Dalessandro, C. Perlich, and B. Hamner. The tencent dataset and kdd-cup'12. In KDD-Cup Workshop, 2012.Google Scholar
P. Richtárik and M. Takáč. Distributed coordinate descent method for learning with big data. arXiv preprint arXiv:1310.2059, 2013.Google Scholar
N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS. 2012.Google Scholar
S. Shalev-Shwartz and T. Zhang. Proximal stochatic dual coordinate ascent. arXiv:1211.2772, November 2012.Google Scholar
S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS, pages 378--385, 2013.Google Scholar
S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567--599, 2013. Google ScholarDigital Library
S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31st ICML, pages 1--41, 2014.Google Scholar
Stamper, Niculescu-Mizil, Ritter, Gordon, and Koedinger. Bridge to algebra 2006-2007 - challenge data set from kdd cup 2010 educational data mining challenge, 2010.Google Scholar
T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31st ICML, pages 736--744, 2014.Google Scholar
T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In NIPS, pages 629--637, 2013.Google Scholar
H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when data cannot fit in memory. ACM Transactions on Knowledge Discovery From Data, pages 1--23, 2012. Google ScholarDigital Library
H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41--75, 2011. Google ScholarDigital Library
Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In NIPS, pages 1502--1510, 2012.Google ScholarDigital Library

Index Terms

Scaling Up Stochastic Dual Coordinate Ascent
1. Information systems
  1. Information systems applications

Recommendations

Stochastic dual coordinate ascent methods for regularized loss

Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has ...

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 352
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.