skip to main content
10.1145/2799562.2799563acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Speculative Approximations for Terascale Distributed Gradient Descent Optimization

Published: 31 May 2015 Publication History

Abstract

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes.
In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution.
We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th fraction of the time.

References

[1]
A. Agarwal et al. A Reliable Effective Terascale Linear Learning System. JMLR, 15(1), 2014.
[2]
A. Dobra et al. Turbo-Charging Estimate Convergence in DBO. PVLDB, 2009.
[3]
A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE 2011.
[4]
A. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML 2011.
[5]
D. P. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. MIT 2010.
[6]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[7]
C. Jermaine et al. Scalable Approximate Query Processing with the DBO Engine. In SIGMOD 2007.
[8]
C. Jermaine et al. The Sort-Merge-Shrink Join. TODS, 31(4), 2006.
[9]
C. Qin and F. Rusu. Speculative Approximations for Terascale Analytics. http://arxiv.org/abs/1501.00255, 2015.
[10]
C. Wang et al. On Pruning for Top-K Ranking in Uncertain Databases. PVLDB, 4(10), 2011.
[11]
Y. Cheng, C. Qin, and F. Rusu. GLADE: Big Data Analytics Made Easy. In SIGMOD 2012.
[12]
E. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM 2013.
[13]
F. Rusu et al. The DBO Database System. In SIGMOD 2008.
[14]
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD 2012.
[15]
G. Cormode et al. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases, 4, 2012.
[16]
G. Luo, C. J. Ellmann, P. J. Haas, and J. F. Naughton. A Scalable Hash Ripple Join Algorithm. In SIGMOD 2002.
[17]
A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, 2003.
[18]
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. In KDD 2011.
[19]
P. J. Haas. Large-Sample and Deterministic Confidence Intervals for Online Aggregation. In SSDBM 1997.
[20]
P. J. Haas and J. M. Hellerstein. Ripple Joins for Online Aggregation. In SIGMOD 1999.
[21]
J. Hellerstein, P. Haas, and H. Wang. Online Aggregation. In SIGMOD 1997.
[22]
J. Dean et al. Large Scale Distributed Deep Networks. In NIPS 2012.
[23]
J. Hellerstein et al. The MADlib Analytics Library: Or MAD Skills, the SQL. PVLDB, 2012.
[24]
A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI 2012.
[25]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11), 2011.
[26]
F. Niu, B. Recht, C. Ré, and S. J. Wright. A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS 2011.
[27]
O. Dekel et al. Optimal Distributed Online Prediction Using Mini-Batches. JMLR, 13(1), 2012.
[28]
C. Qin and F. Rusu. Scalable I/O-Bound Parallel Incremental Gradient Descent for Big Data Analytics in GLADE. In DanaC 2013.
[29]
C. Qin and F. Rusu. PF-OLA: A High-Performance Framework for Parallel Online Aggregation. DAPD, 32(3), 2014.
[30]
R. Avnur et al. CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. In SIGMOD 1998.
[31]
F. Rusu and A. Dobra. GLADE: A Scalable Framework for Efficient Analytics. OS Review, 46(1), 2012.
[32]
S. Agarwal et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD 2014.
[33]
S. Agarwal et al. Blink and It's Done: Interactive Queries on Very Large Data. PVLDB, 5(12), 2012.
[34]
S. Chen et al. PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees. In SIGMOD 2010.
[35]
S. Wu et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD 2010.
[36]
S. Wu et al. Distributed Online Aggregation. PVLDB, 2(1), 2009.
[37]
T. Condie et al. MapReduce Online. In NSDI 2010.
[38]
Y. Low et al. GraphLab: A New Parallel Framework for Machine Learning. In UAI 2010.
[39]
Y. Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8), 2012.
[40]
Z. Cai et al. A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms. In SIGMOD 2014.
[41]
Z. Cai et al. Simulation of Database-Valued Markov Chains using SimSQL. In SIGMOD 2013.
[42]
C. Zhang and C. Ré. DimmWitted: A Study of Main-Memory Statistical Analytics. PVLDB, 7(12), 2014.
[43]
M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized Stochastic Gradient Descent. In NIPS 2010.

Cited By

View all
  • (2023)F-IVM: analytics over relational databases under updatesThe VLDB Journal10.1007/s00778-023-00817-w33:4(903-929)Online publication date: 14-Nov-2023
  • (2022)Functional collection programming with semi-ring dictionariesProceedings of the ACM on Programming Languages10.1145/35273336:OOPSLA1(1-33)Online publication date: 29-Apr-2022
  • (2020)Multi-layer optimizations for end-to-end data analyticsProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377923(145-157)Online publication date: 22-Feb-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DanaC'15: Proceedings of the Fourth Workshop on Data analytics in the Cloud
May 2015
29 pages
ISBN:9781450337243
DOI:10.1145/2799562
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
VIC, Melbourne, Australia

Acceptance Rates

DanaC'15 Paper Acceptance Rate 4 of 6 submissions, 67%;
Overall Acceptance Rate 19 of 34 submissions, 56%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)F-IVM: analytics over relational databases under updatesThe VLDB Journal10.1007/s00778-023-00817-w33:4(903-929)Online publication date: 14-Nov-2023
  • (2022)Functional collection programming with semi-ring dictionariesProceedings of the ACM on Programming Languages10.1145/35273336:OOPSLA1(1-33)Online publication date: 29-Apr-2022
  • (2020)Multi-layer optimizations for end-to-end data analyticsProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377923(145-157)Online publication date: 22-Feb-2020
  • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
  • (2019)Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous?2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00113(1063-1072)Online publication date: May-2019
  • (2018)In-Database Learning with Sparse TensorsProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196960(325-340)Online publication date: 27-May-2018
  • (2018)Incremental View Maintenance with Triple Lock Factorization BenefitsProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183758(365-380)Online publication date: 27-May-2018
  • (2017)Scalable asynchronous gradient descent optimization for out-of-core modelsProceedings of the VLDB Endowment10.14778/3115404.311540510:10(986-997)Online publication date: 1-Jun-2017
  • (2017)Dot-Product JoinProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085512(1-12)Online publication date: 27-Jun-2017
  • (2017)Data Management in Machine LearningProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3054775(1717-1722)Online publication date: 9-May-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media