skip to main content
10.1145/2882903.2882939acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Learning Linear Regression Models over Factorized Joins

Published: 14 June 2016 Publication History

Abstract

We investigate the problem of building least squares regression models over training datasets defined by arbitrary join queries on database tables. Our key observation is that joins entail a high degree of redundancy in both computation and data representation, which is not required for the end-to-end solution to learning over joins.
We propose a new paradigm for computing batch gradient descent that exploits the factorized computation and representation of the training datasets, a rewriting of the regression objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection. We introduce three flavors of this approach: F/FDB computes the cofactors in one pass over the materialized factorized join; Favoids this materialization and intermixes cofactor and join computation; F/SQL expresses this mixture as one SQL query.
Our approach has the complexity of join factorization, which can be exponentially lower than of standard joins. Experiments with commercial, public, and synthetic datasets show that it outperforms MADlib, Python StatsModels, and R, by up to three orders of magnitude.

References

[1]
Apache. MLlib: Machine learning in Spark, https://spark.apache.org/mllib, 2015.
[2]
M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen, and G. Washburn. Design and implementation of the LogicBlox system. In SIGMOD, pages 1371--1382, 2015.
[3]
A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. In FOCS, pages 739--748, 2008.
[4]
N. Bakibayev, T. Kociský, D. Olteanu, and J. Závodný. Aggregation and ordering in factorised databases. PVLDB, 6(14):1990--2001, 2013.
[5]
N. Bakibayev, D. Olteanu, and J. Závodný. FDB: A query engine for factorised relational databases. PVLDB, 5(11):1232--1243, 2012.
[6]
C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics), 2006.
[7]
M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid parallelization strategies for large-scale machine learning in SystemML. PVLDB, 7(7):553--564, 2014.
[8]
L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade (2nd ed), pages 421--436, 2012.
[9]
Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. M. Jermaine. A comparison of platforms for implelementing and running very large scale machine learning algorithms. In SIGMOD, pages 1371--1382, 2014.
[10]
I. Cantador, P. Brusilovsky, and T. Kuflik. 2nd workshop on information heterogeneity and fusion in recommender systems. In RecSys, pages 387--388, 2011, http://grouplens.org/datasets/hetrec-2011.
[11]
T. Condie, P. Mineiro, N. Polyzotis, and M. Weimer. Machine learning for big data. In SIGMOD, pages 939--942, 2013.
[12]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232--1240, 2012.
[13]
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121--2159, 2011.
[14]
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In SIGMOD, pages 325--336, 2012.
[15]
J. G. F. Francis. The QR transformation: A unitary analogue to the LR transformation--Part 1. The Computer Journal, 4(3):265--271, 1961.
[16]
GroupLens iResearch. MovieLens, http://grouplens.org/datasets/movielens, 2003.
[17]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12):1700--1711, 2012.
[18]
B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. Reiss. Resource elasticity for large-scale machine learning. In SIGMOD, pages 137--152, 2015.
[19]
M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions Asked Frequently, CoRR:1504.04044, 2015.
[20]
A. Kumar, J. F. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984, 2015.
[21]
J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In ICML, pages 469--477, 2014.
[22]
F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what COST? In HotOS, 2015.
[23]
R. Menich and N. Vasiloglou. The future of LogicBlox machine learning. LogicBlox User Days, 2013.
[24]
D. Neumann. Lightning-fast deep learning on Spark via parallel stochastic gradient updates, www.deepdist.com, 2015.
[25]
A. Ng. CS229 Lecture Notes. Stanford & Coursera, http://cs229.stanford.edu/, 2014.
[26]
H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms. In PODS, pages 37--48, 2012.
[27]
F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693--701, 2011.
[28]
D. Olteanu and J. Závodný. Size bounds for factorised representations of query results. TODS, 40(1):2, 2015.
[29]
R. Penrose. A generalized inverse for matrices. Math. Proc., Cambridge, Phil., Soc., 51(03):406--413, 1955.
[30]
F. Petroni and L. Querzoni. GASGD: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. In RecSys, pages 241--248, 2014.
[31]
C. Qin and F. Rusu. Scalable i/o-bound parallel incremental gradient descent for big data analytics in glade. In DanaC, pages 16--20, 2013.
[32]
C. Qin and F. Rusu. Speculative approximations for terascale distributed gradient descent optimization. In DanaC, pages 1:1--1:10, 2015.
[33]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, www.r-project.org, 2013.
[34]
C. Ré, D. Agrawal, M. Balazinska, M. I. Cafarella, M. I. Jordan, T. Kraska, and R. Ramakrishnan. Machine learning and databases: The sound of things to come or a cacophony of hype? In SIGMOD, pages 283--284, 2015.
[35]
B. Recht and C. Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput., 5(2):201--226, 2013.
[36]
S. Rendle. Scaling factorization machines to relational data. PVLDB, 6(5):337--348, 2013.
[37]
P. Richtárik and M. Schmidt. Modern convex optimization methods for large-scale empirical risk minimization. In ICML, 2015. Invited Tutorial.
[38]
S. Schelter, J. Soto, V. Markl, D. Burdick, B. Reinwald, and A. V. Evfimievski. Efficient sample generation for scalable meta learning. In ICDE, pages 1191--1202, 2015.
[39]
J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. PVLDB, 8(11):1310--1321, 2015.
[40]
The StatsModels development team. StatsModels: Statistics in Python, http://statsmodels.sourceforge.net, 2012.
[41]
T. L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In ICDT, pages 96--106, 2014.
[42]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 15--28, 2012.
[43]
M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Cited By

View all
  • (2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. factorized databases
  2. join processing
  3. linear regression

Qualifiers

  • Research-article

Funding Sources

  • ERC

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)132
  • Downloads (Last 6 weeks)8
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
  • (2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
  • (2023)Saibot: A Differentially Private Data Search PlatformProceedings of the VLDB Endowment10.14778/3611479.361150816:11(3057-3070)Online publication date: 24-Aug-2023
  • (2023)ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement LearningProceedings of the VLDB Endowment10.14778/3611479.361148916:11(2805-2817)Online publication date: 24-Aug-2023
  • (2023)Query Evaluation under Differential PrivacyACM SIGMOD Record10.1145/3631504.363150652:3(6-17)Online publication date: 2-Nov-2023
  • (2023)Lightweight Materialization for Fast Dashboards Over JoinsProceedings of the ACM on Management of Data10.1145/36267351:4(1-27)Online publication date: 12-Dec-2023
  • (2023)Aggregation Consistency Errors in Semantic Layers and How to Avoid ThemProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3597465.3605224(1-7)Online publication date: 18-Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media