research-article

Learning Linear Regression Models over Factorized Joins

Authors:

Maximilian Schleich,

Radu CiucanuAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 3 - 18

https://doi.org/10.1145/2882903.2882939

Published: 14 June 2016 Publication History

Abstract

We investigate the problem of building least squares regression models over training datasets defined by arbitrary join queries on database tables. Our key observation is that joins entail a high degree of redundancy in both computation and data representation, which is not required for the end-to-end solution to learning over joins.

We propose a new paradigm for computing batch gradient descent that exploits the factorized computation and representation of the training datasets, a rewriting of the regression objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection. We introduce three flavors of this approach: F/FDB computes the cofactors in one pass over the materialized factorized join; Favoids this materialization and intermixes cofactor and join computation; F/SQL expresses this mixture as one SQL query.

Our approach has the complexity of join factorization, which can be exponentially lower than of standard joins. Experiments with commercial, public, and synthetic datasets show that it outperforms MADlib, Python StatsModels, and R, by up to three orders of magnitude.

References

[1]

Apache. MLlib: Machine learning in Spark, https://spark.apache.org/mllib, 2015.

[2]

M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen, and G. Washburn. Design and implementation of the LogicBlox system. In SIGMOD, pages 1371--1382, 2015.

Digital Library

[3]

A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. In FOCS, pages 739--748, 2008.

Digital Library

[4]

N. Bakibayev, T. Kociský, D. Olteanu, and J. Závodný. Aggregation and ordering in factorised databases. PVLDB, 6(14):1990--2001, 2013.

Digital Library

[5]

N. Bakibayev, D. Olteanu, and J. Závodný. FDB: A query engine for factorised relational databases. PVLDB, 5(11):1232--1243, 2012.

Digital Library

[6]

C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics), 2006.

Digital Library

[7]

M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid parallelization strategies for large-scale machine learning in SystemML. PVLDB, 7(7):553--564, 2014.

Digital Library

[8]

L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade (2nd ed), pages 421--436, 2012.

[9]

Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. M. Jermaine. A comparison of platforms for implelementing and running very large scale machine learning algorithms. In SIGMOD, pages 1371--1382, 2014.

Digital Library

[10]

I. Cantador, P. Brusilovsky, and T. Kuflik. 2nd workshop on information heterogeneity and fusion in recommender systems. In RecSys, pages 387--388, 2011, http://grouplens.org/datasets/hetrec-2011.

Digital Library

[11]

T. Condie, P. Mineiro, N. Polyzotis, and M. Weimer. Machine learning for big data. In SIGMOD, pages 939--942, 2013.

Digital Library

[12]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232--1240, 2012.

Digital Library

[13]

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121--2159, 2011.

Digital Library

[14]

X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In SIGMOD, pages 325--336, 2012.

Digital Library

[15]

J. G. F. Francis. The QR transformation: A unitary analogue to the LR transformation--Part 1. The Computer Journal, 4(3):265--271, 1961.

[16]

GroupLens iResearch. MovieLens, http://grouplens.org/datasets/movielens, 2003.

[17]

J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12):1700--1711, 2012.

Digital Library

[18]

B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. Reiss. Resource elasticity for large-scale machine learning. In SIGMOD, pages 137--152, 2015.

Digital Library

[19]

M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions Asked Frequently, CoRR:1504.04044, 2015.

[20]

A. Kumar, J. F. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984, 2015.

Digital Library

[21]

J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In ICML, pages 469--477, 2014.

Digital Library

[22]

F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what COST? In HotOS, 2015.

Digital Library

[23]

R. Menich and N. Vasiloglou. The future of LogicBlox machine learning. LogicBlox User Days, 2013.

[24]

D. Neumann. Lightning-fast deep learning on Spark via parallel stochastic gradient updates, www.deepdist.com, 2015.

[25]

A. Ng. CS229 Lecture Notes. Stanford & Coursera, http://cs229.stanford.edu/, 2014.

[26]

H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms. In PODS, pages 37--48, 2012.

Digital Library

[27]

F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693--701, 2011.

Digital Library

[28]

D. Olteanu and J. Závodný. Size bounds for factorised representations of query results. TODS, 40(1):2, 2015.

Digital Library

[29]

R. Penrose. A generalized inverse for matrices. Math. Proc., Cambridge, Phil., Soc., 51(03):406--413, 1955.

[30]

F. Petroni and L. Querzoni. GASGD: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. In RecSys, pages 241--248, 2014.

Digital Library

[31]

C. Qin and F. Rusu. Scalable i/o-bound parallel incremental gradient descent for big data analytics in glade. In DanaC, pages 16--20, 2013.

Digital Library

[32]

C. Qin and F. Rusu. Speculative approximations for terascale distributed gradient descent optimization. In DanaC, pages 1:1--1:10, 2015.

Digital Library

[33]

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, www.r-project.org, 2013.

[34]

C. Ré, D. Agrawal, M. Balazinska, M. I. Cafarella, M. I. Jordan, T. Kraska, and R. Ramakrishnan. Machine learning and databases: The sound of things to come or a cacophony of hype? In SIGMOD, pages 283--284, 2015.

Digital Library

[35]

B. Recht and C. Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput., 5(2):201--226, 2013.

[36]

S. Rendle. Scaling factorization machines to relational data. PVLDB, 6(5):337--348, 2013.

Digital Library

[37]

P. Richtárik and M. Schmidt. Modern convex optimization methods for large-scale empirical risk minimization. In ICML, 2015. Invited Tutorial.

[38]

S. Schelter, J. Soto, V. Markl, D. Burdick, B. Reinwald, and A. V. Evfimievski. Efficient sample generation for scalable meta learning. In ICDE, pages 1191--1202, 2015.

[39]

J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. PVLDB, 8(11):1310--1321, 2015.

Digital Library

[40]

The StatsModels development team. StatsModels: Statistics in Python, http://statsmodels.sourceforge.net, 2012.

[41]

T. L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In ICDT, pages 96--106, 2014.

[42]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 15--28, 2012.

Digital Library

[43]

M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Digital Library

Cited By

Esmailpour ASintos S(2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3695831
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Li ZSun WZhan DKang YChen LBozzon AHai R(2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3357389
Show More Cited By

Recommendations

Processing continuous join queries in sensor networks: a filtering approach
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

While join processing in wireless sensor networks has received a lot of attention recently, current solutions do not work well for continuous queries. In those networks however, continuous queries are the rule. To minimize the communication costs of ...
Mining Related Queries from Query Logs Based on Linear Regression
FITME '08: Proceedings of the 2008 International Seminar on Future Information Technology and Management Engineering

In this paper a novel linear regression model is proposed to mine related queries from query logs. Three types of association relationships between queries are identified and leveraged in our model, which include query session co-occurence, URL-clicked ...
Efficient Join-Index-Based Spatial-Join Processing: A Clustering Approach

A join-index is a data structure used for processing join queries in databases. Join-indices use precomputation techniques to speed up online query processing and are useful for data sets which are updated infrequently. The I/O cost of join computation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ERC

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

92
Total Citations
View Citations
1,488
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)8

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Esmailpour ASintos S(2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3695831
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Li ZSun WZhan DKang YChen LBozzon AHai R(2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3357389
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
https://doi.org/10.1007/s00778-024-00845-0
Huang ZSen RLiu JWu E(2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611509
Huang ZLiu JAlabi DFernandez RWu E(2023)Saibot: A Differentially Private Data Search PlatformProceedings of the VLDB Endowment10.14778/3611479.361150816:11(3057-3070)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611508
Wang JTrummer IKara AOlteanu D(2023)ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement LearningProceedings of the VLDB Endowment10.14778/3611479.361148916:11(2805-2817)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611489
Dong WYi K(2023)Query Evaluation under Differential PrivacyACM SIGMOD Record10.1145/3631504.363150652:3(6-17)Online publication date: 2-Nov-2023
https://dl.acm.org/doi/10.1145/3631504.3631506
Huang ZWu E(2023)Lightweight Materialization for Fast Dashboards Over JoinsProceedings of the ACM on Management of Data10.1145/36267351:4(1-27)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626735
Huang ZDamalapati PWu E(2023)Aggregation Consistency Errors in Semantic Layers and How to Avoid ThemProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3597465.3605224(1-7)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3597465.3605224
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten