research-article

To Join or Not to Join?: Thinking Twice about Joins before Feature Selection

Authors:

Jeffrey Naughton,

Jignesh M. Patel,

Xiaojin ZhuAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 19 - 34

https://doi.org/10.1145/2882903.2882952

Published: 14 June 2016 Publication History

Abstract

Closer integration of machine learning (ML) with data processing is a booming area in both the data management industry and academia. Almost all ML toolkits assume that the input is a single table, but many datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins to obtain features from all base tables and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this work, we show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can "avoid joins safely." We identify the core technical issue that could cause accuracy to decrease in some cases and analyze this issue theoretically. Using simulations, we validate our analysis and measure the effects of various properties of normalized data on accuracy. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins in order to help analysts exploit this runtime-accuracy trade-off. Experiments with multiple real normalized datasets show that our rules are able to accurately predict when joins can be avoided safely, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods.

References

[1]

Feature Selection and Dimension Reduction Techniques in SAS. nesug.org/Proceedings/nesug11/sa/sa08.pdf.

[2]

Gartner Report on Analytics. gartner.com/it/page.jsp?id=1971516.

[3]

Oracle R Enterprise.

[4]

SAS Report on Analytics. sas.com/reg/wp/corp/23876.

[5]

S. Abiteboul et al. Foundations of Databases. Addison-Wesley, 1995.

Digital Library

[6]

H. Almuallim and T. G. Dietterich. Efficient Algorithms for Identifying Relevant Features. Technical report, 1992.

Digital Library

[7]

M. Anderson et al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.

[8]

C. Beeri and P. A. Bernstein. Computational Problems Related to the Design of Normal Form Relational Schemas. TODS, 4(1), Mar. 1979.

Digital Library

[9]

Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013.

Digital Library

[10]

A. Daniely et al. Multiclass Learning Approaches: A Theoretical Comparison with Implications. In NIPS, 2012.

[11]

P. Domingos. A Unified Bias-Variance Decomposition and its Applications. In ICML, 2000.

Digital Library

[12]

P. Domingos. A Few Useful Things to Know About Machine Learning. CACM, 55(10), Oct. 2012.

Digital Library

[13]

P. Domingos and M. Pazzani. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2--3):103--130, 1997.

Digital Library

[14]

J. H. Friedman et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 2010.

[15]

N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29(2--3):131--163, Nov. 1997.

Digital Library

[16]

A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011.

Digital Library

[17]

I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. New York: Springer-Verlag, 2001.

Digital Library

[18]

T. Hastie et al. The Elements of Statistical Learning: Data mining, Inference, and Prediction. Springer-Verlag, 2001.

[19]

J. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012.

Digital Library

[20]

S. Kandel et al. Enterprise Data Analysis and Visualization: An Interview Study. In IEEE VAST, 2012.

Digital Library

[21]

R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Artif. Intell., 97(1--2), Dec. 1997.

Digital Library

[22]

D. Koller and M. Sahami. Toward Optimal Feature Selection. In ICML, 1995.

Digital Library

[23]

P. Konda et al. Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System. In VLDB, 2013.

Digital Library

[24]

Y. Koren et al. Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), Aug. 2009.

Digital Library

[25]

T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.

[26]

A. Kumar et al. Hazy: Making it Easier to Build and Maintain Big-data Analytics. CACM, 56(3):40--49, March 2013.

Digital Library

[27]

A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. ACM SIGMOD Record, Dec. 2015.

Digital Library

[28]

A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. UW-Madison CS Tech. Rep. TR1828, 2015.

[29]

A. Kumar, J. Naughton, and J. Patel. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015.

Digital Library

[30]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[31]

T. M. Mitchell. Machine Learning. McGraw Hill, 1997.

Digital Library

[32]

A. Pavlo et al. Skew-aware Automatic Database Partitioning in Shared-nothing, Parallel OLTP Systems. In SIGMOD, 2012.

Digital Library

[33]

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988.

Digital Library

[34]

J. Pearl and T. Verma. The Logic of Representing Dependencies by Directed Graphs. In AAAI, 1987.

Digital Library

[35]

R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 2003.

Digital Library

[36]

C. Ré et al. Feature Engineering for Knowledge Base Construction. Data Engineering Bulletin, 2014.

[37]

R. Ricci, E. Eide, and the CloudLab Team. Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications. ;login:, 39(6), 2014.

[38]

I. Rish et al. An Analysis of Data Characteristics that Affect Naive Bayes Performance. In ICML, 2001.

[39]

A. I. Schein et al. Methods and Metrics for Cold-start Recommendations. In SIGIR, 2002.

Digital Library

[40]

S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

[41]

A. Silberschatz et al. Database Systems Concepts. McGraw-Hill, Inc., 2006.

Digital Library

[42]

O. Uncu and I. Turksen. A Novel Feature Selection Approach: Combining Feature Wrappers and Filters. Information Sciences, 177(2), 2007.

[43]

V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., 1995.

[44]

S. K. M. Wong et al. A Method for Implementing a Probabilistic Model as a Relational Database. In UAI, 1995.

Digital Library

[45]

L. Yu and H. Liu. Efficient Feature Selection via Analysis of Relevance and Redundancy. JMLR, 5, Dec. 2004.

Digital Library

[46]

C. Zhang et al. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

Digital Library

[47]

Y. Zhang et al. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.

Cited By

Zha DBhat ZLai KYang FJiang ZZhong SHu X(2025)Data-centric Artificial Intelligence: A SurveyACM Computing Surveys10.1145/371111857:5(1-42)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3711118
Mumuni AMumuni F(2025)Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methodsKnowledge and Information Systems10.1007/s10115-025-02349-xOnline publication date: 22-Feb-2025
https://doi.org/10.1007/s10115-025-02349-x
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Show More Cited By

Index Terms

To Join or Not to Join?: Thinking Twice about Joins before Feature Selection

Recommendations

Multi-way spatial join selectivity for the ring join graph

Efficient spatial query processing is very important since the applications of the spatial DBMS (e.g. GIS, CAD/CAM, LBS) handle massive amount of data and consume much time. Many spatial queries contain the multi-way spatial join due to the fact that ...
Processing multi-join queries
Distributed stream join query processing with semijoins

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Microsoft Jim Gray Systems Lab

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
1,237
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)9

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zha DBhat ZLai KYang FJiang ZZhong SHu X(2025)Data-centric Artificial Intelligence: A SurveyACM Computing Surveys10.1145/371111857:5(1-42)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3711118
Mumuni AMumuni F(2025)Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methodsKnowledge and Information Systems10.1007/s10115-025-02349-xOnline publication date: 22-Feb-2025
https://doi.org/10.1007/s10115-025-02349-x
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659448
Ionescu AMouw ZAivaloglou EKatsifodimos AFekete JOmidvar-Tehrani BRong KShraga R(2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 18-Jun-2024
https://doi.org/10.1145/3665939.3665961
Gan QWang MWipf DFaloutsos CBaeza-Yates RBonchi F(2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671471
Ionescu AMouw ZAivaloglou EHai RKatsifodimos ASerra ESpezzano F(2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679211
Leventidis AChristensen MLissandrini MDi Rocco LHose KMiller RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657877
Zhuang JChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Robust Data-centric Graph Structure Learning for Text ClassificationCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651915(1486-1495)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651915
Chai CJin KTang NFan JQiao LWang YLuo YYuan YWang G(2024)Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00278(3613-3626)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00278
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten