skip to main content
10.1145/2882903.2882952acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

To Join or Not to Join?: Thinking Twice about Joins before Feature Selection

Published: 14 June 2016 Publication History

Abstract

Closer integration of machine learning (ML) with data processing is a booming area in both the data management industry and academia. Almost all ML toolkits assume that the input is a single table, but many datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins to obtain features from all base tables and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this work, we show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can "avoid joins safely." We identify the core technical issue that could cause accuracy to decrease in some cases and analyze this issue theoretically. Using simulations, we validate our analysis and measure the effects of various properties of normalized data on accuracy. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins in order to help analysts exploit this runtime-accuracy trade-off. Experiments with multiple real normalized datasets show that our rules are able to accurately predict when joins can be avoided safely, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods.

References

[1]
Feature Selection and Dimension Reduction Techniques in SAS. nesug.org/Proceedings/nesug11/sa/sa08.pdf.
[2]
Gartner Report on Analytics. gartner.com/it/page.jsp?id=1971516.
[3]
Oracle R Enterprise.
[4]
SAS Report on Analytics. sas.com/reg/wp/corp/23876.
[5]
S. Abiteboul et al. Foundations of Databases. Addison-Wesley, 1995.
[6]
H. Almuallim and T. G. Dietterich. Efficient Algorithms for Identifying Relevant Features. Technical report, 1992.
[7]
M. Anderson et al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.
[8]
C. Beeri and P. A. Bernstein. Computational Problems Related to the Design of Normal Form Relational Schemas. TODS, 4(1), Mar. 1979.
[9]
Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013.
[10]
A. Daniely et al. Multiclass Learning Approaches: A Theoretical Comparison with Implications. In NIPS, 2012.
[11]
P. Domingos. A Unified Bias-Variance Decomposition and its Applications. In ICML, 2000.
[12]
P. Domingos. A Few Useful Things to Know About Machine Learning. CACM, 55(10), Oct. 2012.
[13]
P. Domingos and M. Pazzani. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2--3):103--130, 1997.
[14]
J. H. Friedman et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 2010.
[15]
N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29(2--3):131--163, Nov. 1997.
[16]
A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011.
[17]
I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. New York: Springer-Verlag, 2001.
[18]
T. Hastie et al. The Elements of Statistical Learning: Data mining, Inference, and Prediction. Springer-Verlag, 2001.
[19]
J. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012.
[20]
S. Kandel et al. Enterprise Data Analysis and Visualization: An Interview Study. In IEEE VAST, 2012.
[21]
R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Artif. Intell., 97(1--2), Dec. 1997.
[22]
D. Koller and M. Sahami. Toward Optimal Feature Selection. In ICML, 1995.
[23]
P. Konda et al. Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System. In VLDB, 2013.
[24]
Y. Koren et al. Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), Aug. 2009.
[25]
T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.
[26]
A. Kumar et al. Hazy: Making it Easier to Build and Maintain Big-data Analytics. CACM, 56(3):40--49, March 2013.
[27]
A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. ACM SIGMOD Record, Dec. 2015.
[28]
A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. UW-Madison CS Tech. Rep. TR1828, 2015.
[29]
A. Kumar, J. Naughton, and J. Patel. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015.
[30]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[31]
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
[32]
A. Pavlo et al. Skew-aware Automatic Database Partitioning in Shared-nothing, Parallel OLTP Systems. In SIGMOD, 2012.
[33]
J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988.
[34]
J. Pearl and T. Verma. The Logic of Representing Dependencies by Directed Graphs. In AAAI, 1987.
[35]
R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 2003.
[36]
C. Ré et al. Feature Engineering for Knowledge Base Construction. Data Engineering Bulletin, 2014.
[37]
R. Ricci, E. Eide, and the CloudLab Team. Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications. ;login:, 39(6), 2014.
[38]
I. Rish et al. An Analysis of Data Characteristics that Affect Naive Bayes Performance. In ICML, 2001.
[39]
A. I. Schein et al. Methods and Metrics for Cold-start Recommendations. In SIGIR, 2002.
[40]
S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[41]
A. Silberschatz et al. Database Systems Concepts. McGraw-Hill, Inc., 2006.
[42]
O. Uncu and I. Turksen. A Novel Feature Selection Approach: Combining Feature Wrappers and Filters. Information Sciences, 177(2), 2007.
[43]
V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., 1995.
[44]
S. K. M. Wong et al. A Method for Implementing a Probabilistic Model as a Relational Database. In UAI, 1995.
[45]
L. Yu and H. Liu. Efficient Feature Selection via Analysis of Relevance and Redundancy. JMLR, 5, Dec. 2004.
[46]
C. Zhang et al. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.
[47]
Y. Zhang et al. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.

Cited By

View all
  • (2025)Data-centric Artificial Intelligence: A SurveyACM Computing Surveys10.1145/371111857:5(1-42)Online publication date: 24-Jan-2025
  • (2025)Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methodsKnowledge and Information Systems10.1007/s10115-025-02349-xOnline publication date: 22-Feb-2025
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. VC dimension
  2. advanced analytics
  3. feature engineering
  4. feature selection
  5. functional dependencies
  6. key-foreign key joins
  7. machine learning

Qualifiers

  • Research-article

Funding Sources

  • Microsoft Jim Gray Systems Lab

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)82
  • Downloads (Last 6 weeks)9
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Data-centric Artificial Intelligence: A SurveyACM Computing Surveys10.1145/371111857:5(1-42)Online publication date: 24-Jan-2025
  • (2025)Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methodsKnowledge and Information Systems10.1007/s10115-025-02349-xOnline publication date: 22-Feb-2025
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
  • (2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 18-Jun-2024
  • (2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
  • (2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
  • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
  • (2024)Robust Data-centric Graph Structure Learning for Text ClassificationCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651915(1486-1495)Online publication date: 13-May-2024
  • (2024)Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00278(3613-3626)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media