research-article

Materialization Optimizations for Feature Selection Workloads

Authors:

Christopher RéAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 41, Issue 1

Article No.: 2, Pages 1 - 32

https://doi.org/10.1145/2877204

Published: 24 February 2016 Publication History

Abstract

There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current industrial R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analogue in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of datasets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new trade-off space across multiple R backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

References

[1]

Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A data system for feature engineering. In 6th Biennial Conference on Innovative Data Systems Research (CIDR’13). http://web.eecs.umich.edu/&sim;michjc/papers/mythical_man.pdf.

[2]

Austin R. Benson, David F. Gleich, and James Demmel. 2013. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In Proceedings of the 2013 IEEE International Conference on Big Data. 264--272.

[3]

D. P. Bertsekas. 1999. Nonlinear Programming. Athena Scientific.

[4]

L. Susan Blackford, Jaeyoung Choi, Andrew J. Cleary, James Demmel, Inderjit S. Dhillon, Jack Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clinton Whaley. 1996. ScaLAPACK: A portable linear algebra library for distributed memory computers—design issues and performance. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing. 5.

Digital Library

[5]

Léon Bottou and Olivier Bousquet. 2007. The tradeoffs of large scale learning. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS’07). 161--168. http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.

[6]

Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. 2013. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory 59, 10, 6880--6892.

Digital Library

[7]

David E. Boyce. 1974. Optimal Subset Selection: Multiple Regression, Interdependence, and Optimal Network Algorithms. Springer-Verlag.

Digital Library

[8]

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3, 1, 1--122.

Digital Library

[9]

Paul G. Brown. 2010. Overview of SciDB: Large scale array storage, processing and analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 963--968.

Digital Library

[10]

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD skills: New analysis practices for big data. Proceedings of the VLDB Endowment 2, 2, 1481--1492.

Digital Library

[11]

Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a unified architecture for in-RDBMS analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 325--336.

Digital Library

[12]

Archana Ganapathi, Yanpei Chen, Armando Fox, Randy H. Katz, and David A. Patterson. 2010. Statistics-driven workload modeling for the cloud. In Proceedings of the Workshops of the IEEE International Conference on Data Engineering (ICDE’10). 87--92.

[13]

M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.

Digital Library

[14]

Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’11). 231--242.

Digital Library

[15]

G. Golub. 1965. Numerical methods for solving linear least squares problems. Numerische Mathematik 7, 3, 206--216.

Digital Library

[16]

Goetz Graefe and William J. McKenna. 1993. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’93). 209--218.

Digital Library

[17]

Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157--1182. http://www.jmlr.org/papers/v3/guyon03a.html.

Digital Library

[18]

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. 2006. Feature Extraction: Foundations and Applications. Springer-Verlag, New York, NY.

Digital Library

[19]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

[20]

Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library or MAD skills, the SQL. Proceedings of the VLDB Endowment 5, 12, 1700--1711. http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf.

Digital Library

[21]

George H. John, Ron Kohavi, and Karl Pfleger. 1994. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning. 121--129.

Digital Library

[22]

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics 18, 12, 2917--2926.

Digital Library

[23]

Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith, Michael J. Franklin, and Michael I. Jordan. 2013. MLbase: A distributed machine-learning system. In Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR’13). http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper118.pdf.

[24]

Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2015. Learning generalized linear models over normalized data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1969--1984.

Digital Library

[25]

Arun Kumar, Feng Niu, and Christopher Ré. 2013. Hazy: Making it easier to build and maintain big-data analytics. Communications of the ACM 56, 3, 40--49.

Digital Library

[26]

Michael Langberg and Leonard J. Schulman. 2010. Universal epsilon-approximators for integrals. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 598--607.

Digital Library

[27]

Sunita Sarawagi and Michael Stonebraker. 1994. Efficient organization of large multidimensional arrays. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’94). 328--336.

Digital Library

[28]

Shai Shalev-Shwartz and Nathan Srebro. 2008. SVM optimization: Inverse dependence on training set size. In Machine Learning: Proceedings of the 25th International Conference. 928--935.

Digital Library

[29]

Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment 8, 11, 1310--1321.

Digital Library

[30]

Sameer Singh, Jeremy Kubica, Scott Larsen, and Daria Sorokina. 2009. Parallel large scale feature selection for logistic regression. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 1172--1183.

[31]

Michael Stonebraker, Sam Madden, and Pradeep Dubey. 2013. Intel “big data” science and technology center vision and execution plan. ACM SIGMOD Record 42, 1, 44--49.

Digital Library

[32]

Jiyan Yang, Yin-Lam Chow, Christopher Ré, and Michael Mahoney. 2016. Weighted SGD for lp regression with randomized preconditioning. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’16).

Digital Library

[33]

Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 265--276.

Digital Library

[34]

Yi Zhang, Weiping Zhang, and Jun Yang. 2010. I/O-efficient statistical computing with RIOT. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’10). 1157--1160. 10.1109/ICDE.2010.5447819

Cited By

Li YShen YChen LYuan M(2024)A Caching-based Framework for Scalable Temporal Graph Neural Network TrainingACM Transactions on Database Systems10.1145/370589450:1(1-46)Online publication date: 25-Nov-2024
https://dl.acm.org/doi/10.1145/3705894
Rajenthiram KBosch JLewis GCleland-Huang JMuccini H(2024)Optimizing Data Analytics Workflows through User-driven ExperimentationProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644971(253-255)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644971
Kontaxakis ASacharidis DSimitsis AAbelló ANadal S(2024)HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00024(221-234)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00024
Show More Cited By

Index Terms

Materialization Optimizations for Feature Selection Workloads
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Materialization optimizations for feature selection workloads
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in ...
Correlation based feature selection method

Feature selection is an important data preprocessing step which is performed before a learning algorithm is applied. The issue that has to be taken into consideration when proposing a feature selection method is its computational complexity. Often, if ...
Multiclass feature selection with metaheuristic optimization algorithms: a review
Abstract
Selecting relevant feature subsets is vital in machine learning, and multiclass feature selection is harder to perform since most classifications are binary. The feature selection problem aims at reducing the feature set dimension while ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 41, Issue 1

Invited Paper from ICDT 2015, SIGMOD 2014, EDBT 2014 and Regular Papers

April 2016

287 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/2897141

Editor:
Christian S. Jensen
Aalborg University, Denmark

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2016

Accepted: 01 November 2015

Revised: 01 September 2015

Received: 01 February 2015

Published in TODS Volume 41, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Defense Advanced Research Projects Agency (DARPA) XDATA
Google
Office of Naval Research (ONR)
DEFT
Toshiba
DARPA's MEMEX program and SIMPLEX program
National Science Foundation (NSF) CAREER
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
trans-NIH Big Data to Knowledge
Sloan Research Fellowship
Moore Foundation
American Family Insurance
National Institutes of Health (NIH)
Microsoft Jim Gray Systems Lab

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
916
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)3

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li YShen YChen LYuan M(2024)A Caching-based Framework for Scalable Temporal Graph Neural Network TrainingACM Transactions on Database Systems10.1145/370589450:1(1-46)Online publication date: 25-Nov-2024
https://dl.acm.org/doi/10.1145/3705894
Rajenthiram KBosch JLewis GCleland-Huang JMuccini H(2024)Optimizing Data Analytics Workflows through User-driven ExperimentationProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644971(253-255)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644971
Kontaxakis ASacharidis DSimitsis AAbelló ANadal S(2024)HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00024(221-234)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00024
Li ZGor PPrabhu RYu HMao YPark Y(2023)ElasticNotebook: Enabling Live Migration for Computational NotebooksProceedings of the VLDB Endowment10.14778/3626292.362629617:2(119-133)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626296
Liu RPark KPsallidas FZhu XMo JSen RInterlandi MKaranasos KTian YCamacho-Rodríguez J(2023)Optimizing Data Pipelines for Machine Learning in Feature StoresProceedings of the VLDB Endowment10.14778/3625054.362506016:13(4230-4239)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.14778/3625054.3625060
Li YShen YChen LYuan M(2023)Orca: Scalable Temporal Graph Neural Network Training with Theoretical GuaranteesProceedings of the ACM on Management of Data10.1145/35887371:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588737
Park HMegahed AYin POng YMahajan PGuo P(2023)Incorporating experts’ judgment into machine learning modelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120118228:COnline publication date: 15-Oct-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120118
Derakhshan BRezaei Mahdiraji AKaoudi ZRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526186
Galhotra SShanmugam KSattigeri PVarshney KIves ZBonifati AEl Abbadi A(2022)Causal Feature Selection for Algorithmic FairnessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517909(276-285)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517909
Nakandala SKumar AIves ZBonifati AEl Abbadi A(2022)Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training DatasetsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517846(506-520)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517846
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents