skip to main content
research-article

Materialization Optimizations for Feature Selection Workloads

Published: 24 February 2016 Publication History

Abstract

There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current industrial R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analogue in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of datasets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new trade-off space across multiple R backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

References

[1]
Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A data system for feature engineering. In 6th Biennial Conference on Innovative Data Systems Research (CIDR’13). http://web.eecs.umich.edu/∼michjc/papers/mythical_man.pdf.
[2]
Austin R. Benson, David F. Gleich, and James Demmel. 2013. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In Proceedings of the 2013 IEEE International Conference on Big Data. 264--272.
[3]
D. P. Bertsekas. 1999. Nonlinear Programming. Athena Scientific.
[4]
L. Susan Blackford, Jaeyoung Choi, Andrew J. Cleary, James Demmel, Inderjit S. Dhillon, Jack Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clinton Whaley. 1996. ScaLAPACK: A portable linear algebra library for distributed memory computers—design issues and performance. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing. 5.
[5]
Léon Bottou and Olivier Bousquet. 2007. The tradeoffs of large scale learning. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS’07). 161--168. http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.
[6]
Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. 2013. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory 59, 10, 6880--6892.
[7]
David E. Boyce. 1974. Optimal Subset Selection: Multiple Regression, Interdependence, and Optimal Network Algorithms. Springer-Verlag.
[8]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3, 1, 1--122.
[9]
Paul G. Brown. 2010. Overview of SciDB: Large scale array storage, processing and analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 963--968.
[10]
Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD skills: New analysis practices for big data. Proceedings of the VLDB Endowment 2, 2, 1481--1492.
[11]
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a unified architecture for in-RDBMS analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 325--336.
[12]
Archana Ganapathi, Yanpei Chen, Armando Fox, Randy H. Katz, and David A. Patterson. 2010. Statistics-driven workload modeling for the cloud. In Proceedings of the Workshops of the IEEE International Conference on Data Engineering (ICDE’10). 87--92.
[13]
M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
[14]
Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’11). 231--242.
[15]
G. Golub. 1965. Numerical methods for solving linear least squares problems. Numerische Mathematik 7, 3, 206--216.
[16]
Goetz Graefe and William J. McKenna. 1993. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’93). 209--218.
[17]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157--1182. http://www.jmlr.org/papers/v3/guyon03a.html.
[18]
Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. 2006. Feature Extraction: Foundations and Applications. Springer-Verlag, New York, NY.
[19]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
[20]
Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library or MAD skills, the SQL. Proceedings of the VLDB Endowment 5, 12, 1700--1711. http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf.
[21]
George H. John, Ron Kohavi, and Karl Pfleger. 1994. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning. 121--129.
[22]
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics 18, 12, 2917--2926.
[23]
Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith, Michael J. Franklin, and Michael I. Jordan. 2013. MLbase: A distributed machine-learning system. In Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR’13). http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper118.pdf.
[24]
Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2015. Learning generalized linear models over normalized data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1969--1984.
[25]
Arun Kumar, Feng Niu, and Christopher Ré. 2013. Hazy: Making it easier to build and maintain big-data analytics. Communications of the ACM 56, 3, 40--49.
[26]
Michael Langberg and Leonard J. Schulman. 2010. Universal epsilon-approximators for integrals. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 598--607.
[27]
Sunita Sarawagi and Michael Stonebraker. 1994. Efficient organization of large multidimensional arrays. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’94). 328--336.
[28]
Shai Shalev-Shwartz and Nathan Srebro. 2008. SVM optimization: Inverse dependence on training set size. In Machine Learning: Proceedings of the 25th International Conference. 928--935.
[29]
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment 8, 11, 1310--1321.
[30]
Sameer Singh, Jeremy Kubica, Scott Larsen, and Daria Sorokina. 2009. Parallel large scale feature selection for logistic regression. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 1172--1183.
[31]
Michael Stonebraker, Sam Madden, and Pradeep Dubey. 2013. Intel “big data” science and technology center vision and execution plan. ACM SIGMOD Record 42, 1, 44--49.
[32]
Jiyan Yang, Yin-Lam Chow, Christopher Ré, and Michael Mahoney. 2016. Weighted SGD for lp regression with randomized preconditioning. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’16).
[33]
Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 265--276.
[34]
Yi Zhang, Weiping Zhang, and Jun Yang. 2010. I/O-efficient statistical computing with RIOT. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’10). 1157--1160. 10.1109/ICDE.2010.5447819

Cited By

View all
  • (2024)A Caching-based Framework for Scalable Temporal Graph Neural Network TrainingACM Transactions on Database Systems10.1145/370589450:1(1-46)Online publication date: 25-Nov-2024
  • (2024)Optimizing Data Analytics Workflows through User-driven ExperimentationProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644971(253-255)Online publication date: 14-Apr-2024
  • (2024)HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00024(221-234)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Materialization Optimizations for Feature Selection Workloads

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 41, Issue 1
    Invited Paper from ICDT 2015, SIGMOD 2014, EDBT 2014 and Regular Papers
    April 2016
    287 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/2897141
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 February 2016
    Accepted: 01 November 2015
    Revised: 01 September 2015
    Received: 01 February 2015
    Published in TODS Volume 41, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Feature selection
    2. R
    3. declarative language
    4. machine learning
    5. materialization
    6. optimization
    7. statistical analytics

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Defense Advanced Research Projects Agency (DARPA) XDATA
    • Google
    • Office of Naval Research (ONR)
    • DEFT
    • Toshiba
    • DARPA's MEMEX program and SIMPLEX program
    • National Science Foundation (NSF) CAREER
    • National Institute of Biomedical Imaging and Bioengineering (NIBIB)
    • trans-NIH Big Data to Knowledge
    • Sloan Research Fellowship
    • Moore Foundation
    • American Family Insurance
    • National Institutes of Health (NIH)
    • Microsoft Jim Gray Systems Lab

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 23 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Caching-based Framework for Scalable Temporal Graph Neural Network TrainingACM Transactions on Database Systems10.1145/370589450:1(1-46)Online publication date: 25-Nov-2024
    • (2024)Optimizing Data Analytics Workflows through User-driven ExperimentationProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644971(253-255)Online publication date: 14-Apr-2024
    • (2024)HYPPO: Using Equivalences to Optimize Pipelines in Exploratory Machine Learning2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00024(221-234)Online publication date: 13-May-2024
    • (2023)ElasticNotebook: Enabling Live Migration for Computational NotebooksProceedings of the VLDB Endowment10.14778/3626292.362629617:2(119-133)Online publication date: 1-Oct-2023
    • (2023)Optimizing Data Pipelines for Machine Learning in Feature StoresProceedings of the VLDB Endowment10.14778/3625054.362506016:13(4230-4239)Online publication date: 1-Sep-2023
    • (2023)Orca: Scalable Temporal Graph Neural Network Training with Theoretical GuaranteesProceedings of the ACM on Management of Data10.1145/35887371:1(1-27)Online publication date: 30-May-2023
    • (2023)Incorporating experts’ judgment into machine learning modelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120118228:COnline publication date: 15-Oct-2023
    • (2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
    • (2022)Causal Feature Selection for Algorithmic FairnessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517909(276-285)Online publication date: 10-Jun-2022
    • (2022)Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training DatasetsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517846(506-520)Online publication date: 10-Jun-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media