skip to main content
10.1145/2588555.2593678acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Materialization optimizations for feature selection workloads

Published: 18 June 2014 Publication History

Abstract

There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature-selection language and a supporting prototype system that builds on top of current industrial, R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive, human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple R-backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

References

[1]
Apache Mahout. mahout.apache.org.
[2]
Feature Selection and Dimension Reduction Techniques in SAS. nesug.org/Proceedings/nesug11/sa/sa08.pdf.
[3]
Oracle Data Mining. oracle.com/technetwork/database/options/advanced-analytics/odm.
[4]
Oracle R Enterprise. docs.oracle.com/cd/E27988_01/doc/doc.112/e26499.pdf.
[5]
SAP HANA and R. help.sap.com/hana/hana_dev_r_emb_en.pdf.
[6]
SAS Report on Analytics. sas.com/reg/wp/corp/23876.
[7]
Variable Selection in the Credit Card Industry. nesug.org/proceedings/nesug06/an/da23.pdf.
[8]
D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[9]
L. S. Blackford and et al. ScaLAPACK: A portable linear algebra library for distributed memory computers - design issues and performance. In SuperComputing, 1996.
[10]
L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, 2007.
[11]
C. Boutsidis and et al. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory, 2013.
[12]
D. Boyce and et al. Optimal Subset Selection. Springer, 1974.
[13]
S. Boyd and et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 2011.
[14]
P. G. Brown. Overview of sciDB: Large scale array storage, processing and analysis. In SIGMOD, 2010.
[15]
J. Cohen and et al. MAD skills: New analysis practices for big data. PVLDB, 2009.
[16]
P. G. Constantine and D. F. Gleich. Tall and skinny qr factorizations in mapreduce architectures. In MapReduce, 2011.
[17]
A. Ghoting and et al. SystemML: Declarative machine learning on MapReduce. In ICDE, 2011.
[18]
G. Golub. Numerical methods for solving linear least squares problems. Numerische Mathematik, 1965.
[19]
G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, 1993.
[20]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 2003.
[21]
I. Guyon and et al. Feature Extraction: Foundations and Applications. New York: Springer-Verlag, 2001.
[22]
T. Hastie and et al. The Elements of Statistical Learning: Data mining, inference, and prediction. Springer, 2001.
[23]
J. Hellerstein and et al. The MADlib analytics library or MAD skills, the SQL. In PVLDB, 2012.
[24]
G. H. John and et al. Irrelevant features and the subset selection problem. In ICML, 1994.
[25]
S. Kandel and et al. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 2012.
[26]
T. Kraska and et al. MLbase: A distributed machine-learning system. In CIDR, 2013.
[27]
M. Langberg and L. J. Schulman. Universal ε-approximators for integrals. In SODA, 2010.
[28]
S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. In ICDE, 1994.
[29]
S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In ICML, 2008.
[30]
S. Singh and et al. Parallel large scale feature selection for logistic regression. In SDM, 2009.
[31]
M. Stonebraker and et al. Intel "big data" science and technology center vision and execution plan. SIGMOD Rec., 2013.
[32]
Y. Zhang and et al. I/O-efficient statistical computing with RIOT. In ICDE, 2010.

Cited By

View all
  • (2024)DMRNet: Effective Network for Accurate Discharge Medication Recommendation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00262(3393-3406)Online publication date: 13-May-2024
  • (2022)UPLIFTProceedings of the VLDB Endowment10.14778/3551793.355184215:11(2929-2938)Online publication date: 29-Sep-2022
  • (2022)Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training DatasetsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517846(506-520)Online publication date: 10-Jun-2022
  • Show More Cited By

Index Terms

  1. Materialization optimizations for feature selection workloads

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. feature selection
    2. materialization
    3. statistical analytics

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DMRNet: Effective Network for Accurate Discharge Medication Recommendation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00262(3393-3406)Online publication date: 13-May-2024
    • (2022)UPLIFTProceedings of the VLDB Endowment10.14778/3551793.355184215:11(2929-2938)Online publication date: 29-Sep-2022
    • (2022)Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training DatasetsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517846(506-520)Online publication date: 10-Jun-2022
    • (2022)Database Meets Artificial Intelligence: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299464134:3(1096-1116)Online publication date: 1-Mar-2022
    • (2021)AI Meets Database: AI4DB and DB4AIProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457542(2859-2866)Online publication date: 9-Jun-2021
    • (2020)Query Optimization for Faster Deep CNN ExplanationsACM SIGMOD Record10.1145/3422648.342266349:1(61-68)Online publication date: 4-Sep-2020
    • (2020)Incremental and Approximate Computations for Accelerating Deep CNN InferenceACM Transactions on Database Systems10.1145/339746145:4(1-42)Online publication date: 6-Dec-2020
    • (2020)TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389720(1747-1763)Online publication date: 11-Jun-2020
    • (2020)Optimizing Machine Learning Workloads in Collaborative EnvironmentsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389715(1701-1716)Online publication date: 11-Jun-2020
    • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media