research-article

Materialization optimizations for feature selection workloads

Authors:

Christopher RéAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 265 - 276

https://doi.org/10.1145/2588555.2593678

Published: 18 June 2014 Publication History

Abstract

There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature-selection language and a supporting prototype system that builds on top of current industrial, R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive, human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple R-backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

References

[1]

Apache Mahout. mahout.apache.org.

[2]

Feature Selection and Dimension Reduction Techniques in SAS. nesug.org/Proceedings/nesug11/sa/sa08.pdf.

[3]

Oracle Data Mining. oracle.com/technetwork/database/options/advanced-analytics/odm.

[4]

Oracle R Enterprise. docs.oracle.com/cd/E27988_01/doc/doc.112/e26499.pdf.

[5]

SAP HANA and R. help.sap.com/hana/hana_dev_r_emb_en.pdf.

[6]

SAS Report on Analytics. sas.com/reg/wp/corp/23876.

[7]

Variable Selection in the Credit Card Industry. nesug.org/proceedings/nesug06/an/da23.pdf.

[8]

D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[9]

L. S. Blackford and et al. ScaLAPACK: A portable linear algebra library for distributed memory computers - design issues and performance. In SuperComputing, 1996.

Digital Library

[10]

L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, 2007.

[11]

C. Boutsidis and et al. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory, 2013.

[12]

D. Boyce and et al. Optimal Subset Selection. Springer, 1974.

[13]

S. Boyd and et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 2011.

Digital Library

[14]

P. G. Brown. Overview of sciDB: Large scale array storage, processing and analysis. In SIGMOD, 2010.

Digital Library

[15]

J. Cohen and et al. MAD skills: New analysis practices for big data. PVLDB, 2009.

Digital Library

[16]

P. G. Constantine and D. F. Gleich. Tall and skinny qr factorizations in mapreduce architectures. In MapReduce, 2011.

Digital Library

[17]

A. Ghoting and et al. SystemML: Declarative machine learning on MapReduce. In ICDE, 2011.

Digital Library

[18]

G. Golub. Numerical methods for solving linear least squares problems. Numerische Mathematik, 1965.

[19]

G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, 1993.

Digital Library

[20]

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 2003.

Digital Library

[21]

I. Guyon and et al. Feature Extraction: Foundations and Applications. New York: Springer-Verlag, 2001.

Digital Library

[22]

T. Hastie and et al. The Elements of Statistical Learning: Data mining, inference, and prediction. Springer, 2001.

[23]

J. Hellerstein and et al. The MADlib analytics library or MAD skills, the SQL. In PVLDB, 2012.

Digital Library

[24]

G. H. John and et al. Irrelevant features and the subset selection problem. In ICML, 1994.

[25]

S. Kandel and et al. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 2012.

[26]

T. Kraska and et al. MLbase: A distributed machine-learning system. In CIDR, 2013.

[27]

M. Langberg and L. J. Schulman. Universal ε-approximators for integrals. In SODA, 2010.

Digital Library

[28]

S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. In ICDE, 1994.

Digital Library

[29]

S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In ICML, 2008.

Digital Library

[30]

S. Singh and et al. Parallel large scale feature selection for logistic regression. In SDM, 2009.

[31]

M. Stonebraker and et al. Intel "big data" science and technology center vision and execution plan. SIGMOD Rec., 2013.

Digital Library

[32]

Y. Zhang and et al. I/O-efficient statistical computing with RIOT. In ICDE, 2010.

Cited By

Shi JWang YZhang CLuo ZChai CZhang M(2024)DMRNet: Effective Network for Accurate Discharge Medication Recommendation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00262(3393-3406)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00262
Phani AErlbacher LBoehm M(2022)UPLIFTProceedings of the VLDB Endowment10.14778/3551793.355184215:11(2929-2938)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551842
Nakandala SKumar AIves ZBonifati AEl Abbadi A(2022)Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training DatasetsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517846(506-520)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517846
Show More Cited By

Index Terms

Materialization optimizations for feature selection workloads
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Materialization Optimizations for Feature Selection Workloads
Invited Paper from ICDT 2015, SIGMOD 2014, EDBT 2014 and Regular Papers

There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical ...
A survey on online feature selection with streaming features

In the era of big data, the dimensionality of data is increasing dramatically in many domains. To deal with high dimensionality, online feature selection becomes critical in big data mining. Recently, online selection of dynamic features has received ...
A Feature Selection Method Using Hierarchical Clustering
MIKE 2013: Proceedings of the First International Conference on Mining Intelligence and Knowledge Exploration - Volume 8284

Feature selection refers to a problem to select a subset of features which are most optimal for intended tasks. As one of well-known feature selection methods, clustering features into several groups and picking one feature from each group have been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
955
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shi JWang YZhang CLuo ZChai CZhang M(2024)DMRNet: Effective Network for Accurate Discharge Medication Recommendation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00262(3393-3406)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00262
Phani AErlbacher LBoehm M(2022)UPLIFTProceedings of the VLDB Endowment10.14778/3551793.355184215:11(2929-2938)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551842
Nakandala SKumar AIves ZBonifati AEl Abbadi A(2022)Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training DatasetsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517846(506-520)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517846
Zhou XChai CLi GSun J(2022)Database Meets Artificial Intelligence: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299464134:3(1096-1116)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TKDE.2020.2994641
Li GZhou XCao LLi GLi ZIdreos SSrivastava D(2021)AI Meets Database: AI4DB and DB4AIProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457542(2859-2866)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457542
Nakandala SKumar APapakonstantinou Y(2020)Query Optimization for Faster Deep CNN ExplanationsACM SIGMOD Record10.1145/3422648.342266349:1(61-68)Online publication date: 4-Sep-2020
https://dl.acm.org/doi/10.1145/3422648.3422663
Nakandala SNagrecha KKumar APapakonstantinou Y(2020)Incremental and Approximate Computations for Accelerating Deep CNN InferenceACM Transactions on Database Systems10.1145/339746145:4(1-42)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.1145/3397461
Zheng KCai SChua HWang WNgiam KOoi BMaier DPottinger RDoan ATan WAlawini ANgo H(2020)TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389720(1747-1763)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389720
Derakhshan BRezaei Mahdiraji AAbedjan ZRabl TMarkl VMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Optimizing Machine Learning Workloads in Collaborative EnvironmentsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389715(1701-1716)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389715
Boehm MKumar AYang J(2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
https://doi.org/10.2200/S00895ED1V01Y201901DTM057
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten