skip to main content
research-article

A Relational Framework for Classifier Engineering

Published: 10 September 2018 Publication History

Abstract

In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. We embark on the establishment of database foundations for feature engineering. Specifically, we propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database's modeling and understanding of data and queries, and by deploying the well studied principles of database management. We demonstrate the usefulness of the framework by formally defining key algorithmic challenges and presenting preliminary complexity results.

References

[1]
SAS Report on Analytics. sas.com/reg/wp/corp/23876.
[2]
M. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. Cafarella, A. Kumar, F. Niu, Y. Park, C. Re, and C. Zhang. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.
[3]
M. R. Anderson, M. J. Cafarella, Y. Jiang, G. Wang, and B. Zhang. An integrated development environment for faster feature engineering. PVLDB, 7(13):1657-1660, 2014.
[4]
M. Arias and R. Khardon. Complexity parameters for first order classes. Machine Learning, 64(1-3):121-144, 2006.
[5]
F. Bacchus, A. J. Grove, J. Y. Halpern, and D. Koller. From statistical knowledge bases to degrees of belief. CoRR, cs.AI/0307056, 2003.
[6]
V. B'ar'any, B. ten Cate, B. Kimelfeld, D. Olteanu, and Z. Vagena. Declarative probabilistic programming with datalog. In ICDT, volume 48 of LIPIcs, pages 7:1-7:19. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.
[7]
P. Barcelo and M. Romero. The complexity of reverse engineering problems for conjunctive queries. In ICDT, volume 68 of LIPIcs, pages 7:1-7:17. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
[8]
P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. In COLT, pages 224-240, 2001.
[9]
D. E. Boyce. Optimal Subset Selection: Multiple Regression, Interdependence, and Optimal Network Algorithms . Springer-Verlag, 1974.
[10]
C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. In ITCS, pages 214-226. ACM, 2012.
[11]
R. Fagin, J. Y. Halpern, and N. Megiddo. A logic for reasoning about probabilities. Inf. Comput., 87(1/2):78-128, 1990.
[12]
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37-48, 2013.
[13]
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12, 2015.
[14]
N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300-1309, 1999.
[15]
A. Gammerman, K. S. Azoury, and V. Vapnik. Learning by transduction. In UAI, pages 148-155. Morgan Kaufmann, 1998.
[16]
M. Grohe and M. Ritzert. Learning first-order definable concepts over structures of small degree. In LICS, pages 1-12. IEEE Computer Society, 2017.
[17]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157-1182, 2003.
[18]
I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc., 2006.
[19]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data mining, inference, and prediction. Springer, 2001.
[20]
G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Machine Learning, pages 121-129, 1994.
[21]
S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12):2917-2926, 2012.
[22]
B. Kimelfeld and C. R'e. A relational framework for classifier engineering. In PODS, pages 5-20. ACM, 2017.
[23]
A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD Conference, pages 19-34. ACM, 2016.
[24]
E. L. Lehmann and G. Casella. Theory of point estimation, volume 31. Springer, 1998.
[25]
B. Milch, B. Marthi, S. J. Russell, D. Sontag, D. L. Ong, and A. Kolobov. Blog: Probabilistic models with unknown objects. In IJCAI, pages 1352-1359, 2005.
[26]
M. Richardson and P. Domingos. Markov logic networks. Mach. Learn., 62(1-2):107-136, 2006.
[27]
T. Sato and Y. Kameya. PRISM: A language for symbolic-statistical modeling. In IJCAI, pages 1330-1339, 1997.
[28]
S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[29]
J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. R'e. Incremental knowledge base construction using DeepDive. PVLDB, 8(11):1310-1321, 2015.
[30]
B. ten Cate and V. Dalmau. The product homomorphism problem and applications. In ICDT, volume 31 of LIPIcs, pages 161-176. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015.
[31]
V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264-280, 1971.
[32]
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1-305, 2008.
[33]
R. Willard. Testing expressibility is hard. In CP, volume 6308 of Lecture Notes in Computer Science, pages 9-23. Springer, 2010.
[34]
R. S. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In ICML, volume 28 of JMLR Proceedings, pages 325-333. JMLR.org, 2013.
[35]
C. Zhang, A. Kumar, and C. R'e. Materialization optimizations for feature selection workloads. In SIGMOD Conference, pages 265-276, 2014.
[36]
I. Zliobaite. A survey on measuring indirect discrimination in machine learning. CoRR, abs/1511.00148, 2015.

Cited By

View all
  • (2024)Fitting Algorithms for Conjunctive QueriesACM SIGMOD Record10.1145/3641832.364183452:4(6-18)Online publication date: 19-Jan-2024
  • (2023)Extremal Fitting Problems for Conjunctive QueriesProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588655(89-98)Online publication date: 18-Jun-2023
  • (2022)Answering (Unions of) Conjunctive Queries using Random Access and Random-Order EnumerationACM Transactions on Database Systems10.1145/353105547:3(1-49)Online publication date: 25-Jun-2022
  1. A Relational Framework for Classifier Engineering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 47, Issue 1
    March 2018
    45 pages
    ISSN:0163-5808
    DOI:10.1145/3277006
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 September 2018
    Published in SIGMOD Volume 47, Issue 1

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Fitting Algorithms for Conjunctive QueriesACM SIGMOD Record10.1145/3641832.364183452:4(6-18)Online publication date: 19-Jan-2024
    • (2023)Extremal Fitting Problems for Conjunctive QueriesProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588655(89-98)Online publication date: 18-Jun-2023
    • (2022)Answering (Unions of) Conjunctive Queries using Random Access and Random-Order EnumerationACM Transactions on Database Systems10.1145/353105547:3(1-49)Online publication date: 25-Jun-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media