research-article

Catch: Collaborative Feature Set Search for Automated Feature Engineering

Authors:

Junbo ZhaoAuthors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 1886 - 1896

https://doi.org/10.1145/3543507.3583527

Published: 30 April 2023 Publication History

Abstract

Feature engineering often plays a crucial role in building mining systems for tabular data, which traditionally requires experienced human experts to perform. Thanks to the rapid advances in reinforcement learning, it has offered an automated alternative, i.e. automated feature engineering (AutoFE). In this work, through scrutiny of the prior AutoFE methods, we characterize several research challenges that remained in this regime, concerning system-wide efficiency, efficacy, and practicality toward production. We then propose Catch, a full-fledged new AutoFE framework that comprehensively addresses the aforementioned challenges. The core to Catch composes a hierarchical-policy reinforcement learning scheme that manifests a collaborative feature engineering exploration and exploitation grounded on the granularity of the whole feature set. At a higher level of the hierarchy, a decision-making module controls the post-processing of the attained feature engineering transformation. We extensively experiment with Catch on 26 academic standardized tabular datasets and 9 industrialized real-world datasets. Measured by numerous metrics and analyses, Catch establishes a new state-of-the-art, from perspectives performance, latency as well as its practicality towards production. Source code1 can be found at https://github.com/1171000709/Catch.

References

[1]

Ahmed Al-Ani and Rami N Khushaba. 2012. A population based feature subset selection algorithm guided by fuzzy feature dependency. In International Conference on Advanced Machine Learning Technologies and Applications. Springer, 430–438.

[2]

Davide Anguita, Luca Ghelardoni, Alessandro Ghio, Luca Oneto, and Sandro Ridella. 2012. The ‘K’in K-fold cross validation. In 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). i6doc. com publ, 441–446.

[3]

John O Awoyemi, Adebayo O Adetunmbi, and Samuel A Oluwadare. 2017. Credit card fraud detection using machine learning techniques: A comparative analysis. In 2017 international conference on computing networking and informatics (ICCNI). IEEE, 1–9.

[4]

Joao Bastos. 2007. Credit scoring with boosted decision trees. (2007).

[5]

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.

Digital Library

[6]

Svitlana Bondarenko, Olena Laburtseva, Olena Sadchenko, Vira Lebedieva, Oleksandra Haidukova, and Tetyana Kharchenko. 2019. Modern lead generation in internet marketing for the development of enterprise potential. (2019).

[7]

Leon Bornemann, Tobias Bleifuß, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2020. Natural key discovery in Wikipedia tables. In Proceedings of The Web Conference 2020. 2789–2795.

Digital Library

[8]

Matteo Cannaviccio, Denilson Barbosa, and Paolo Merialdo. 2018. Towards annotating relational data on the web with language models. In Proceedings of the 2018 World Wide Web Conference. 1307–1316.

Digital Library

[9]

Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, 2015. Xgboost: extreme gradient boosting. R package version 0.4-2 1, 4 (2015), 1–4.

[10]

Xiangning Chen, Qingwei Lin, Chuan Luo, Xudong Li, Hongyu Zhang, Yong Xu, Yingnong Dang, Kaixin Sui, Xu Zhang, Bo Qiao, 2019. Neural feature search: A neural architecture for automated feature engineering. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 71–80.

[11]

Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.

[12]

Hu Ding and Zixiu Wang. 2020. Layered sampling for robust optimization problems. In International Conference on Machine Learning. PMLR, 2556–2566.

[13]

Guozhu Dong and Huan Liu. 2018. Feature engineering for machine learning and data analytics. CRC Press.

[14]

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020).

[15]

Besnik Fetahu, Avishek Anand, and Maria Koutraki. 2019. Tablenet: An approach for determining fine-grained relations for wikipedia tables. In The World Wide Web Conference. 2736–2742.

Digital Library

[16]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. Advances in neural information processing systems 28 (2015).

[17]

Dibya Ghosh, Jad Rahme, Aviral Kumar, Amy Zhang, Ryan P. Adams, and Sergey Levine. 2021. Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability. arxiv:2107.06277 [cs.LG]

[18]

Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, and Xueqi Cheng. 2019. Link prediction on n-ary relational data. In The World Wide Web Conference. 583–593.

Digital Library

[19]

Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems 212 (2021), 106622.

[20]

Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 8 (1998), 832–844. https://doi.org/10.1109/34.709601

Digital Library

[21]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[22]

Franziska Horn, Robert Pack, and Michael Rieger. 2019. The autofeat python library for automated feature engineering and selection. arXiv preprint arXiv:1901.07329 (2019).

[23]

David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013. Applied logistic regression. Vol. 398. John Wiley & Sons.

[24]

James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 1–10.

[25]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).

Digital Library

[26]

Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2018. Feature engineering for predictive modeling using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[27]

Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. 2016. Cognito: Automated feature engineering for supervised learning. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 1304–1307.

[28]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[29]

Hoang Thanh Lam, Beat Buesser, Hong Min, Tran Ngoc Minh, Martin Wistuba, Udayan Khurana, Gregory Bramble, Theodoros Salonidis, Dakuo Wang, and Horst Samulowitz. 2021. Automated Data Science for Relational Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 2689–2692. https://doi.org/10.1109/ICDE51399.2021.00305

[30]

Hoang Thanh Lam, Tran Ngoc Minh, Mathieu Sinn, Beat Buesser, and Martin Wistuba. 2018. Neural Feature Learning From Relational Database. https://doi.org/10.48550/ARXIV.1801.05372

[31]

Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen, Tiep Mai, and Oznur Alkan. 2017. One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327 (2017).

[32]

Tor Lattimore, Marcus Hutter, and Peter Sunehag. 2013. The sample-complexity of general reinforcement learning. (2013), 28–36.

[33]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, Yike Guo and Faisal Farooq (Eds.). ACM, 1754–1763. https://doi.org/10.1145/3219819.3220023

Digital Library

[34]

Liye Ma and Baohong Sun. 2020. Machine learning and AI in marketing–Connecting computing power to human insights. International Journal of Research in Marketing 37, 3 (2020), 481–504.

[35]

Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil, and Deepak S Turaga. 2017. Learning Feature Engineering for Classification. In Ijcai. 2529–2535.

[36]

SalazarRicardo, NeutatzFelix, and AbedjanZiawasch. 2021. Automated feature engineering for algorithmic fairness. Proceedings of the VLDB Endowment (2021).

[37]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[38]

Maxim Vladimirovich Shcherbakov, Adriaan Brebels, Nataliya Lvovna Shcherbakova, Anton Pavlovich Tyukov, Timur Alexandrovich Janovsky, Valeriy Anatol’evich Kamaev, 2013. A survey of forecast error measures. World Applied Sciences Journal 24, 24 (2013), 171–176.

[39]

Qitao Shi, Ya-Lin Zhang, Longfei Li, Xinxing Yang, Meng Li, and Jun Zhou. 2020. Safe: Scalable automatic feature engineering framework for industrial tasks. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1645–1656.

[40]

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484–489.

[41]

Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing 14, 3 (2004), 199–222.

[42]

solid IT gmbh. 2020. DB-Engines Ranking. (2020). https://db-engines.com/en/ranking

[43]

Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1161–1170.

Digital Library

[44]

Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.

[45]

Fei-Yue Wang, Jun Jason Zhang, Xinhu Zheng, Xiao Wang, Yong Yuan, Xiaoxiao Dai, Jie Zhang, and Liuqing Yang. 2016. Where does AlphaGo go: From church-turing thesis to AlphaGo thesis and beyond. IEEE/CAA Journal of Automatica Sinica 3, 2 (2016), 113–120.

[46]

Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 12:1–12:7. https://doi.org/10.1145/3124749.3124754

Digital Library

[47]

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems., 1785–1797 pages.

Digital Library

[48]

Zhou Wang and Alan C Bovik. 2009. Mean squared error: Love it or leave it¿ A new look at signal fidelity measures. IEEE signal processing magazine 26, 1 (2009), 98–117.

[49]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.

[50]

Chelsea C White III and Douglas J White. 1989. Markov decision processes. European Journal of Operational Research 39, 1 (1989), 1–16.

[51]

Guanghui Zhu, Zhuoer Xu, Chunfeng Yuan, and Yihua Huang. 2022. DIFER: differentiable automated feature engineering. In International Conference on Automated Machine Learning. PMLR, 17–1.

Cited By

Eldeeb HElshawi R(2025)Empowering Machine Learning With Scalable Feature Engineering and Interpretable AutoMLIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34007526:2(432-447)Online publication date: Feb-2025
https://doi.org/10.1109/TAI.2024.3400752
Ye CLu GWang HLi LWu SChen GZhao JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645707
Liu ZZhang DLiu HDong ZJia WTan J(2024)Visible-hidden hybrid automatic feature engineering via multi-agent reinforcement learningKnowledge-Based Systems10.1016/j.knosys.2024.111941299:COnline publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111941

Index Terms

Catch: Collaborative Feature Set Search for Automated Feature Engineering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Evolutionary Automated Feature Engineering
PRICAI 2022: Trends in Artificial Intelligence
Abstract
Effective feature engineering serves as a prerequisite for many machine learning tasks. Feature engineering, which usually uses a series of mathematical functions to transform the features, aims to find valuable new features that can reflect the ...
Human-in-the-Loop Feature Discovery for Tabular Data
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the ...
KRAFT: Leveraging Knowledge Graphs for Interpretable Feature Generation
Web Information Systems Engineering – WISE 2024
Abstract
The quality of Machine Learning (ML) models strongly depends on the quality of the input data, as such Feature Engineering (FE) is often required in ML. In addition, with the proliferation of ML-powered systems, especially in critical contexts, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23: Proceedings of the ACM Web Conference 2023

April 2023

4293 pages

ISBN:9781450394161

DOI:10.1145/3543507

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSFC

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)7

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Eldeeb HElshawi R(2025)Empowering Machine Learning With Scalable Feature Engineering and Interpretable AutoMLIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34007526:2(432-447)Online publication date: Feb-2025
https://doi.org/10.1109/TAI.2024.3400752
Ye CLu GWang HLi LWu SChen GZhao JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645707
Liu ZZhang DLiu HDong ZJia WTan J(2024)Visible-hidden hybrid automatic feature engineering via multi-agent reinforcement learningKnowledge-Based Systems10.1016/j.knosys.2024.111941299:COnline publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111941

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten