skip to main content
10.1145/3543507.3583527acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Catch: Collaborative Feature Set Search for Automated Feature Engineering

Published: 30 April 2023 Publication History

Abstract

Feature engineering often plays a crucial role in building mining systems for tabular data, which traditionally requires experienced human experts to perform. Thanks to the rapid advances in reinforcement learning, it has offered an automated alternative, i.e. automated feature engineering (AutoFE). In this work, through scrutiny of the prior AutoFE methods, we characterize several research challenges that remained in this regime, concerning system-wide efficiency, efficacy, and practicality toward production. We then propose Catch, a full-fledged new AutoFE framework that comprehensively addresses the aforementioned challenges. The core to Catch composes a hierarchical-policy reinforcement learning scheme that manifests a collaborative feature engineering exploration and exploitation grounded on the granularity of the whole feature set. At a higher level of the hierarchy, a decision-making module controls the post-processing of the attained feature engineering transformation. We extensively experiment with Catch on 26 academic standardized tabular datasets and 9 industrialized real-world datasets. Measured by numerous metrics and analyses, Catch establishes a new state-of-the-art, from perspectives performance, latency as well as its practicality towards production. Source code1 can be found at https://github.com/1171000709/Catch.

References

[1]
Ahmed Al-Ani and Rami N Khushaba. 2012. A population based feature subset selection algorithm guided by fuzzy feature dependency. In International Conference on Advanced Machine Learning Technologies and Applications. Springer, 430–438.
[2]
Davide Anguita, Luca Ghelardoni, Alessandro Ghio, Luca Oneto, and Sandro Ridella. 2012. The ‘K’in K-fold cross validation. In 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). i6doc. com publ, 441–446.
[3]
John O Awoyemi, Adebayo O Adetunmbi, and Samuel A Oluwadare. 2017. Credit card fraud detection using machine learning techniques: A comparative analysis. In 2017 international conference on computing networking and informatics (ICCNI). IEEE, 1–9.
[4]
Joao Bastos. 2007. Credit scoring with boosted decision trees. (2007).
[5]
Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
[6]
Svitlana Bondarenko, Olena Laburtseva, Olena Sadchenko, Vira Lebedieva, Oleksandra Haidukova, and Tetyana Kharchenko. 2019. Modern lead generation in internet marketing for the development of enterprise potential. (2019).
[7]
Leon Bornemann, Tobias Bleifuß, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2020. Natural key discovery in Wikipedia tables. In Proceedings of The Web Conference 2020. 2789–2795.
[8]
Matteo Cannaviccio, Denilson Barbosa, and Paolo Merialdo. 2018. Towards annotating relational data on the web with language models. In Proceedings of the 2018 World Wide Web Conference. 1307–1316.
[9]
Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, 2015. Xgboost: extreme gradient boosting. R package version 0.4-2 1, 4 (2015), 1–4.
[10]
Xiangning Chen, Qingwei Lin, Chuan Luo, Xudong Li, Hongyu Zhang, Yong Xu, Yingnong Dang, Kaixin Sui, Xu Zhang, Bo Qiao, 2019. Neural feature search: A neural architecture for automated feature engineering. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 71–80.
[11]
Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.
[12]
Hu Ding and Zixiu Wang. 2020. Layered sampling for robust optimization problems. In International Conference on Machine Learning. PMLR, 2556–2566.
[13]
Guozhu Dong and Huan Liu. 2018. Feature engineering for machine learning and data analytics. CRC Press.
[14]
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020).
[15]
Besnik Fetahu, Avishek Anand, and Maria Koutraki. 2019. Tablenet: An approach for determining fine-grained relations for wikipedia tables. In The World Wide Web Conference. 2736–2742.
[16]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. Advances in neural information processing systems 28 (2015).
[17]
Dibya Ghosh, Jad Rahme, Aviral Kumar, Amy Zhang, Ryan P. Adams, and Sergey Levine. 2021. Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability. arxiv:2107.06277 [cs.LG]
[18]
Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, and Xueqi Cheng. 2019. Link prediction on n-ary relational data. In The World Wide Web Conference. 583–593.
[19]
Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems 212 (2021), 106622.
[20]
Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 8 (1998), 832–844. https://doi.org/10.1109/34.709601
[21]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[22]
Franziska Horn, Robert Pack, and Michael Rieger. 2019. The autofeat python library for automated feature engineering and selection. arXiv preprint arXiv:1901.07329 (2019).
[23]
David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013. Applied logistic regression. Vol. 398. John Wiley & Sons.
[24]
James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 1–10.
[25]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
[26]
Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2018. Feature engineering for predictive modeling using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[27]
Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. 2016. Cognito: Automated feature engineering for supervised learning. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 1304–1307.
[28]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[29]
Hoang Thanh Lam, Beat Buesser, Hong Min, Tran Ngoc Minh, Martin Wistuba, Udayan Khurana, Gregory Bramble, Theodoros Salonidis, Dakuo Wang, and Horst Samulowitz. 2021. Automated Data Science for Relational Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 2689–2692. https://doi.org/10.1109/ICDE51399.2021.00305
[30]
Hoang Thanh Lam, Tran Ngoc Minh, Mathieu Sinn, Beat Buesser, and Martin Wistuba. 2018. Neural Feature Learning From Relational Database. https://doi.org/10.48550/ARXIV.1801.05372
[31]
Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen, Tiep Mai, and Oznur Alkan. 2017. One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327 (2017).
[32]
Tor Lattimore, Marcus Hutter, and Peter Sunehag. 2013. The sample-complexity of general reinforcement learning. (2013), 28–36.
[33]
Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, Yike Guo and Faisal Farooq (Eds.). ACM, 1754–1763. https://doi.org/10.1145/3219819.3220023
[34]
Liye Ma and Baohong Sun. 2020. Machine learning and AI in marketing–Connecting computing power to human insights. International Journal of Research in Marketing 37, 3 (2020), 481–504.
[35]
Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil, and Deepak S Turaga. 2017. Learning Feature Engineering for Classification. In Ijcai. 2529–2535.
[36]
SalazarRicardo, NeutatzFelix, and AbedjanZiawasch. 2021. Automated feature engineering for algorithmic fairness. Proceedings of the VLDB Endowment (2021).
[37]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[38]
Maxim Vladimirovich Shcherbakov, Adriaan Brebels, Nataliya Lvovna Shcherbakova, Anton Pavlovich Tyukov, Timur Alexandrovich Janovsky, Valeriy Anatol’evich Kamaev, 2013. A survey of forecast error measures. World Applied Sciences Journal 24, 24 (2013), 171–176.
[39]
Qitao Shi, Ya-Lin Zhang, Longfei Li, Xinxing Yang, Meng Li, and Jun Zhou. 2020. Safe: Scalable automatic feature engineering framework for industrial tasks. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1645–1656.
[40]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484–489.
[41]
Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing 14, 3 (2004), 199–222.
[42]
solid IT gmbh. 2020. DB-Engines Ranking. (2020). https://db-engines.com/en/ranking
[43]
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1161–1170.
[44]
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.
[45]
Fei-Yue Wang, Jun Jason Zhang, Xinhu Zheng, Xiao Wang, Yong Yuan, Xiaoxiao Dai, Jie Zhang, and Liuqing Yang. 2016. Where does AlphaGo go: From church-turing thesis to AlphaGo thesis and beyond. IEEE/CAA Journal of Automatica Sinica 3, 2 (2016), 113–120.
[46]
Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 12:1–12:7. https://doi.org/10.1145/3124749.3124754
[47]
Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems., 1785–1797 pages.
[48]
Zhou Wang and Alan C Bovik. 2009. Mean squared error: Love it or leave it¿ A new look at signal fidelity measures. IEEE signal processing magazine 26, 1 (2009), 98–117.
[49]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
[50]
Chelsea C White III and Douglas J White. 1989. Markov decision processes. European Journal of Operational Research 39, 1 (1989), 1–16.
[51]
Guanghui Zhu, Zhuoer Xu, Chunfeng Yuan, and Yihua Huang. 2022. DIFER: differentiable automated feature engineering. In International Conference on Automated Machine Learning. PMLR, 17–1.

Cited By

View all
  • (2025)Empowering Machine Learning With Scalable Feature Engineering and Interpretable AutoMLIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34007526:2(432-447)Online publication date: Feb-2025
  • (2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
  • (2024)Visible-hidden hybrid automatic feature engineering via multi-agent reinforcement learningKnowledge-Based Systems10.1016/j.knosys.2024.111941299:COnline publication date: 18-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '23: Proceedings of the ACM Web Conference 2023
April 2023
4293 pages
ISBN:9781450394161
DOI:10.1145/3543507
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automated Feature Engineering
  2. Data Mining
  3. Tabular Data

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NSFC

Conference

WWW '23
Sponsor:
WWW '23: The ACM Web Conference 2023
April 30 - May 4, 2023
TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)7
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Empowering Machine Learning With Scalable Feature Engineering and Interpretable AutoMLIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34007526:2(432-447)Online publication date: Feb-2025
  • (2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
  • (2024)Visible-hidden hybrid automatic feature engineering via multi-agent reinforcement learningKnowledge-Based Systems10.1016/j.knosys.2024.111941299:COnline publication date: 18-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media