skip to main content
10.1145/3583133.3596367acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
research-article

MLStar: A System for Synthesis of Machine-Learning Programs

Published: 24 July 2023 Publication History

Abstract

This paper introduces our auto-ML system, MLStar, which uses genetic programming to create scikit-learn and Keras-based Python programs to perform supervised learning. MLStar leverages our own genetic programming system (GPStar4) and provides a greater search space compared to traditional genetic programming frameworks.
Key elements that enable MLStar's performance include representing individuals as Directed Acyclic Graphs (DAGs), a rich type system to shape the kinds of graphs generated, novel genetic operators which work on the DAG structure, and advanced hyperparameter tuning via the Optuna hyperparameter optimization framework. MLStar also offers multiobjective fitnesses and a variety of complex population types.
We show that MLStar performs favorably to several other auto-ML frameworks on benchmark tests. We also demonstrate that MLStar is capable of competitive solutions even when running with computationally expensive features disabled.

References

[1]
2023. auto_ml. https://github.com/ClimbsRocks/auto_ml
[2]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[3]
James Bergstra, Daniel Yamins, and David Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 115--123. https://proceedings.mlr.press/v28/bergstra13.html
[4]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). ACM, New York, NY, USA, 785--794.
[5]
Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras
[6]
Piali Das, Nikita Ivkin, Tanya Bansal, Laurence Rouesnel, Philip Gautier, Zohar Karnin, Leo Dirac, Lakshmi Ramakrishnan, Andre Perunicic, Iaroslav Shcherbatyi, Wilton Wu, Aida Zolic, Huibin Shen, Amr Ahmed, Fela Winkelmolen, Miroslav Miladinovic, Cedric Archembeau, Alex Tang, Bhaskar Dutt, Patricia Grao, and Kumar Venkateswar. 2020. Amazon SageMaker Autopilot: A White Box AutoML Solution at Scale. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM'20). Association for Computing Machinery, New York, NY, USA, Article 2, 7 pages.
[7]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems 28 (2015). 2962--2970.
[8]
Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD '19). Association for Computing Machinery, New York, NY, USA, 1946--1956.
[9]
Gabriel Kopito Julien Amblard, Robert Filman. 2023. GPStar4: A flexible framework for experimenting with genetic programming. submitted to GECCO 2023.
[10]
James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (2015), 1--10.
[11]
Donald E. Knuth. 1968. Semantics of Context-Free Languages. Math. Syst. Theory 2, 2 (1968), 127--145.
[12]
Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) (July 2020). https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf
[13]
Edgar Galván López and Katya Rodríguez-Vázquez. 2007. Multiple Interactive Outputs in a Single Tree: An Empirical Investigation. In Genetic Programming, 10th European Conference, EuroGP 2007, Valencia, Spain, April 11--13, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4445), Marc Ebner, Michael O'Neill, Anikó Ekárt, Leonardo Vanneschi, and Anna Esparcia-Alcázar (Eds.). Springer, 341--350.
[14]
Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science.
[15]
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 36 (11 Dec 2017), 1--13.
[16]
Michael O'Neill. 2009. Riccardo Poli, William B. Langdon, Nicholas F. McPhee: A Field Guide to Genetic Programming: Lulu. com, 2008, 250 pp, ISBN 978-1-4092-0073-4.
[17]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[18]
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf
[19]
Léo Françoso D. P. Sotto, Paul Kaufmann, Timothy Atkinson, Roman Kalkreuth, and Márcio Porto Basgalupp. 2020. A Study on Graph Representations for Genetic Programming. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference (Cancún, Mexico) (GECCO '20). Association for Computing Machinery, New York, NY, USA, 931--939.
[20]
In-Kwon Yeo and Richard A. Johnson. 2000. A new family of power transformations to improve normality or symmetry. Biometrika 87, 4 (12 2000), 954--959. arXiv:https://academic.oup.com/biomet/article-pdf/87/4/954/633221/870954.pdf

Cited By

View all
  • (2023)GPStar4: A flexible framework for experimenting with genetic programmingProceedings of the Companion Conference on Genetic and Evolutionary Computation10.1145/3583133.3596369(1910-1915)Online publication date: 15-Jul-2023

Index Terms

  1. MLStar: A System for Synthesis of Machine-Learning Programs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    GECCO '23 Companion: Proceedings of the Companion Conference on Genetic and Evolutionary Computation
    July 2023
    2519 pages
    ISBN:9798400701207
    DOI:10.1145/3583133
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. genetic programming
    2. directed acyclic graphs
    3. ScikitLearn
    4. auto-ML

    Qualifiers

    • Research-article

    Conference

    GECCO '23 Companion
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,669 of 4,410 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)GPStar4: A flexible framework for experimenting with genetic programmingProceedings of the Companion Conference on Genetic and Evolutionary Computation10.1145/3583133.3596369(1910-1915)Online publication date: 15-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media