Statistical Evaluation of CESAMO Encoder for Pattern Preservation in Categorical Data

Valdez-Valenzuela, Eric; Kuri-Morales, Angel; Gomez-Adorno, Helena

doi:10.1007/978-3-031-62836-8_5

Eric Valdez-Valenzuela²⁹,
Angel Kuri-Morales³⁰ &
Helena Gomez-Adorno³¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14755))

Included in the following conference series:

Mexican Conference on Pattern Recognition

381 Accesses

Abstract

It is common to find categorical attributes in datasets used for training Machine Learning (ML) algorithms. However, most ML models are designed to exclusively handle numerical inputs. To effectively incorporate these categorical attributes, it is necessary to convert them into numerical values. Preserving the inherent patterns and information associated with the categorical attributes is essential throughout this conversion process. Any loss of information or pattern might adversely impact the performance of ML algorithms. Several encoding techniques have been proposed to handle this conversion. This paper delves into the exploration of the CESAMO encoding technique. CESAMO encoder captures the relationships between categorical attributes and other variables using what is inferred as Pattern Preserving Codes. A statistical evaluation of this encoding technique was conducted using synthetic data, comparing its performance with other encoding methods. The experimental results demonstrate that CESAMO outperforms all the other categorical encoding techniques that were compared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CESAMMO: Categorical Encoding by Statistical Applied Multivariable Modeling

Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

Categorical Data: Need, Encoding, Selection of Encoding Method and Its Emergence in Machine Learning Models—A Practical Review Study on Heart Disease Prediction Dataset Using Pearson Correlation

References

Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman and Hall/CRC, Boca Raton (2019)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Google Scholar
Hancock, J.T., Khoshgoftaar, T.M.: Survey on categorical data for neural networks. J. Big Data 7(1), 1–41 (2020)
Article Google Scholar
Zheng, A., Casari, A.: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc., Sebastopol (2018)
Google Scholar
Seger, C.: An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing (2018)
Google Scholar
Galli, S.: Python Feature Engineering Cookbook: Over 70 Recipes for Creating, Engineering, and Transforming Features to Build Machine Learning Models. Packt Publishing Ltd., Birmingham (2022)
Google Scholar
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120 (2009)
Google Scholar
Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92198-3_18
Chapter Google Scholar
Valdez-Valenzuela, E., Kuri-Morales, A., Gomez-Adorno, H.: Measuring the effect of categorical encoders in machine learning tasks using synthetic data. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds.) MICAI 2021. LNCS (LNAI), vol. 13067, pp. 92–107. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89817-5_7
Chapter Google Scholar
Pargent, F., Bischl, B., Thomas, J.: A benchmark experiment on how to encode categorical features in predictive modeling. Ludwig-Maximilians-Universität München, München (2019)
Google Scholar
Matteucci, F., Arzamasov, V., Boehm, K.: A benchmark of categorical encoders for binary classification. arXiv preprint arXiv:2307.09191 (2023)
Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30
Chapter Google Scholar
Cheney, E.W.: Introduction to Approximation Theory. McGraw-Hill Book Company, New York (1966)
Google Scholar
McGinnis, W.D., Siu, C., Andre, S., Huang, H.: Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. J. Open Source Softw. 3(21), 501 (2018)
Article Google Scholar
Forsyth, D.: Probability and Statistics for Computer Science, pp. 36–42. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-64410-3
Kuri-Morales, A.F.: A methodology for the statistical characterization of genetic algorithms. In: Coello Coello, C.A., de Albornoz, A., Sucar, L.E., Battistutti, O.C. (eds.) MICAI 2002. LNCS (LNAI), vol. 2313, pp. 79–88. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46016-0_9
Chapter Google Scholar
Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62428-0_1
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Posgrado en Ciencia e Ingeniería de la Computación, Universidad Nacional Autónoma de México, Mexico City, Mexico
Eric Valdez-Valenzuela
Instituto Tecnológico Autónomo de México, Mexico City, Mexico
Angel Kuri-Morales
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Mexico City, Mexico
Helena Gomez-Adorno

Authors

Eric Valdez-Valenzuela
View author publications
You can also search for this author in PubMed Google Scholar
Angel Kuri-Morales
View author publications
You can also search for this author in PubMed Google Scholar
Helena Gomez-Adorno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric Valdez-Valenzuela .

Editor information

Editors and Affiliations

Universidad Veracruzana, Veracruz, Mexico
Efrén Mezura-Montes
Universidad Veracruzana, Veracruz, Mexico
Héctor Gabriel Acosta-Mesa
Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Puebla, Mexico
José Francisco Martínez-Trinidad
Autonomous University of Puebla (BUAP), Puebla, Mexico
José Arturo Olvera-López

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valdez-Valenzuela, E., Kuri-Morales, A., Gomez-Adorno, H. (2024). Statistical Evaluation of CESAMO Encoder for Pattern Preservation in Categorical Data. In: Mezura-Montes, E., Acosta-Mesa, H.G., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2024. Lecture Notes in Computer Science, vol 14755. Springer, Cham. https://doi.org/10.1007/978-3-031-62836-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-62836-8_5
Published: 17 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62835-1
Online ISBN: 978-3-031-62836-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)