Abstract
It is common to find categorical attributes in datasets used for training Machine Learning (ML) algorithms. However, most ML models are designed to exclusively handle numerical inputs. To effectively incorporate these categorical attributes, it is necessary to convert them into numerical values. Preserving the inherent patterns and information associated with the categorical attributes is essential throughout this conversion process. Any loss of information or pattern might adversely impact the performance of ML algorithms. Several encoding techniques have been proposed to handle this conversion. This paper delves into the exploration of the CESAMO encoding technique. CESAMO encoder captures the relationships between categorical attributes and other variables using what is inferred as Pattern Preserving Codes. A statistical evaluation of this encoding technique was conducted using synthetic data, comparing its performance with other encoding methods. The experimental results demonstrate that CESAMO outperforms all the other categorical encoding techniques that were compared.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman and Hall/CRC, Boca Raton (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Hancock, J.T., Khoshgoftaar, T.M.: Survey on categorical data for neural networks. J. Big Data 7(1), 1–41 (2020)
Zheng, A., Casari, A.: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc., Sebastopol (2018)
Seger, C.: An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing (2018)
Galli, S.: Python Feature Engineering Cookbook: Over 70 Recipes for Creating, Engineering, and Transforming Features to Build Machine Learning Models. Packt Publishing Ltd., Birmingham (2022)
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120 (2009)
Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92198-3_18
Valdez-Valenzuela, E., Kuri-Morales, A., Gomez-Adorno, H.: Measuring the effect of categorical encoders in machine learning tasks using synthetic data. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds.) MICAI 2021. LNCS (LNAI), vol. 13067, pp. 92–107. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89817-5_7
Pargent, F., Bischl, B., Thomas, J.: A benchmark experiment on how to encode categorical features in predictive modeling. Ludwig-Maximilians-Universität München, München (2019)
Matteucci, F., Arzamasov, V., Boehm, K.: A benchmark of categorical encoders for binary classification. arXiv preprint arXiv:2307.09191 (2023)
Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30
Cheney, E.W.: Introduction to Approximation Theory. McGraw-Hill Book Company, New York (1966)
McGinnis, W.D., Siu, C., Andre, S., Huang, H.: Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. J. Open Source Softw. 3(21), 501 (2018)
Forsyth, D.: Probability and Statistics for Computer Science, pp. 36–42. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-64410-3
Kuri-Morales, A.F.: A methodology for the statistical characterization of genetic algorithms. In: Coello Coello, C.A., de Albornoz, A., Sucar, L.E., Battistutti, O.C. (eds.) MICAI 2002. LNCS (LNAI), vol. 2313, pp. 79–88. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46016-0_9
Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62428-0_1
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Valdez-Valenzuela, E., Kuri-Morales, A., Gomez-Adorno, H. (2024). Statistical Evaluation of CESAMO Encoder for Pattern Preservation in Categorical Data. In: Mezura-Montes, E., Acosta-Mesa, H.G., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2024. Lecture Notes in Computer Science, vol 14755. Springer, Cham. https://doi.org/10.1007/978-3-031-62836-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-62836-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62835-1
Online ISBN: 978-3-031-62836-8
eBook Packages: Computer ScienceComputer Science (R0)