Skip to main content

Statistical Evaluation of CESAMO Encoder for Pattern Preservation in Categorical Data

  • Conference paper
  • First Online:
Pattern Recognition (MCPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14755))

Included in the following conference series:

  • 381 Accesses

Abstract

It is common to find categorical attributes in datasets used for training Machine Learning (ML) algorithms. However, most ML models are designed to exclusively handle numerical inputs. To effectively incorporate these categorical attributes, it is necessary to convert them into numerical values. Preserving the inherent patterns and information associated with the categorical attributes is essential throughout this conversion process. Any loss of information or pattern might adversely impact the performance of ML algorithms. Several encoding techniques have been proposed to handle this conversion. This paper delves into the exploration of the CESAMO encoding technique. CESAMO encoder captures the relationships between categorical attributes and other variables using what is inferred as Pattern Preserving Codes. A statistical evaluation of this encoding technique was conducted using synthetic data, comparing its performance with other encoding methods. The experimental results demonstrate that CESAMO outperforms all the other categorical encoding techniques that were compared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman and Hall/CRC, Boca Raton (2019)

    Google Scholar 

  2. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  3. Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  4. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

    Google Scholar 

  5. Hancock, J.T., Khoshgoftaar, T.M.: Survey on categorical data for neural networks. J. Big Data 7(1), 1–41 (2020)

    Article  Google Scholar 

  6. Zheng, A., Casari, A.: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc., Sebastopol (2018)

    Google Scholar 

  7. Seger, C.: An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing (2018)

    Google Scholar 

  8. Galli, S.: Python Feature Engineering Cookbook: Over 70 Recipes for Creating, Engineering, and Transforming Features to Build Machine Learning Models. Packt Publishing Ltd., Birmingham (2022)

    Google Scholar 

  9. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120 (2009)

    Google Scholar 

  10. Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92198-3_18

    Chapter  Google Scholar 

  11. Valdez-Valenzuela, E., Kuri-Morales, A., Gomez-Adorno, H.: Measuring the effect of categorical encoders in machine learning tasks using synthetic data. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds.) MICAI 2021. LNCS (LNAI), vol. 13067, pp. 92–107. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89817-5_7

    Chapter  Google Scholar 

  12. Pargent, F., Bischl, B., Thomas, J.: A benchmark experiment on how to encode categorical features in predictive modeling. Ludwig-Maximilians-Universität München, München (2019)

    Google Scholar 

  13. Matteucci, F., Arzamasov, V., Boehm, K.: A benchmark of categorical encoders for binary classification. arXiv preprint arXiv:2307.09191 (2023)

  14. Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30

    Chapter  Google Scholar 

  15. Cheney, E.W.: Introduction to Approximation Theory. McGraw-Hill Book Company, New York (1966)

    Google Scholar 

  16. McGinnis, W.D., Siu, C., Andre, S., Huang, H.: Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. J. Open Source Softw. 3(21), 501 (2018)

    Article  Google Scholar 

  17. Forsyth, D.: Probability and Statistics for Computer Science, pp. 36–42. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-64410-3

  18. Kuri-Morales, A.F.: A methodology for the statistical characterization of genetic algorithms. In: Coello Coello, C.A., de Albornoz, A., Sucar, L.E., Battistutti, O.C. (eds.) MICAI 2002. LNCS (LNAI), vol. 2313, pp. 79–88. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46016-0_9

    Chapter  Google Scholar 

  19. Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62428-0_1

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric Valdez-Valenzuela .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Valdez-Valenzuela, E., Kuri-Morales, A., Gomez-Adorno, H. (2024). Statistical Evaluation of CESAMO Encoder for Pattern Preservation in Categorical Data. In: Mezura-Montes, E., Acosta-Mesa, H.G., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2024. Lecture Notes in Computer Science, vol 14755. Springer, Cham. https://doi.org/10.1007/978-3-031-62836-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-62836-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-62835-1

  • Online ISBN: 978-3-031-62836-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics