Transforming Mixed Data Bases for Machine Learning: A Case Study

Kuri-Morales, Angel

doi:10.1007/978-3-030-04491-6_12

Angel Kuri-Morales¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11288))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

591 Accesses
1 Citations

Abstract

Structured Data Bases which include both numerical and categorical attributes (Mixed Databases or MD) ought to be adequately pre-processed so that machine learning algorithms may be applied to their analysis and further processing. Of primordial importance is that the instances of all the categorical attributes be encoded so that the patterns embedded in the MD be preserved. We discuss CESAMO, an algorithm that achieves this by statistically sampling the space of possible codes. CESAMO’s implementation requires the determination of the moment when the codes distribute normally. It also requires the approximation of an encoded attribute as a function of other attributes such that the best code assignment may be identified. The MD’s categorical attributes are thusly mapped into purely numerical ones. The resulting numerical database (ND) is then accessible to supervised and non-supervised learning algorithms. We discuss CESAMO, normality assessment and functional approximation. A case study of the US census database is described. Data is made strictly numerical using CESAMO. Neural Networks and Self-Organized Maps are then applied. Our results are compared to classical analysis. We show that CESAMO’s application yields better results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Artificial Neural Networks
2.
Genetic Algorithms
3.
For which see Sect. 4.
4.
Several possible forms for the approximant are possible. We selected a polynomial form due to the well known Weierstrass approximation theorem, which states that every continuous function defined on a closed interval [a, b] can be uniformly approximated as closely as desired by a polynomial function.
5.
One of the requirements for the generality of the algorithm is that the characteristics of the data are not known in advance. Assuming them to be experimental underlines this fact.
6.
(N) denotes a numerical attribute; (C) denotes a categorical attribute.

References

Goebel, M., Gruenwald, L.: A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explor. Newsl. 1(1), 20–33 (1999)
Article Google Scholar
Sokal, R.R.: The principles of numerical taxonomy: twenty-five years later. Comput.-Assist. Bacterial Syst. 15, 1 (1985)
Google Scholar
Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM (2002)
Google Scholar
Kuri-Morales, A.F.: Categorical encoding with neural networks and genetic algorithms. In: Zhuang, X., Guarnaccia, C. (eds.) WSEAS Proceedings of the 6th International Conference on Applied Informatics and. Computing Theory, pp. 167–175, 01 Jul 2015. ISBN 9781618043139, ISSN 1790-5109
Google Scholar
Kuri-Morales, A., Sagastuy-Breña, J.: A parallel genetic algorithm for pattern recognition in mixed databases. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds.) MCPR 2017. LNCS, vol. 10267, pp. 13–21. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59226-8_2
Chapter Google Scholar
Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92198-3_18
Chapter Google Scholar
Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Schoenauer, M., et al. (eds.) PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45356-3_83
Chapter Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2(4), 303–314 (1989)
Article MathSciNet Google Scholar
Rudolph, G.: Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Netw. 5(1), 96–101 (1994)
Article Google Scholar
Kuri-Morales, A.F., Aldana-Bobadilla, E., López-Peña, I.: The best genetic algorithm II. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS (LNAI), vol. 8266, pp. 16–29. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45111-9_2
Chapter Google Scholar
Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)
Article Google Scholar
Lopez-Peña, I., Kuri-Morales, A.: Multivariate approximation methods using polynomial models: a comparative study. In: 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI). IEEE (2015)
Google Scholar
Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30
Chapter Google Scholar
Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62428-0_1
Chapter Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River (1994)
MATH Google Scholar
Kwon, S.H.: Cluster validity index for fuzzy clustering. Electron. Lett. 34(22), 2176–2177 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico Autónomo de México, Río Hondo no. 1, 01000, D.F. Mexico, Mexico
Angel Kuri-Morales

Authors

Angel Kuri-Morales
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angel Kuri-Morales .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Ildar Batyrshin
Universidad Panamericana, Mexico City, Mexico
María de Lourdes Martínez-Villaseñor
Faculty of Engineering, Universidad Panamericana, Mexico City, Mexico
Hiram Eredín Ponce Espinosa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuri-Morales, A. (2018). Transforming Mixed Data Bases for Machine Learning: A Case Study. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Soft Computing. MICAI 2018. Lecture Notes in Computer Science(), vol 11288. Springer, Cham. https://doi.org/10.1007/978-3-030-04491-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-04491-6_12
Published: 03 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04490-9
Online ISBN: 978-3-030-04491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics