Skip to main content

Transforming Mixed Data Bases for Machine Learning: A Case Study

  • Conference paper
  • First Online:
Advances in Soft Computing (MICAI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11288))

Included in the following conference series:

Abstract

Structured Data Bases which include both numerical and categorical attributes (Mixed Databases or MD) ought to be adequately pre-processed so that machine learning algorithms may be applied to their analysis and further processing. Of primordial importance is that the instances of all the categorical attributes be encoded so that the patterns embedded in the MD be preserved. We discuss CESAMO, an algorithm that achieves this by statistically sampling the space of possible codes. CESAMO’s implementation requires the determination of the moment when the codes distribute normally. It also requires the approximation of an encoded attribute as a function of other attributes such that the best code assignment may be identified. The MD’s categorical attributes are thusly mapped into purely numerical ones. The resulting numerical database (ND) is then accessible to supervised and non-supervised learning algorithms. We discuss CESAMO, normality assessment and functional approximation. A case study of the US census database is described. Data is made strictly numerical using CESAMO. Neural Networks and Self-Organized Maps are then applied. Our results are compared to classical analysis. We show that CESAMO’s application yields better results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Artificial Neural Networks

  2. 2.

    Genetic Algorithms

  3. 3.

    For which see Sect. 4.

  4. 4.

    Several possible forms for the approximant are possible. We selected a polynomial form due to the well known Weierstrass approximation theorem, which states that every continuous function defined on a closed interval [a, b] can be uniformly approximated as closely as desired by a polynomial function.

  5. 5.

    One of the requirements for the generality of the algorithm is that the characteristics of the data are not known in advance. Assuming them to be experimental underlines this fact.

  6. 6.

    (N) denotes a numerical attribute; (C) denotes a categorical attribute.

References

  1. Goebel, M., Gruenwald, L.: A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explor. Newsl. 1(1), 20–33 (1999)

    Article  Google Scholar 

  2. Sokal, R.R.: The principles of numerical taxonomy: twenty-five years later. Comput.-Assist. Bacterial Syst. 15, 1 (1985)

    Google Scholar 

  3. Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM (2002)

    Google Scholar 

  4. Kuri-Morales, A.F.: Categorical encoding with neural networks and genetic algorithms. In: Zhuang, X., Guarnaccia, C. (eds.) WSEAS Proceedings of the 6th International Conference on Applied Informatics and. Computing Theory, pp. 167–175, 01 Jul 2015. ISBN 9781618043139, ISSN 1790-5109

    Google Scholar 

  5. Kuri-Morales, A., Sagastuy-Breña, J.: A parallel genetic algorithm for pattern recognition in mixed databases. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds.) MCPR 2017. LNCS, vol. 10267, pp. 13–21. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59226-8_2

    Chapter  Google Scholar 

  6. Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92198-3_18

    Chapter  Google Scholar 

  7. Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Schoenauer, M., et al. (eds.) PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45356-3_83

    Chapter  Google Scholar 

  8. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2(4), 303–314 (1989)

    Article  MathSciNet  Google Scholar 

  9. Rudolph, G.: Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Netw. 5(1), 96–101 (1994)

    Article  Google Scholar 

  10. Kuri-Morales, A.F., Aldana-Bobadilla, E., López-Peña, I.: The best genetic algorithm II. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS (LNAI), vol. 8266, pp. 16–29. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45111-9_2

    Chapter  Google Scholar 

  11. Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)

    Article  Google Scholar 

  12. Lopez-Peña, I., Kuri-Morales, A.: Multivariate approximation methods using polynomial models: a comparative study. In: 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI). IEEE (2015)

    Google Scholar 

  13. Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30

    Chapter  Google Scholar 

  14. Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62428-0_1

    Chapter  Google Scholar 

  15. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River (1994)

    MATH  Google Scholar 

  16. Kwon, S.H.: Cluster validity index for fuzzy clustering. Electron. Lett. 34(22), 2176–2177 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angel Kuri-Morales .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kuri-Morales, A. (2018). Transforming Mixed Data Bases for Machine Learning: A Case Study. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Soft Computing. MICAI 2018. Lecture Notes in Computer Science(), vol 11288. Springer, Cham. https://doi.org/10.1007/978-3-030-04491-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04491-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04490-9

  • Online ISBN: 978-3-030-04491-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics