Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification

Ruiz-Chavez, Zoila; Salvador-Meneses, Jaime; Garcia-Rodriguez, Jose

doi:10.1007/978-3-030-03493-1_32

Zoila Ruiz-Chavez¹⁷,
Jaime Salvador-Meneses¹⁷ &
Jose Garcia-Rodriguez¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11314))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

2610 Accesses
3 Citations

Abstract

The following study is aimed at dealing with large volumes of data whose main characteristic is to contain a high number of variables, most of which are categorical in nature. In the knowledge extraction process, Knowledge Discovery in Databases (KDD), it is very common to deal with a stage of data pre-processing and dimensionality reduction. A key part of extracting information is having high quality data. This paper proposes the use of the Pairwise and Listwise methods as part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables. As part of the pre-processing, we generate n-clusters using Kohonen Self-Organizing Maps (SOM) algorithm with H2O on R. A comparison of the performance and accuracy of classification algorithms is made with the complete subdata set and the algorithms are applied to each cluster. As a case study, we analyzed the characteristics that influence the level of schooling of women of childbearing age.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aldehim, G., Wang, W.: Determining appropriate approaches for using data in feature selection. Int. J. Mach. Learn. Cybern. 8(3), 915–928 (2017)
Article Google Scholar
Chackiel, J.: Métodos de estimaciones demográficas de pueblos indígenas a partir de censos de población: La Fecundidad y la Mortalidad. In: Pueblos indigenas y afrodescendientes de América Latina y el Caribe: relevancia y pertinencia de la informacion sociodemografica para politicas y programas, p. 30 (2005)
Google Scholar
Gorade, M.S.M., Deo, A., Purohit, P.: A study of some data mining classification techniques. IRJET 4, 3112–3115 (2017)
Google Scholar
Acuña, M.: Redatam Informa. Redatam Inf. 19(19), 13–17 (2013)
Google Scholar
Mojirsheibani, M., Shaw, C.: Classification with incomplete functional covariates. Stat. Prob. Lett. 139, 40–46 (2018)
Article MathSciNet Google Scholar
Pandey, G., Ren, Z., Wang, S., Veijalainen, J., de Rijke, M.: Linear feature extraction for ranking. Inf. Retrieval J. 1, 1–26 (2018)
Google Scholar
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017)
Article Google Scholar
Roy, A., Cruz, R.M., Sabourin, R., Cavalcanti, G.D.: A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286, 179–192 (2018)
Article Google Scholar
Shirzad, M.B., Keyvanpour, M.R.: A systematic study of feature selection methods for learning to rank algorithms. Int. J. Inf. Retr. Res. (IJIRR) 8(3), 46–67 (2018)
Google Scholar
Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electron. Notes Theor. Comput. Sci. 292, 135–151 (2013)
Article Google Scholar
Zulkepli, F.S., Ibrahim, R., Saeed, F.: Data preprocessing techniques for research performance analysis. In: Patnaik, S., Popentiu-Vladicescu, F. (eds.) Recent Developments in Intelligent Computing, Communication and Devices. AISC, vol. 555, pp. 157–162. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3779-5_20
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Central del Ecuador, Ciudadela Universitaria, Quito, Ecuador
Zoila Ruiz-Chavez & Jaime Salvador-Meneses
Universidad de Alicante, Ap. 99., 03080, Alicante, Spain
Jose Garcia-Rodriguez

Authors

Zoila Ruiz-Chavez
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Salvador-Meneses
View author publications
You can also search for this author in PubMed Google Scholar
Jose Garcia-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zoila Ruiz-Chavez .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Autonomous University of Madrid, Madrid, Spain
David Camacho
Campus of Gualtar, University of Minho, Braga, Portugal
Paulo Novais
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruiz-Chavez, Z., Salvador-Meneses, J., Garcia-Rodriguez, J. (2018). Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11314. Springer, Cham. https://doi.org/10.1007/978-3-030-03493-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-03493-1_32
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03492-4
Online ISBN: 978-3-030-03493-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics