Abstract
The following study is aimed at dealing with large volumes of data whose main characteristic is to contain a high number of variables, most of which are categorical in nature. In the knowledge extraction process, Knowledge Discovery in Databases (KDD), it is very common to deal with a stage of data pre-processing and dimensionality reduction. A key part of extracting information is having high quality data. This paper proposes the use of the Pairwise and Listwise methods as part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables. As part of the pre-processing, we generate n-clusters using Kohonen Self-Organizing Maps (SOM) algorithm with H2O on R. A comparison of the performance and accuracy of classification algorithms is made with the complete subdata set and the algorithms are applied to each cluster. As a case study, we analyzed the characteristics that influence the level of schooling of women of childbearing age.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aldehim, G., Wang, W.: Determining appropriate approaches for using data in feature selection. Int. J. Mach. Learn. Cybern. 8(3), 915–928 (2017)
Chackiel, J.: Métodos de estimaciones demográficas de pueblos indígenas a partir de censos de población: La Fecundidad y la Mortalidad. In: Pueblos indigenas y afrodescendientes de América Latina y el Caribe: relevancia y pertinencia de la informacion sociodemografica para politicas y programas, p. 30 (2005)
Gorade, M.S.M., Deo, A., Purohit, P.: A study of some data mining classification techniques. IRJET 4, 3112–3115 (2017)
Acuña, M.: Redatam Informa. Redatam Inf. 19(19), 13–17 (2013)
Mojirsheibani, M., Shaw, C.: Classification with incomplete functional covariates. Stat. Prob. Lett. 139, 40–46 (2018)
Pandey, G., Ren, Z., Wang, S., Veijalainen, J., de Rijke, M.: Linear feature extraction for ranking. Inf. Retrieval J. 1, 1–26 (2018)
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017)
Roy, A., Cruz, R.M., Sabourin, R., Cavalcanti, G.D.: A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286, 179–192 (2018)
Shirzad, M.B., Keyvanpour, M.R.: A systematic study of feature selection methods for learning to rank algorithms. Int. J. Inf. Retr. Res. (IJIRR) 8(3), 46–67 (2018)
Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electron. Notes Theor. Comput. Sci. 292, 135–151 (2013)
Zulkepli, F.S., Ibrahim, R., Saeed, F.: Data preprocessing techniques for research performance analysis. In: Patnaik, S., Popentiu-Vladicescu, F. (eds.) Recent Developments in Intelligent Computing, Communication and Devices. AISC, vol. 555, pp. 157–162. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3779-5_20
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ruiz-Chavez, Z., Salvador-Meneses, J., Garcia-Rodriguez, J. (2018). Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11314. Springer, Cham. https://doi.org/10.1007/978-3-030-03493-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-03493-1_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03492-4
Online ISBN: 978-3-030-03493-1
eBook Packages: Computer ScienceComputer Science (R0)