Skip to main content

Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2018 (IDEAL 2018)

Abstract

Improving accuracy or reducing computational cost are the main approaches of machine learning techniques, but it depends heavily on the test data used. Even more so when it comes to from real-world data such as censuses, surveys or tokens that contain a high level of missing values. The data absence or presence of outliers are problems that must be treated carefully prior to any process related to data analysis. The following work presents an overview of data pre-processing and aims at presenting the steps to follow prior to process large volumes of high-dimensionality data with categorical variables. As part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables, we use the Pairwise and Listwise Deletion methods. Thus, the generation of m-clusters using the Kohonen Self-Organizing Maps (SOM) algorithm with H2O over R is also considered as a division of data into similar groups, which are used as cluster to apply Multiple Imputation algorithms, creating different m-values to impute a missing value.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bar, H.: Missing data–mechanisms and possible solutions. Cultura y Educación 29(3), 492–525 (2017)

    Article  Google Scholar 

  2. Chackiel, J.: Métodos de estimaciones demográficas de pueblos indígenas a partir de censos de población: La Fecundidad y la Mortalidad. Pueblos indigenas y afrodescendientes de América Latina y el Caribe: relevancia y pertinencia de la informacion sociodemografica para politicas y programas, p. 30 (2005)

    Google Scholar 

  3. Cheema, J.R.: A review of missing data handling methods in education research. Rev. Educ. Res. 84(4), 487–508 (2014)

    Article  Google Scholar 

  4. Famili, A., Shen, W.-M., Weber, R., Simoudis, E.: Data preprocessing and intelligent data analysis. Intell. Data Anal. 1(1), 3–23 (1997)

    Article  Google Scholar 

  5. Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10(4), 300–310 (2002)

    Article  Google Scholar 

  6. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)

    Article  Google Scholar 

  7. Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70

    Chapter  Google Scholar 

  8. Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5(4), 297–310 (2011)

    Article  Google Scholar 

  9. Newman, D.A.: Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ. Res. Methods 6(3), 328–362 (2003)

    Article  Google Scholar 

  10. Nishanth, K.J., Ravi, V.: Probabilistic neural network based categorical data imputation. Neurocomputing 218, 17–25 (2016)

    Article  Google Scholar 

  11. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zoila Ruiz-Chavez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ruiz-Chavez, Z., Salvador-Meneses, J., Garcia-Rodriguez, J., Tallón-Ballesteros, A.J. (2018). Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11315. Springer, Cham. https://doi.org/10.1007/978-3-030-03496-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-03496-2_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-03495-5

  • Online ISBN: 978-3-030-03496-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics