Skip to main content
Log in

Benchmarking k-nearest neighbour imputation with homogeneous Likert data

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

References

  • Batista GEAPA, Monard MC (2001) A study of k-nearest neighbour as a model-based method to treat missing data. In: Proceedings of the 3rd Argentine Symposium on Artificial Intelligence, vol. 30. Buenos Aires, Argentine, pp 1–9

  • Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of the 9th International Software metrics Symposium. Sydney, Australia, pp 154–165

  • Chen G, Åstebro T (2003) How to deal with missing categorical data: test of a simple Bayesian method. Organ Res Methods 6:309–327

    Article  Google Scholar 

  • Chen J, Shao J (2000) Nearest neighbor imputation for survey data. J Off Stat 16(2):113–131

    MATH  Google Scholar 

  • De Leeuw ED (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160

    Article  Google Scholar 

  • Downey RG, King CV (1998) Missing data in Likert ratings: a comparison of replacement methods. J Gen Psych 125(2):175–191

    Article  Google Scholar 

  • Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, NY

  • Engels JM, Diehr P (2003) Imputation of missing longitudinal data: a comparison of methods. J Clin Epidemiol 56:968–976

    Article  Google Scholar 

  • Gediga G, Düntsch I (2003) Maximum consistency of incomplete data via non-invasive imputation. Artif Intell Rev 19(1):93–107

    Article  Google Scholar 

  • Gmel G (2001) Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 20:2369–2381

    Article  Google Scholar 

  • Hu M, Salvucci SM, Cohen MP (1998) Evaluation of some popular imputation algorithms. In: Proceedings of the Survey Research Methods Section, American Statistical Association, pp 308–313

  • Huisman M (2000) Imputation of missing item responses: some simple techniques. Qual Quant 34:331–351

    Article  Google Scholar 

  • Jönsson P, Wohlin C (2004) Evaluation of k-Nearest neighbour imputation using Likert data. In: Proceedings of the 10th International Metrics Symposium, Sep. 14–16, 2004, Chicago, USA, pp 108–118

  • Jönsson P, Wohlin C (2005) Understanding the importance of roles in architecture-related process improvement—a case study. In: Proceedings of the 6th International Conference on Product Focused Software Process Improvement, June 13–15, 2005, Oulu, Finland, pp 343–357

  • Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27:999–1013

    Article  Google Scholar 

  • Raaijmakers QAW (1999, October) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Meas 59(5):725–748

    Google Scholar 

  • Robson C (2002) Real World Research, 2nd ed. Blackwell Publishers, Malden, MA

    Google Scholar 

  • Sande IG (1983) Hot-deck imputation procedures. In: Madow WG, Olkin I (eds) Incomplete Data in Sample Surveys, vol. 3, Proceedings of the Symposium, Academic Press, pp 334–350

  • Scheffer J (2002) Dealing with missing data. Research Letters in the Information and Mathematical Sciences 3:153–160

    Google Scholar 

  • Song Q, Shepperd M, Cartwright MH (2005) A short note on safest default missingness mechanism assumptions. Empir Softw Eng 10:235–243

    Article  Google Scholar 

  • Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27:890–908

    Article  Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525

    Article  Google Scholar 

  • Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34

    MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments that have allowed us to improve the paper significantly. This work was partly funded by The Knowledge Foundation in Sweden under a research grant for the project “Blekinge—Engineering Software Qualities (BESQ)” http://www.bth.se/besq.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Per Jönsson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jönsson, P., Wohlin, C. Benchmarking k-nearest neighbour imputation with homogeneous Likert data. Empir Software Eng 11, 463–489 (2006). https://doi.org/10.1007/s10664-006-9001-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-006-9001-9

Keywords

Navigation