Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Jönsson, Per; Wohlin, Claes

doi:10.1007/s10664-006-9001-9

Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Published: 31 May 2006

Volume 11, pages 463–489, (2006)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Per Jönsson¹ &
Claes Wohlin¹

557 Accesses
Explore all metrics

Abstract

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Working with missing data in large-scale assessments

Article Open access 23 April 2025

Feature Based Multivariate Data Imputation

A sequential distance-based approach for imputing missing data: Forward Imputation

Article 25 March 2016

References

Batista GEAPA, Monard MC (2001) A study of k-nearest neighbour as a model-based method to treat missing data. In: Proceedings of the 3rd Argentine Symposium on Artificial Intelligence, vol. 30. Buenos Aires, Argentine, pp 1–9
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of the 9th International Software metrics Symposium. Sydney, Australia, pp 154–165
Chen G, Åstebro T (2003) How to deal with missing categorical data: test of a simple Bayesian method. Organ Res Methods 6:309–327
Article Google Scholar
Chen J, Shao J (2000) Nearest neighbor imputation for survey data. J Off Stat 16(2):113–131
MATH Google Scholar
De Leeuw ED (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160
Article Google Scholar
Downey RG, King CV (1998) Missing data in Likert ratings: a comparison of replacement methods. J Gen Psych 125(2):175–191
Article Google Scholar
Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, NY
Engels JM, Diehr P (2003) Imputation of missing longitudinal data: a comparison of methods. J Clin Epidemiol 56:968–976
Article Google Scholar
Gediga G, Düntsch I (2003) Maximum consistency of incomplete data via non-invasive imputation. Artif Intell Rev 19(1):93–107
Article Google Scholar
Gmel G (2001) Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 20:2369–2381
Article Google Scholar
Hu M, Salvucci SM, Cohen MP (1998) Evaluation of some popular imputation algorithms. In: Proceedings of the Survey Research Methods Section, American Statistical Association, pp 308–313
Huisman M (2000) Imputation of missing item responses: some simple techniques. Qual Quant 34:331–351
Article Google Scholar
Jönsson P, Wohlin C (2004) Evaluation of k-Nearest neighbour imputation using Likert data. In: Proceedings of the 10th International Metrics Symposium, Sep. 14–16, 2004, Chicago, USA, pp 108–118
Jönsson P, Wohlin C (2005) Understanding the importance of roles in architecture-related process improvement—a case study. In: Proceedings of the 6th International Conference on Product Focused Software Process Improvement, June 13–15, 2005, Oulu, Finland, pp 343–357
Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27:999–1013
Article Google Scholar
Raaijmakers QAW (1999, October) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Meas 59(5):725–748
Google Scholar
Robson C (2002) Real World Research, 2nd ed. Blackwell Publishers, Malden, MA
Google Scholar
Sande IG (1983) Hot-deck imputation procedures. In: Madow WG, Olkin I (eds) Incomplete Data in Sample Surveys, vol. 3, Proceedings of the Symposium, Academic Press, pp 334–350
Scheffer J (2002) Dealing with missing data. Research Letters in the Information and Mathematical Sciences 3:153–160
Google Scholar
Song Q, Shepperd M, Cartwright MH (2005) A short note on safest default missingness mechanism assumptions. Empir Softw Eng 10:235–243
Article Google Scholar
Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27:890–908
Article Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Article Google Scholar
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34
MATH MathSciNet Google Scholar

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments that have allowed us to improve the paper significantly. This work was partly funded by The Knowledge Foundation in Sweden under a research grant for the project “Blekinge—Engineering Software Qualities (BESQ)” http://www.bth.se/besq.

Author information

Authors and Affiliations

School of Engineering, Blekinge Institute of Technology, PO-Box 520, SE-372 25, Ronneby, Sweden
Per Jönsson & Claes Wohlin

Authors

Per Jönsson
View author publications
You can also search for this author inPubMed Google Scholar
Claes Wohlin
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Per Jönsson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jönsson, P., Wohlin, C. Benchmarking k-nearest neighbour imputation with homogeneous Likert data. Empir Software Eng 11, 463–489 (2006). https://doi.org/10.1007/s10664-006-9001-9

Download citation

Published: 31 May 2006
Issue Date: September 2006
DOI: https://doi.org/10.1007/s10664-006-9001-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Working with missing data in large-scale assessments

Feature Based Multivariate Data Imputation

A sequential distance-based approach for imputing missing data: Forward Imputation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now