Skip to main content
Log in

A Short Note on Safest Default Missingness Mechanism Assumptions

  • Original Article
  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Angelis, L., Stamelos, I., and Morisio, M. 2001. Building a software cost estimation model based on categorical data. Proceedings Seventh International Software Metrics Symposium ( METRICS 2001). pp. 4–15.

  • Conte, S., Dunsmore, H., and Shen, V. Y. 1986. Software Engineering Metrics and Models, Menlo Park, CA: Benjamin Cummings.

    Google Scholar 

  • Fix, E., and Hodges, J. L. 1952. Discriminatory analysis: Nonparametric discrimination: Small sample performance, Technical Report Project 21-49-004, Report Number 11, USAF School of Aviation Medicine, Randolf Field, Texas.

  • Jeffery, R., Ruhe, M., and Wieczorek, I. 2001. Using public domain metrics to estimate software development effort. Proceedings Seventh International Software Metrics Symposium (METRICS 2001), pp. 16–27.

  • Kirsopp, C., Shepperd, M. J., and Hart, J., 2002. Search Heuristics, Cased-Based Reasoning and Software Project Effort Prediction. GECCO 2002: Genetic and Evolutionary Computation Conf., New York, AAAI.

  • Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association 83(404): 1198–1202.

    Google Scholar 

  • Little, R. J. A., and Rubin, D. B. 1987. Statistical Analysis with Missing Data. New York: John Wiley & Sons.

    Google Scholar 

  • Myrtveit, I., Stensrud, E., and Olsson, U. 2001. Analyzing data set with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering 27(11): 999–1013.

    Google Scholar 

  • Quinlan, J. R., 1996. Learning decision tree classifiers. ACM Computing Surveys 28: 71–72.

    Google Scholar 

  • Reinsdorf, M. B., Liegey, P., and Stewart, K. J. 1996. New Ways of Handling Quality Change in the U.S. Consumer Price Index. Bureau of Labor Statistics working paper no. 276, USA.

  • Schafer, J., and Olsen, M. 1998. Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioural Research 33: 545–571.

    Google Scholar 

  • Strike, K., Emam, K. E., and Madhavji, N. 2001. Software cost estimation with incomplete data. IEEE Transactions on Software Engineering 27(10): 890–908.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinbao Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Q., Shepperd, M. & Cartwright, M. A Short Note on Safest Default Missingness Mechanism Assumptions. Empir Software Eng 10, 235–243 (2005). https://doi.org/10.1007/s10664-004-6193-8

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-004-6193-8

Keywords

Navigation