A Short Note on Safest Default Missingness Mechanism Assumptions

Song, Qinbao; Shepperd, Martin; Cartwright, Michelle

doi:10.1007/s10664-004-6193-8

A Short Note on Safest Default Missingness Mechanism Assumptions

Original Article
Published: April 2005

Volume 10, pages 235–243, (2005)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Qinbao Song¹,
Martin Shepperd¹ &
Michelle Cartwright¹

179 Accesses
28 Citations
Explore all metrics

Abstract

A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing Data Imputation Techniques for Software Effort Estimation: A Study of Recent Issues and Challenges

Dealing with Missing Values in Software Project Datasets: A Systematic Mapping Study

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

Article 12 February 2024

References

Angelis, L., Stamelos, I., and Morisio, M. 2001. Building a software cost estimation model based on categorical data. Proceedings Seventh International Software Metrics Symposium ( METRICS 2001). pp. 4–15.
Conte, S., Dunsmore, H., and Shen, V. Y. 1986. Software Engineering Metrics and Models, Menlo Park, CA: Benjamin Cummings.
Google Scholar
Fix, E., and Hodges, J. L. 1952. Discriminatory analysis: Nonparametric discrimination: Small sample performance, Technical Report Project 21-49-004, Report Number 11, USAF School of Aviation Medicine, Randolf Field, Texas.
Jeffery, R., Ruhe, M., and Wieczorek, I. 2001. Using public domain metrics to estimate software development effort. Proceedings Seventh International Software Metrics Symposium (METRICS 2001), pp. 16–27.
Kirsopp, C., Shepperd, M. J., and Hart, J., 2002. Search Heuristics, Cased-Based Reasoning and Software Project Effort Prediction. GECCO 2002: Genetic and Evolutionary Computation Conf., New York, AAAI.
Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association 83(404): 1198–1202.
Google Scholar
Little, R. J. A., and Rubin, D. B. 1987. Statistical Analysis with Missing Data. New York: John Wiley & Sons.
Google Scholar
Myrtveit, I., Stensrud, E., and Olsson, U. 2001. Analyzing data set with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering 27(11): 999–1013.
Google Scholar
Quinlan, J. R., 1996. Learning decision tree classifiers. ACM Computing Surveys 28: 71–72.
Google Scholar
Reinsdorf, M. B., Liegey, P., and Stewart, K. J. 1996. New Ways of Handling Quality Change in the U.S. Consumer Price Index. Bureau of Labor Statistics working paper no. 276, USA.
Schafer, J., and Olsen, M. 1998. Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioural Research 33: 545–571.
Google Scholar
Strike, K., Emam, K. E., and Madhavji, N. 2001. Software cost estimation with incomplete data. IEEE Transactions on Software Engineering 27(10): 890–908.
Google Scholar

Download references

Author information

Authors and Affiliations

Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK
Qinbao Song, Martin Shepperd & Michelle Cartwright

Authors

Qinbao Song
View author publications
You can also search for this author in PubMed Google Scholar
Martin Shepperd
View author publications
You can also search for this author in PubMed Google Scholar
Michelle Cartwright
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinbao Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Q., Shepperd, M. & Cartwright, M. A Short Note on Safest Default Missingness Mechanism Assumptions. Empir Software Eng 10, 235–243 (2005). https://doi.org/10.1007/s10664-004-6193-8

Download citation

Issue Date: April 2005
DOI: https://doi.org/10.1007/s10664-004-6193-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Short Note on Safest Default Missingness Mechanism Assumptions

Abstract

Access this article

Similar content being viewed by others

Missing Data Imputation Techniques for Software Effort Estimation: A Study of Recent Issues and Challenges

Dealing with Missing Values in Software Project Datasets: A Systematic Mapping Study

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Short Note on Safest Default Missingness Mechanism Assumptions

Abstract

Access this article

Similar content being viewed by others

Missing Data Imputation Techniques for Software Effort Estimation: A Study of Recent Issues and Challenges

Dealing with Missing Values in Software Project Datasets: A Systematic Mapping Study

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation