Abstract
The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.
Similar content being viewed by others
References
Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA.
Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer.
Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165.
Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.
Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company.
Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118.
Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.
Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602.
Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV.
Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226.
Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.
Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.
Orr, K. (1998). Data quality and systems theory. Communications of the ACM, 41(2), 66–71.
Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc.
Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC.
Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.
Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.
Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40.
Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.
Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267.
Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29.
Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.
Acknowledgements
We thank the anonymous reviewers for their constructive comments and suggestions which helped improve this paper. We are grateful to the current and former members of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories at Florida Atlantic University for their reviews and comments.
Author information
Authors and Affiliations
Corresponding author
Appendix: Attribute noise ranking procedure
Appendix: Attribute noise ranking procedure
In related work, we have proposed a procedure for ranking attributes in a dataset from most to least noisy. Information about the relative quality of each of the attributes in the datasets used in this work will be utilized in our experiments. We present a brief overview of our procedure for completeness; additional information can be obtained in the references (Khoshgoftaar and Van Hulse 2005b).
Our attribute noise ranking technique utilizes a procedure called PANDA (Fig. 2), which provides a ranking of instances from most to least noisy based on the Noise Factor S i . For each observation in the dataset, PANDA examines each pair of attributes and computes the deviation of the second attribute from its mean value given the partitioned value of the first attribute. For a given instance, if these deviations occur often and severely enough when compared to the remainder of the dataset, that instance will appear more noisy.
The procedure to rank the attributes from most to least noisy (Khoshgoftaar and Van Hulse 2005b) is presented in Fig. 3. Suppose there are a total of n instances and m attributes. PANDA is executed with all m attributes and the instance noise ranking is created. Denote this ordering by rank and the rank of instance i as rank(i), where 1 ≤ i ≤ n (Line 1 of Fig. 3). Each of the m attributes is removed and PANDA is executed using only the remaining m − 1 attributes (Line 3). The output is another instance ordering from most likely to least likely noise. If the jth attribute is removed, denote the output of PANDA by rank j and the rank of instance i as rank j (i). For example, if the second attribute is removed and instance 100 is ranked by PANDA as most noisy, then rank 2(100) = 1.
Kendall’s Tau rank correlation (Conover 1971), a non-parametric measure of the association between two attributes, is calculated between rank j and rank for each attribute j (Line 4). Attributes are ordered from most noisy to least noisy based on their correlation with the ranking rank when the attribute is removed from the dataset. The attribute that creates the ranking with the lowest correlation to rank after it is removed has the most noise, while the attribute with the highest correlation to rank is considered to be the cleanest attribute.
Rights and permissions
About this article
Cite this article
Khoshgoftaar, T.M., Van Hulse, J. Imputation techniques for multivariate missingness in software measurement data. Software Qual J 16, 563–600 (2008). https://doi.org/10.1007/s11219-008-9054-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11219-008-9054-7