Imputation techniques for multivariate missingness in software measurement data

Khoshgoftaar, Taghi M.; Van Hulse, Jason

doi:10.1007/s11219-008-9054-7

Imputation techniques for multivariate missingness in software measurement data

Published: 11 June 2008

Volume 16, pages 563–600, (2008)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Taghi M. Khoshgoftaar¹ &
Jason Van Hulse¹

321 Accesses
17 Citations
Explore all metrics

Abstract

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing Data Imputation Techniques for Software Effort Estimation: A Study of Recent Issues and Challenges

Dealing with Missing Values in Software Project Datasets: A Systematic Mapping Study

Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data

Notes

The reasons for not using 40 and 50% missingness for MAR and NI missingness, which is related to constraints in the dataset size, are discussed in Sect. 3.4.
There are some works (Song et al. 2005), however, that use a similar evaluation methodology to the one we present.

References

Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA.
Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer.
Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165.
Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
MATH MathSciNet Google Scholar
Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.
Article Google Scholar
Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company.
Google Scholar
Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118.
Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.
Article Google Scholar
Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602.
Google Scholar
Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV.
Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226.
Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27.
Google Scholar
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.
MATH Google Scholar
Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.
Article Google Scholar
Orr, K. (1998). Data quality and systems theory. Communications of the ACM, 41(2), 66–71.
Article MathSciNet Google Scholar
Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13.
Google Scholar
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc.
Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC.
Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.
Article Google Scholar
Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.
Article Google Scholar
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.
MATH MathSciNet Google Scholar
Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40.
Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.
MATH Google Scholar
Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267.
Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29.
Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.
Article MATH Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive comments and suggestions which helped improve this paper. We are grateful to the current and former members of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories at Florida Atlantic University for their reviews and comments.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, 33431, USA
Taghi M. Khoshgoftaar & Jason Van Hulse

Authors

Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Jason Van Hulse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taghi M. Khoshgoftaar.

Appendix: Attribute noise ranking procedure

In related work, we have proposed a procedure for ranking attributes in a dataset from most to least noisy. Information about the relative quality of each of the attributes in the datasets used in this work will be utilized in our experiments. We present a brief overview of our procedure for completeness; additional information can be obtained in the references (Khoshgoftaar and Van Hulse 2005b).

Our attribute noise ranking technique utilizes a procedure called PANDA (Fig. 2), which provides a ranking of instances from most to least noisy based on the Noise Factor S _i. For each observation in the dataset, PANDA examines each pair of attributes and computes the deviation of the second attribute from its mean value given the partitioned value of the first attribute. For a given instance, if these deviations occur often and severely enough when compared to the remainder of the dataset, that instance will appear more noisy.

The procedure to rank the attributes from most to least noisy (Khoshgoftaar and Van Hulse 2005b) is presented in Fig. 3. Suppose there are a total of n instances and m attributes. PANDA is executed with all m attributes and the instance noise ranking is created. Denote this ordering by rank and the rank of instance i as rank(i), where 1 ≤ i ≤ n (Line 1 of Fig. 3). Each of the m attributes is removed and PANDA is executed using only the remaining m − 1 attributes (Line 3). The output is another instance ordering from most likely to least likely noise. If the jth attribute is removed, denote the output of PANDA by rank _j and the rank of instance i as rank _j(i). For example, if the second attribute is removed and instance 100 is ranked by PANDA as most noisy, then rank ₂(100) = 1.

Kendall’s Tau rank correlation (Conover 1971), a non-parametric measure of the association between two attributes, is calculated between rank _j and rank for each attribute j (Line 4). Attributes are ordered from most noisy to least noisy based on their correlation with the ranking rank when the attribute is removed from the dataset. The attribute that creates the ranking with the lowest correlation to rank after it is removed has the most noise, while the attribute with the highest correlation to rank is considered to be the cleanest attribute.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khoshgoftaar, T.M., Van Hulse, J. Imputation techniques for multivariate missingness in software measurement data. Software Qual J 16, 563–600 (2008). https://doi.org/10.1007/s11219-008-9054-7

Download citation

Published: 11 June 2008
Issue Date: December 2008
DOI: https://doi.org/10.1007/s11219-008-9054-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imputation techniques for multivariate missingness in software measurement data

Abstract

Access this article

Similar content being viewed by others

Missing Data Imputation Techniques for Software Effort Estimation: A Study of Recent Issues and Challenges

Dealing with Missing Values in Software Project Datasets: A Systematic Mapping Study

Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data

Notes

References

Acknowledgements