Skip to main content
Log in

Imputation techniques for multivariate missingness in software measurement data

  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. The reasons for not using 40 and 50% missingness for MAR and NI missingness, which is related to constraints in the dataset size, are discussed in Sect. 3.4.

  2. There are some works (Song et al. 2005), however, that use a similar evaluation methodology to the one we present.

References

  • Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA.

  • Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer.

  • Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165.

  • Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley.

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.

    MATH  MathSciNet  Google Scholar 

  • Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.

    Article  Google Scholar 

  • Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company.

    Google Scholar 

  • Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118.

  • Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.

    Article  Google Scholar 

  • Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602.

    Google Scholar 

  • Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV.

  • Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226.

  • Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27.

    Google Scholar 

  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.

    MATH  Google Scholar 

  • Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.

    Article  Google Scholar 

  • Orr, K. (1998). Data quality and systems theory. Communications of the ACM, 41(2), 66–71.

    Article  MathSciNet  Google Scholar 

  • Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13.

    Google Scholar 

  • Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.

  • SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc.

  • Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC.

  • Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.

    Article  Google Scholar 

  • Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.

    Article  Google Scholar 

  • Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.

    MATH  MathSciNet  Google Scholar 

  • Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40.

  • Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.

    MATH  Google Scholar 

  • Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267.

  • Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29.

  • Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive comments and suggestions which helped improve this paper. We are grateful to the current and former members of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories at Florida Atlantic University for their reviews and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taghi M. Khoshgoftaar.

Appendix: Attribute noise ranking procedure

Appendix: Attribute noise ranking procedure

In related work, we have proposed a procedure for ranking attributes in a dataset from most to least noisy. Information about the relative quality of each of the attributes in the datasets used in this work will be utilized in our experiments. We present a brief overview of our procedure for completeness; additional information can be obtained in the references (Khoshgoftaar and Van Hulse 2005b).

Our attribute noise ranking technique utilizes a procedure called PANDA (Fig. 2), which provides a ranking of instances from most to least noisy based on the Noise Factor S i . For each observation in the dataset, PANDA examines each pair of attributes and computes the deviation of the second attribute from its mean value given the partitioned value of the first attribute. For a given instance, if these deviations occur often and severely enough when compared to the remainder of the dataset, that instance will appear more noisy.

Fig. 2
figure 2

Pairwise attribute noise detection algorithm (PANDA)

The procedure to rank the attributes from most to least noisy (Khoshgoftaar and Van Hulse 2005b) is presented in Fig. 3. Suppose there are a total of n instances and m attributes. PANDA is executed with all m attributes and the instance noise ranking is created. Denote this ordering by rank and the rank of instance i as rank(i), where 1 ≤ i ≤ n (Line 1 of Fig. 3). Each of the m attributes is removed and PANDA is executed using only the remaining m − 1 attributes (Line 3). The output is another instance ordering from most likely to least likely noise. If the jth attribute is removed, denote the output of PANDA by rank j and the rank of instance i as rank j (i). For example, if the second attribute is removed and instance 100 is ranked by PANDA as most noisy, then rank 2(100) = 1.

Fig. 3
figure 3

Noisy attribute ranking methodology

Kendall’s Tau rank correlation (Conover 1971), a non-parametric measure of the association between two attributes, is calculated between rank j and rank for each attribute j (Line 4). Attributes are ordered from most noisy to least noisy based on their correlation with the ranking rank when the attribute is removed from the dataset. The attribute that creates the ranking with the lowest correlation to rank after it is removed has the most noise, while the attribute with the highest correlation to rank is considered to be the cleanest attribute.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khoshgoftaar, T.M., Van Hulse, J. Imputation techniques for multivariate missingness in software measurement data. Software Qual J 16, 563–600 (2008). https://doi.org/10.1007/s11219-008-9054-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-008-9054-7

Keywords

Navigation