A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

https://doi.org/10.1016/j.jss.2007.07.043Get rights and content

Abstract

The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.

Introduction

In the domain of software quality, estimation models and methods are widely recognized as a useful methodology for optimizing scarce resources. By constructing a model using data from past projects to capture the relationship between software metrics (independent variables) and the number of faults (dependent variable) observed in the program modules (instances), the quality of current software development projects can be estimated. This allows for the optimal allocation of additional resources to perform extra inspection of at-risk program modules. Various estimation procedures, such as instance-based learning (Khoshgoftaar and Seliya, 2003a) and linear regression, have been applied to perform such tasks (Khoshgoftaar and Seliya, 2003b).

One of the most important issues that needs to be addressed before data analysis can be completed is missing data. Linear regression techniques, for example, are unable to directly handle missing data. Many software implementations of linear regression delete any instances with missing data (listwise deletion) before the model is constructed. Such an approach leads to significant information loss, especially as the amount of missing data increases, and may result in severely biased models if the missing data is not distributed completely randomly (Allison, 2001, Little and Rubin, 2002). In response to these issues, significant effort has been devoted to developing and evaluating various imputation methodologies. Imputation techniques ‘fill-in’ (or impute) the missing values with one or more plausible values.

At the present time, very little emphasis in the software quality estimation domain has been given to imputation techniques. Therefore, we present in this work a comprehensive evaluation of five imputation procedures using a real-world software measurement dataset in numerous carefully controlled experiments. As the experimental datasets do not have missing values, missingness is injected at various levels according to three common missingness mechanisms (explained in Section 3.2.2). We only consider missingness in the dependent variable, which is extremely important because the dependent variable is often the most valuable attribute in the dataset. Since information loss in the dependent variable is often intolerable, the successful treatment of missing data in the dependent variable in software measurement data is especially critical. Since software measurement datasets are often small in size, it may not be practical to eliminate instances with missing data from any analysis. To our knowledge, no research in the software engineering (SE) domain has focused specifically on the important occurrence of missing data in the dependent variable. We believe it is critical to understand and evaluate imputation techniques with data missing from a single variable before multi-attribute missingness is explored. The scenario of having missing values in the dependent variable (which relates to the fault-proneness of the software modules) is realistic and commonly-encountered in real-world software quality modeling applications. Code metrics are often recorded by software tools, while process metrics, which include information about the fault-proneness of the modules, may be self-reported, and hence have a higher likelihood to be missing. As we discuss in more detail in Section 3.2.2, all three types of missing data mechanisms (MCAR, MAR, and NI) can reasonably occur in measurement data, and hence our simulations are consistent with what is encountered in real-world SE applications. Considering missingness in only the dependent variable has precedence in literature on missing values. Little and Rubin (2002) discuss missing values in controlled experiments and contend that ‘since the levels of the design factor in an experiment are fixed by the experimenter, missing values, if they do occur, do so far more frequently in the outcome variable…’. Even though the domain is different, our consideration of missing values only in the dependent variable is well-founded.

Our study is the first to evaluate the impact of noise in software measurement data on the imputation process. To the best of our knowledge, our research group was the first to address the important subject of data quality in SE measurement data (Khoshgoftaar and Seliya, 2004, Khoshgoftaar et al., 2005). Data quality has been demonstrated to be a key component of the analytical and knowledge discovery process (Khoshgoftaar and Seliya, 2004), and we have proposed some procedures to cope with this critical issue (Khoshgoftaar and Van Hulse, 2006a, Van Hulse et al., 2007). Finally, some preliminary work on Bayesian multiple imputation (MI) in SE has been recently presented (Twala and Cartwright, 2005), however, our work is the first to provide an in-depth analysis of MI for handling missing data in empirical SE applications. Our experimental results strongly support the use of MI and further demonstrate that mean imputation is a very poor imputation technique. In particular, we expand upon our initial results demonstrating the strong imputation performance of MI (Khoshgoftaar and Van Hulse, 2006b).

The following topics are considered in this empirical study, and represent the main contributions of this work:

  • 1.

    What is the impact of poor data quality (i.e., the presence of noise in a dataset) on the analysis of the effectiveness of imputation techniques?

    • (a)

      The accuracy of an imputation technique should be measured by comparing the imputed value to what the value should be (i.e., the clean value), as opposed to what the value is (i.e., the noisy value). If a particular attribute is noisy, these two values will not be the same. Given a real-world dataset, the difficulty is that it is not typically known which attribute values are noisy or what the ‘correct’ value should be. As noise is a common occurrence in real-world datasets, any analysis of the effectiveness of imputation techniques that does not also consider data quality is fundamentally flawed. In order to assess the quality of the underlying data, we employ an expert in the SE domain who has worked specifically with the dataset used in our experiments for many years. With respect to considering the impact of poor data quality on the imputation process, our study is unique and unprecedented. The precise method of our evaluation is explained more fully in Section 4.1. From this study, we conclude that the quality of the underlying data cannot be ignored when examining the efficacy of imputation procedures.

    • (b)

      Independent of the above, does noise in a dataset affect the imputation results? In other words, given a dataset with a subset of relatively clean instances and a subset of relatively noisy instances, does the presence of noise dramatically affect the imputed values of the relatively clean instances? Our experiments demonstrate conclusively that data quality does play an important role in the effectiveness of imputation techniques (Section 4.2). We conclude that the imputation results obtained using a noisy dataset will be worse than those obtained using a clean dataset, and that noise will adversely impact the imputation process itself.

  • 2.

    Which techniques provide the best imputation performance?

    • (a)

      Mean imputation (MEI) performs very poorly. In relation to (1a) above, when using the noisy value to measure the quality of the imputation, MEI appears to show good results (Section 4.1). When the imputed value is measured against what the value should be (i.e., the clean value), however, MEI has very poor imputation accuracy. MEI is, therefore, not a reliable imputation technique.

    • (b)

      Both Bayesian multiple imputation and regression imputation are very effective for imputing missing values in the dependent variable.

  • 3.

    Is the missingness mechanism, either MCAR, MAR, or NI, or the percentage of missing data (5%, 10%, 15%, and 20%) an important factor in the imputation process? Based on our experiments, all imputation techniques perform significantly better when the data is missing completely at random (MCAR). At the four levels considered in this study, the percentage of missing data was not deemed significant.

Listwise deletion (LD), which consists of removing incomplete cases from the dataset, is not considered in this work for numerous reasons. As previously mentioned, numerous studies (Allison, 2001, Little and Rubin, 2002) have remarked that LD results in a loss of precision and introduces a bias when the missing data mechanism is not MCAR. As the objective of this work is to understand the relationship between imputation and data quality, performing LD provides no insight into this relationship. Finally, it may not be possible to use LD in certain circumstances. For example, suppose that a software quality regression model has been constructed using historical fault data, and the number of faults for each program module in a test dataset are to be estimated using the model. Suppose further that some of the instances in the test dataset contain missing values in one of the attributes used in the regression model. Clearly it is infeasible to discard the modules with missing values from the test dataset using LD, so the common practice is to utilize MEI before applying the regression model. Due to these reasons, we have only considered imputation procedures in this work.

The remainder of this work is organized as follows. Some remarks on related work for imputation in the software engineering domain are presented in Section 2. The design of our experiments is provided in Section 3, with the results presented in Section 4. Conclusions and future work are provided in Section 5.

Section snippets

Related work

Imputation techniques for software cost estimation have been examined in several works (Cartwright et al., 2003, Moses and Farrow, 2005, Myrtveit et al., 2001, Song et al., 2005, Strike et al., 2001, Twala and Cartwright, 2005), though relatively little published work is available in the domain of software defect estimation. Note that none of these works have considered the impact of data quality on imputation, which is one of the main contributions of this work. Further, very little work in SE

CCCS dataset

The CCCS dataset used in our experiments is a large military command, control and communications system written in Ada (Khoshgoftaar and Allen, 1998). CCCS contains 282 instances (program modules), where each instance is an Ada package consisting of one or more procedures. CCCS contains eight software metrics which are used as independent variables or attributes. An additional attribute nfaults (the dependent variable) indicates the number of faults attributed to a module during the system

Empirical results

This section includes the empirical results for our experiments. First, we present the research questions:

  • Q1:

    Does comparing the results of imputation techniques on a set of noisy data, relative to a noisy or incorrect value, produce unreliable and misleading results? What impact might this have on a comparative study of imputation techniques?

  • Q2:

    Which imputation techniques produce the best and worst results, as measured by the aae?

  • Q3:

    Are the missingness mechanism and percentage of missing data

Conclusions

The handling of missing software metrics data is an important research topic in the SE community. Techniques that can be used to effectively and efficiently handle missing data can provide significant benefit to a practitioner in the software quality domain. One extremely important application of imputation techniques is for data missing from the dependent variable of a software measurement dataset, where the dependent variable records the number of faults in the program modules. Recording the

Acknowledgements

We thank the anonymous reviewers for their constructive comments and suggestions. We are grateful to the current and former members of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories at Florida Atlantic University for their reviews and comments.

Jason Van Hulse received the Ph.D. degree in Computer Engineering from the Department of Computer Science and Engineering at Florida Atlantic University in 2007, the M.A. degree in Mathematics from Stony Brook University in 2000, and the B.S. degree in Mathematics from the University at Albany in 1997. His research interests include data mining and knowledge discovery, machine learning, computational intelligence, and statistics. He has published numerous peer-reviewed research papers in

References (29)

  • Allison, P.D., 2001. Missing Data. 07-136 Sage University Papers Series on Quantitative Applications in the Social...
  • M.L. Berenson et al.

    Intermediate Statistical Methods and Applications: A Computer Package Approach

    (1983)
  • P. Bremaud

    Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues

    (1999)
  • Cartwright, M.H., Shepperd, M.J., Song, Q. 2003. Dealing with missing software project data. In: Nineth IEEE...
  • N.E. Fenton et al.

    Software Metrics: A Rigorous and Practical Approach

    (1997)
  • Jonsson, P., Wohlin, C. 2004. An evaluation of k-nearest neighbour imputation using likert data. In: Tenth IEEE...
  • T.M. Khoshgoftaar et al.

    Classification of fault-prone software modules: prior probabilities, costs and model evaluation

    Empirical Software Engineering

    (1998)
  • T.M. Khoshgoftaar et al.

    Analogy-based practical classification rules for software quality estimation

    Empirical Software Engineering Journal

    (2003)
  • T.M. Khoshgoftaar et al.

    Fault-prediction modeling for software quality estimation: comparing commonly used techniques

    Empirical Software Engineering Journal

    (2003)
  • T.M. Khoshgoftaar et al.

    The necessity of assuring quality in software measurement data

  • T.M. Khoshgoftaar et al.

    Determining noisy instances relative to attributes of interest

    Intelligent Data Analysis: An International Journal

    (2006)
  • Khoshgoftaar, T.M., Van Hulse, J. 2006b. Multiple imputation of software measurement data: a case study. In:...
  • T.M. Khoshgoftaar et al.

    Enhancing software quality estimation using ensemble-classifier based noise filtering

    Intelligent Data Analysis: An International Journal

    (2005)
  • Khoshgoftaar, T.M. Van Hulse, J., Seiffert, C. 2006. A hybrid approach to cleansing software measurement data. In:...
  • Cited by (46)

    • A mixed solution-based high agreement filtering method for class noise detection in binary classification

      2020, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      Moreover, KNN is an expensive method in terms of computation as it requires a lot of storage space [9,31]. NB is usually considered as a more robust algorithm to noisy samples than RF [32]. However, Folleco et al. [33] showed that RF can provide very consistent classification results in some cases.

    • A user-driven case-based reasoning tool for infilling missing values in daily mean river flow records

      2016, Environmental Modelling and Software
      Citation Excerpt :

      gapIt uses the implementation of the multilayer perceptron with back propagation (Witten et al., 2011, section 6.4). Expectation-Maximization (EM) method (Van Hulse and Khoshgoftaar, 2008) was also included in gapIt, through a Weka plugin which uses EM to replace missing values with a multivariate normal model.5 The above listed techniques are implemented and available in gapIt.

    • An empirical analysis of data preprocessing for machine learning-based software cost estimation

      2015, Information and Software Technology
      Citation Excerpt :

      The combination of scaling scheme and missing-data treatment (MDT) is firstly analyzed; however, their impacts onto the ML method were not studied. Many studies propose one or more DP techniques to deal with a specific issue in SCE, such as data missingness [27–29], redundant or irrelevant features [10,25], or abnormal cases [30,31]. But they did not study the effectiveness of different DP techniques.

    • A hybrid approach of missing data imputation for upper gastrointestinal diagnosis

      2023, International Journal of Advanced Intelligence Paradigms
    View all citing articles on Scopus

    Jason Van Hulse received the Ph.D. degree in Computer Engineering from the Department of Computer Science and Engineering at Florida Atlantic University in 2007, the M.A. degree in Mathematics from Stony Brook University in 2000, and the B.S. degree in Mathematics from the University at Albany in 1997. His research interests include data mining and knowledge discovery, machine learning, computational intelligence, and statistics. He has published numerous peer-reviewed research papers in various conferences and journals, and is a member of the IEEE, IEEE Computer Society, and ACM. He has worked in the data mining and predictive modeling field at First Data Corp. since 2000, and is currently Vice President, Decision Science.

    Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University, and the Director of the Empirical Software Engineering Laboratory, and the Data Mining and Machine Learning Laboratory. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 350 refereed papers in these areas. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability Society. He was the program chair, and General Chair of the IEEE International Conference of Tools with Artificial Intelligence in 2004, and 2005 respectively. He has served on technical program committees of various international conferences, symposia, and workshops. He is the program chair of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE 2008). Also, he has served as North American Editor of the Software Quality Journal, and is on the editorial boards of the journals Software Quality and Fuzzy systems.

    View full text