A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

doi:10.1016/j.jss.2007.07.043

Journal of Systems and Software

Volume 81, Issue 5, May 2008, Pages 691-708

https://doi.org/10.1016/j.jss.2007.07.043 Get rights and content

Abstract

The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.

Introduction

In the domain of software quality, estimation models and methods are widely recognized as a useful methodology for optimizing scarce resources. By constructing a model using data from past projects to capture the relationship between software metrics (independent variables) and the number of faults (dependent variable) observed in the program modules (instances), the quality of current software development projects can be estimated. This allows for the optimal allocation of additional resources to perform extra inspection of at-risk program modules. Various estimation procedures, such as instance-based learning (Khoshgoftaar and Seliya, 2003a) and linear regression, have been applied to perform such tasks (Khoshgoftaar and Seliya, 2003b).

One of the most important issues that needs to be addressed before data analysis can be completed is missing data. Linear regression techniques, for example, are unable to directly handle missing data. Many software implementations of linear regression delete any instances with missing data (listwise deletion) before the model is constructed. Such an approach leads to significant information loss, especially as the amount of missing data increases, and may result in severely biased models if the missing data is not distributed completely randomly (Allison, 2001, Little and Rubin, 2002). In response to these issues, significant effort has been devoted to developing and evaluating various imputation methodologies. Imputation techniques ‘fill-in’ (or impute) the missing values with one or more plausible values.

At the present time, very little emphasis in the software quality estimation domain has been given to imputation techniques. Therefore, we present in this work a comprehensive evaluation of five imputation procedures using a real-world software measurement dataset in numerous carefully controlled experiments. As the experimental datasets do not have missing values, missingness is injected at various levels according to three common missingness mechanisms (explained in Section 3.2.2). We only consider missingness in the dependent variable, which is extremely important because the dependent variable is often the most valuable attribute in the dataset. Since information loss in the dependent variable is often intolerable, the successful treatment of missing data in the dependent variable in software measurement data is especially critical. Since software measurement datasets are often small in size, it may not be practical to eliminate instances with missing data from any analysis. To our knowledge, no research in the software engineering (SE) domain has focused specifically on the important occurrence of missing data in the dependent variable. We believe it is critical to understand and evaluate imputation techniques with data missing from a single variable before multi-attribute missingness is explored. The scenario of having missing values in the dependent variable (which relates to the fault-proneness of the software modules) is realistic and commonly-encountered in real-world software quality modeling applications. Code metrics are often recorded by software tools, while process metrics, which include information about the fault-proneness of the modules, may be self-reported, and hence have a higher likelihood to be missing. As we discuss in more detail in Section 3.2.2, all three types of missing data mechanisms (MCAR, MAR, and NI) can reasonably occur in measurement data, and hence our simulations are consistent with what is encountered in real-world SE applications. Considering missingness in only the dependent variable has precedence in literature on missing values. Little and Rubin (2002) discuss missing values in controlled experiments and contend that ‘since the levels of the design factor in an experiment are fixed by the experimenter, missing values, if they do occur, do so far more frequently in the outcome variable…’. Even though the domain is different, our consideration of missing values only in the dependent variable is well-founded.

Our study is the first to evaluate the impact of noise in software measurement data on the imputation process. To the best of our knowledge, our research group was the first to address the important subject of data quality in SE measurement data (Khoshgoftaar and Seliya, 2004, Khoshgoftaar et al., 2005). Data quality has been demonstrated to be a key component of the analytical and knowledge discovery process (Khoshgoftaar and Seliya, 2004), and we have proposed some procedures to cope with this critical issue (Khoshgoftaar and Van Hulse, 2006a, Van Hulse et al., 2007). Finally, some preliminary work on Bayesian multiple imputation (MI) in SE has been recently presented (Twala and Cartwright, 2005), however, our work is the first to provide an in-depth analysis of MI for handling missing data in empirical SE applications. Our experimental results strongly support the use of MI and further demonstrate that mean imputation is a very poor imputation technique. In particular, we expand upon our initial results demonstrating the strong imputation performance of MI (Khoshgoftaar and Van Hulse, 2006b).

The following topics are considered in this empirical study, and represent the main contributions of this work:

1.
What is the impact of poor data quality (i.e., the presence of noise in a dataset) on the analysis of the effectiveness of imputation techniques?
- (a)
  The accuracy of an imputation technique should be measured by comparing the imputed value to what the value should be (i.e., the clean value), as opposed to what the value is (i.e., the noisy value). If a particular attribute is noisy, these two values will not be the same. Given a real-world dataset, the difficulty is that it is not typically known which attribute values are noisy or what the ‘correct’ value should be. As noise is a common occurrence in real-world datasets, any analysis of the effectiveness of imputation techniques that does not also consider data quality is fundamentally flawed. In order to assess the quality of the underlying data, we employ an expert in the SE domain who has worked specifically with the dataset used in our experiments for many years. With respect to considering the impact of poor data quality on the imputation process, our study is unique and unprecedented. The precise method of our evaluation is explained more fully in Section 4.1. From this study, we conclude that the quality of the underlying data cannot be ignored when examining the efficacy of imputation procedures.
- (b)
  Independent of the above, does noise in a dataset affect the imputation results? In other words, given a dataset with a subset of relatively clean instances and a subset of relatively noisy instances, does the presence of noise dramatically affect the imputed values of the relatively clean instances? Our experiments demonstrate conclusively that data quality does play an important role in the effectiveness of imputation techniques (Section 4.2). We conclude that the imputation results obtained using a noisy dataset will be worse than those obtained using a clean dataset, and that noise will adversely impact the imputation process itself.
2.
Which techniques provide the best imputation performance?
- (a)
  Mean imputation (MEI) performs very poorly. In relation to (1a) above, when using the noisy value to measure the quality of the imputation, MEI appears to show good results (Section 4.1). When the imputed value is measured against what the value should be (i.e., the clean value), however, MEI has very poor imputation accuracy. MEI is, therefore, not a reliable imputation technique.
- (b)
  Both Bayesian multiple imputation and regression imputation are very effective for imputing missing values in the dependent variable.
3.
Is the missingness mechanism, either MCAR, MAR, or NI, or the percentage of missing data (5%, 10%, 15%, and 20%) an important factor in the imputation process? Based on our experiments, all imputation techniques perform significantly better when the data is missing completely at random (MCAR). At the four levels considered in this study, the percentage of missing data was not deemed significant.

Listwise deletion (LD), which consists of removing incomplete cases from the dataset, is not considered in this work for numerous reasons. As previously mentioned, numerous studies (Allison, 2001, Little and Rubin, 2002) have remarked that LD results in a loss of precision and introduces a bias when the missing data mechanism is not MCAR. As the objective of this work is to understand the relationship between imputation and data quality, performing LD provides no insight into this relationship. Finally, it may not be possible to use LD in certain circumstances. For example, suppose that a software quality regression model has been constructed using historical fault data, and the number of faults for each program module in a test dataset are to be estimated using the model. Suppose further that some of the instances in the test dataset contain missing values in one of the attributes used in the regression model. Clearly it is infeasible to discard the modules with missing values from the test dataset using LD, so the common practice is to utilize MEI before applying the regression model. Due to these reasons, we have only considered imputation procedures in this work.

The remainder of this work is organized as follows. Some remarks on related work for imputation in the software engineering domain are presented in Section 2. The design of our experiments is provided in Section 3, with the results presented in Section 4. Conclusions and future work are provided in Section 5.

Section snippets

Related work

Imputation techniques for software cost estimation have been examined in several works (Cartwright et al., 2003, Moses and Farrow, 2005, Myrtveit et al., 2001, Song et al., 2005, Strike et al., 2001, Twala and Cartwright, 2005), though relatively little published work is available in the domain of software defect estimation. Note that none of these works have considered the impact of data quality on imputation, which is one of the main contributions of this work. Further, very little work in SE

CCCS dataset

The CCCS dataset used in our experiments is a large military command, control and communications system written in Ada (Khoshgoftaar and Allen, 1998). CCCS contains 282 instances (program modules), where each instance is an Ada package consisting of one or more procedures. CCCS contains eight software metrics which are used as independent variables or attributes. An additional attribute nfaults (the dependent variable) indicates the number of faults attributed to a module during the system

Empirical results

This section includes the empirical results for our experiments. First, we present the research questions:

$Q_{1}$ :
Does comparing the results of imputation techniques on a set of noisy data, relative to a noisy or incorrect value, produce unreliable and misleading results? What impact might this have on a comparative study of imputation techniques?
$Q_{2}$ :
Which imputation techniques produce the best and worst results, as measured by the aae?
$Q_{3}$ :
Are the missingness mechanism and percentage of missing data

Conclusions

The handling of missing software metrics data is an important research topic in the SE community. Techniques that can be used to effectively and efficiently handle missing data can provide significant benefit to a practitioner in the software quality domain. One extremely important application of imputation techniques is for data missing from the dependent variable of a software measurement dataset, where the dependent variable records the number of faults in the program modules. Recording the

Acknowledgements

We thank the anonymous reviewers for their constructive comments and suggestions. We are grateful to the current and former members of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories at Florida Atlantic University for their reviews and comments.

References (29)

Allison, P.D., 2001. Missing Data. 07-136 Sage University Papers Series on Quantitative Applications in the Social...
M.L. Berenson et al.
Intermediate Statistical Methods and Applications: A Computer Package Approach
(1983)
P. Bremaud
Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues
(1999)
Cartwright, M.H., Shepperd, M.J., Song, Q. 2003. Dealing with missing software project data. In: Nineth IEEE...
N.E. Fenton et al.
Software Metrics: A Rigorous and Practical Approach
(1997)
Jonsson, P., Wohlin, C. 2004. An evaluation of k-nearest neighbour imputation using likert data. In: Tenth IEEE...
T.M. Khoshgoftaar et al.
Classification of fault-prone software modules: prior probabilities, costs and model evaluation
Empirical Software Engineering
(1998)
T.M. Khoshgoftaar et al.
Analogy-based practical classification rules for software quality estimation
Empirical Software Engineering Journal
(2003)
T.M. Khoshgoftaar et al.
Fault-prediction modeling for software quality estimation: comparing commonly used techniques
Empirical Software Engineering Journal
(2003)
T.M. Khoshgoftaar et al.
The necessity of assuring quality in software measurement data

T.M. Khoshgoftaar et al.

Determining noisy instances relative to attributes of interest

Intelligent Data Analysis: An International Journal

(2006)

Khoshgoftaar, T.M., Van Hulse, J. 2006b. Multiple imputation of software measurement data: a case study. In:...

T.M. Khoshgoftaar et al.

Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis: An International Journal

(2005)

Khoshgoftaar, T.M. Van Hulse, J., Seiffert, C. 2006. A hybrid approach to cleansing software measurement data. In:...

Cited by (46)

A review on missing values for main challenges and methods
2023, Information Systems
Several recent reviews summarize common missing value analysis methods. However, none of them provide a systematic and in-depth summary of the analytical challenges and solutions for dealing with missing values. For the purpose of guiding the handling of missing values, this review aims to consolidate current developments in novel missing-value research methodologies. In particular, we comprehensively investigated cutting-edge missing value solutions and methodically studied the main challenges associated with missing values analysis (missing mechanisms, missing patterns, and missing rates). Furthermore, we reviewed 63 publications that compare different strategies for deleting and imputing missing values. Then we investigated data characteristics, highlighted three main problems when analyzing missing values, and analyzed the performance of missing value solutions in these studied papers. Moreover, we conducted comprehensive experiments on 9 public datasets using typical missing value processing methods and provided a simple guided decision tree for handling missing values. Finally, we described current Research hotspots and open challenges, which give potential research topics.
Knowledge discovery from noisy imbalanced and incomplete binary class data
2021, Expert Systems with Applications
Class imbalance creates a considerable impact on the classification of instances using traditional classifiers. Class imbalance, along with other difficulties, creates a significant impact on recognizing instances of minority class. Researchers work in various directions to mitigate class imbalance effect along with noise as well as missing values in datasets. However, combined studies of noisy class imbalance along with incomplete datasets have not been performed yet. This article contains a detailed analysis of 84 different machine learning models to deal with noisy binary class imbalanced and incomplete data using AUC, G-Mean, and F1-score as performance metrics. This article contains a detailed experiment considering missing value imputation and oversampling techniques. The article contains three comparisons: first missing value imputation techniques in incomplete and binary class imbalanced data, second, resampling techniques in noisy binary class imbalanced data, and third, combined techniques in noisy binary class imbalanced and incomplete data. We conclude that MICE and KNN techniques perform well with an increase in the imbalanced dataset's missing value from the first comparison. In second comparison, the SMOTE-ENN technique performs better than state-of-art in noisy binary class imbalanced datasets, and in the third comparison, we conclude that MICE with SMOTE-ENN technique perform well compared to the rest of the techniques.
A mixed solution-based high agreement filtering method for class noise detection in binary classification
2020, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
Moreover, KNN is an expensive method in terms of computation as it requires a lot of storage space [9,31]. NB is usually considered as a more robust algorithm to noisy samples than RF [32]. However, Folleco et al. [33] showed that RF can provide very consistent classification results in some cases.
Classification of noisy data has been a longstanding topic in data mining and machine learning. Many scientists have proposed effective methods to detect and eliminate such data in diverse real-world datasets. In this paper, we deal with mislabeled instances in supervised learning, including majority voting filtering and consensus voting filtering. The majority voting procedure usually incorrectly identifies many correct instances as noisy, whereas the consensus voting procedure is not able to detect at all many noisy instances. Our new method minimizes the majority and consensus filtering weaknesses by providing a novel class noise detection strategy, namely a high agreement voting filtering with mixed strategy, which proceeds by removing strong and semi-strong noisy records from the dataset as well as by relabeling weak noisy data. The proposed method, designed for binary classification problems, outperforms the high agreement voting filtering procedure. Extensive experiments conducted with 16 real datasets, using four noise filtering methods with two levels of class noise (10% and 15%), prove the superiority of the proposed methodology.
A user-driven case-based reasoning tool for infilling missing values in daily mean river flow records
2016, Environmental Modelling and Software
Citation Excerpt :
gapIt uses the implementation of the multilayer perceptron with back propagation (Witten et al., 2011, section 6.4). Expectation-Maximization (EM) method (Van Hulse and Khoshgoftaar, 2008) was also included in gapIt, through a Weka plugin which uses EM to replace missing values with a multivariate normal model.5 The above listed techniques are implemented and available in gapIt.
Missing data in river flow records represent a loss of information and a serious drawback in water management. In this work, we introduce gapIt, a user-driven case-based reasoning tool for infilling gaps in daily mean river flow records. Given a set of flow time series, gapIt builds a database of artificial gaps for which it computes several flow estimates, to find the best combinations of infilling algorithm and automatically selected donor station(s), according to state-of-the-art performance indicators. We obtained satisfactory results with Nash-Sutcliffe >0.7 for more than half of the ∼5000 synthetic gaps of various lengths and positions, randomly created along the available records. gapIt was evaluated on 24 daily river discharge time series recorded in Luxembourg over seven years from 01/01/2007 to 31/12/2013. We also discuss the benefits of coupling this approach with user-expertise for an improved infilling of real data gaps.
An empirical analysis of data preprocessing for machine learning-based software cost estimation
2015, Information and Software Technology
Citation Excerpt :
The combination of scaling scheme and missing-data treatment (MDT) is firstly analyzed; however, their impacts onto the ML method were not studied. Many studies propose one or more DP techniques to deal with a specific issue in SCE, such as data missingness [27–29], redundant or irrelevant features [10,25], or abnormal cases [30,31]. But they did not study the effectiveness of different DP techniques.
Due to the complex nature of software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques.
This study aims for an empirical assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation.
In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations.
Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods.
In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation.
A hybrid approach of missing data imputation for upper gastrointestinal diagnosis
2023, International Journal of Advanced Intelligence Paradigms

View all citing articles on Scopus

Jason Van Hulse received the Ph.D. degree in Computer Engineering from the Department of Computer Science and Engineering at Florida Atlantic University in 2007, the M.A. degree in Mathematics from Stony Brook University in 2000, and the B.S. degree in Mathematics from the University at Albany in 1997. His research interests include data mining and knowledge discovery, machine learning, computational intelligence, and statistics. He has published numerous peer-reviewed research papers in various conferences and journals, and is a member of the IEEE, IEEE Computer Society, and ACM. He has worked in the data mining and predictive modeling field at First Data Corp. since 2000, and is currently Vice President, Decision Science.

Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University, and the Director of the Empirical Software Engineering Laboratory, and the Data Mining and Machine Learning Laboratory. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 350 refereed papers in these areas. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability Society. He was the program chair, and General Chair of the IEEE International Conference of Tools with Artificial Intelligence in 2004, and 2005 respectively. He has served on technical program committees of various international conferences, symposia, and workshops. He is the program chair of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE 2008). Also, he has served as North American Editor of the Software Quality Journal, and is on the editorial boards of the journals Software Quality and Fuzzy systems.

View full text

A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

Abstract

Introduction

Section snippets

Related work

CCCS dataset

Empirical results

Conclusions

Acknowledgements

Intermediate Statistical Methods and Applications: A Computer Package Approach

Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues

Software Metrics: A Rigorous and Practical Approach

Classification of fault-prone software modules: prior probabilities, costs and model evaluation

Empirical Software Engineering

Analogy-based practical classification rules for software quality estimation

Empirical Software Engineering Journal

Fault-prediction modeling for software quality estimation: comparing commonly used techniques

Empirical Software Engineering Journal

The necessity of assuring quality in software measurement data

Determining noisy instances relative to attributes of interest

Intelligent Data Analysis: An International Journal

Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis: An International Journal