Transfer learning for cross-company software defect prediction
Introduction
Predicting the quality of software modules is very critical for the high-assurance and mission-critical systems. Within-company defect prediction [1], [2], [3], [4], [5], [6], [7] has been well studied in the last three decades. However, there are rarely local training data available in practice, either because the past local defective modules are expensive to label, or the modules developed belong to strange domains for companies. Fortunately, there exist a lot of public data repositories from different companies. But, to the best of our knowledge, very few studies focused on the prediction model trained with these cross-company data.
Cross-company defect prediction is not a traditional machine learning problem, because the training data and test data are under different distributions. In order to solve this problem, Turhan et al. [8] use a Nearest Neighbor Filter (NN-filter) to select the similar data from source data as training data. They discard dissimilar data, which may contain useful information for training. After that, Zimmermann et al. [9] use decision trees to help managers to estimate precision, recall, and accuracy before attempting a prediction across projects. However their method does not yield good results from used across different projects. We consider this a critical transfer learning problem, as defined by Pan and Yang [10].
Unlike these papers, we develop a novel transfer learning algorithm called Transfer Naive Bayes (TNB) for cross-company defect prediction. Instead of discarding some training samples, we exploit information of all the cross-company data in training step. By weighting the instance of training data based on target set information, we build a weighted Naive Bayes classifier. Finally, we perform analysis on publicly available project data sets from NASA and Turkish local software data sets [11]. Our experimental results show that TNB gives better performance on all the data sets when compared with the state-of-the-art methods.
The rest of this paper is organized as follows. Section 2 briefly reviews the background of the transfer learning techniques and software defect prediction algorithms. Section 3 presents our transfer algorithm, and analyzes the theoretical runtime cost of the algorithm. Section 4 describes the software defect data sets, performance metrics used in this study, and shows the experimental results with discussions. Section 5 finalizes the paper with conclusions and future works.
Section snippets
Transfer learning techniques
Transfer learning techniques allow the domains, tasks, and distributions of the training data and test data to be different, and have been applied successfully in many real-world applications recently. According to [10], transfer learning is defined as follows: Given a source domain DS and learning task TS, a target domain DT and learning task TT, transfer learning aims to help improve the learning of the target predictive function in DT using the knowledge in DS and TS, where DS ≠ DT or TS ≠ TT.
Transfer learning for software defect prediction
In this section, we present our Transfer Naive Bayes (TNB) algorithm, based on Naive Bayes. Furthermore, we give the theoretical runtime cost analysis for the algorithm. The main idea of TNB is giving weights to the training samples, according to the similarities between the source and target data on feature level. And then, Naive Bayes classifier is built on these weighted training samples.
Experiments
In this section we evaluate TNB algorithm empirically. We use Naive Bayes classifier in Weka [35] to conduct the CC method. And we implement the NN-filter and TNB methods in Weka environment. We focus on cross-company defect-prone software modules prediction problems in this experiment. As we will show later, TNB significantly improves prediction performance with less time over the sample selecting method when applied to defect data sets from different companies.
Threats to validity
As every empirical experiment, our results are subject to some threats to validity.
Conclusion and future work
In this paper, we addressed the issue of how to predict software defects using cross-company data. In our setting, the labeled training data are available but have a different distribution from the unlabeled test data. We have developed a sample weighting algorithm based on Naive Bayes, called Transfer Naive Bayes.
The TNB algorithm applies the weighted Naive Bayes model by transferring information from the target data to the source data. First, it calculates the each attribute information of
Acknowledgements
This research was partially supported by National High Technology Research and Development Program of China (No. 2007AA01Z443), Research Fund for the Doctoral Program of Higher Education (No. 20070614008), and the Fundamental Research Funds for the Central Universities (No. ZYGX2009J066). We thank the anonymous reviewers for their great helpful comments.
References (38)
- et al.
Predicting software defects in varying development lifecycles using Bayesian nets
Information and Software Technology
(2007) - et al.
Object-oriented software fault prediction using neural networks
Information and Software Technology
(2007) - et al.
Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem
Information Sciences
(2009) - et al.
A Comparison of techniques for developing predictive models of software metrics
Information and Software Technology
(1997) - et al.
Object-oriented product metrics: a generic-framework
Information Sciences
(2007) - et al.
Data gravitation based classification
Information Sciences
(2009) An introduction to ROC analysis
Pattern Recognition Letters
(2006)- et al.
Developing interpretable models with optimized set reduction for identifying high risk software components
IEEE Transactions on Software Engineering
(1993) - et al.
Application of neural networks to software quality modeling of a very large telecommunications system
IEEE Transactions on Neural Networks
(1997) - et al.
Data mining static code attributes to learn defect predictors
IEEE Transactions on Software Engineering
(2007)
On the relative value of cross-company and within-company data for defect prediction
Empirical Software Engineering
Cited by (421)
A survey on machine learning techniques applied to source code
2024, Journal of Systems and SoftwareARRAY: Adaptive triple feature-weighted transfer Naive Bayes for cross-project defect prediction
2023, Journal of Systems and SoftwareA soft computing approach for software defect density prediction
2024, Journal of Software: Evolution and ProcessJoint Instance and Feature Adaptation for Heterogeneous Defect Prediction
2024, IEEE Transactions on Reliability