Transfer learning for cross-company software defect prediction

https://doi.org/10.1016/j.infsof.2011.09.007Get rights and content

Abstract

Context

Software defect prediction studies usually built models using within-company data, but very few focused on the prediction models trained with cross-company data. It is difficult to employ these models which are built on the within-company data in practice, because of the lack of these local data repositories. Recently, transfer learning has attracted more and more attention for building classifier in target domain using the data from related source domain. It is very useful in cases when distributions of training and test instances differ, but is it appropriate for cross-company software defect prediction?

Objective

In this paper, we consider the cross-company defect prediction scenario where source and target data are drawn from different companies. In order to harness cross company data, we try to exploit the transfer learning method to build faster and highly effective prediction model.

Method

Unlike the prior works selecting training data which are similar from the test data, we proposed a novel algorithm called Transfer Naive Bayes (TNB), by using the information of all the proper features in training data. Our solution estimates the distribution of the test data, and transfers cross-company data information into the weights of the training data. On these weighted data, the defect prediction model is built.

Results

This article presents a theoretical analysis for the comparative methods, and shows the experiment results on the data sets from different organizations. It indicates that TNB is more accurate in terms of AUC (The area under the receiver operating characteristic curve), within less runtime than the state of the art methods.

Conclusion

It is concluded that when there are too few local training data to train good classifiers, the useful knowledge from different-distribution training data on feature level may help. We are optimistic that our transfer learning method can guide optimal resource allocation strategies, which may reduce software testing cost and increase effectiveness of software testing process.

Introduction

Predicting the quality of software modules is very critical for the high-assurance and mission-critical systems. Within-company defect prediction [1], [2], [3], [4], [5], [6], [7] has been well studied in the last three decades. However, there are rarely local training data available in practice, either because the past local defective modules are expensive to label, or the modules developed belong to strange domains for companies. Fortunately, there exist a lot of public data repositories from different companies. But, to the best of our knowledge, very few studies focused on the prediction model trained with these cross-company data.

Cross-company defect prediction is not a traditional machine learning problem, because the training data and test data are under different distributions. In order to solve this problem, Turhan et al. [8] use a Nearest Neighbor Filter (NN-filter) to select the similar data from source data as training data. They discard dissimilar data, which may contain useful information for training. After that, Zimmermann et al. [9] use decision trees to help managers to estimate precision, recall, and accuracy before attempting a prediction across projects. However their method does not yield good results from used across different projects. We consider this a critical transfer learning problem, as defined by Pan and Yang [10].

Unlike these papers, we develop a novel transfer learning algorithm called Transfer Naive Bayes (TNB) for cross-company defect prediction. Instead of discarding some training samples, we exploit information of all the cross-company data in training step. By weighting the instance of training data based on target set information, we build a weighted Naive Bayes classifier. Finally, we perform analysis on publicly available project data sets from NASA and Turkish local software data sets [11]. Our experimental results show that TNB gives better performance on all the data sets when compared with the state-of-the-art methods.

The rest of this paper is organized as follows. Section 2 briefly reviews the background of the transfer learning techniques and software defect prediction algorithms. Section 3 presents our transfer algorithm, and analyzes the theoretical runtime cost of the algorithm. Section 4 describes the software defect data sets, performance metrics used in this study, and shows the experimental results with discussions. Section 5 finalizes the paper with conclusions and future works.

Section snippets

Transfer learning techniques

Transfer learning techniques allow the domains, tasks, and distributions of the training data and test data to be different, and have been applied successfully in many real-world applications recently. According to [10], transfer learning is defined as follows: Given a source domain DS and learning task TS, a target domain DT and learning task TT, transfer learning aims to help improve the learning of the target predictive function in DT using the knowledge in DS and TS, where DS  DT or TS  TT.

Transfer learning for software defect prediction

In this section, we present our Transfer Naive Bayes (TNB) algorithm, based on Naive Bayes. Furthermore, we give the theoretical runtime cost analysis for the algorithm. The main idea of TNB is giving weights to the training samples, according to the similarities between the source and target data on feature level. And then, Naive Bayes classifier is built on these weighted training samples.

Experiments

In this section we evaluate TNB algorithm empirically. We use Naive Bayes classifier in Weka [35] to conduct the CC method. And we implement the NN-filter and TNB methods in Weka environment. We focus on cross-company defect-prone software modules prediction problems in this experiment. As we will show later, TNB significantly improves prediction performance with less time over the sample selecting method when applied to defect data sets from different companies.

Threats to validity

As every empirical experiment, our results are subject to some threats to validity.

Conclusion and future work

In this paper, we addressed the issue of how to predict software defects using cross-company data. In our setting, the labeled training data are available but have a different distribution from the unlabeled test data. We have developed a sample weighting algorithm based on Naive Bayes, called Transfer Naive Bayes.

The TNB algorithm applies the weighted Naive Bayes model by transferring information from the target data to the source data. First, it calculates the each attribute information of

Acknowledgements

This research was partially supported by National High Technology Research and Development Program of China (No. 2007AA01Z443), Research Fund for the Doctoral Program of Higher Education (No. 20070614008), and the Fundamental Research Funds for the Central Universities (No. ZYGX2009J066). We thank the anonymous reviewers for their great helpful comments.

References (38)

  • T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, Y. Jiang, Implications of ceiling effects in defect predictors, in:...
  • B. Turhan et al.

    On the relative value of cross-company and within-company data for defect prediction

    Empirical Software Engineering

    (2009)
  • T. Zimmermann, N. Nagappan, H. Gall, E. Giger, Cross-project defect prediction: a large scale experiment on data vs....
  • S.J. Pan, Q. Yang, A survey on transfer learning, Technical Report HKUST-CS 08-08, Department of Computer Science and...
  • G. Boetticher, T. Menzies, T. Ostrand, The PROMISE Repository of Empirical Software Engineering Data, 2007...
  • Y. Shi, Z. Lan, W. Liu, W. Bi, Extending semi-supervised learning methods for inductive transfer learning, In: Ninth...
  • J. Huang, A. Smola, A. Gretton, K.M. Borgwardt, B. Scholkopf, Correcting sample selection bias by unlabeled data, in:...
  • S. Bickel, M. Bruckner, T. Scheffer, Discriminative learning for differing training and test distributions, in:...
  • M. Sugiyama, S. Nakajima, H. Kashima, P.V. Buenau, M. Kawanabe, Direct importance estimation with model selection and...
  • Cited by (421)

    • A soft computing approach for software defect density prediction

      2024, Journal of Software: Evolution and Process
    View all citing articles on Scopus
    View full text