Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning

Yu, Xiao; Wu, Man; Jian, Yiheng; Bennin, Kwabena Ebo; Fu, Mandi; Ma, Chuanxiang

doi:10.1007/s00500-018-3093-1

Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning

Methodologies and Application
Published: 08 March 2018

Volume 22, pages 3461–3472, (2018)
Cite this article

Soft Computing Aims and scope Submit manuscript

Xiao Yu^1,2,
Man Wu³,
Yiheng Jian⁴,
Kwabena Ebo Bennin⁵,
Mandi Fu³ &
…
Chuanxiang Ma^2,6

778 Accesses
35 Citations
Explore all metrics

Abstract

Cross-company defect prediction (CCDP) is a practical way that trains a prediction model by exploiting one or multiple projects of a source company and then applies the model to a target company. Unfortunately, larger irrelevant cross-company (CC) data usually make it difficult to build a prediction model with high performance. On the other hand, brute force leveraging of CC data poorly related to within-company data may decrease the prediction model performance. To address such issues, we aim to provide an effective solution for CCDP. First, we propose a novel semi-supervised clustering-based data filtering method (i.e., SSDBSCAN filter) to filter out irrelevant CC data. Second, based on the filtered CC data, we for the first time introduce multi-source TrAdaBoost algorithm, an effective transfer learning method, into CCDP to import knowledge not from one but from multiple sources to avoid negative transfer. Experiments on 15 public datasets indicate that: (1) our proposed SSDBSCAN filter achieves better overall performance than compared data filtering methods; (2) our proposed CCDP approach achieves the best overall performance among all tested CCDP approaches; and (3) our proposed CCDP approach performs significantly better than with-company defect prediction models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Three-Level Training Data Filter for Cross-project Defect Prediction

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

Article 27 April 2021

Global vs. local models for cross-project defect prediction

Article 24 October 2016

References

Arar Ömer Faruk, Ayan Kürşat (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277
Article Google Scholar
Bennin KE, Keung J, Phannachitta P, et al (2017) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2017.2731766
Bennin K, Keung J, Monden A, et al (2017) The significant effects of data sampling approaches on software defect prioritization and classification. In:11th International symposium on empirical software engineering and measurement, ESEM 2017
Boetticher G, Menzies T, Ostrand T (2007) PROMISE Repository of empirical software engineering data, West Virginia University, Department of Computer Science. http://promisedata.org/repository
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chen L, Fang B, Shang Z et al (2015) Negative samples reduction in cross-company software defects prediction. Inf Softw Technol 62:67–77
Article Google Scholar
Dai W et al (2007) Boosting for transfer learning. In: 24th International conference on Machine learning, pp 193–200
Dhanajayan RCG, Pillai SA (2016) SLMBC: spiral life cycle model-based Bayesian classification technique for efficient software fault prediction and classification, Soft Computing, 1-13
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. Softw J Syst Softw 81(5):649–660
Article Google Scholar
Erturk Ezgi, Sezer Ebru Akcapinar (2016) Iterative software fault prediction with a hybrid approach. Appl Soft Comput 49:1020–1033
Article Google Scholar
Field AP (2001) Discovering statistics using SPSS for windows: advanced techniques for beginners, pp 551–552
Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753
Article MATH Google Scholar
Gray D, Bowes D, Davey N, et al (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks. Springer, Berlin, pp 223–234
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Article Google Scholar
Hosmer DW, Lemeshow S (2000) Introduction to the logistic regression model. Appl Logist Regres 1–30
Jain K (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jing X et al (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th joint meeting on foundations of software engineering, pp 496–507
Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp 414–423
Kampenes V By et al (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11):1073–1086
Article Google Scholar
Kawata K, Amasaki S, Yokogawa T (2016) Improving relevancy filter methods for cross-project defect prediction, applied computing & information technology, pp 1–12
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58:388–402
Article Google Scholar
Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: 9th IEEE international conference on data mining, pp 842–847
Lewis DD (1998) Naive (Bayes) at forty the independence assumption in information retrieval. In: European conference on machine learning, pp 4–15
Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256
Article Google Scholar
Malhotra Ruchika (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Article Google Scholar
Mesquita PP Diego et al (2016) Classification with reject option for software defect prediction. Appl Soft Comput 49:1085–1093
Article Google Scholar
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 382–391
Peng L, Yang B, Chen Y, Abraham A (2009) Data gravitation based classification. Inf Sci 179(6):809–819
Article MATH Google Scholar
Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: Proceedings of the 10th international workshop on mining software repositories, pp 409–418
Ryu D, Jang JI, Baik J (2015) A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J Comput Sci Technol 30(5):969–980
Article Google Scholar
Seliya N, Khoshgoftaar TM (2011) The use of decision trees for cost- sensitive classification: an empirical study in software quality prediction. Wiley Interdiscip Rev Data Min Knowl Discov 1(5):448–459
Article Google Scholar
Shepperd M, Bowes D, Hall T (2014) Researcher bias The use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616
Article Google Scholar
Shukla S, Radhakrishnan T, Muthukumaran K, et al (2016) Multi-objective cross-version defect prediction, Soft Computing 1-22
Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 51:62–71
Article Google Scholar
Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370
Article Google Scholar
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806–1817
Article Google Scholar
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578
Article Google Scholar
Turhan B, Tosun Mısırlı A, Bener A (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118
Article Google Scholar
Vashisht V, Lal M, Sureshchandar GS et al (2015) A framework for software defect prediction using neural networks. J Softw Eng Appl 8(8):384
Article Google Scholar
Wang J, Shen B, Chen Y (2012) Compressed C4. 5 models for software defect prediction. In: 12th international conference on quality software, pp 13–16
Wilcoxon Frank (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Article Google Scholar
Yan Z, Chen X, Guo P (2010) Software defect prediction using fuzzy support vector regression. In: International Symposium on Neural Networks. Springer, Berlin Heidelberg, pp 17–24
Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: IEEE conference on computer vision and pattern recognition, pp 1855–1862
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp 91–100

Download references

Acknowledgements

This work is partly supported by the grants of National Natural Science Foundation of China (61070013, 61300042, U1135005, 71401128), the Fundamental Research Funds for the Central Universities (Nos. 2042014kf0272, 2014211020201) and Natural Science Foundation of HuBei (2011CDB072).

Author information

Authors and Affiliations

School of Computer, Wuhan University, Wuhan, China
Xiao Yu
School of Computer Science and Information Engineering, Hubei University, Wuhan, China
Xiao Yu & Chuanxiang Ma
Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China
Man Wu & Mandi Fu
School of Information and Electronics, Beijing Institute of Technology, Beijing, China
Yiheng Jian
Department of Computer Science, City University of Hong Kong, Hong Kong, China
Kwabena Ebo Bennin
Educational Informationalization Engineering Research Center of Hubei Province, Wuhan, China
Chuanxiang Ma

Authors

Xiao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Man Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yiheng Jian
View author publications
You can also search for this author in PubMed Google Scholar
Kwabena Ebo Bennin
View author publications
You can also search for this author in PubMed Google Scholar
Mandi Fu
View author publications
You can also search for this author in PubMed Google Scholar
Chuanxiang Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuanxiang Ma.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain any studies with human participants.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, X., Wu, M., Jian, Y. et al. Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22, 3461–3472 (2018). https://doi.org/10.1007/s00500-018-3093-1

Download citation

Published: 08 March 2018
Issue Date: May 2018
DOI: https://doi.org/10.1007/s00500-018-3093-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning

Abstract

Access this article

Similar content being viewed by others

A Three-Level Training Data Filter for Cross-project Defect Prediction

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

Global vs. local models for cross-project defect prediction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning

Abstract

Access this article

Similar content being viewed by others

A Three-Level Training Data Filter for Cross-project Defect Prediction

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

Global vs. local models for cross-project defect prediction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation