Imbalanced Data Processing Model for Software Defect Prediction

Zhou, Lijuan; Li, Ran; Zhang, Shudong; Wang, Hua

doi:10.1007/s11277-017-5117-z

Imbalanced Data Processing Model for Software Defect Prediction

Published: 14 December 2017

Volume 102, pages 937–950, (2018)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Lijuan Zhou¹,
Ran Li¹,
Shudong Zhang¹ &
…
Hua Wang¹

497 Accesses
12 Citations
Explore all metrics

Abstract

In the field of software engineering, software defect prediction is the hotspot of the researches which can effectively guarantee the quality during software development. However, the problem of class imbalanced datasets will affect the accuracy of overall classification of software defect prediction, which is the key issue to be solved urgently today. In order to better solve this problem, this paper proposes a model named ASRA which combines attribute selection, sampling technologies and ensemble algorithm. The model adopts the Chi square test of attribute selection and then utilizes the combined sampling technique which includes SMOTE over-sampling and under-sampling to remove the redundant attributes and make the datasets balance. Afterwards, the model ASRA is eventually established by ensemble algorithm named Adaboost with basic classifier J48 decision tree. The data used in the experiments comes from UCI datasets. It can draw the conclusion that the effect of software defect prediction classification which using this model is improved and better than before by comparing the precision P, F-measure and AUC values from the results of the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Educational data mining: prediction of students' academic performance using machine learning algorithms

Article Open access 03 March 2022

Mustafa Yağcı

References

Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38(6), 1276–1304.
Article Google Scholar
Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496.
Article Google Scholar
Wang, J., Shen, B., & Chen, Y. (2012). Compressed C4.5 models for software defect prediction. In International conference on quality software (Vol. 430, pp. 13–16). IEEE.
Czibula, G., Marian, Z., & Czibula, I. G. (2014). Software defect prediction using relational association rule mining. Information Sciences, 264(183), 260–278.
Article Google Scholar
Turhan, B., & Bener, A. (2009). Analysis of Naive Bayes’ assumptions on software fault data: An empirical study ☆. Data & Knowledge Engineering, 68(2), 278–290.
Article Google Scholar
Weiss, G. M. (2004). Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter, 6(1), 7–19.
Article Google Scholar
Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing Journal, 27(C), 504–518.
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.
Article MATH Google Scholar
Tahir, M. A., Kittler, J., & Yan, F. (2012). Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition, 45(10), 3738–3750.
Article Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 42(4), 463–484.
Article Google Scholar
Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., & Riquelme, J. C. (2014). Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In ACM international conference on evaluation and assessment in software engineering (pp. 1–10).
Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). AdaCost: Misclassification cost-sensitive boosting. In Sixteenth international conference on machine learning (Vol. 33, pp. 97–105). Morgan Kaufmann Publishers Inc.
Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. In ACM SIGKDD international conference on knowledge discovery and data mining (pp. 155–164). ACM.
Quah, T. S., & Thwin, M. M. T. (2003). Application of neural networks for software quality prediction using object-oriented metrics. In International conference on software maintenance (Vol. 76, pp. 116). IEEE Computer Society.
Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2), 434–443.
Article Google Scholar
Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for Adaboost. Machine Learning, 42(3), 287–320.
Article MATH Google Scholar
Liu, M., Miao, L., & Zhang, D. (2014). Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.
Article Google Scholar
Khoshgoftaar, T. M., Gao, K., & Hulse, J. V. (2012). Feature selection for highly imbalanced software measurement data. Recent trends in information reuse and integration (pp. 167–189). Vienna: Springer.
Chapter Google Scholar
Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory (Vol. 55(1), pp. 119–139) Berlin, Heidelberg: Springer.
Khoshgoftaar, T. M., & Gao, K. (2009). Feature selection with imbalanced data for software defect prediction. In International conference on machine learning and applications (pp. 235–240). IEEE Computer Society.
Mandal, P., & Ami, A. S. (2015). Selecting best attributes for software defect prediction. In IEEE international wie conference on electrical and computer engineering (pp. 110–113). IEEE.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
Article MathSciNet Google Scholar
Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Machine Learning, 31(8), 1–38.
MathSciNet Google Scholar
Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3), 299–310.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the editors and all of the reviewers through the development for this work. Special thanks are according to the authors and experts from the UCI Machine Learning Repository who made the numerous datasets that were used in the experimental chapters in this paper. “The computer application technology” Beijing municipal key construction of the discipline. The authors acknowledge the support of Capital Normal University during the thesis is completed and the National Nature Science Foundation (Grant: 61601310).

Author information

Authors and Affiliations

College of Information Engineering, Capital Normal University, Beijing, China
Lijuan Zhou, Ran Li, Shudong Zhang & Hua Wang

Authors

Lijuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ran Li
View author publications
You can also search for this author in PubMed Google Scholar
Shudong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, L., Li, R., Zhang, S. et al. Imbalanced Data Processing Model for Software Defect Prediction. Wireless Pers Commun 102, 937–950 (2018). https://doi.org/10.1007/s11277-017-5117-z

Download citation

Published: 14 December 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11277-017-5117-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Imbalanced Data Processing Model for Software Defect Prediction

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Educational data mining: prediction of students' academic performance using machine learning algorithms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Imbalanced Data Processing Model for Software Defect Prediction

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Educational data mining: prediction of students' academic performance using machine learning algorithms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation