Skip to main content
Log in

Imbalanced Data Processing Model for Software Defect Prediction

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

In the field of software engineering, software defect prediction is the hotspot of the researches which can effectively guarantee the quality during software development. However, the problem of class imbalanced datasets will affect the accuracy of overall classification of software defect prediction, which is the key issue to be solved urgently today. In order to better solve this problem, this paper proposes a model named ASRA which combines attribute selection, sampling technologies and ensemble algorithm. The model adopts the Chi square test of attribute selection and then utilizes the combined sampling technique which includes SMOTE over-sampling and under-sampling to remove the redundant attributes and make the datasets balance. Afterwards, the model ASRA is eventually established by ensemble algorithm named Adaboost with basic classifier J48 decision tree. The data used in the experiments comes from UCI datasets. It can draw the conclusion that the effect of software defect prediction classification which using this model is improved and better than before by comparing the precision P, F-measure and AUC values from the results of the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38(6), 1276–1304.

    Article  Google Scholar 

  2. Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496.

    Article  Google Scholar 

  3. Wang, J., Shen, B., & Chen, Y. (2012). Compressed C4.5 models for software defect prediction. In International conference on quality software (Vol. 430, pp. 13–16). IEEE.

  4. Czibula, G., Marian, Z., & Czibula, I. G. (2014). Software defect prediction using relational association rule mining. Information Sciences, 264(183), 260–278.

    Article  Google Scholar 

  5. Turhan, B., & Bener, A. (2009). Analysis of Naive Bayes’ assumptions on software fault data: An empirical study ☆. Data & Knowledge Engineering, 68(2), 278–290.

    Article  Google Scholar 

  6. Weiss, G. M. (2004). Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter, 6(1), 7–19.

    Article  Google Scholar 

  7. Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing Journal, 27(C), 504–518.

    Article  Google Scholar 

  8. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.

    Article  MATH  Google Scholar 

  9. Tahir, M. A., Kittler, J., & Yan, F. (2012). Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition, 45(10), 3738–3750.

    Article  Google Scholar 

  10. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 42(4), 463–484.

    Article  Google Scholar 

  11. Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., & Riquelme, J. C. (2014). Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In ACM international conference on evaluation and assessment in software engineering (pp. 1–10).

  12. Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). AdaCost: Misclassification cost-sensitive boosting. In Sixteenth international conference on machine learning (Vol. 33, pp. 97–105). Morgan Kaufmann Publishers Inc.

  13. Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. In ACM SIGKDD international conference on knowledge discovery and data mining (pp. 155–164). ACM.

  14. Quah, T. S., & Thwin, M. M. T. (2003). Application of neural networks for software quality prediction using object-oriented metrics. In International conference on software maintenance (Vol. 76, pp. 116). IEEE Computer Society.

  15. Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2), 434–443.

    Article  Google Scholar 

  16. Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for Adaboost. Machine Learning, 42(3), 287–320.

    Article  MATH  Google Scholar 

  17. Liu, M., Miao, L., & Zhang, D. (2014). Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.

    Article  Google Scholar 

  18. Khoshgoftaar, T. M., Gao, K., & Hulse, J. V. (2012). Feature selection for highly imbalanced software measurement data. Recent trends in information reuse and integration (pp. 167–189). Vienna: Springer.

    Chapter  Google Scholar 

  19. Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory (Vol. 55(1), pp. 119–139) Berlin, Heidelberg: Springer.

  20. Khoshgoftaar, T. M., & Gao, K. (2009). Feature selection with imbalanced data for software defect prediction. In International conference on machine learning and applications (pp. 235–240). IEEE Computer Society.

  21. Mandal, P., & Ami, A. S. (2015). Selecting best attributes for software defect prediction. In IEEE international wie conference on electrical and computer engineering (pp. 110–113). IEEE.

  22. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.

    Article  MathSciNet  Google Scholar 

  23. Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Machine Learning, 31(8), 1–38.

    MathSciNet  Google Scholar 

  24. Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3), 299–310.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editors and all of the reviewers through the development for this work. Special thanks are according to the authors and experts from the UCI Machine Learning Repository who made the numerous datasets that were used in the experimental chapters in this paper. “The computer application technology” Beijing municipal key construction of the discipline. The authors acknowledge the support of Capital Normal University during the thesis is completed and the National Nature Science Foundation (Grant: 61601310).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ran Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, L., Li, R., Zhang, S. et al. Imbalanced Data Processing Model for Software Defect Prediction. Wireless Pers Commun 102, 937–950 (2018). https://doi.org/10.1007/s11277-017-5117-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-017-5117-z

Keywords

Navigation