Abstract
In the field of software engineering, software defect prediction is the hotspot of the researches which can effectively guarantee the quality during software development. However, the problem of class imbalanced datasets will affect the accuracy of overall classification of software defect prediction, which is the key issue to be solved urgently today. In order to better solve this problem, this paper proposes a model named ASRA which combines attribute selection, sampling technologies and ensemble algorithm. The model adopts the Chi square test of attribute selection and then utilizes the combined sampling technique which includes SMOTE over-sampling and under-sampling to remove the redundant attributes and make the datasets balance. Afterwards, the model ASRA is eventually established by ensemble algorithm named Adaboost with basic classifier J48 decision tree. The data used in the experiments comes from UCI datasets. It can draw the conclusion that the effect of software defect prediction classification which using this model is improved and better than before by comparing the precision P, F-measure and AUC values from the results of the experiments.
Similar content being viewed by others
References
Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38(6), 1276–1304.
Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496.
Wang, J., Shen, B., & Chen, Y. (2012). Compressed C4.5 models for software defect prediction. In International conference on quality software (Vol. 430, pp. 13–16). IEEE.
Czibula, G., Marian, Z., & Czibula, I. G. (2014). Software defect prediction using relational association rule mining. Information Sciences, 264(183), 260–278.
Turhan, B., & Bener, A. (2009). Analysis of Naive Bayes’ assumptions on software fault data: An empirical study ☆. Data & Knowledge Engineering, 68(2), 278–290.
Weiss, G. M. (2004). Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter, 6(1), 7–19.
Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing Journal, 27(C), 504–518.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.
Tahir, M. A., Kittler, J., & Yan, F. (2012). Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition, 45(10), 3738–3750.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 42(4), 463–484.
Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., & Riquelme, J. C. (2014). Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In ACM international conference on evaluation and assessment in software engineering (pp. 1–10).
Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). AdaCost: Misclassification cost-sensitive boosting. In Sixteenth international conference on machine learning (Vol. 33, pp. 97–105). Morgan Kaufmann Publishers Inc.
Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. In ACM SIGKDD international conference on knowledge discovery and data mining (pp. 155–164). ACM.
Quah, T. S., & Thwin, M. M. T. (2003). Application of neural networks for software quality prediction using object-oriented metrics. In International conference on software maintenance (Vol. 76, pp. 116). IEEE Computer Society.
Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2), 434–443.
Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for Adaboost. Machine Learning, 42(3), 287–320.
Liu, M., Miao, L., & Zhang, D. (2014). Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.
Khoshgoftaar, T. M., Gao, K., & Hulse, J. V. (2012). Feature selection for highly imbalanced software measurement data. Recent trends in information reuse and integration (pp. 167–189). Vienna: Springer.
Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory (Vol. 55(1), pp. 119–139) Berlin, Heidelberg: Springer.
Khoshgoftaar, T. M., & Gao, K. (2009). Feature selection with imbalanced data for software defect prediction. In International conference on machine learning and applications (pp. 235–240). IEEE Computer Society.
Mandal, P., & Ami, A. S. (2015). Selecting best attributes for software defect prediction. In IEEE international wie conference on electrical and computer engineering (pp. 110–113). IEEE.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Machine Learning, 31(8), 1–38.
Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3), 299–310.
Acknowledgements
The authors would like to thank the editors and all of the reviewers through the development for this work. Special thanks are according to the authors and experts from the UCI Machine Learning Repository who made the numerous datasets that were used in the experimental chapters in this paper. “The computer application technology” Beijing municipal key construction of the discipline. The authors acknowledge the support of Capital Normal University during the thesis is completed and the National Nature Science Foundation (Grant: 61601310).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhou, L., Li, R., Zhang, S. et al. Imbalanced Data Processing Model for Software Defect Prediction. Wireless Pers Commun 102, 937–950 (2018). https://doi.org/10.1007/s11277-017-5117-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-017-5117-z