Abstract
Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches, namely SMOTE, address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Despite its success, these approaches come with the following shortcomings such as 1) over-generalization problem and generate near-duplicated data instances (less diverse data) due to oversampling of noisy samples, and 2) increasing the overlaps between different classes around the class boundaries. This paper introduces INF-SMOTE (Informative- Synthetic Minority Oversampling Technique), a novel and efficient synthetic oversampling approach for software defect datasets, simultaneously targeting all the shortcomings. INF-SMOTE identifies the informative minority samples that are appropriate for over-sampling. The process is in two way 1.) it identify and remove the noisy and overlapping samples from borderline minority instances based on the sampling seeds, and 2) synthetic samples are generated from the informative minority samples. Experiments were conducted on 12 releases of SDP (Software Defect Prediction) Datasets from the NASA repository. By comparing with the state-of-the-art techniques, we observe that the INF-SMOTE improves the defect prediction performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bennin, K.E., Keung, J., Monden, A., Kamei, Y., Ubayashi, N.: Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp. 154–163. IEEE (2016)
Bennin, K.E., Keung, J., Monden, A., Phannachitta, P., Mensah, S.: The significant effects of data sampling approaches on software defect prioritization and classification. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 364–373. IEEE (2017)
Bhargava, N., Sharma, G., Bhargava, R., Mathuria, M.: Decision tree analysis on j48 algorithm for data mining. Proc. Int. J. Adv. Res. Comput. Sci. Software Eng. 3(6), 1114–1119 (2013)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. intell. Res. 16, 321–357 (2002)
Fenton, N.E., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Software Eng. 26(8), 797–814 (2000)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)
Gosain, A., Sardana, S.: Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 79–85. IEEE (2017)
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: 15th Annual Conference on Evaluation & Assessment in Software Engineering (EASE 2011), pp. 96–103. IET (2011)
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Software Eng. 38(6), 1276–1304 (2011)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., Matsumoto, K.I.: The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 196–204. IEEE (2007)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2006)
Meyer, D., Wien, F.T.: Support vector machines. Interface LIBSVM Package e1071 28, 20 (2015)
Pelayo, L., Dick, S.: Applying novel resampling strategies to software defect prediction. In: NAFIPS 2007–2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72. IEEE (2007)
Potharaju, S.P., Sreedevi, M.: A novel LtR and RtL framework for subset feature selection (reduction) for improving the classification accuracy. In: Pati, B., Panigrahi, C.R., Misra, S., Pujari, A.K., Bakshi, S. (eds.) Progress in Advanced Computing and Intelligent Engineering. AISC, vol. 713, pp. 215–224. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1708-8_20
Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets, vol. 68, pp. 1–3. AAAI Press (2000)
Rekha, G., Tyagi, A.K., Krishna Reddy, V.: A wide scale classification of class imbalance problem and its solutions: a systematic literature review. J. Comput. Sci. 15, 886–929 (2019)
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Thirugnanasambandam, K., Prakash, S., Subramanian, V., Pothula, S., Thirumal, V.: Reinforced cuckoo search algorithm-based multimodal optimization. Appl. Intell. 49(6), 2059–2083 (2019). https://doi.org/10.1007/s10489-018-1355-3
Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rekha, G., Shailaja, K., Jatoth, C. (2022). Informative Software Defect Data Generation and Prediction: INF-SMOTE. In: Singh, M., Tyagi, V., Gupta, P.K., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2022. Communications in Computer and Information Science, vol 1613. Springer, Cham. https://doi.org/10.1007/978-3-031-12638-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-12638-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12637-6
Online ISBN: 978-3-031-12638-3
eBook Packages: Computer ScienceComputer Science (R0)