Informative Software Defect Data Generation and Prediction: INF-SMOTE

Rekha, G.; Shailaja, K.; Jatoth, Chandrashekar

doi:10.1007/978-3-031-12638-3_16

G. Rekha¹⁰,
K. Shailaja¹¹ &
Chandrashekar Jatoth¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1613))

Included in the following conference series:

International Conference on Advances in Computing and Data Sciences

494 Accesses

Abstract

Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches, namely SMOTE, address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Despite its success, these approaches come with the following shortcomings such as 1) over-generalization problem and generate near-duplicated data instances (less diverse data) due to oversampling of noisy samples, and 2) increasing the overlaps between different classes around the class boundaries. This paper introduces INF-SMOTE (Informative- Synthetic Minority Oversampling Technique), a novel and efficient synthetic oversampling approach for software defect datasets, simultaneously targeting all the shortcomings. INF-SMOTE identifies the informative minority samples that are appropriate for over-sampling. The process is in two way 1.) it identify and remove the noisy and overlapping samples from borderline minority instances based on the sampling seeds, and 2) synthetic samples are generated from the informative minority samples. Experiments were conducted on 12 releases of SDP (Software Defect Prediction) Datasets from the NASA repository. By comparing with the state-of-the-art techniques, we observe that the INF-SMOTE improves the defect prediction performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bennin, K.E., Keung, J., Monden, A., Kamei, Y., Ubayashi, N.: Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp. 154–163. IEEE (2016)
Google Scholar
Bennin, K.E., Keung, J., Monden, A., Phannachitta, P., Mensah, S.: The significant effects of data sampling approaches on software defect prioritization and classification. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 364–373. IEEE (2017)
Google Scholar
Bhargava, N., Sharma, G., Bhargava, R., Mathuria, M.: Decision tree analysis on j48 algorithm for data mining. Proc. Int. J. Adv. Res. Comput. Sci. Software Eng. 3(6), 1114–1119 (2013)
Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. intell. Res. 16, 321–357 (2002)
Article Google Scholar
Fenton, N.E., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Software Eng. 26(8), 797–814 (2000)
Article Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)
Article Google Scholar
Gosain, A., Sardana, S.: Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 79–85. IEEE (2017)
Google Scholar
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: 15th Annual Conference on Evaluation & Assessment in Software Engineering (EASE 2011), pp. 96–103. IET (2011)
Google Scholar
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Software Eng. 38(6), 1276–1304 (2011)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Google Scholar
Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., Matsumoto, K.I.: The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 196–204. IEEE (2007)
Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2006)
Article Google Scholar
Meyer, D., Wien, F.T.: Support vector machines. Interface LIBSVM Package e1071 28, 20 (2015)
Google Scholar
Pelayo, L., Dick, S.: Applying novel resampling strategies to software defect prediction. In: NAFIPS 2007–2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72. IEEE (2007)
Google Scholar
Potharaju, S.P., Sreedevi, M.: A novel LtR and RtL framework for subset feature selection (reduction) for improving the classification accuracy. In: Pati, B., Panigrahi, C.R., Misra, S., Pujari, A.K., Bakshi, S. (eds.) Progress in Advanced Computing and Intelligent Engineering. AISC, vol. 713, pp. 215–224. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1708-8_20
Chapter Google Scholar
Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets, vol. 68, pp. 1–3. AAAI Press (2000)
Google Scholar
Rekha, G., Tyagi, A.K., Krishna Reddy, V.: A wide scale classification of class imbalance problem and its solutions: a systematic literature review. J. Comput. Sci. 15, 886–929 (2019)
Article Google Scholar
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Google Scholar
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Chapter Google Scholar
Thirugnanasambandam, K., Prakash, S., Subramanian, V., Pothula, S., Thirumal, V.: Reinforced cuckoo search algorithm-based multimodal optimization. Appl. Intell. 49(6), 2059–2083 (2019). https://doi.org/10.1007/s10489-018-1355-3
Article Google Scholar
Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Hyderabad, India
G. Rekha
Department of Computer Science and Engineering, Vasavi College of Engineering, Hyderabad, India
K. Shailaja
Department of Computer Science and Engineering, National Institute of Technology, Raipur, India
Chandrashekar Jatoth

Authors

G. Rekha
View author publications
You can also search for this author in PubMed Google Scholar
K. Shailaja
View author publications
You can also search for this author in PubMed Google Scholar
Chandrashekar Jatoth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to G. Rekha .

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Mayank Singh
Jaypee University of Engineering and Technology, Guna, India
Vipin Tyagi
Jaypee University of Information Technology, Waknaghat, India
P. K. Gupta
Institute of Information Theory and Automation, Prague, Czech Republic
Jan Flusser
University of Ottawa, Ottawa, ON, Canada
Tuncer Ören

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rekha, G., Shailaja, K., Jatoth, C. (2022). Informative Software Defect Data Generation and Prediction: INF-SMOTE. In: Singh, M., Tyagi, V., Gupta, P.K., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2022. Communications in Computer and Information Science, vol 1613. Springer, Cham. https://doi.org/10.1007/978-3-031-12638-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-12638-3_16
Published: 28 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12637-6
Online ISBN: 978-3-031-12638-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Informative Software Defect Data Generation and Prediction: INF-SMOTE