ABSTRACT
To make most software defect-free, a considerable amount of budget needs to be allocated to the software testing phase. As each day goes by, this budget slowly rises, as most software grows in size and complexity, which causes an issue for specific companies that cannot allocate sufficient resources towards testing. To tackle this, many researchers use machine learning methods to create software fault prediction models that can help detect defect-prone modules so that resources can be allocated more efficiently during testing. Although this is a feasible plan, the effectiveness of these machine learning models also depends on a few factors, such as the issue of data imbalance. There are many known techniques in class imbalance research that can potentially improve the performance of prediction models through processing the dataset before providing it as input. However, not all methods are compatible with one another. Before building a prediction model, the dataset undergoes the preprocessing step, the under-sampling, and the feature selection process. This study uses an under-sampling process by employing the Instance Hardness Threshold (IHT), which reduces the number of data present in the majority class. The performance of the proposed approach is evaluated based on eight machine learning algorithms by applying it to eight moderate and highly imbalanced NASA datasets. The results of our proposed approach show improvement in AUC and F1-Score by 33% and 26%, respectively, compared to other research work in some datasets.
- Abaei, G., & Selamat, A. (2015). Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (pp. 179-193). Springer, Cham.Google Scholar
- Tong, H., Liu, B., & Wang, S. (2018). Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology, 96, 94-111.Google ScholarDigital Library
- Le, T., Le Son, H., Vo, M. T., Lee, M. Y., & Baik, S. W. (2018). A cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset. Symmetry, 10(7), 250.Google ScholarCross Ref
- Boughorbel, S., Jarray, F., & El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS one, 12(6), e0177678.Google ScholarCross Ref
- Yen, S. J., & Lee, Y. S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation (pp. 731-740). Springer, Berlin, Heidelberg.Google ScholarCross Ref
- Yin, L., Ge, Y., Xiao, K., Wang, X., & Quan, X. (2013). Feature selection for high-dimensional imbalanced data. Neurocomputing, 105, 3-11Google ScholarDigital Library
- Yap, B. W., Abd Rani, K., Abd Rahman, H. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, under-sampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the first international conference on advanced data and information engineering (DaEng-2013) (pp. 13-22). Springer, Singapore.Google ScholarCross Ref
- Zhang, Y., Lo, D., Xia, X., Xu, B., Sun, J., & Li, S. (2015, December). Combining software metrics and text features for vulnerable file prediction. In 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS) (pp. 40-49). IEEE.Google ScholarDigital Library
- Haque, M. N., Noman, N., Beretta, R., & Moscato, P. (2016). Heterogeneous ensemble combination search using genetic algorithms for class imbalanced data classification. PloS one, 11(1), e0146116.Google ScholarCross Ref
- Yucalar, F., Ozcift, A., Borandag, E., & Kilinc, D. (2020). Multiple-classifiers in software quality engineering: Combining predictors to improve software fault prediction ability. Engineering Science and Technology, an International Journal, 23(4), 938-950.Google Scholar
- Rtayli, N., & Enneya, N. (2020). Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications, 55, 102596.Google ScholarCross Ref
- Berrar, D. (2018). Bayes’ theorem and naive Bayes classifier. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Elsevier Science Publisher: Amsterdam, The Netherlands, 403-412.Google Scholar
- Sharma, H., & Kumar, S. (2016). A survey on decision tree algorithms of classification in data mining. International Journal of Science and Research (IJSR), 5(4), 2094-2097.Google ScholarCross Ref
- Anagaw, A., & Chang, Y. L. (2019). A new complement naïve Bayesian approach for biomedical data classification. Journal of Ambient Intelligence and Humanized Computing, 10(10), 3889-3897.Google ScholarCross Ref
- Hilbe, J. M. (2009). Logistic regression models. Chapman and hall/CRC.Google ScholarCross Ref
- Paliwal, M., & Kumar, U. A. (2009). Neural networks and statistical techniques: A review of applications. Expert systems with applications, 36(1), 2-17.Google Scholar
- Parmar, A., Katariya, R., & Patel, V. (2018, August). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (pp. 758-763). Springer, Cham.Google Scholar
- Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE transactions on pattern analysis and machine intelligence, 28(10), 1619-1630.Google Scholar
- Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.Google ScholarCross Ref
- Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD dissertation, Department of Computer Science, University of Waikato (1999)Google Scholar
- Nkiama, H., Said, S. Z. M., & Saidu, M. (2016). A subset feature elimination mechanism for intrusion detection system. International Journal of Advanced Computer Science and Applications, 7(4), 148-157.Google Scholar
Recommendations
Over-sampling via under-sampling in strongly imbalanced data
Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling
AbstractImbalanced learning has become a research emphasis in recent years because of the growing number of class-imbalance classification problems in real applications. It is particularly challenging when the imbalanced rate is very high. Sampling, ...
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets
A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
Comments