research-article

Improving software fault prediction in imbalanced datasets using the under-sampling approach

Authors:
Golnoush Abaei

School of Information Technology, Monash University Malaysia, Malaysia

School of Information Technology, Monash University Malaysia, Malaysia
View Profile

,
Wen Zhong Tah

School of Information Technology, Monash University Malaysia, Malaysia

School of Information Technology, Monash University Malaysia, Malaysia
View Profile

,
Jason Zhern Wee Toh

School of Information Technology, Monash University Malaysia, Malaysia

School of Information Technology, Monash University Malaysia, Malaysia
View Profile

,
Ethan Sheng Jian Hor

School of Information Technology, Monash University Malaysia, Malaysia

School of Information Technology, Monash University Malaysia, Malaysia
View Profile

ICSCA '22: Proceedings of the 2022 11th International Conference on Software and Computer ApplicationsFebruary 2022Pages 41–47https://doi.org/10.1145/3524304.3524310

Published:06 June 2022Publication History

ICSCA '22: Proceedings of the 2022 11th International Conference on Software and Computer Applications

Pages 41–47

ABSTRACT

To make most software defect-free, a considerable amount of budget needs to be allocated to the software testing phase. As each day goes by, this budget slowly rises, as most software grows in size and complexity, which causes an issue for specific companies that cannot allocate sufficient resources towards testing. To tackle this, many researchers use machine learning methods to create software fault prediction models that can help detect defect-prone modules so that resources can be allocated more efficiently during testing. Although this is a feasible plan, the effectiveness of these machine learning models also depends on a few factors, such as the issue of data imbalance. There are many known techniques in class imbalance research that can potentially improve the performance of prediction models through processing the dataset before providing it as input. However, not all methods are compatible with one another. Before building a prediction model, the dataset undergoes the preprocessing step, the under-sampling, and the feature selection process. This study uses an under-sampling process by employing the Instance Hardness Threshold (IHT), which reduces the number of data present in the majority class. The performance of the proposed approach is evaluated based on eight machine learning algorithms by applying it to eight moderate and highly imbalanced NASA datasets. The results of our proposed approach show improvement in AUC and F1-Score by 33% and 26%, respectively, compared to other research work in some datasets.

References

Abaei, G., & Selamat, A. (2015). Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (pp. 179-193). Springer, Cham.Google Scholar
Tong, H., Liu, B., & Wang, S. (2018). Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology, 96, 94-111.Google ScholarDigital Library
Le, T., Le Son, H., Vo, M. T., Lee, M. Y., & Baik, S. W. (2018). A cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset. Symmetry, 10(7), 250.Google ScholarCross Ref
Boughorbel, S., Jarray, F., & El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS one, 12(6), e0177678.Google ScholarCross Ref
Yen, S. J., & Lee, Y. S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation (pp. 731-740). Springer, Berlin, Heidelberg.Google ScholarCross Ref
Yin, L., Ge, Y., Xiao, K., Wang, X., & Quan, X. (2013). Feature selection for high-dimensional imbalanced data. Neurocomputing, 105, 3-11Google ScholarDigital Library
Yap, B. W., Abd Rani, K., Abd Rahman, H. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, under-sampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the first international conference on advanced data and information engineering (DaEng-2013) (pp. 13-22). Springer, Singapore.Google ScholarCross Ref
Zhang, Y., Lo, D., Xia, X., Xu, B., Sun, J., & Li, S. (2015, December). Combining software metrics and text features for vulnerable file prediction. In 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS) (pp. 40-49). IEEE.Google ScholarDigital Library
Haque, M. N., Noman, N., Beretta, R., & Moscato, P. (2016). Heterogeneous ensemble combination search using genetic algorithms for class imbalanced data classification. PloS one, 11(1), e0146116.Google ScholarCross Ref
Yucalar, F., Ozcift, A., Borandag, E., & Kilinc, D. (2020). Multiple-classifiers in software quality engineering: Combining predictors to improve software fault prediction ability. Engineering Science and Technology, an International Journal, 23(4), 938-950.Google Scholar
Rtayli, N., & Enneya, N. (2020). Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications, 55, 102596.Google ScholarCross Ref
Berrar, D. (2018). Bayes’ theorem and naive Bayes classifier. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Elsevier Science Publisher: Amsterdam, The Netherlands, 403-412.Google Scholar
Sharma, H., & Kumar, S. (2016). A survey on decision tree algorithms of classification in data mining. International Journal of Science and Research (IJSR), 5(4), 2094-2097.Google ScholarCross Ref
Anagaw, A., & Chang, Y. L. (2019). A new complement naïve Bayesian approach for biomedical data classification. Journal of Ambient Intelligence and Humanized Computing, 10(10), 3889-3897.Google ScholarCross Ref
Hilbe, J. M. (2009). Logistic regression models. Chapman and hall/CRC.Google ScholarCross Ref
Paliwal, M., & Kumar, U. A. (2009). Neural networks and statistical techniques: A review of applications. Expert systems with applications, 36(1), 2-17.Google Scholar
Parmar, A., Katariya, R., & Patel, V. (2018, August). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (pp. 758-763). Springer, Cham.Google Scholar
Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE transactions on pattern analysis and machine intelligence, 28(10), 1619-1630.Google Scholar
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.Google ScholarCross Ref
Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD dissertation, Department of Computer Science, University of Waikato (1999)Google Scholar
Nkiama, H., Said, S. Z. M., & Saidu, M. (2016). A subset feature elimination mechanism for intrusion detection system. International Journal of Advanced Computer Science and Applications, 7(4), 148-157.Google Scholar

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Read More
KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling
Abstract
Imbalanced learning has become a research emphasis in recent years because of the growing number of class-imbalance classification problems in real applications. It is particularly challenging when the imbalanced rate is very high. Sampling, ...
Read More
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICSCA '22: Proceedings of the 2022 11th International Conference on Software and Computer Applications
February 2022
224 pages
ISBN:9781450385770
DOI:10.1145/3524304

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Imbalanced Dataset
Software Fault Prediction
Testing
Under-sampling
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 74
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Improving software fault prediction in imbalanced datasets using the under-sampling approach

ICSCA '22: Proceedings of the 2022 11th International Conference on Software and Computer Applications

ABSTRACT

References

Cited By

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling

Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Improving software fault prediction in imbalanced datasets using the under-sampling approach

ICSCA '22: Proceedings of the 2022 11th International Conference on Software and Computer Applications

ABSTRACT

References

Cited By

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling

Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media