skip to main content
10.1145/3524304.3524310acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicscaConference Proceedingsconference-collections
research-article

Improving software fault prediction in imbalanced datasets using the under-sampling approach

Authors Info & Claims
Published:06 June 2022Publication History

ABSTRACT

To make most software defect-free, a considerable amount of budget needs to be allocated to the software testing phase. As each day goes by, this budget slowly rises, as most software grows in size and complexity, which causes an issue for specific companies that cannot allocate sufficient resources towards testing. To tackle this, many researchers use machine learning methods to create software fault prediction models that can help detect defect-prone modules so that resources can be allocated more efficiently during testing. Although this is a feasible plan, the effectiveness of these machine learning models also depends on a few factors, such as the issue of data imbalance. There are many known techniques in class imbalance research that can potentially improve the performance of prediction models through processing the dataset before providing it as input. However, not all methods are compatible with one another. Before building a prediction model, the dataset undergoes the preprocessing step, the under-sampling, and the feature selection process. This study uses an under-sampling process by employing the Instance Hardness Threshold (IHT), which reduces the number of data present in the majority class. The performance of the proposed approach is evaluated based on eight machine learning algorithms by applying it to eight moderate and highly imbalanced NASA datasets. The results of our proposed approach show improvement in AUC and F1-Score by 33% and 26%, respectively, compared to other research work in some datasets.

References

  1. Abaei, G., & Selamat, A. (2015). Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (pp. 179-193). Springer, Cham.Google ScholarGoogle Scholar
  2. Tong, H., Liu, B., & Wang, S. (2018). Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology, 96, 94-111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Le, T., Le Son, H., Vo, M. T., Lee, M. Y., & Baik, S. W. (2018). A cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset. Symmetry, 10(7), 250.Google ScholarGoogle ScholarCross RefCross Ref
  4. Boughorbel, S., Jarray, F., & El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS one, 12(6), e0177678.Google ScholarGoogle ScholarCross RefCross Ref
  5. Yen, S. J., & Lee, Y. S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation (pp. 731-740). Springer, Berlin, Heidelberg.Google ScholarGoogle ScholarCross RefCross Ref
  6. Yin, L., Ge, Y., Xiao, K., Wang, X., & Quan, X. (2013). Feature selection for high-dimensional imbalanced data. Neurocomputing, 105, 3-11Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yap, B. W., Abd Rani, K., Abd Rahman, H. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, under-sampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the first international conference on advanced data and information engineering (DaEng-2013) (pp. 13-22). Springer, Singapore.Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhang, Y., Lo, D., Xia, X., Xu, B., Sun, J., & Li, S. (2015, December). Combining software metrics and text features for vulnerable file prediction. In 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS) (pp. 40-49). IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Haque, M. N., Noman, N., Beretta, R., & Moscato, P. (2016). Heterogeneous ensemble combination search using genetic algorithms for class imbalanced data classification. PloS one, 11(1), e0146116.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yucalar, F., Ozcift, A., Borandag, E., & Kilinc, D. (2020). Multiple-classifiers in software quality engineering: Combining predictors to improve software fault prediction ability. Engineering Science and Technology, an International Journal, 23(4), 938-950.Google ScholarGoogle Scholar
  11. Rtayli, N., & Enneya, N. (2020). Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications, 55, 102596.Google ScholarGoogle ScholarCross RefCross Ref
  12. Berrar, D. (2018). Bayes’ theorem and naive Bayes classifier. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Elsevier Science Publisher: Amsterdam, The Netherlands, 403-412.Google ScholarGoogle Scholar
  13. Sharma, H., & Kumar, S. (2016). A survey on decision tree algorithms of classification in data mining. International Journal of Science and Research (IJSR), 5(4), 2094-2097.Google ScholarGoogle ScholarCross RefCross Ref
  14. Anagaw, A., & Chang, Y. L. (2019). A new complement naïve Bayesian approach for biomedical data classification. Journal of Ambient Intelligence and Humanized Computing, 10(10), 3889-3897.Google ScholarGoogle ScholarCross RefCross Ref
  15. Hilbe, J. M. (2009). Logistic regression models. Chapman and hall/CRC.Google ScholarGoogle ScholarCross RefCross Ref
  16. Paliwal, M., & Kumar, U. A. (2009). Neural networks and statistical techniques: A review of applications. Expert systems with applications, 36(1), 2-17.Google ScholarGoogle Scholar
  17. Parmar, A., Katariya, R., & Patel, V. (2018, August). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (pp. 758-763). Springer, Cham.Google ScholarGoogle Scholar
  18. Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE transactions on pattern analysis and machine intelligence, 28(10), 1619-1630.Google ScholarGoogle Scholar
  19. Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.Google ScholarGoogle ScholarCross RefCross Ref
  20. Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD dissertation, Department of Computer Science, University of Waikato (1999)Google ScholarGoogle Scholar
  21. Nkiama, H., Said, S. Z. M., & Saidu, M. (2016). A subset feature elimination mechanism for intrusion detection system. International Journal of Advanced Computer Science and Applications, 7(4), 148-157.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICSCA '22: Proceedings of the 2022 11th International Conference on Software and Computer Applications
    February 2022
    224 pages
    ISBN:9781450385770
    DOI:10.1145/3524304

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 6 June 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format