Abstract
Accurately predicting defects in software modules helps the developers and testers to find the defective modules quickly and save their efforts in other software development aspects. Most previous studies have used single machine learning technique-based models to detect defects in software. These models have produced limited results as they perform well in only some parts of the data and fail to capture all the defect-causing patterns. The mixture of experts (MoE) is a combination method that utilizes experts specialized in the given data subspaces. The results of different specialized experts are combined according to their specific expertise for the final prediction governed by a gating network. This paper explores using the MoE method and presents implicit and explicit MoE-based models for software defect prediction. The presented models are evaluated via an experimental study on twenty-two software defect datasets collected from AEEEM, PROMISE, and JIRA repositories. The prediction performance of the presented models is evaluated using accuracy, f1-score, area under the ROC curve (AUC), and Mathew correlation coefficient (MCC) performance metrics. The experimental results showed that the presented MoE-based models outperformed different machine learning and ensemble techniques, such as Bagging and AdaBoost, and produced a state-of-the-art performance for defect prediction. Additionally, we found that the MoE models produced better or at least equal performance than the DNN-based model for most cases. The results are consistent for all the datasets. The results of the Wilcoxon test also showed that the presented models performed significantly better than the other techniques.
Similar content being viewed by others
Data availability
The datasets used during and/or analyzed during the current study are available in the online repository, https://zenodo.org/record/3362613#.YrhPAnZByM8.
References
Alsawalqah, J., Faris, H., Aljarah, I., Alnemer, L., & Alhindawi, N. (2017). In Computer Science On-line Conference (Springer, 2017) pp. 355–366.
Arora, I., Tetarwal, V., & Saha, A. (2015). Open issues in software defect prediction. Procedia Computer Science, 46, 906–912.
Assim, M., Obeidat, Q., & Hammad, M. (2020). In 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI) (2020), pp. 1–6.
Bock, A. S., & Fine, I. (2014). Anatomical and functional plasticity in early blind individuals and the mixture of experts architecture. Frontiers in human neuroscience, 8, 971.
Bowes, D., Hall, T., & Petrić, J. (2018). Software defect prediction: do different classifiers find the same defects? Software Quality Journal, 26(2), 525–552.
Canfora, G., De Lucia, A., Di Penta, M., Oliveto, R., Panichella, A., & Panichella, S. (2023). In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation (IEEE, 2013), pp. 252–261.
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189–215.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.
D’ Ambros, M., Lanza, M., & Robbes, R. (2010) In 2010 7th IEEE working conference on mining software repositories (MSR 2010) (IEEE, 2010), pp. 31–41.
D’Ambros, M., Lanza, M., & Robbes, R. (2012). Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17, 531–577.
Deep Singh, P., & Chug, A. (2017). In 2017 7th International Conference on Cloud Computing, Data Science & Engineering - Confluence pp. 775–781.
Di Nucci, D., Palomba, F., & De Lucia, A. (2018). In 2018 IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE) (IEEE, 2018), pp. 48–54.
Dreiseitl, S., & Ohno-Machado, L. (2002). Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics, 35(5–6), 352–359.
Elmishali, A., & Kalech, M. (2023). Issues-driven features for software fault prediction. Information and Software Technology 155, 107102.
Feng, S., Keung, J., Yu, X., Xiao, Y., Bennin, K. E., Kabir, M. A., & Zhang, M. (2021). Coste: Complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Information and Software Technology 129, 106432.
Feng, S. J., Keung, X., Yu, Y., & Xiao, M. (2021). Zhang, Investigation on the stability of smote-based oversampling techniques in software defect prediction. Information and Software Technology 139, 106662.
Ferrari, D., & Milioni, A. (2011). Choices and pitfalls concerning mixture-of-experts modeling. Pesquisa Operacional, 31, 95–111.
Ghosh S., Rana A., & Kansal V. (2018). A nonlinear manifold detection based model for software defect prediction. International Conference on Computational Intelligence and Data Science, Procedia Computer Science 132, 581–594.
Gormley, I. C., & Frühwirth-Schnatter, S. (2019). In Handbook of mixture analysis (Chapman and Hall/CRC, 2019) pp. 271–307.
Jović, A., Brkić, K., & Bogunović, N. (2015). In 2015 38th International convention on information and communication technology, electronics and microelectronics (MIPRO) (IEEE, 2015), pp. 1200–1205.
Jureczko, M., & Madeyski, L. (2010). In Proceedings of the 6th international conference on predictive models in software engineering pp. 1–10.
Kondratyuk, D., Tan, M., Brown, M., & Gong, B. (2020). When ensembling smaller models is more efficient than single large models. arXiv preprint arXiv:2005.00570
Komaroff, E. (2020). Relationships between p-values and pearson correlation coefficients, type 1 errors and effect size errors, under a true null hypothesis. Journal of Statistical Theory and Practice, 14(3), 1–13.
Kwak, S. K., & Kim, J. H. (2017). Statistical data preparation: Management of missing values and outliers. Korean Journal of Anesthesiology, 70(4), 407–411.
Li, L., Lessmann, S., & Baesens, B. (2019). Evaluating software defect prediction performance: an updated benchmarking study. arXiv preprint. http://arxiv.org/abs/1901.01726
Li, N., Shepperd, M., & Guo, Y. (2020). A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology 122, 106287.
Liaw, A., Wiener, M., et al., (2002). Classification and regression by randomforest. R news, 2(3), 18–22.
Majd, A., Vahidi-Asl, M., Khalilian, A., Poorsarvi-Tehrani, P., & Haghighi, H. (2020). Sldeep: Statement-level software defect prediction using deep-learning model on static code features. Expert Systems with Applications 147, 113156.
Masoudnia, S., & Ebrahimpour, R. (2014). Mixture of experts: a literature survey. Artificial Intelligence Review, 42(2), 275–293.
Moustafa, S., ElNainay, M. Y., El Makky, N., & Abougabal, M. S. (2018). Software bug prediction using weighted majority voting techniques. Alexandria engineering journal, 57(4), 2763–2774.
Nam, J. (2014). Survey on software defect prediction. Department of Compter Science and Engineering, The Hong Kong University of Science and Technology, Tech. Rep.
Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7, 21.
Niu, J., Li, Z., Chen, H., Dong, X., & Jing, X. Y. (2022) Data sampling and kernel manifold discriminant alignment for mixed-project heterogeneous defect prediction. Software Quality Journal pp. 1–35.
O’Doherty, J. P., Lee, S. W., Tadayonnejad, R., Cockburn, J., Iigaya, K., & Charpentier, C. J. (2021). Why and how the brain weights contributions from a mixture of experts. Neuroscience & Biobehavioral Reviews, 123, 14–23.
Pandey, S. K., Mishra, R. B., & Tripathi, A. K. (2020) BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques. Expert Systems with Applications 144, 113085.
Parsons, V. L. (2014). Stratified sampling. Wiley Stats Ref: Statistics Reference Online pp. 1–11.
Pelleg, D., Moore, A. W. et al. (2000). In ICML, vol. 1 pp. 727–734.
Priyanka, D. (2020). Kumar, Decision tree classifier: A detailed survey. International Journal of Information and Decision Sciences, 12(3), 246–269.
Qiao, L., Li, X., Umer, Q., & Guo, P. (2020). Deep learning based software defect prediction. Neurocomputing, 385, 100–110.
Rathore, S. S., & Kumar, S. (2021). An empirical study of ensemble techniques for software fault prediction. Applied Intelligence, 51(6), 3615–3644.
Radwan, A., Kamarudin, N., Solihin, M. I., Leong, H., Rizon, M., Hazry, D., & Bin Azizan, M. A. (2020). X-means clustering for wireless sensor networks. Journal of Robotics Networking and Artificial Life 7(2), 111–115.
Rey, D., & Neuhäuser, M. (2011) In International encyclopedia of statistical science (Springer, 2011), pp. 1658–1659.
Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in follow-up studies: Roc area, cohen’s d, and r. Law and human behavior, 29(5), 615–620.
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.
Shao, Y., Liu, B., Wang, S., & Li, G. (2020). Software defect prediction based on correlation weighted class association rule mining. Knowledge-Based Systems 196, 105742.
Singh, P. K., Panda, R., & Sangwan, O. P. (2015). A critical analysis on software fault prediction techniques. World applied sciences journal, 33(3), 371–379.
Sotto-Mayor, B., Elmishali, A., Kalech, M., & Abreu, R. (2022). Exploring design smells for smell-based defect prediction. Engineering Applications of Artificial Intelligence 115, 105240.
Sotto-Mayor, B., & Kalech, M. (2021). Cross-project smell-based defect prediction. Soft Computing, 25(22), 14171–14181.
Tanaka, K., Monden, A., & Yücel, Z. (2019, July). Prediction of software defects using automated machine learning. In 2019 20th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) (pp. 490-494). IEEE.
Tantithamthavorn, C. K. (2022). Large defect prediction benchmark. Zenodo. Retrieved from https://zenodo.org/record/6342328
Thota, M. K., Shajin, F. H., Rajesh, P., et al., (2020). Survey on software defect prediction techniques. International Journal of Applied Science and Engineering, 17(4), 331–344.
Wahono, R. S. (2015). A systematic literature review of software defect prediction. Journal of Software Engineering, 1(1), 1–16.
Wang, H., Zhuang, W., & Zhang, X. (2021). Software defect prediction based on gated hierarchical lstms. IEEE Transactions on Reliability, 70(2), 711–727.
Waterhouse S. R. (1998). Classification and regression using mixtures of experts. Ph.D. thesis, CiteSeer.
Woolson, R. F. (2007). Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials pp. 1–3.
Wu, J., Wu, Y., Niu, N., & Zhou, M. (2021). Mhcpdp: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder. Software Quality Journal, 29(2), 405–430.
Xu, Z., Liu, J., Luo, X., Yang, Z., Zhang, Y., Yuan, P., Tang, Y., & Zhang, T. (2019). Software defect prediction based on kernel pca and weighted extreme learning machine. Information and Software Technology, 106, 182–200.
Yang, X., Lo, D., Xia, X., & Sun, J. (2017). Tlel: A two-layer ensemble learning approach for just-in-time defect prediction. Information and Software Technology, 87, 206–220.
Yatish, S., Jiarpakdee, J., Thongtanunam, P., & Tantithamthavorn, C. (2019). In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) (IEEE, 2019), pp. 654–665.
Yigit, H. (2013). In 2013 international conference on electronics, computer and computation (ICECCO) (IEEE, 2013), pp. 228–231.
Yuksel, S. E., Wilson, J. N., & Gader, P. D. (2012). Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8), 1177–1193.
Zeng, Z., Zhang, Y., Zhang, H., & Zhang, L. (2021). In Proceedings of the 30th ACM SIGSOFT. International Symposium on Software Testing and Analysis pp. 427–438.
Zimmermann, T., Premraj, R., & Zeller, A. (2007). In Third International Workshop on Predictor Models in Software Engineering (PROMISE’07: ICSE Workshops 2007) (IEEE, 2007), pp. 9–9.
Author information
Authors and Affiliations
Contributions
The authors confirm their contribution to the paper as follows: Aditya Shankar Mishra: study conception and design, data collection, experimental analysis. Santosh Singh Rathore: concept design, analysis and interpretation of results, draft manuscript. Both authors reviewed the results and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A. Comparison with other state-of- the-art works
A comparative analysis of the presented MoE-based prediction models (MIoE and MEoE) has been performed against other previously published SDP-related works. We have not re-produced the results but borrowed them from the studies cited and used them. The works specified in Table 13 for comparison purposes are for the same datasets, with the same performance measures, and a similar experimental environment as used for our presented models. The table shows that except for the work presented by Panday et al. (2020), for all other works, the MIoE and MEoE models produced better performances for different performance measures. The presented work has used twenty-two defect datasets of different domains. For some of the datasets, the presented MoE model performed relatively poorly, reducing the methods’ average performance. In contrast, other previous works have considered fewer fault datasets for the performance evaluation of their methods. The average and highest values of the presented models were greater than their counterparts. Most of the works reported in Table 13 have used static ensemble methods, where the weights of the base learners were decided during training only. Moreover, only a few works have analyzed the base learners’ competitiveness for the ensemble, and the experimental analysis has also been limited to a few datasets.
Appendix B. Parameter details
In this study, we have used Python’s implementation of used machine learning techniques and ensemble methods. The following parameter values have been set for these techniques/methods as given in Table 14.
Appendix C. Description of the used software metrics
Table 15 provides descriptions of the used software metrics available in different software defect datasets used for experimentation in this work. A detailed description of these metrics can be found in Jureczko and Madeyski (2010), D’Ambros et al. (2010), Zimmermann et al. (2007), and Yatish et al. (2019).
Appendix D. Analysis of MIoE and MEoE models
The working of the presented MIoE and MEoE models (Algorithms 1 and 2) is based on standard algorithms and techniques. The analysis of the models and the steps of the algorithms are discussed as follows. The initial steps of the method involve data cleaning, data balancing, and handling high data dimensionality, which takes some non-constant time. In Algorithm-1 (MIoE model), the training data is initially randomly partitioned into sub-spaces equal to the number of experts. This step takes constant time and is applied only once. After that, experts are trained on their corresponding input sub-spaces. The training time depends on the learning technique; however, it is always less than training a single large model. Finally, the obtained best experts are tested on the testing dataset, and a gating function is then applied to combine the experts’ predictions. It takes some linear computation time. Other computations take constant time and are applied only once. Similarly, in Algorithm-2 (MEoE model), the X-means clustering algorithm is used to partition the training data into sub-spaces. It is applied only once and takes some linear computation time. After that, experts are trained on their corresponding input sub-spaces. Again, the training time depends on the learning technique, but it is always less than training a single large model. Finally, the obtained best experts are tested on the testing dataset, and the results are passed to the gating function for the final prediction. It takes some linear computation time. Other computations take constant time and are applied only once. On the machine with 16 GB of RAM and an Intel i7 processor, it took only 160 s to build the MIoE model and make the prediction and 180 s to build the MEoE model and make the prediction. In their work, Kondratyuk et al. (2020) studied the performance of ensemble or combinational models and found that ensembling can often be more efficient than training larger models. Additionally, the authors concluded that ensembles execute multiple models in parallel and then combine their outputs to make the final prediction. Therefore, the overall time will be better than the large single model.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shankar Mishra, A., Singh Rathore, S. Implicit and explicit mixture of experts models for software defect prediction. Software Qual J 31, 1331–1368 (2023). https://doi.org/10.1007/s11219-023-09640-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11219-023-09640-6