A new weighted naive Bayes method based on information diffusion for software defect prediction

Ji, Haijin; Huang, Song; Wu, Yaning; Hui, Zhanwei; Zheng, Changyou

doi:10.1007/s11219-018-9436-4

A new weighted naive Bayes method based on information diffusion for software defect prediction

Published: 02 January 2019

Volume 27, pages 923–968, (2019)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Haijin Ji^1,2,
Song Huang ORCID: orcid.org/0000-0002-6894-3916²,
Yaning Wu²,
Zhanwei Hui² &
…
Changyou Zheng²

800 Accesses
Explore all metrics

Abstract

Software defect prediction (SDP) plays a significant part in identifying the most defect-prone modules before software testing and allocating limited testing resources. One of the most commonly used classifiers in SDP is naive Bayes (NB). Despite the simplicity of the NB classifier, it can often perform better than more complicated classification models. In NB, the features are assumed to be equally important, and the numeric features are assumed to have a normal distribution. However, the features often do not contribute equivalently to the classification, and they usually do not have a normal distribution after performing a Kolmogorov-Smirnov test; this may harm the performance of the NB classifier. Therefore, this paper proposes a new weighted naive Bayes method based on information diffusion (WNB-ID) for SDP. More specifically, for the equal importance assumption, we investigate six weight assignment methods for setting the feature weights and then choose the most suitable one based on the F-measure. For the normal distribution assumption, we apply the information diffusion model (IDM) to compute the probability density of each feature instead of the acquiescent probability density function of the normal distribution. We carry out experiments on 10 software defect data sets of three types of projects in three different programming languages provided by the PROMISE repository. Several well-known classifiers and ensemble methods are included for comparison. The final experimental results demonstrate the effectiveness and practicability of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A training sample selection method for predicting software defects

Article 19 September 2022

Software defect prediction based on nested-stacking and heterogeneous feature selection

Article Open access 20 February 2022

A decision analysis approach for selecting software defect prediction method in the early phases

Article 06 September 2022

References

Aman, H., Amasaki, S., Sasaki, T., Kawahara, M. (2015). Lines of comments as a noteworthy metric for analyzing faultproneness in methods. IEICE transactions on Information & Systems, vol. E98.D, no. 12, pp. 2218-2228.
Arar, Ö. F., & Ayan, K. (2017). A feature dependent naive Bayes approach and its application to the software defect prediction problem. Applied Soft Computing, 59, 197–209.
Article Google Scholar
Boetticher, G., Menzies, T., Ostrand, T. J. (2007). The promise repository of empirical software engineering data. [online]. Available: http://openscience.us/repo.
Bai, C., Hong, M., Wang, D., Zhang, R., & Qian, L. (2014). Evolving an information diffusion model using a genetic algorithm for monthly river discharge time series interpolation and forecasting. Journal of Hydrometeorology, 15(6), 2236–2249.
Article Google Scholar
Bai, C. Z., Zhang, R., Hong, M., Qian, L., & Wang, Z. (2015). A new information diffusion modeling technique based on vibrating string equation and its application in natural disaster risk assessment. International Journal of General Systems, 44(5), 601–614.
Article MathSciNet MATH Google Scholar
Bai, C., Zhang, R., Qian, L., & Wu, Y. (2017). A fuzzy graph evolved by a new adaptive Bayesian framework and its applications in natural hazards. Natural Hazards Journal of the International Society for the Prevention & Mitigation of Natural Hazards, 87, 899–918.
Google Scholar
Bai, C., Zhang, R., Bao, S., Liang, X. S., & Guo, W. (2018). Forecasting the tropical cyclone genesis over the northwest pacific through identifying the causal factors in the cyclone-climate interactions. Journal of Atmospheric & Oceanic Technology, 35(2), 247–259.
Article Google Scholar
Bicer, M.S., Diri, B. (2015). Predicting defect prone modules in web applications. 21st international conference on information and software technologies (ICIST).
Bicer, M. S., & Diri, B. (2016). Defect prediction for cascading style sheets. Applied Soft Computing, 49, 1078–1084.
Article Google Scholar
Bowes, D., Hall, T., Harman, M. et al. (2016). Mutation-aware fault prediction. International symposium on software testing and analysis, pp. 330-341.
Chen, X., Zhao, Y., Wang, Q., & Yuan, Z. (2018). MULTI: Multi-objective effort-aware just-in-time software defect prediction. Information and Software Technology, 93, 1–13.
Article Google Scholar
Ghotra, B., McIntosh, S., & Hassan, A. E. (2015). Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. 37th international conference on software engineering (pp. 789–800).
Google Scholar
Hall, T., Zhang, M., Bowes, D., & Sun, Y. (2014). Some code smells have a significant but small effect on faults. ACM Transactions on Software Engineering and Methodology, 23(4), 1–39.
Article Google Scholar
Halstead, M. H. (1977). Elements of software science. NewYork: Elsevier.
MATH Google Scholar
Huang, C. (1997). Principle of information diffusion. Fuzzy Sets and Systems, 91, 69–90.
Article MathSciNet MATH Google Scholar
Hand, D. J., & Yu, K. (2001). Idiot's Bayes: Not so stupid after all? International Statistical Review, 69(3), 385–398.
MATH Google Scholar
Herbold, S., Trautsch, A., & Grabowski, J. (2017). Global vs. local models for cross-project defect prediction a replication study. Empirical software engineering., 22(4), 1866–1902.
Article Google Scholar
He, P., Li, B., Liu, X., Chen, J., & Ma, Y. (2015). An empirical study on software defect prediction with a simplified metric set. Information and Software Technology, 59, 170–190.
Article Google Scholar
Hosseini, S., Turhan, B., & Mäntylä, M. (2018). A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Information and Software Technology., 95, 296–312.
Article Google Scholar
Huang, C. (2002). An application of calculated fuzzy risk. Information Sciences, 142(1-4), 37–56.
Article MATH Google Scholar
Huang, C., Shi, Y.(2012). Towards efficient fuzzy information processing: Using the principle of information diffusion. Vol. 99:Physica.
Jagannathan, G., Pillaipakkamnatt, K., & Wright, R. N. (2009). A practical differentially private random decision tree classifier. In In IEEE international conference on data mining workshops (pp. 114–121).
Google Scholar
Jin, C., & Liu, J. A. (2010). Applications of support vector machine and unsupervised learning for predicting maintainability using object-oriented metrics. In Second international conference on multimedia and information technology (pp. 24–27).
Chapter Google Scholar
Kamei, Y., et al. (2013). A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering, 39(6), 757–773.
Article Google Scholar
Kaufman, A., Augustson, E. M., & Patrick, H. (2011). Unraveling the relationship between smoking and weight: The role of sedentary behavior. Journal of Obesity, 2012, 1–12.
Article Google Scholar
Kim, S., & Zhang, Y. (2008). Classifying software changes: Clean or buggy. IEEE Transactions on Software Engineering, 34(2), 181–196.
Article Google Scholar
Kira, K., Rendell, L. A. (1992). A practical approach to feature selection. Proc. 9th international workshop on machine learning, pp. 249-256.
Khoshgoftaar, T. M., Seliya, N.(2002). Tree-based software quality estimation models for fault prediction. Proc. 8th IEEE symposium software metrics, pp. 203-214.
Kononenko, I. (1994) Estimating attributes: Analysis and extensions of relief. Proc. European conference on machine learning on Machine Learning, pp.171–183.
Lee, T., Nam, J., Han, D., Kim, S., & In, H. P. (2016). Developer micro interaction metrics for software defect prediction. IEEE Transactions on Software Engineering, 42(11), 1015–1035.
Article Google Scholar
Li, H. (2012). Statistical learning method. Tsinghua University press.
Liang, X. S. (2014). Unraveling the cause-effect relation between time series. Physical Review E Statistical Nonlinear & Soft Matter Physics, 90(5–1), 052150.
Article Google Scholar
Lenz, A. R., Pozo, A., & Vergilio, S. R. (2013). Linking software testing results with a machine learning approach. Pergamon press. Inc, 26(5–6), 1631–1640.
Google Scholar
Ma, W., Chen, L., Yang, Y., Zhou, Y., & Xu, B. (2016a). Empirical analysis of network measures for effort-aware fault-proneness prediction. Information & Software Technology, 69(c), 50–70.
Article Google Scholar
Macias, D., Garcia-Gorriz, E., & Stips, A. (2016). The seasonal cycle of the Atlantic jet dynamics in the alboran sea: Direct atmospheric forcing versus Mediterranean thermohaline circulation. Ocean Dynamics, 66(2), 1–15.
Article Google Scholar
McCabe, T. J. (1976). A complexity measure. IEEE Transactions on Software Engineering, 2(4), 308–320.
Article MathSciNet MATH Google Scholar
Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2–13.
Article Google Scholar
Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing Journal, 27(c), 504–518.
Article Google Scholar
Ma, Y., Liang, S., Chen, X., & Jia, C. (2016b). The approach to detect abnormal access behavior based on naive Bayes algorithm. In International conference on innovative Mobile and internet Services in Ubiquitous Computing, IEEE (pp. 313–315).
Google Scholar
Miholca, D., Czibula, G., & Czibula, I. G. (2018). A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Information Sciences, 441, 152–170.
Article MathSciNet Google Scholar
Plackett, R. L. (1983). Karl Pearson and the chi-squared test. International Statistical Review, 51(1), 59–72.
Article MathSciNet MATH Google Scholar
Pelayo, L., Dick, S. (2007). Applying novel resampling strategies to software defect prediction. NAFIPS 2007–2007 annual meeting of the north American fuzzy information processing society, pp. 69-72.
Quinlan, J. R. (1993). C4.5: Programs for machine learning.
Olague, H. M., Gholston, S., Quattlebaum, S. (2007). Empirical validation of three software metrics suites to predict fault-proneness of object-oriented classes developed using highly iterative or agile software development processes. IEEE Transactions on Software Engineering,vol.33, no.6, 402–419.
Robnikšikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53(1/2), 23–69.
Article MATH Google Scholar
Rathore, S. S., & Kumar, S. (2017). Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems. Knowledge-Based Systems, 119, 232–256.
Article Google Scholar
Razali, N. M., & Wah, Y. B. (2011). Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics, 2(1), 21–33.
Google Scholar
Song, Q., Jia, Z., Shepperd, M., Ying, S., Liu, J.(2011). A general software defect-proneness prediction framework. IEEE Transactions on Software Engineering,vol.37, no.3, pp.356–370.
Shirakawa, M., Nakayama, K., Hara, T., & Nishio, S. (2015). Wikipedia-based semantic similarity measurements for Noisy short texts using extended naive Bayes. IEEE Transactions on Emerging Topics in Computing, 3(2), 205–219.
Article Google Scholar
Tang, B., He, H., Baggenstoss, P., & Kay, S. (2016). A Bayesian classification approach using class-specific features for text categorization. IEEE Transactions on Knowledge & Data Engineering, 28(6), 1602–1606.
Article Google Scholar
Tantithamthavorn, C., Mcintosh, S., Hassan, A., & Matsumoto, K. (2017). An empirical comparison of model validation techniques for defect prediction models. IEEE Transactions on Software Engineering, 43(1), 1–18.
Article Google Scholar
Tong, H., Liu, B., & Wang, S. (2018). Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology, 96, 94–111.
Article Google Scholar
Turhan, B., & Bener, A. (2007). Software defect prediction: Heuristics for weighted Naïve Bayes. In Proceedings of the second international conference on software and data technologies (pp. 244–249).
Google Scholar
Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2009). On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5), 540–578.
Article Google Scholar
Turhan, B., & Bener, A. (2009). Analysis of naive bayes’ assumptions on software fault data: An empirical study. Data & Knowledge Engineering, 68(2), 278–290.
Article Google Scholar
Vitello, G., Sorbello, M., & F., G. I. M., Conti, V., Vitabile, S. (2014). A novel technique for fingerprint classification based on fuzzy C-means and naive Bayes classifier. In Eighth international conference on complex (pp. 155–161).
Witten, L. H., Frank, E., & Hell, M. A. (2011). Data mining: Practical machine learning tools and techniques (third edition). In Acm Sigsoft software engineering notes, 90–99. Burlington: Morgan Kaufmann.
Google Scholar
Wong, T. T. (2012). A hybrid discretization method for naive Bayesian classifiers. Pattern Recognition, 45(6), 2321–2325.
Article Google Scholar
Wu, Y., Huang, S., Ji, H., Zheng, C., & Bai, C. (2018). A novel Bayes defect predictor based on information diffusion function. Knowledge-Based Systems, 144, 1–8.
Article Google Scholar
Xia, X., Lo, D., Pan, S. J., Nagappan, N., & Wang, X. (2016). HYDRA: Massively compositional model for cross-project defect prediction. IEEE Transactions on Software Engineering, 42(10), 977–998.
Article Google Scholar
Yang, X., Lo, D., Xia, X., & Sun, J. (2017). TLEL: A two-layer ensemble learning approach for just-in-time defect prediction. Information and Software Technology, 87, 206–220.
Article Google Scholar
Yang, X., Tang, K., & Yao, X. (2015). A learning-to-rank approach to software defect prediction. IEEE Transactions on Reliability, 64(1), 234–246.
Article Google Scholar
Yang, T., Qian, K., & Dan, C. T. L. (2016). Improve the prediction accuracy of Naïve Bayes classifier with association rule mining. In International conference on big data security on cloud, IEEE (pp. 129–133).
Google Scholar
Yu, Q., Jiang, S., & Zhang, Y. (2017). A feature matching and transfer approach for cross-company defect prediction. Journal of Systems and Software, 132, 366–378.
Article Google Scholar
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Twentieth international conference on international conference on machine learning (pp. 856–863).
Google Scholar
Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I. (2013). Alleviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 14(1), 1947–1988.
MathSciNet MATH Google Scholar
Zhang, H., & Sheng, S. (2005). Learning weighted naive Bayes with accurate ranking. In IEEE international conference on data mining (pp. 567–570).
Google Scholar
Zhao, Y., Yang, Y., Lu, H., Zhou, Y., Song, Q., & Xu, B. (2015). An empirical analysis of package-modularization metrics: Implications for software fault-proneness. Information & Software Technology, 57(1), 186–203.
Article Google Scholar
Zhao, Y., Yang, Y., Lu, H., Liu, J., Leung, H., Wu, Y., Zhou, Y., & Xu, B. (2017). Understanding the value of considering client usage context in package cohesion for fault-proneness prediction. Automated Software Engineering, 24(2), 393–453.
Article Google Scholar
Zheng, F., Webb, G. I. (2005). A comparative study of semi-naive Bayes methods in classification learning. Proc. 4th Australasian data mining conference, pp. 141-156.
Zheng, J. (2010). Cost-sensitive boosting neural networks for software defect prediction. Expert Systems with Applications, 37(6), 4537–4543.
Article Google Scholar
Zhou, L., Li, R., Zhang, S., & Wang, H. (2017). Imbalanced data processing model for software defect prediction. Wireless Pers Commun, 6, 1–14.
Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No. 61702544) and the Natural Science Foundation of Jiangsu Province of China (Grant No. BK20160769).

Author information

Authors and Affiliations

School of Computer Science and Technology, Huaiyin Normal University, Huaian, 223300, China
Haijin Ji
Command & Control Engineering College, Army Engineering University of PLA, Nanjing, 210007, China
Haijin Ji, Song Huang, Yaning Wu, Zhanwei Hui & Changyou Zheng

Authors

Haijin Ji
View author publications
You can also search for this author in PubMed Google Scholar
Song Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yaning Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhanwei Hui
View author publications
You can also search for this author in PubMed Google Scholar
Changyou Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Song Huang.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interests regarding the publication of this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, H., Huang, S., Wu, Y. et al. A new weighted naive Bayes method based on information diffusion for software defect prediction. Software Qual J 27, 923–968 (2019). https://doi.org/10.1007/s11219-018-9436-4

Download citation

Published: 02 January 2019
Issue Date: September 2019
DOI: https://doi.org/10.1007/s11219-018-9436-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new weighted naive Bayes method based on information diffusion for software defect prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A training sample selection method for predicting software defects

Software defect prediction based on nested-stacking and heterogeneous feature selection

A decision analysis approach for selecting software defect prediction method in the early phases

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now