Skip to main content
Log in

Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Naive Bayesian classification has been widely used in data mining area because of its simplicity and robustness to missing values and irrelevant attributes. However, naive Bayes classifiers sometimes show poor performance due to their unrealistic assumption that all attributes are equally important and conditionally independent of each other. In this research, we dispense with the former assumption by proposing a new attribute weighting method. The proposed method considers each attribute as a single classifier and measures its discriminating ability using the area under an ROC curve (AUC). Each AUC value is then used to weight the corresponding attribute. In addition, we try to reduce the complexity of classification models by selecting high AUC attributes. Using 20 real datasets from the machine learning repository at UC Irvine (UCI), we conduct a numerical experiment to show that the proposed method is an improvement over standard naive Bayes classification and existing weighting methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Bradley P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159

    Article  Google Scholar 

  2. Campadelli P, Casiraghi E, Valentini G (2005) Support vector machines for candidate nodules classification. Neurocomputing 68:281–288

    Article  Google Scholar 

  3. Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74

    Article  Google Scholar 

  4. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  5. Drummond C, Holte RC (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the 17th International Conference on Machine Learning, pp 239–246

  6. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874

    Article  Google Scholar 

  7. Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC Curve. In: Proceedings of the 19th International Conference on Machine Learning, pp 139–146

  8. Guo H, Viktor H (2004) Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explor Spec Issue Imbal Data Sets 6:30–39

    Article  Google Scholar 

  9. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  10. Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl-Based Syst 20(2):120–126

    Article  Google Scholar 

  11. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36

    Article  Google Scholar 

  12. Hassan MR, Hossain MM, Bailey J, Ramamohanarao K (2008) Improving k-nearest neighbour classification with distance functions based on receiver operating characteristics. Lec Notes Comput Sci 5211:489–504

    Article  Google Scholar 

  13. Hossain MM, Hassan MR, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of SIAM International Conference on Data Mining, pp 455–465

  14. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Google Scholar 

  15. Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of 10th International Conference on Uncertainty in Artificial Intelligence, pp 399–406

  16. Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback-Leibler measure. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp 1146–1151

  17. Lee JS, Zhu D (2011) When costs are unequal and unknown: a subtree grafting approach for unbalanced data classification. Decision Sci 42(4):803–829

    Article  Google Scholar 

  18. Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550

    Article  Google Scholar 

  19. Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Boston

    Google Scholar 

  20. Tang Y, Krasser S, Alperovitch D, Judge P (2008) Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In: Proceedings of International Conference on Artificial Intelligence and Pattern Recognition, pp 174–180

  21. UCI Repository of Machine Learning Databases. University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html/

  22. Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Proceedings of 2007 International Conference on Data Mining, pp 35–41

  23. Wu J, Cai Z (2011) Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (WNB). J Comput Inform Syst 7(5):1672–1679

    Google Scholar 

  24. Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst with Appl 36(3):5718–5727

    Article  MathSciNet  Google Scholar 

  25. Zhang G, Berardi VL (1998) An investigation of neural networks in thyroid function diagnosis. Health Care Manage Sci 1(1):29–37

    Article  Google Scholar 

  26. Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp 567–570

Download references

Acknowledgments

This research was supported by the MSIP, Korea, under the G-ITRC support program (IITP-2015-R6812-15-0001) supervised by the IITP.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jong-Seok Lee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, T., Chung, B.D. & Lee, JS. Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification. Computing 99, 203–218 (2017). https://doi.org/10.1007/s00607-016-0483-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-016-0483-z

Keywords

Mathematics Subject Classification

Navigation