Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

Kim, Taeheung; Chung, Byung Do; Lee, Jong-Seok

doi:10.1007/s00607-016-0483-z

Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

Published: 02 February 2016

Volume 99, pages 203–218, (2017)
Cite this article

Computing Aims and scope Submit manuscript

Taeheung Kim¹,
Byung Do Chung² &
Jong-Seok Lee¹

778 Accesses
13 Citations
Explore all metrics

Abstract

Naive Bayesian classification has been widely used in data mining area because of its simplicity and robustness to missing values and irrelevant attributes. However, naive Bayes classifiers sometimes show poor performance due to their unrealistic assumption that all attributes are equally important and conditionally independent of each other. In this research, we dispense with the former assumption by proposing a new attribute weighting method. The proposed method considers each attribute as a single classifier and measures its discriminating ability using the area under an ROC curve (AUC). Each AUC value is then used to weight the corresponding attribute. In addition, we try to reduce the complexity of classification models by selecting high AUC attributes. Using 20 real datasets from the machine learning repository at UC Irvine (UCI), we conduct a numerical experiment to show that the proposed method is an improvement over standard naive Bayes classification and existing weighting methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimental analysis of naïve Bayes classifier based on an attribute weighting framework with smooth kernel density estimations

Article 22 October 2015

Classification of Imbalanced Data Using Decision Tree and Bayesian Classifier

Reducing the overfitting in the gROC curve estimation

Article 10 March 2023

References

Bradley P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159
Article Google Scholar
Campadelli P, Casiraghi E, Valentini G (2005) Support vector machines for candidate nodules classification. Neurocomputing 68:281–288
Article Google Scholar
Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Drummond C, Holte RC (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the 17th International Conference on Machine Learning, pp 239–246
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Article Google Scholar
Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC Curve. In: Proceedings of the 19th International Conference on Machine Learning, pp 139–146
Guo H, Viktor H (2004) Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explor Spec Issue Imbal Data Sets 6:30–39
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl-Based Syst 20(2):120–126
Article Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Article Google Scholar
Hassan MR, Hossain MM, Bailey J, Ramamohanarao K (2008) Improving k-nearest neighbour classification with distance functions based on receiver operating characteristics. Lec Notes Comput Sci 5211:489–504
Article Google Scholar
Hossain MM, Hassan MR, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of SIAM International Conference on Data Mining, pp 455–465
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
MATH Google Scholar
Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of 10th International Conference on Uncertainty in Artificial Intelligence, pp 399–406
Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback-Leibler measure. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp 1146–1151
Lee JS, Zhu D (2011) When costs are unequal and unknown: a subtree grafting approach for unbalanced data classification. Decision Sci 42(4):803–829
Article Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550
Article Google Scholar
Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Boston
Google Scholar
Tang Y, Krasser S, Alperovitch D, Judge P (2008) Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In: Proceedings of International Conference on Artificial Intelligence and Pattern Recognition, pp 174–180
UCI Repository of Machine Learning Databases. University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html/
Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Proceedings of 2007 International Conference on Data Mining, pp 35–41
Wu J, Cai Z (2011) Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (WNB). J Comput Inform Syst 7(5):1672–1679
Google Scholar
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst with Appl 36(3):5718–5727
Article MathSciNet Google Scholar
Zhang G, Berardi VL (1998) An investigation of neural networks in thyroid function diagnosis. Health Care Manage Sci 1(1):29–37
Article Google Scholar
Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp 567–570

Download references

Acknowledgments

This research was supported by the MSIP, Korea, under the G-ITRC support program (IITP-2015-R6812-15-0001) supervised by the IITP.

Author information

Authors and Affiliations

Department of Industrial Engineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
Taeheung Kim & Jong-Seok Lee
Department of Information & Industrial Engineering, Yonsei University, 50 Yonsei-Ro, Seodaemun-gu, Seoul, 03722, Republic of Korea
Byung Do Chung

Authors

Taeheung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Byung Do Chung
View author publications
You can also search for this author in PubMed Google Scholar
Jong-Seok Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jong-Seok Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, T., Chung, B.D. & Lee, JS. Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification. Computing 99, 203–218 (2017). https://doi.org/10.1007/s00607-016-0483-z

Download citation

Received: 06 February 2014
Accepted: 18 January 2016
Published: 02 February 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s00607-016-0483-z

Keywords

Mathematics Subject Classification

68Q32

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

Abstract

Access this article

Similar content being viewed by others

Experimental analysis of naïve Bayes classifier based on an attribute weighting framework with smooth kernel density estimations

Classification of Imbalanced Data Using Decision Tree and Bayesian Classifier

Reducing the overfitting in the gROC curve estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

Abstract

Access this article

Similar content being viewed by others

Experimental analysis of naïve Bayes classifier based on an attribute weighting framework with smooth kernel density estimations

Classification of Imbalanced Data Using Decision Tree and Bayesian Classifier

Reducing the overfitting in the gROC curve estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation