BenchMetrics: a systematic benchmarking method for binary classification performance metrics

Canbek, Gürol; Taskaya Temizel, Tugba; Sagiroglu, Seref

doi:10.1007/s00521-021-06103-6

BenchMetrics: a systematic benchmarking method for binary classification performance metrics

Original Article
Published: 22 August 2021

Volume 33, pages 14623–14650, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

1261 Accesses
2 Altmetric
Explore all metrics

Abstract

This paper proposes a systematic benchmarking method called BenchMetrics to analyze and compare the robustness of binary classification performance metrics based on the confusion matrix for a crisp classifier. BenchMetrics, introducing new concepts such as meta-metrics (metrics about metrics) and metric space, has been tested on fifteen well-known metrics including balanced accuracy, normalized mutual information, Cohen’s Kappa, and Matthews correlation coefficient (MCC), along with two recently proposed metrics, optimized precision and index of balanced accuracy in the literature. The method formally presents a pseudo-universal metric space where all the permutations of confusion matrix elements yielding the same sample size are calculated. It evaluates the metrics and metric spaces in a two-staged benchmark based on our proposed eighteen new criteria and finally ranks the metrics by aggregating the criteria results. The mathematical evaluation stage analyzes metrics’ equations, specific confusion matrix variations, and corresponding metric spaces. The second stage, including seven novel meta-metrics, evaluates the robustness aspects of metric spaces. We interpreted each benchmarking result and comparatively assessed the effectiveness of BenchMetrics with the limited comparison studies in the literature. The results of BenchMetrics have demonstrated that widely used metrics have significant robustness issues, and MCC is the most robust and recommended metric for binary classification performance evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring Evaluation Metrics for Binary Classification in Data Analysis: the Worthiness Benchmark Concept

Unified Performance Measure for Binary Classification Problems

The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification

Article Open access 17 February 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and material

The authors confirm that all data and materials support their published claims and comply with field standards (see Appendix A).

Code availability

The authors confirm that all software application or custom code support their published claims and comply with field standards (see Appendix A).

Notes

Note that ‘performance metrics’ that are in [0, 1] or [− 1, 1] directly represents the success of a classifier (e.g., Accuracy or True Positive Rate). Those metrics are the instruments published in the literature to report, evaluate, and compare classifiers. Whereas, ‘performance measures’ that are usually not published represent other aspects such as dataset or classifier’s output characteristics (e.g., PREV is the ratio of positive examples in a dataset and BIAS is the ratio of positive outcomes of a classifier). Some instruments indicating the performance in an unbounded interval [0, ∞) or (− ∞, ∞) are also ‘measures’ that are not applicable to publish and compare classification performances in the literature (e.g., Odds Ratio or Discriminant Power) because of limitations in interpretability.
Sample sizes (permutations/metric space sizes): Sn = 25 (3,276); Sn = 50 (23,426); Sn = 75 (76,076); Sn = 100 (176,851); Sn = 125 (341,376); Sn = 150 (585,276); Sn = 175 (924,176); Sn = 200 (1,373,701); Sn = 250 (2,667,126).
For ten negative samples (e.g., i = 1, …, 10): c_i = 0 and example p_i = 0.49 then | c_i – p_i |= 0.49. For remaining ten positive samples (e.g., i = 11, …, 20): c_i = 1 and example p_i = 0.51 then | c_i – p_i |= 0.49. Hence, MAE = 0.49.

References

Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231. https://doi.org/10.1016/j.patcog.2019.02.023
Article Google Scholar
Staartjes VE, Schröder ML (2018) Letter to the Editor. Class imbalance in machine learning for neurosurgical outcome prediction: are our models valid? J Neurosurg Spine 29:611–612. https://doi.org/10.3171/2018.5.SPINE18543
Article Google Scholar
Brown JB (2018) Classifiers and their metrics quantified. Mol Inform 37:1–11. https://doi.org/10.1002/minf.201700127
Article Google Scholar
Sokolova M (2006) Assessing invariance properties of evaluation measures. Proc Work Test Deployable Learn Decis Syst 19th Neural Inf Process Syst Conf (NIPS 2006) 1–6
Ranawana R, Palade V (2006) Optimized precision—a new measure for classifier performance evaluation. In: 2006 IEEE International Conference on Evolutionary Computation. IEEE, Vancouver, BC, Canada, pp 2254–2261
Garcia V, Mollineda RA, Sanchez JS (2010) Theoretical analysis of a performance measure for imbalanced data. IEEE Int Conf Pattern Recognit 2006:617–620. https://doi.org/10.1109/ICPR.2010.156
Article Google Scholar
Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE 12:1–17. https://doi.org/10.1371/journal.pone.0177678
Article Google Scholar
Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30:27–38. https://doi.org/10.1016/j.patrec.2008.08.010
Article Google Scholar
Seliya N, Khoshgoftaar TM, Van Hulse J (2009) Aggregating performance metrics for classifier evaluation. In: IEEE International Conference on Information Reuse and Integration, IRI. pp 35–40
Liu Y, Zhou Y, Wen S, Tang C (2016) A strategy on selecting performance metrics for classifier evaluation. Int J Mob Comput Multimed Commun 6:20–35. https://doi.org/10.4018/ijmcmc.2014100102
Article Google Scholar
Brzezinski D, Stefanowski J, Susmaga R, Szczȩch I (2018) Visual-based analysis of classification measures and their properties for class imbalanced problems. Inf Sci (Ny) 462:242–261. https://doi.org/10.1016/j.ins.2018.06.020
Article MathSciNet Google Scholar
Hu B-G, Dong W-M (2014) A study on cost behaviors of binary classification measures in class-imbalanced problems. Comput Res Repos abs/1403.7
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45:427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Article Google Scholar
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17:299–310. https://doi.org/10.1109/TKDE.2005.50
Article Google Scholar
Forbes A (1995) Classification-algorithm evaluation: five performance measures based on confusion matrices. J Clin Monit Comput 11:189–206. https://doi.org/10.1007/BF01617722
Article Google Scholar
Pereira RB, Plastino A, Zadrozny B, Merschmann LHC (2018) Correlation analysis of performance measures for multi-label classification. Inf Process Manag 54:359–369. https://doi.org/10.1016/j.ipm.2018.01.002
Article Google Scholar
Straube S, Krell MM (2014) How to evaluate an agent’s behavior to infrequent events? Reliable performance estimation insensitive to class distribution. Front Comput Neurosci 8:1–6. https://doi.org/10.3389/fncom.2014.00043
Article Google Scholar
Hossin M, Sulaiman MN (2015) A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process 5:1–11. https://doi.org/10.5121/ijdkp.2015.5201
Article Google Scholar
Tharwat A (2020) Classification assessment methods. Appl Comput Informatics Informatics ahead-of-p:1–13. https://doi.org/10.1016/j.aci.2018.08.003
Article Google Scholar
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. https://doi.org/10.1186/s12864-019-6413-7
Article Google Scholar
Brzezinski D, Stefanowski J, Susmaga R, Szczech I (2020) On the dynamics of classification measures for imbalanced and streaming data. IEEE Trans Neural Networks Learn Syst 31:1–11. https://doi.org/10.1109/TNNLS.2019.2899061
Article Google Scholar
Baldi P, Brunak S, Chauvin Y et al (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412–424. https://doi.org/10.1093/bioinformatics/16.5.412
Article Google Scholar
Hu B-G, He R, Yuan X-T (2012) Information-theoretic measures for objective evaluation of classifications. Acta Autom Sin 38:1169–1182. https://doi.org/10.1016/S1874-1029(11)60289-9
Article MathSciNet Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Article Google Scholar
Valverde-Albacete FJ, Peláez-Moreno C (2014) 100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox. PLoS ONE 9:1–10. https://doi.org/10.1371/journal.pone.0084217
Article Google Scholar
Shepperd M (2013) Assessing the predictive performance of machine learners in software defect prediction function. In: The 24th CREST Open Workshop (COW), on Machine Learning and Search Based Software Engineering (ML&SBSE). Centre for Research on Evolution, Search and Testing (CREST), London, pp 1–16
Schröder G, Thiele M, Lehner W (2011) Setting goals and choosing metrics for recommender system evaluations. In: UCERSTI 2 Workshop at the 5th ACM Conference on Recommender Systems. Chicago, Illinois, pp. 1–8
Delgado R, Tibau XA (2019) Why Cohen’s kappa should be avoided as performance measure in classification. PLoS ONE 14:1–26. https://doi.org/10.1371/journal.pone.0222916
Article Google Scholar
Ma J, Zhou S (2020) Metric learning-guided k nearest neighbor multilabel classifier. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05134-9
Article Google Scholar
Fatourechi M, Ward RK, Mason SG, et al (2008) Comparison of evaluation metrics in classification applications with imbalanced datasets. In: 7th International Conference on Machine Learning and Applications (ICMLA). pp. 777–782
Seliya N, Khoshgoftaar TM, Van Hulse J (2009) A study on the relationships of classifier performance metrics. In: 21st IEEE International Conference on Tools with Artificial Intelligence, ICTAI. pp. 59–66
Joshi MV (2002) On evaluating performance of classifiers for rare classes. In: Proceedings IEEE International Conference on Data Mining. IEEE, pp. 641–644
Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: an empirical analysis of supervised learning performance criteria. Proc 10th ACM SIGKDD Int Conf Knowl Discov Data Min 69–78. https://doi.org/1-58113-888-1/04/0008
Huang J, Ling CX (2007) Constructing new and better evaluation measures for machine learning. IJCAI Int Jt Conf Artif Intell 859–864
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, Cambridge
Book Google Scholar
Contreras-Reyes JE (2020) An asymptotic test for bimodality using the Kullback-Leibler divergence. Symmetry (Basel) 12:1–13. https://doi.org/10.3390/SYM12061013
Article Google Scholar
Shi L, Campbell G, Jones WD et al (2010) The Microarray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 28:827–838. https://doi.org/10.1038/nbt.1665
Article Google Scholar
Rohani A, Mamarabadi M (2019) Free alignment classification of dikarya fungi using some machine learning methods. Neural Comput Appl 31:6995–7016. https://doi.org/10.1007/s00521-018-3539-5
Article Google Scholar
Azar AT, El-Said SA (2014) Performance analysis of support vector machines classifiers in breast cancer mammography recognition. Neural Comput Appl 24:1163–1177. https://doi.org/10.1007/s00521-012-1324-4
Article Google Scholar
Canbek G, Sagiroglu S, Taskaya Temizel T, Baykal N (2017) Binary classification performance measures/metrics: A comprehensive visualized roadmap to gain new insights. In: 2017 International Conference on Computer Science and Engineering (UBMK). IEEE, Antalya, Turkey, pp. 821–826

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

ASELSAN, Ankara, Turkey
Gürol Canbek
Informatics Institute, Middle East Technical University, Ankara, Turkey
Gürol Canbek & Tugba Taskaya Temizel
Computer Engineering Department, Gazi University, Ankara, Turkey
Seref Sagiroglu

Authors

Gürol Canbek
View author publications
You can also search for this author inPubMed Google Scholar
Tugba Taskaya Temizel
View author publications
You can also search for this author inPubMed Google Scholar
Seref Sagiroglu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

GC: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing—original draft, Writing—review and editing, Visualization. TTT: Validation, Writing—review and editing, Supervision. SS: Conceptualization, Validation, Writing—review & editing, Supervision.

Corresponding author

Correspondence to Gürol Canbek.

Ethics declarations

Conflicts of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Developed online research tools and data

An online interactive BenchMetrics experimentation platform

Platform: Code Ocean,

Address: https://doi.org/10.24433/co.1564477.v3
BenchMetrics open-source performance metrics benchmarking software library (API)

Repository: GitHub,

Address: https://github.com/gurol/benchmetrics
Binary classification performance metric spaces data

Repository: Mendeley Data,

Address: http://dx.doi.org/10.17632/64r4jr8c88.2
Binary classification performance metrics benchmarking data

Repository: Mendeley Data,

Address: http://dx.doi.org/10.17632/2g36672s5f.4

Appendix B Binary classification performance instrument list

Table 16 lists the instruments and their abbreviations and equations. Note that more information can be found in [40]. See Table 12 for the recently proposed metrics’ equations.

Table 16 Performance measures and metrics (names, alternative names, abbreviations, and equations)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Canbek, G., Taskaya Temizel, T. & Sagiroglu, S. BenchMetrics: a systematic benchmarking method for binary classification performance metrics. Neural Comput & Applic 33, 14623–14650 (2021). https://doi.org/10.1007/s00521-021-06103-6

Download citation

Received: 16 September 2020
Accepted: 05 May 2021
Published: 22 August 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s00521-021-06103-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BenchMetrics: a systematic benchmarking method for binary classification performance metrics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Evaluation Metrics for Binary Classification in Data Analysis: the Worthiness Benchmark Concept

Unified Performance Measure for Binary Classification Problems

The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification

Explore related subjects

Availability of data and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendix A Developed online research tools and data

Appendix B Binary classification performance instrument list

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now