PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics

Canbek, Gürol; Taskaya Temizel, Tugba; Sagiroglu, Seref

doi:10.1007/s42979-022-01409-1

PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics

Original Research
Published: 16 October 2022

Volume 4, article number 13, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

3767 Accesses
2 Altmetric
Explore all metrics

Abstract

Although few performance evaluation instruments have been used conventionally in different machine learning-based classification problem domains, there are numerous ones defined in the literature. This study reviews and describes performance instruments via formally defined novel concepts and clarifies the terminology. The study first highlights the issues in performance evaluation via a survey of 78 mobile-malware classification studies and reviews terminology. Based on three research questions, it proposes novel concepts to identify characteristics, similarities, and differences of instruments that are categorized into ‘performance measures’ and ‘performance metrics’ in the classification context for the first time. The concepts reflecting the intrinsic properties of instruments such as canonical form, geometry, duality, complementation, dependency, and leveling, aim to reveal similarities and differences of numerous instruments, such as redundancy and ground-truth versus prediction focuses. As an application of knowledge representation, we introduced a new exploratory table called PToPI (Periodic Table of Performance Instruments) for 29 measures and 28 metrics (69 instruments including variant and parametric ones). Visualizing proposed concepts, PToPI provides a new relational structure for the instruments including graphical, probabilistic, and entropic ones to see their properties and dependencies all in one place. Applications of the exploratory table in six examples from different domains in the literature have shown that PToPI aids overall instrument analysis and selection of the proper performance metrics according to the specific requirements of a classification problem. We expect that the proposed concepts and PToPI will help researchers comprehend and use the instruments and follow a systematic approach to classification performance evaluation and publication.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing ϕ and the F-measure as performance metrics for software-related classifications

Article Open access 01 October 2022

Fast and Straightforward Feature Selection Method

A Strategy for Predicting the Performance of Supervised and Unsupervised Tabular Data Classifiers

Article Open access 28 October 2024

Notes

Surveyed studies: S#17 (in Table 9); S#32 (in Tables 9, 7, and 8); S#39 (in Table 8); S#40 (in Table 2); S#57 (in Table 8); and S#18 (in Table 8, TPR and recall equations are given at the same time). The tables given here are numbered as they appear in the studies listed in online Table E.1. described in Appendix 5.
https://www.merriam-webster.com/dictionary/measure.
https://www.merriam-webster.com/dictionary/metric.
${DET(BM}_{X}){=}^{?}{DET(BM}_{X1})+{DET(BM}_{X2})$⇒Examples (Sn = 20 = 10 + 10): subadditive: $\left|\begin{array}{cc}5& 6\\ 5& 4\end{array}\right|\le \left|\begin{array}{cc}3& 3\\ 1& 3\end{array}\right|+\left|\begin{array}{cc}2& 3\\ 4& 1\end{array}\right|$ (− 10 ≤ 6 + − 10), superadditive: $\left|\begin{array}{cc}5& 7\\ 3& 5\end{array}\right|\le \left|\begin{array}{cc}2& 4\\ 2& 2\end{array}\right|+\left|\begin{array}{cc}3& 3\\ 1& 3\end{array}\right|$ (4 ≥ − 4 + 6), countable additive: $\left|\begin{array}{cc}4& 4\\ 6& 6\end{array}\right|=\left|\begin{array}{cc}3& 2\\ 2& 3\end{array}\right|+\left|\begin{array}{cc}1& 2\\ 4& 3\end{array}\right|$ (0 = 5 + − 5).
Performance metrics that are represented in [-1, 1] (e.g., CK and MCC) can be transformed into [0, 1].

References

Mooers CN. Making information retrieval pay. Boston: Boston Portland State University; 1951.
Google Scholar
Cleverdon C, Mills J, Keen M. Factors affecting the performance of indexing systems, vol. I. Cranfield: Cranfield University; 1966.
Google Scholar
Tharwat A. Classification assessment methods. Appl Comput Informa. 2020. https://doi.org/10.1016/j.aci.2018.08.003 (ahead-of-p).
Article Google Scholar
Cleverdon C, Keen M. Factors affecting the performance of indexing systems, vol. II. Cranfield: Cranfield University; 1966.
Google Scholar
Sokal RR, Sneath PHA. Principles of numerical taxonomy. San Francisco: W. H. Freeman and Company; 1963.
MATH Google Scholar
Jaccard P. Nouvelles recherches sur la distribution florale. Bull la Société Vaudoise Des Sci Nat. 1908;44:223–70.
Google Scholar
Japkowicz N, Shah M. Evaluating learning algorithms: a classification perspective. Cambridge: Cambridge University Press; 2011.
Book MATH Google Scholar
Powers DMW. Evaluation: From precision, recall and F-factor to ROC, informedness, markedness & correlation. J Mach Learn Technol. 2011;2:37–63.
Google Scholar
Luque A, Carrasco A, Martín A, Lama JR. Exploring symmetry of binary classification performance metrics. Symmetry (Basel). 2019. https://doi.org/10.3390/sym11010047.
Article MATH Google Scholar
Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. Adv Artif Intell. 2006;4304:1015–21. https://doi.org/10.1007/11941439_114.
Article Google Scholar
Razgallah A, Khoury R, Hallé S, Khanmohammadi K. A survey of malware detection in Android apps: recommendations and perspectives for future research. Comput Sci Rev. 2021;39: 100358. https://doi.org/10.1016/j.cosrev.2020.100358.
Article Google Scholar
Sihag V, Vardhan M, Singh P. A survey of Android application and malware hardening. Comput Sci Rev. 2021;39: 100365. https://doi.org/10.1016/j.cosrev.2021.100365.
Article Google Scholar
Straube S, Krell MM. How to evaluate an agent’s behavior to infrequent events? Reliable performance estimation insensitive to class distribution. Front Comput Neurosci. 2014;8:1–6. https://doi.org/10.3389/fncom.2014.00043.
Article Google Scholar
Luque A, Carrasco A, Martín A, de las Heras A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019;91:216–31. https://doi.org/10.1016/j.patcog.2019.02.023.
Article Google Scholar
Brzezinski D, Stefanowski J, Susmaga R, Szczȩch I. Visual-based analysis of classification measures and their properties for class imbalanced problems. Inf Sci (NY). 2018;462:242–61. https://doi.org/10.1016/j.ins.2018.06.020.
Article MathSciNet Google Scholar
Mullick SS, Datta S, Dhekane SG, Das S. Appropriateness of performance indices for imbalanced data classification: an analysis. Pattern Recognit. 2020;102: 107197. https://doi.org/10.1016/j.patcog.2020.107197.
Article Google Scholar
Sun Y, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell. 2009;23:687–719. https://doi.org/10.1142/S0218001409007326.
Article Google Scholar
Valverde-Albacete FJ, Peláez-Moreno C. 100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox. PLoS One. 2014;9:1–10. https://doi.org/10.1371/journal.pone.0084217.
Article Google Scholar
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30:1145–59.
Article Google Scholar
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020. https://doi.org/10.1186/s12864-019-6413-7.
Article Google Scholar
Hu B-G, Dong W-M (2014) A study on cost behaviors of binary classification measures in class-imbalanced problems. Comput Res Repos abs/1403.7
Labatut V, Cherifi H. Evaluation of performance measures for classifiers comparison. Ubiquitous Comput Commun J. 2011;6:21–34.
Google Scholar
Wang S, Yao X. Relationships between diversity of classification ensembles and single-class performance measures. IEEE Trans Knowl Data Eng. 2013;25:206–19. https://doi.org/10.1109/TKDE.2011.207.
Article Google Scholar
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–37. https://doi.org/10.1016/j.ipm.2009.03.002.
Article Google Scholar
Seung-Seok C, Sung-Hyuk C, Tappert CC. A survey of binary similarity and distance measures. J Syst Cybern Inform. 2010;8:43–8.
Google Scholar
Warrens MJ. Similarity coefficients for binary data: properties of coefficients, coefficient matrices, multi-way metrics and multivariate coefficient. Leiden: Leiden University; 2008.
Google Scholar
Yan B, Koyejo O, Zhong K, Ravikumar P (2018) Binary classification with karmic, threshold-quasi-concave metrics. In: Proceedings of the35th international conference on machine learning (ICML), Stockholm, Sweden, pp 5527–5536
Forbes A. Classification-algorithm evaluation: five performance measures based on confusion matrices. J Clin Monit Comput. 1995;11:189–206. https://doi.org/10.1007/BF01617722.
Article Google Scholar
Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17:299–310. https://doi.org/10.1109/TKDE.2005.50.
Article Google Scholar
Canbek G, Taskaya Temizel T, Sagiroglu S. BenchMetrics: a systematic benchmarking method for binary-classification performance metrics. Neural Comput Appl. 2021;33:14623–50. https://doi.org/10.1007/s00521-021-06103-6.
Article Google Scholar
Pereira RB, Plastino A, Zadrozny B, Merschmann LHC. Correlation analysis of performance measures for multi-label classification. Inf Process Manag. 2018;54:359–69. https://doi.org/10.1016/j.ipm.2018.01.002.
Article Google Scholar
Kolo B. Binary and multiclass classification. Weatherford: Weatherford Press; 2011.
Google Scholar
Kocher M, Savoy J. Distance measures in author profiling. Inf Process Manag. 2017;53:1103–19. https://doi.org/10.1016/j.ipm.2017.04.004.
Article Google Scholar
Tulloss RE. Assessment of similarity indices for undesirable properties and a new tripartite similarity index based on cost functions. In: Mycology in sustainable development: expanding concepts, vanishing borders. Boone: Parkway Publishers; 1997. p. 122–43.
Google Scholar
Koyejo OO, Natarajan N, Ravikumar PK, Dhillon IS (2014) Consistent binary classification with generalized performance metrics. In: Advances in neural information processing systems 27: annual conference on neural information processing systems 2014, December 8–13 2014, Montreal, Quebec, Canada. ACM, Montreal, Canada, pp 2744–2752
Paradowski M. On the order equivalence relation of binary association measures. Int J Appl Math Comput Sci. 2015;25:645–57. https://doi.org/10.1515/amcs-2015-0047.
Article MathSciNet MATH Google Scholar
Kenter T, Balog K, De Rijke M. Evaluating document filtering systems over time. Inf Process Manag. 2015;51:791–808. https://doi.org/10.1016/j.ipm.2015.03.005.
Article Google Scholar
Carbonero-Ruz M, Martínez-Estudillo FJ, Fernández-Navarro F, et al. A two dimensional accuracy-based measure for classification performance. Inf Sci (NY). 2017;382–383:60–80. https://doi.org/10.1016/j.ins.2016.12.005.
Article Google Scholar
Hossin M, Sulaiman MN. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5:1–11. https://doi.org/10.5121/ijdkp.2015.5201.
Article Google Scholar
Welty C, Paritosh P, Aroyo L (2020) Metrology for AI: from benchmarks to instruments. In: The 34th AAAI conference on artificial intelligence (evaluating evaluation of AI systems workshop, Meta-Eval 2020). New York, NY
Canbek G, Sagiroglu S, Temizel TT, Baykal N (2017) Binary classification performance measures/metrics: a comprehensive visualized roadmap to gain new insights. In: 2017 International conference on computer science and engineering (UBMK). IEEE, Antalya, Turkey, pp 821–826
van Stralen KJ, Stel VS, Reitsma JB, et al. Diagnostic methods I: sensitivity, specificity, and other measures of accuracy. Kidney Int. 2009;75:1257–63. https://doi.org/10.1038/ki.2009.92.
Article Google Scholar
Wilks DS. Statistical methods in the atmospheric sciences. 2nd ed. New York: Elsevier; 2006.
Google Scholar
Baldi P, Brunak S, Chauvin Y, et al. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16:412–24. https://doi.org/10.1093/bioinformatics/16.5.412.
Article Google Scholar
Ferri C, Hernández-Orallo J, Modroiu R. An experimental comparison of performance measures for classification. Pattern Recognit Lett. 2009;30:27–38. https://doi.org/10.1016/j.patrec.2008.08.010.
Article Google Scholar
Yerima SY, Sezer S, McWilliams G. Analysis of Bayesian classification-based approaches for Android malware detection. IET Inf Secur. 2014;8:25–36. https://doi.org/10.1049/iet-ifs.2013.0095.
Article Google Scholar
Hjørland B. Facet analysis: the logical approach to knowledge organization. Inf Process Manag. 2013;49:545–57. https://doi.org/10.1016/j.ipm.2012.10.001.
Article Google Scholar
Hjørland B, Scerri E, Dupré J. Forum: the philosophy of classification. Knowl Organ. 2011;38:9–24.
Article Google Scholar
Jakus G, Milutinović V, Omerović S, Tomažič S. Concepts, ontologies, and knowledge representation. New York: Springer; 2013.
Book MATH Google Scholar
Huang M, Briançon A (2018) Cerebri AI periodic table of data science. In: Cerebri. https://www.cerebriai.com/periodic-table. Accessed 15 Aug 2019
Govaert G, Nadif M. Mutual information, phi-squared and model-based co-clustering for contingency tables. Adv Data Anal Classif. 2018;12:455–88. https://doi.org/10.1007/s11634-016-0274-6.
Article MathSciNet MATH Google Scholar
Hu B-G, He R, Yuan X-T. Information-theoretic measures for objective evaluation of classifications. Acta Autom Sin. 2012;38:1169–82. https://doi.org/10.1016/S1874-1029(11)60289-9.
Article MathSciNet Google Scholar
Berrar D, Flach P. Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them). Brief Bioinform. 2012;13:83–97. https://doi.org/10.1093/bib/bbr008.
Article Google Scholar
Voigt T, Fried R, Backes M, Rhode W. Threshold optimization for classification in imbalanced data in a problem of gamma-ray astronomy. Adv Data Anal Classif. 2014;8:195–216. https://doi.org/10.1007/s11634-014-0167-5.
Article MathSciNet MATH Google Scholar
Berrar D. Performance measures for binary classification. Encycl Bioinform Comput Biol ABC Bioinform. 2018;1:546–60. https://doi.org/10.1016/B978-0-12-809633-8.20351-8.
Article Google Scholar
Jolliffe IT, Stephenson DB. Forecast verification: a practitioner’s guide in atmospheric science. 2nd ed. Hoboken: Wiley; 2012.
Google Scholar
Ikonen E, Kortela U, Najim K. Distributed logic processors in process identification. In: Leondes CT, editor. Expert systems: the technology of knowledge management and decision making for the 21st century. New York: Academic Press; 2001. p. 1947.
Google Scholar
Cardoso JS, Sousa R. Measuring the performance of ordinal classification. Int J Pattern Recognit Artif Intell. 2011;25:1173–95. https://doi.org/10.1142/S0218001411009093.
Article MathSciNet Google Scholar
Hirose S, Kozu T, Jin Y, Miyamura Y. Hierarchical relevance determination based on information criterion minimization. SN Comput Sci. 2020;1:1–19. https://doi.org/10.1007/s42979-020-00239-3.
Article Google Scholar
Chin RJ, Lai SH, Ibrahim S, et al. Rheological wall slip velocity prediction model based on artificial neural network. J Exp Theor Artif Intell. 2019;31:659–76. https://doi.org/10.1080/0952813X.2019.1592235.
Article Google Scholar
Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: an empirical analysis of supervised learning performance criteria. In: Proceedings of 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 69–78. 1-58113-888-1/04/0008
Ranawana R, Palade V (2006) Optimized precision - a new measure for classifier performance evaluation. In: 2006 IEEE international conference on evolutionary computation. IEEE, Vancouver, BC, Canada, pp 2254–2261
Garcia V, Mollineda RA, Sanchez JS. Theoretical analysis of a performance measure for imbalanced data. IEEE Int Conf Pattern Recognit. 2006;1:617–20. https://doi.org/10.1109/ICPR.2010.156.
Article Google Scholar
Kim S, Kim H. A new metric of absolute percentage error for intermittent demand forecasts. Int J Forecast. 2016;32:669–79. https://doi.org/10.1016/j.ijforecast.2015.12.003.
Article Google Scholar
Texel PP (2013) Measure, metric, and indicator: an object-oriented approach for consistent terminology. In: Proceedings of IEEE Southeastcon. IEEE, Jacksonville, FL
Olsina L, de los Angeles Martín M,. Ontology for software metrics and indicators: Building process and decisions taken. J Web Eng. 2004;2:262–81.
Google Scholar
García F, Bertoa MF, Calero C, et al. Towards a consistent terminology for software measurement. Inf Softw Technol. 2006;48:631–44. https://doi.org/10.1016/j.infsof.2005.07.001.
Article Google Scholar
Zammito F (2019) What’s considered a good log loss in machine learning? https://medium.com/@fzammito/whats-considered-a-good-log-loss-in-machine-learning-a529d400632d. Accessed 15 Jul 2020
Davies HTO, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ. 1998;316:989–91. https://doi.org/10.1136/bmj.316.7136.989.
Article Google Scholar
Schmidt CO, Kohlmann T. When to use the odds ratio or the relative risk? Int J Public Health. 2008;53:165–7. https://doi.org/10.1007/s00038-008-7068-3.
Article Google Scholar
Glas AS, Lijmer JG, Prins MH, et al. The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol. 2003;56:1129–35. https://doi.org/10.1016/S0895-4356(03)00177-X.
Article Google Scholar
Siegerink B, Rohmann JL. Impact of your results: beyond the relative risk. Res Pract Thromb Haemost. 2018;2:653–7. https://doi.org/10.1002/rth2.12148.
Article Google Scholar
Press WH (2008) Classifier performance: ROC, precision-recall, and all that. In: Computational statistics with application to bioinformatics. The University of Texas at Austin, Austin
Manning CD, Raghavan P, Schütze H. An introduction to information retrieval, online edition. Cambridge: Cambridge University Press; 2009.
MATH Google Scholar
Lucini FR, S. Fogliatto F, Giovani GJ, et al. Text mining approach to predict hospital admissions using early medical records from the emergency department. Int J Med Inform. 2017;100:1–8. https://doi.org/10.1016/j.ijmedinf.2017.01.001.
Article Google Scholar
Shah SAR, Issac B. Performance comparison of intrusion detection systems and application of machine learning to Snort system. Futur Gener Comput Syst. 2018;80:157–70. https://doi.org/10.1016/j.future.2017.10.016.
Article Google Scholar
Faris H, Al-Zoubi AM, Heidari AA, et al. An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion. 2019;48:67–83. https://doi.org/10.1016/j.inffus.2018.08.002.
Article Google Scholar
Ahamad MM, Aktar S, Rashed-Al-Mahfuz M, et al. A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients. Expert Syst Appl. 2020. https://doi.org/10.1016/j.eswa.2020.113661.
Article Google Scholar
Ben-David A. About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell. 2008;21:874–82. https://doi.org/10.1016/j.engappai.2007.09.009.
Article Google Scholar
Brown JB. Classifiers and their metrics quantified. Mol Inform. 2018;37:1–11. https://doi.org/10.1002/minf.201700127.
Article Google Scholar
Brzezinski D, Stefanowski J, Susmaga R, Szczech I. On the dynamics of classification measures for imbalanced and streaming data. IEEE Trans Neural Netw Learn Syst. 2020;31:1–11. https://doi.org/10.1109/TNNLS.2019.2899061.
Article Google Scholar
Abdualgalil B, Abraham S (2020) Applications of machine learning algorithms and performance comparison: a review. In: International conference on emerging trends in information technology and engineering, ic-ETITE 2020. pp 1–6
Vivo JM, Franco M, Vicari D. Rethinking an ROC partial area index for evaluating the classification performance at a high specificity range. Adv Data Anal Classif. 2018;12:683–704. https://doi.org/10.1007/s11634-017-0295-9.
Article MathSciNet MATH Google Scholar
Prati RC, Batista GEAPA, Monard MC. A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng. 2011;23:1601–18. https://doi.org/10.1109/TKDE.2011.59.
Article Google Scholar
Botchkarev A. A new typology design of performance metrics to measure errors in machine learning regression algorithms. Interdiscip J Inf Knowl Manag. 2019;14:45–79. https://doi.org/10.28945/4184.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Pointr, Ankara, Turkey
Gürol Canbek
Informatics Institute Middle East Technical University, Ankara, Turkey
Gürol Canbek & Tugba Taskaya Temizel
Computer Engineering Department, Gazi University, Ankara, Turkey
Seref Sagiroglu

Authors

Gürol Canbek
View author publications
You can also search for this author inPubMed Google Scholar
Tugba Taskaya Temizel
View author publications
You can also search for this author inPubMed Google Scholar
Seref Sagiroglu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

GC: conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing—original draft, writing—review and editing, and visualization. TTT: validation, writing—review and editing, and supervision. SS: validation, writing—review and editing, and supervision.

Corresponding author

Correspondence to Gürol Canbek.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Instrument Abbreviation and Name List

The list of performance instrument abbreviations (symbols) in alphabetic order per level per instrument category and their names and alternative names are given below. We suggest using the first full name (not the one in square braces) to standardize the terminology in the classification context.

PERFORMANCE MEASURES (29 measures)

(Canonicals: 11 measures: base measures and 1st level measures).

Base Measures (BM) (4 measures):

FN: False Negatives, FP: False Positives, TN: True Negatives, TP: True Positives.

1st Level Measures (7 measures):

N: Negatives, P: Positives, ON: Outcome Negatives, OP: Outcome Positives,

FC: False Classification, TC: True Classification, Sn: Sample Size.

2nd Level Measures (11 measures):

BIAS: Bias, CKc: Cohen's Kappa Chance, DET: Determinant,

DPR: D Prime, IMB: (Class) Imbalance, LRN: Negative Likelihood Ratio, LRP: Positive Likelihood Ratio, NER: Null Error Rate, NIR: No Information Rate (non-information rate), PREV: Prevalence, SKEW: (Class) Skew.

Probabilistic error/loss measures (2 measures):

LogLoss (binary cross-entropy), MRAE (MdRAE / GMRAE): Mean (Median/Geometric Mean) Relative Absolute Error.

3rd Level Measures (5 measures):

DP: Discriminant Power, HC: Class Entropy, HO: Outcome Entropy, LIFT: Lift, OR: Odds Ratio.

PERFORMANCE METRICS (28 metrics)

Base Metrics (14 metrics):

ACC: Accuracy (efficiency, rand index), CRR: (Correct) Rejection Rate, DR: Detection Rate, FDR: False Discovery Rate, FNR: False Negative Rate (miss rate), FOR: False Omission Rate (imprecision), FPR: False Positive Rate (fall-out), HOC: Joint Entropy, MCR: Misclassification Rate, Zero–One Loss (normalized), MI: Mutual Information, NPV: Negative Predictive Value, PPV: Positive Predictive Value (precision, confidence), TNR: True Negative Rate (inverse recall, specificity), TPR: True Positive Rate (recall, sensitivity, hit rate, recognition rate).

1st Level Metrics (13 metrics):

Confusion-matrix derived metrics (8 metrics): BACC: Balanced Accuracy (strength), CK: Cohen's Kappa (Heidke skill score, quality index), F1: F metric (F-score, F-measure, positive specific agreement), (Fm: F-metrics for all weights, F2, F0.5, and Fβ: F metric with weight 2, 0.5 and β), G: G metric (G-mean, Fowlkes-Mallows index), INFORM: Informedness (Youden’s index, delta P', Peirce skill score), MARK: Markedness (delta P, Clayton skill score, predictive summary index), nMI: Normalized Mutual Information, wACC: Weighted Accuracy.

Graphical metrics (2 metrics): AUCROC: Area-Under-ROC-Curve (ROC: Receiver Operating Curve) (GINI), AUCPR: Area-Under-Precision–Recall Curve.

Probabilistic error/loss measures (3 metrics): MSE: Mean Squared Error (Brier score), MAE/MdAE/MxAE Mean/Median/Maximum Absolute Error, RMSE: Root Mean Square Error, nsMAPE: Normalized Symmetric Mean Absolute Percentage Error.

2nd Level Metric (1 metric):

MCC: Matthews Correlation Coefficient (Phi correlation coefficient, Cohen’s index, Yule phi).

Appendix 2: Performance Instrument Equations

The equations of performance instruments are listed below as a complete reference. Equations (with complements and duals if any) are provided in high-level and/or canonical forms. Equivalent forms are also provided for some instruments.

Measures’ Equations (underlined numbered as in PToPI)
$TP$	(B.1)	$FP$	(B.2)	$FN$	(B.3)	$TN$	(B.4)
$P=TP+FN$	(B.5)	$N=TN+FP$	(B.6)	$OP=TP+FP$	(B.7)	$ON=TN+FN$	(B.8)
$TC=TP+TN$			(B.9)	$FC=FP+FN$			(B.10)
$Sn=TP+FP+FN+TN=P+N=OP+ON=TC+FC$							(B.11)
$PREV=\frac{P}{Sn}={BIAS}^{*}$	(B.12)	$NER=\frac{N}{Sn}=\overline{PREV }$	(B.13)	$SKEW =N:P$	(B.14)	$IMB =\frac{\mathrm{max}\left(P, N\right)}{\mathrm{min}\left(P, N\right)}$	(B.15)
$NIR=\frac{\mathrm{max}\left(P, N\right)}{Sn}$	(B.16)	$BIAS=\frac{OP}{Sn}={PREV}^{*}$	(B.17)	$DPR={\rm Z}\left(TPR\right)-{\rm Z}\left(FPR\right)$			(B.18)
$LRP=\frac{TPR}{FPR}=\frac{TP\cdot N}{FP.P}$			(B.19)	$LRN=\frac{FNR}{TNR}=\frac{FN\cdot N}{TN.P}$			(B.20)
$DET=TP\cdot TN-FP\cdot FN$			(B.21)	$CKc=\frac{P\cdot OP+N\cdot ON}{{Sn}^{2}}$			(B.22)
$HC =-\sum_{m=PREV,1-PREV}m{\mathrm{log}}_{2}m$			(B.23)	$HO =-\sum_{m=BIAS,1-BIAS}m{\mathrm{log}}_{2}m$			(B.24)
$LIFT=\frac{TPR}{BIAS}=\frac{TP\cdot Sn}{P\cdot OP}$			(B.25)	$OR=\frac{LRP}{LRN}=\frac{TPR\cdot TNR}{FPR\cdot FNR}=\frac{TP\cdot TN}{FP\cdot FN}$			(B.26)
$DP=\frac{\sqrt{3}}{\pi }\mathrm{log}\frac{LRP}{LRN}=\frac{\sqrt{3}}{\pi }\mathrm{log}\frac{TPR\cdot TNR}{FPR\cdot FNR}=\frac{\sqrt{3}}{\pi }\mathrm{log}\frac{TP\cdot TN}{FP\cdot FN}$							(B.27)
Metrics’ Equations (numbered as in PToPI)
$TPR=\frac{TP}{P}={PPV}^{*}$	(B.1)	$FNR=\frac{FN}{P}=\overline{TPR }$	(B.2)	$FPR=\frac{FP}{N}=\overline{TNR }$	(B.3)	$TNR=\frac{TN}{N}={NPV}^{*}$	(B.4)
$PPV=\frac{TP}{OP}={TPR}^{*}$	(B.5)	$FDR=\frac{FP}{OP}=\overline{PPV }$	(B.5)	$FOR=\frac{FN}{ON}=\overline{NPV }$	(B.7)	$NPV=\frac{TN}{ON}={TNR}^{*}$	(B.8)
$ACC =\frac{TC}{Sn}$	(B.9)	$MCR =\frac{FC}{Sn}=\overline{ACC }$	(B.10)	$DR=\frac{TP}{Sn}$	(B.11)	$CRR=\frac{TN}{Sn}$	(B.12)
$HOC =-\sum_{m=TP,FP,FN,TN}\frac{m}{Sn}{\mathrm{log}}_{2}\frac{m}{Sn}$							(B.13)
$\begin{aligned} MI &=\frac{TP}{Sn}{\mathrm{log}}_{2}\frac{TP/Sn}{PREV\cdot BIAS}+\frac{FP}{Sn}{\mathrm{log}}_{2}\frac{FP/Sn}{\left(1-PREV\right)\cdot BIAS}\\ &\quad +\frac{FN}{Sn}{\mathrm{log}}_{2}\frac{FN/Sn}{PREV\cdot (1-BIAS)}+\frac{TN}{Sn}{\mathrm{log}}_{2}\frac{TN/Sn}{(1-PREV)\cdot (1-BIAS)} \end{aligned}$							(B.14)
$INFORM=TPR+TNR-1=\frac{TP\cdot N+TN\cdot P-P\cdot N}{P\cdot N}=\frac{TP\cdot N+TN\cdot P}{P\cdot N}-1={MARK}^{*}$							(B.15)
$BACC =\frac{TPR+TNR}{2}=\frac{TP\cdot N+TN\cdot P}{2\cdot P\cdot N}$			(B.16)	$G=\sqrt[2]{TPR\cdot TNR}=\sqrt{\frac{TP\cdot TN}{P\cdot N}}$			(B.17)
$wACC=w\cdot TPR+\left(1-w\right)\cdot TNR \text{ where } \textit{w} \text{ is in (0, 1)}$							(B.18)
$MARK=PPV+NPV-1=\frac{TP\cdot ON+TN\cdot OP-OP\cdot ON}{OP\cdot ON}=\frac{TP\cdot ON+TN\cdot OP}{OP\cdot ON}-1={INFORM}^{*}$							(B.19)
$CK=\frac{ACC-CKc}{1-CKc}=\frac{2(TP\cdot TN-FP\cdot FN)}{P\cdot ON+N\cdot OP}=\frac{DET}{(P\cdot ON+N\cdot OP)/2}$ *Correction: CK is undefined (NaN) due to the zero division by zero* (0/0) in case of P = 0 and OP = 0 (TP = Sn) or N = 0 and ON = 0 (TN = Sn). Therefore, CK should be 1 (one) for these cases							(B.20)
${F}_{1}=\frac{2PPV\cdot TPR}{PPV+TPR}=\frac{2TP}{P+OP}=\frac{2TP}{2TP+FC}$							(B.21)
${F}_{\upbeta }=\frac{\left(1+{\upbeta }^{2}\right)PPV\cdot \mathrm{TPR}}{{\upbeta }^{2}PPV+TPR}=\frac{\left(1+{\upbeta }^{2}\right)TP}{\left(1+{\upbeta }^{2}\right)TP+{\upbeta }^{2}FN+FP}$							(B.21.1)
${F}_{0.5}=\frac{1.25PPV.TPR}{0.25PPV+TPR}=\frac{1.25TP}{1.25TP+0.25FN+FP}$							(B.21.2)
${F}_{2}=\frac{5PPV.TPR}{4PPV+TPR}=\frac{5TP}{5TP+4FN+FP}$							(B.21.3)
*nMI* variants:
$nMI=\frac{MI}{f(HO,HC,HOC)}$			(B.22)	$nMI={nMI}_{ari}=\frac{MI}{(HO+HC)/2}$			(B.22.1)
${nMI}_{geo}=\frac{MI}{\sqrt[2]{HO\cdot HC}}$			(B.22.2)	${nMI}_{joi}=\frac{MI}{HOC}$			(B.22.3)
${nMI}_{min}=\frac{MI}{{\text{min}}(HO,HC)}$			(B.22.4)	${c}{nMI}_{max}=\frac{MI}{{\text{max}}(HO,HC)}$			(B.22.5)
$\begin{aligned} MCC & = \sqrt {INFORM \cdot MARK} = \sqrt {TPR \cdot TNR \cdot PPV \cdot NPV} - \sqrt {FPR \cdot FNR \cdot FDR \cdot FOR} \\ & \quad = \frac{{TP/Sn - PREV \cdot BIAS}}{{\sqrt {PREV \cdot BIAS \cdot \left( {1 - PREV} \right) \cdot \left( {1 - BIAS} \right)} }} = \frac{{TP \cdot TN - FP \cdot FN}}{{\sqrt {P \cdot OP \cdot N \cdot ON} }} = \frac{{DET}}{{\sqrt {P \cdot OP \cdot N \cdot ON} }} \\ \end{aligned}$							(B.23)
Graphical Performance Metrics (numbered with ‘g’ prefix)
AUCROC: area-under-ROC-curve (TPR versus TNR)							(B.g1)
$GINI=2AUCROC-1$							(B.g1.1)
AUCPR: area-under-PR- curve (PPV versus TPR)							(B.g2)
Probabilistic Error/Loss Base Equations (numbered with ‘p’ prefix)
(Summary Functions)
${e}_{i}={c}_{i}-{p}_{i}$	(B.pi)	${\%e}_{i}=\frac{{e}_{i}}{{c}_{i}}$	(B.pii)	${\Delta c}_{i}={c}_{i}-\overline{c }$	(B.piii)	${sym\_e}_{i}=\frac{{e}_{i}}{\left\|{c}_{i}\right\|+\left\|{p}_{i}\right\|}$	(B.pvi)
${rel\_e}_{i}=\frac{{e}_{i}}{{\Delta c}_{i}}$			(B.piv)	${sca\_e}_{i}=\frac{{e}_{i}}{\underset{i=2, \ldots ,{ Sn}}{\mathrm{mean}}\left\|{c}_{i}-{c}_{i-1}\right\|}$			(B.pv)
${c}_{i}\in \{0, 1\}$: ground-truth class label for ith example (0 for negative, 1 for positive class), In LogLoss; ${p}_{i}\in [0, 1]$ scores produced by a model C for each of Sn examples, In others: ${p}_{i}=P\left({p}_{i}=1\right\|{ x}_{i})=C({x}_{i})$: predicted class membership score where ${p}_{i}\ge \theta$ outcome is positive otherwise the outcome is negative (decision threshold $\theta \; \mathrm{in} \; [0, 1]$). $\overline{c }$ is the arithmetic mean of class labels
Probabilistic Error/Loss Measures (‘x’ in equation numbers mean “not a proper binary-classification instrument and excluded in PToPI”)
$LogLoss=-\frac{1}{Sn}\sum_{i}^{Sn}{c}_{i}{\mathrm{log}}_{2}{p}_{i}+{(1-c}_{i}){\mathrm{log}}_{2}(1-{p}_{i})$							(B.p1)
(Relative absolute/squared error measures)
$MRAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\left\|{rel\_e}_{i}\right\|$			(B.p2.1)	$MdRAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{median}}\left\|{rel\_e}_{i}\right\|$			(B.p2.2)
$GMRAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{geomean}}\left\|{rel\_e}_{i}\right\|$			(B.p2.3)	$RAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{sum}}\left\|{rel\_e}_{i}\right\|$			(B.p2.3x)
$RSE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{sum}}{{rel\_e}_{i}}^{2}$							(B.p2.4x)
(Squared error measures, continued from MSE, RMSE, and MdSE squared error metrics below)
$SSE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{sum}}{{e}_{i}}^{2}$			(B.p1.4x)	$nMSE \;\text{v1}=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\frac{{{e}_{i}}^{2}}{\overline{c }\cdot \overline{p} }$			(B.p1.5x.1)
$nMSE \;\text{v2}=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\frac{{{e}_{i}}^{2}}{\mathrm{var}(c)}$			(B.p1.5x.2)	$nMSE \;\text{v3}=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\frac{{{e}_{i}}^{2}}{\underset{j=1, \ldots ,{ Sn}}{\mathrm{mean}}{{\Delta c}_{j}}^{2}}$			(B.p1.5x.3)
$nMSE \;\text{v4}=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\frac{{{e}_{i}}^{2}}{\underset{j=1, \ldots ,{ Sn}}{\mathrm{mean}}{{c}_{j}}^{2}}$			(B.p1.5x.4)	$nMSE \;\text{v5}=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\frac{{{e}_{i}}^{2}}{{c}_{i}\cdot {p}_{i}}$			(B.p1.5x.5)
Probabilistic Error Metrics
$ME=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}{e}_{i}$							(B.p0x)
(Squared error metrics)
$MSE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}{{e}_{i}}^{2}$			(B.p1)	$RMSE=\sqrt{\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}{{e}_{i}}^{2}}$			(B.p1.1)
(Absolute error metrics)
$MAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\left\|{e}_{i}\right\|$			(B.p2.1)	$MdAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{median}}\left\|{e}_{i}\right\|$			(B.p2.2)
$MxAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{max}}\left\|{e}_{i}\right\|$			(B.p2.2)	$GMAE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{geomean}}\left\|{e}_{i}\right\|$			(B.p2.4x)
(Percentage error metrics)
$MPE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\%{e}_{i}$	(B.p4.1x)	$MAPE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\left\|\%{e}_{i}\right\|$	(B.p4.1x)	$MdAPE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{median}}\left\|\%{e}_{i}\right\|$			(B.p4.1x)
$RMSPE=\sqrt{\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\%{{e}_{i}}^{2}}$			(B.p4.4x)	$RMdSPE=\sqrt{\underset{i=1, \ldots ,{ Sn}}{\mathrm{median}}\%{{e}_{i}}^{2}}$			(B.p4.5x)
(Symmetric percentage error metrics)
$sMAPE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\left\|sym\_\mathrm{\%}{e}_{i}\right\|$			(B.p3.0x)	$nsMAPE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}\left\|\frac{sym\_\mathrm{\%}{e}_{i}}{2}\right\|$			(B.p3)
$nsMdAPE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{median}}\left\|\frac{sym\_\mathrm{\%}{e}_{i}}{2}\right\|$							(B.p3.1x)
Probabilistic Error Metrics (Absolute scaled errors for time-series forecasting, not applicable for binary classification)
$MASE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}sca\_{e}_{i}$			(B.px.1)	$MdASE=\underset{i=1, \ldots ,{ Sn}}{\mathrm{median}}sca\_{e}_{i}$	(B.px.2)	$RMSSE=\sqrt{\underset{i=1, \ldots ,{ Sn}}{\mathrm{mean}}sca\_{{e}_{i}}^{2}}$	(B.px.3)
Inter/intra-model complexity criteria based on probabilistic error metrics (k: number of model parameters)
$AIC=2k-2{ln}MSE$			(B.i)	$BIC= \textit{kln}Sn-2\textit{ln}MSE$			(B.ii)

Appendix 3: (Online) PToPI: Periodic Table of Performance Instruments (Full View)

The proposed binary-classification performance instruments exploratory table for a total of 57 performance instruments is provided online at https://github.com/gurol/ptopi as in two files: PToPI.xlsx spreadsheet file and ‘Fig. C.1.png’ high-resolution image file). The full view (Fig. C.1) presents all the information such as canonical or high-level dependency equations. See the legend in Table 6 for the design elements used in PToPI.

Appendix 4: Case Study (Performance Evaluation in Android Mobile-Malware Classification) Selection Methodology

The case study described in “Case study: performance evaluation in android mobile-malware classification” surveys 78 academic studies about Android malware classification from 2012 to 2018. The references are given in online Table E.1. Additional to 35 symposia, conference, and journal articles published that had already been reviewed by us, 43 articles were included using the following methodology:

Selecting the relevant journal articles by searching the IEEE academic database with having "((Android and malware) and (accuracy or precision or "True Positive" or "False Positive") and (Classification OR Detection))" words in the articles’ title, abstract, or body on 27 March 2018.

Selecting the relevant conference/journal articles by searching Google Scholar by matching the same keywords above and reviewing the first ten related articles per year from 2012 to 2018 in May 2018, excluding the patents.

Among the relevant surveyed studies, all the articles were included in performance evaluation terminology findings where available. For other statistics, only the related studies were included, as specified in Appendix 5.

Appendix 5: (Online) References of the Surveyed Studies and the Detailed Results of the Case Study in “Case study: performance evaluation in android mobile-malware classification”.

The detailed data and results are provided online at https://doi.org/10.17632/5c442vbjzg.3 via the Mendeley Data platform. Besides, the online Table E.1, which is provided at (AppendixE_Table_E1.pdf) at https://www.github.com/gurol/ptopi, lists the references of the surveyed studies selected by the methodology described in Appendix 4 above.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Canbek, G., Taskaya Temizel, T. & Sagiroglu, S. PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics. SN COMPUT. SCI. 4, 13 (2023). https://doi.org/10.1007/s42979-022-01409-1

Download citation

Received: 29 October 2021
Accepted: 13 September 2022
Published: 16 October 2022
DOI: https://doi.org/10.1007/s42979-022-01409-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparing ϕ and the F-measure as performance metrics for software-related classifications

Fast and Straightforward Feature Selection Method

A Strategy for Predicting the Performance of Supervised and Unsupervised Tabular Data Classifiers

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: Instrument Abbreviation and Name List

Appendix 2: Performance Instrument Equations

Appendix 3: (Online) PToPI: Periodic Table of Performance Instruments (Full View)

Appendix 4: Case Study (Performance Evaluation in Android Mobile-Malware Classification) Selection Methodology

Appendix 5: (Online) References of the Surveyed Studies and the Detailed Results of the Case Study in “Case study: performance evaluation in android mobile-malware classification”.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now