A two-tiered 2D visual tool for assessing classifier performance
Introduction
Many proposals have been made for measuring the performance of classification tasks (see [15], [23], [26] for extensive information on this matter). There are also several graphical representations and tools for model evaluation. In particular, ROC curves [3], [13], [25] are a 2D visual tool widely acknowledged as the default choice for assessing the intrinsic behavior of a classifier. Various aspects of ROC curves have been extensively studied by the machine learning community, including: (i) isometrics of relevant measures [16], [27], [28], (ii) cost curves [9], [10], [11], and (iii) confidence bands [22]. Several proposals have been made with the aim of extending the descriptive power of ROC curves, including those on detection error tradeoff [24], on instance-varying costs [14], on adapting ROC curves for regression tasks [20], on the relationship between ROC analysis and Bayesian models [6], and on explicitly representing the imbalance (e.g., coverage plots [17]). The area under the curve has been actively investigated as well (see, for instance, [4], [5], [19], [21]), for it is considered a good estimate of the discriminant power of a classifier. A further category of measures is aimed at checking the propension to classify inputs as belonging to the positive or negative class (bias) and to what extent the training set affects performance (variance). See also [8] for more information on bias and variance.
The vast majority of performance measures is affected by the imbalance between positive and negative samples. This concept can be given in numerical terms using the ratio between the amount of negative and positive samples (i.e., class ratio). If one wants to assess the intrinsic properties of a classifier, adopting a measure that accounts for the class ratio may not be a reliable choice. In fact, although many practical problems are typically unbalanced, most of the existing performance measures (e.g., accuracy) become progressively meaningless with increasing or decreasing class ratio (e.g., [12], [18]). This fact may be worsened by a lack of statistical significance of experimental results, which may hold for minority test samples. While no practical solution able to contrast the latter issue exists, the former is typically dealt with by adopting a pair of measures (e.g., precision and recall, or specificity and sensitivity). Besides, also ROC diagrams follow this approach, the default choice being false positive rate (i.e., ) on the x axis and true positive rate (i.e., sensitivity) on the y axis.
In this article, two measures (i.e., φ and δ) are proposed, which allow to assess the performance of classifiers according to a bias vs. accuracy perspective. These measures are framed in two different kinds of 2D visual tools, i.e., standard and generalized ⟨φ, δ⟩ diagrams. The primary goal of the former is to highlight the intrinsic performance of a classifier, regardless of the imbalance of the dataset at hand, whereas the latter have been devised to investigate how the underlying statistics of data affects the behavior of a classifier.1 To these ends, standard ⟨φ, δ⟩ diagrams rely on measures that are not affected by the class ratio, whereas generalized ones account also for the class ratio. In particular, in a standard scenario φ and δ will give information about unbiased bias and accuracy, whereas in a generalized scenario φ and δ will give information about biased (i.e., actual) bias and accuracy. As the terms “unbiased” and “biased” do not belong to the classical jargon, a full section will be devoted to illustrate the corresponding concepts. So far, let us concentrate on Table 1, which gives a sort of “roadmap” aimed at shedding light on the most relevant aspects concerning ⟨φ, δ⟩ measures and diagrams. We are confident on its usefulness for the interested reader.
The remainder of this article is organized as follows: Section 2 points out the existence of unbiased and biased spaces, devoted to highlight intrinsic and actual properties of classifiers. Section 3 introduces φ and δ, first as measures and then as diagrams. This section encompasses also details concerning mutual information, break-even points and relevant isometrics. Section 4 generalizes ⟨φ, δ⟩ diagrams, allowing class ratio to be explicitly represented. Section 5 illustrates experimental settings, and Section 6 summarizes some relevant use cases in which ⟨φ, δ⟩ diagrams are put into practice. In particular, two main scenarios are described therein: classifier assessment and feature ranking (the fact that ⟨φ, δ⟩ measures can also be used to assess features should not be surprising, as a feature can always be considered a simple kind of classifier in itself). Section 7 points out the strengths and weaknesses of this proposal, and Section 8 draws conclusions.
The introductory part of this work partially overlaps the one published in [1]. However, this was purposeful, as this article is intended to become a sort of reference for all researchers that will decide to adopt ⟨φ, δ⟩ diagrams for measuring the performance of binary classifiers or for performing feature importance analysis. Any other material, including (i) the semantics of φ and δ axes for both unbiased and biased cases, (ii) a complete study on the relation that holds between ⟨φ, δ⟩ measures and mutual information, as well as (iii) isometrics of the most acknowledged performance measures (i.e., accuracy, precision, negative predictive value, sensitivity and specificity), is totally unpublished. Notably, all details about the way relevant equations have been derived can be found in the supplementary material, i.e., in Appendix A and in Appendix B. The former is devoted to standard ⟨φ, δ⟩ measures, whereas the latter to generalized ones.
Not least of all, the reader should also be aware that this is a methodological article, aimed at illustrating and analyzing new measure spaces able to highlight at a glance some classifier or feature properties deemed relevant by the machine learning community. Notwithstanding this perspective, care has been taken to provide a full experimental section, with the goal of giving researchers a flavor of ⟨φ, δ⟩ diagrams inherent potential.
Section snippets
Unbiased and biased spaces for classifier assessment
Be Ξc(P, N) the confusion matrix of a test run in which a classifier trained on a class c is fed with P positive samples and N negative samples, with a total of M samples. In particular, ξ00, ξ01, ξ10, and ξ11 represent true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP), respectively. Assuming statistical significance, the confusion matrix, possibly averaged over multiple tests, is expected to give reliable information on the performance of the
Standard ⟨φ, δ⟩ diagrams
In a classifier assessment scenario, a researcher is typically interested in understanding to what extent a classifier is able to approximate the oracle and whether it is biased towards the positive or the negative class. In a feature assessment scenario, a researcher is typically interested in assessing to what extent a feature is covariant or contravariant with the positive class, and whether it is characteristic or not for the dataset at hand. As pointed out, the solution proposed in this
Generalized ⟨φ, δ⟩ diagrams
In this section, a generalization of ⟨φ, δ⟩ diagrams is proposed, the corresponding space being named ⟨φb, δb⟩. Notably, all relevant properties identified for ⟨φ, δ⟩ diagrams are maintained in the new space, according to a specific design choice. In particular, it is shown (i) that a value measured on the δb axis corresponds to the actual accuracy remapped in (ii) that a value measured on the φb axis gives (an estimate of) the actual bias, and (iii) that the δb axis is the locus of
Experimental settings
We decided to focus mainly on real-world datasets, as they are typically more effective than artificial ones in the task of highlighting the characteristics of a method. However, as the majority of real-world datasets contain non binary features, we had to devise a way for dealing with nominal and floating point features. To provide essential information on this aspect, let us briefly describe the solutions adopted to make these kinds of features compatible with ⟨φ, δ⟩ diagrams.
Experiments with ⟨φ, δ⟩ diagrams
This section reports some relevant use cases aimed at illustrating the expressiveness of ⟨φ, δ⟩ diagrams. All datasets used for experiments are publicly available on well-known machine learning web sites.5 Two separate subsections follow: one focused on classifier
Strengths and weaknesses of this proposal
⟨φ, δ⟩ diagrams come in two different forms. Similarly to what happens for ROC curves, standard ⟨φ, δ⟩ diagrams are aimed at representing the phenomenon under investigation in a space where the class ratio is not taken into account. Framed along the same perspective of coverage plots, generalized ⟨φ, δ⟩ diagrams are able to visualize relevant information in a space that accounts also for the class ratio.
In either form, ⟨φ, δ⟩ diagrams allow a researcher to see at a glance bias and accuracy,
Conclusions and future work
In this article, two measures have been proposed, i.e. φ and δ, which have been framed in both unbiased and biased spaces. For each space, a corresponding 2D visual environment has been devised and implemented, aimed at facilitating the task of assessing the properties of binary classifiers and of performing feature importance analysis. Isometrics and loci of points have been studied first in the standard (unbiased) space and then the generalized (biased) space. By construction, relevant
Acknowledgments.
This research work has been supported by LR7 2007 grant number: F71J11000590002 (Investment Funds for Basic Research) and by PIA 2010 grant number: 1492-118/2013 (Integrated Subsidized Packages), both funded by the local government of Sardinia. Further support to this work has been given by the DAAD-MIUR Joint Mobility Program, year 2015–16. The authors wish to thank Lorenza Saitta, Dominik Heider and Ursula Neumann for their support in discussing and developing the ideas reported in this
References (29)
A direct measure of discriminant and characteristic capability for classifier building and assessment
Inf. Sci. (Ny)
(2015)The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognit.
(1997)An introduction to ROC analysis
Pattern Recognit. Lett. (special issue: ROC analysis in pattern recognition)
(2006)- et al.
A systematic analysis of performance measures for classification tasks
Inf. Process. Manag.
(2009) Analysis of a random forests model
J. Mach. Learn. Res.
(2012)- et al.
Efficient AUC optimization for classification
Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases
(2007) - et al.
Auc optimization vs. error rate minimization
Proceedings of the NIPS
(2003) Optimal ROC-based classification and performance analysis under Bayesian uncertainty models
IEEE/ACM Trans. Comput. Biol. Bioinf.
(2016)Orange: data mining toolbox in python
J. Mach. Learn. Res.
(2013)A unified bias-variance decomposition for zero-one and squared loss
Proceedings of the Seventh National Conference on Artificial Intelligence (AAAI ’00)
(2000)
Explicitly representing expected cost: an alternative to ROC representation
Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD’00)
What ROC curves cannot do (and cost curves can)
Proceedings of the ROCAI
Cost curves: an improved method for visualizing classifier performance
Mach. Learn.
A framework for comparative evaluation of classifiers in the presence of class imbalance
3rd Int. Workshop on ROC Analysis in Machine Learning (ROCML-2006)
Cited by (7)
Linearithmic and unbiased implementation of DeLong's algorithm for comparing the areas under correlated ROC curves
2024, Expert Systems with ApplicationsForensic Analysis of Text and Messages in Smartphones by a Unification Rosetta Stone Procedure
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Using phidelta diagrams to discover relevant patterns in multilayer perceptrons
2020, Scientific ReportsDeep learning on chaos game representation for proteins
2020, BioinformaticsPhi-Delta-Diagrams: Software Implementation of a Visual Tool for Assessing Classifier and Feature Performance
2019, Machine Learning and Knowledge Extraction