Elsevier

Information Sciences

Volumes 463–464, October 2018, Pages 323-343
Information Sciences

A two-tiered 2D visual tool for assessing classifier performance

https://doi.org/10.1016/j.ins.2018.06.052Get rights and content

Highlights

  • A new kind of diagrams is proposed, called phi-delta diagrams.

  • These diagrams are mainly aimed at supporting classifier building and assessment.

  • Accuracy, bias, and further relevant information are immediately evident therein.

  • Two kinds of diagrams have been devised: 1. not biased by the imbalance, 2. biased.

  • The former are an alternative to ROC curves, whereas the latter to coverage plots.

Abstract

In this article, a new kind of 2D tool is proposed, namely ⟨φ, δ⟩ diagrams, able to highlight most of the information deemed relevant for classifier building and assessment. In particular, accuracy, bias and break-even points are immediately evident therein. These diagrams come in two different forms: the first is aimed at representing the phenomenon under investigation in a space where the imbalance between negative and positive samples is not taken into account, the second (which is a generalization of the first) is able to visualize relevant information in a space that accounts also for the imbalance. According to a specific design choice, all properties found in the first space hold also in the second. The combined use of φ and δ can give important information to researchers involved in the activity of building intelligent systems, in particular for classifier performance assessment and feature ranking/selection.

Introduction

Many proposals have been made for measuring the performance of classification tasks (see [15], [23], [26] for extensive information on this matter). There are also several graphical representations and tools for model evaluation. In particular, ROC curves [3], [13], [25] are a 2D visual tool widely acknowledged as the default choice for assessing the intrinsic behavior of a classifier. Various aspects of ROC curves have been extensively studied by the machine learning community, including: (i) isometrics of relevant measures [16], [27], [28], (ii) cost curves [9], [10], [11], and (iii) confidence bands [22]. Several proposals have been made with the aim of extending the descriptive power of ROC curves, including those on detection error tradeoff [24], on instance-varying costs [14], on adapting ROC curves for regression tasks [20], on the relationship between ROC analysis and Bayesian models [6], and on explicitly representing the imbalance (e.g., coverage plots [17]). The area under the curve has been actively investigated as well (see, for instance, [4], [5], [19], [21]), for it is considered a good estimate of the discriminant power of a classifier. A further category of measures is aimed at checking the propension to classify inputs as belonging to the positive or negative class (bias) and to what extent the training set affects performance (variance). See also [8] for more information on bias and variance.

The vast majority of performance measures is affected by the imbalance between positive and negative samples. This concept can be given in numerical terms using the ratio between the amount of negative and positive samples (i.e., class ratio). If one wants to assess the intrinsic properties of a classifier, adopting a measure that accounts for the class ratio may not be a reliable choice. In fact, although many practical problems are typically unbalanced, most of the existing performance measures (e.g., accuracy) become progressively meaningless with increasing or decreasing class ratio (e.g., [12], [18]). This fact may be worsened by a lack of statistical significance of experimental results, which may hold for minority test samples. While no practical solution able to contrast the latter issue exists, the former is typically dealt with by adopting a pair of measures (e.g., precision and recall, or specificity and sensitivity). Besides, also ROC diagrams follow this approach, the default choice being false positive rate (i.e., 1specificity) on the x axis and true positive rate (i.e., sensitivity) on the y axis.

In this article, two measures (i.e., φ and δ) are proposed, which allow to assess the performance of classifiers according to a bias vs. accuracy perspective. These measures are framed in two different kinds of 2D visual tools, i.e., standard and generalized ⟨φ, δ⟩ diagrams. The primary goal of the former is to highlight the intrinsic performance of a classifier, regardless of the imbalance of the dataset at hand, whereas the latter have been devised to investigate how the underlying statistics of data affects the behavior of a classifier.1 To these ends, standard ⟨φ, δ⟩ diagrams rely on measures that are not affected by the class ratio, whereas generalized ones account also for the class ratio. In particular, in a standard scenario φ and δ will give information about unbiased bias and accuracy, whereas in a generalized scenario φ and δ will give information about biased (i.e., actual) bias and accuracy. As the terms “unbiased” and “biased” do not belong to the classical jargon, a full section will be devoted to illustrate the corresponding concepts. So far, let us concentrate on Table 1, which gives a sort of “roadmap” aimed at shedding light on the most relevant aspects concerning ⟨φ, δ⟩ measures and diagrams. We are confident on its usefulness for the interested reader.

The remainder of this article is organized as follows: Section 2 points out the existence of unbiased and biased spaces, devoted to highlight intrinsic and actual properties of classifiers. Section 3 introduces φ and δ, first as measures and then as diagrams. This section encompasses also details concerning mutual information, break-even points and relevant isometrics. Section 4 generalizes ⟨φ, δ⟩ diagrams, allowing class ratio to be explicitly represented. Section 5 illustrates experimental settings, and Section 6 summarizes some relevant use cases in which ⟨φ, δ⟩ diagrams are put into practice. In particular, two main scenarios are described therein: classifier assessment and feature ranking (the fact that ⟨φ, δ⟩ measures can also be used to assess features should not be surprising, as a feature can always be considered a simple kind of classifier in itself). Section 7 points out the strengths and weaknesses of this proposal, and Section 8 draws conclusions.

The introductory part of this work partially overlaps the one published in [1]. However, this was purposeful, as this article is intended to become a sort of reference for all researchers that will decide to adopt ⟨φ, δ⟩ diagrams for measuring the performance of binary classifiers or for performing feature importance analysis. Any other material, including (i) the semantics of φ and δ axes for both unbiased and biased cases, (ii) a complete study on the relation that holds between ⟨φ, δ⟩ measures and mutual information, as well as (iii) isometrics of the most acknowledged performance measures (i.e., accuracy, precision, negative predictive value, sensitivity and specificity), is totally unpublished. Notably, all details about the way relevant equations have been derived can be found in the supplementary material, i.e., in Appendix A and in Appendix B. The former is devoted to standard ⟨φ, δ⟩ measures, whereas the latter to generalized ones.

Not least of all, the reader should also be aware that this is a methodological article, aimed at illustrating and analyzing new measure spaces able to highlight at a glance some classifier or feature properties deemed relevant by the machine learning community. Notwithstanding this perspective, care has been taken to provide a full experimental section, with the goal of giving researchers a flavor of ⟨φ, δ⟩ diagrams inherent potential.

Section snippets

Unbiased and biased spaces for classifier assessment

Be Ξc(P, N) the confusion matrix of a test run in which a classifier c^ trained on a class c is fed with P positive samples and N negative samples, with a total of M samples. In particular, ξ00, ξ01, ξ10, and ξ11 represent true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP), respectively. Assuming statistical significance, the confusion matrix, possibly averaged over multiple tests, is expected to give reliable information on the performance of the

Standard ⟨φ, δ⟩ diagrams

In a classifier assessment scenario, a researcher is typically interested in understanding to what extent a classifier is able to approximate the oracle and whether it is biased towards the positive or the negative class. In a feature assessment scenario, a researcher is typically interested in assessing to what extent a feature is covariant or contravariant with the positive class, and whether it is characteristic or not for the dataset at hand. As pointed out, the solution proposed in this

Generalized ⟨φ, δ⟩ diagrams

In this section, a generalization of ⟨φ, δ⟩ diagrams is proposed, the corresponding space being named ⟨φb, δb⟩. Notably, all relevant properties identified for ⟨φ, δ⟩ diagrams are maintained in the new space, according to a specific design choice. In particular, it is shown (i) that a value measured on the δb axis corresponds to the actual accuracy remapped in [1,+1], (ii) that a value measured on the φb axis gives (an estimate of) the actual bias, and (iii) that the δb axis is the locus of

Experimental settings

We decided to focus mainly on real-world datasets, as they are typically more effective than artificial ones in the task of highlighting the characteristics of a method. However, as the majority of real-world datasets contain non binary features, we had to devise a way for dealing with nominal and floating point features. To provide essential information on this aspect, let us briefly describe the solutions adopted to make these kinds of features compatible with ⟨φ, δ⟩ diagrams.

Experiments with ⟨φ, δ⟩ diagrams

This section reports some relevant use cases aimed at illustrating the expressiveness of ⟨φ, δ⟩ diagrams. All datasets used for experiments are publicly available on well-known machine learning web sites.5 Two separate subsections follow: one focused on classifier

Strengths and weaknesses of this proposal

⟨φ, δ⟩ diagrams come in two different forms. Similarly to what happens for ROC curves, standard ⟨φ, δ⟩ diagrams are aimed at representing the phenomenon under investigation in a space where the class ratio is not taken into account. Framed along the same perspective of coverage plots, generalized ⟨φ, δ⟩ diagrams are able to visualize relevant information in a space that accounts also for the class ratio.

In either form, ⟨φ, δ⟩ diagrams allow a researcher to see at a glance bias and accuracy,

Conclusions and future work

In this article, two measures have been proposed, i.e. φ and δ, which have been framed in both unbiased and biased spaces. For each space, a corresponding 2D visual environment has been devised and implemented, aimed at facilitating the task of assessing the properties of binary classifiers and of performing feature importance analysis. Isometrics and loci of points have been studied first in the standard (unbiased) space and then the generalized (biased) space. By construction, relevant

Acknowledgments.

This research work has been supported by LR7 2007 grant number: F71J11000590002 (Investment Funds for Basic Research) and by PIA 2010 grant number: 1492-118/2013 (Integrated Subsidized Packages), both funded by the local government of Sardinia. Further support to this work has been given by the DAAD-MIUR Joint Mobility Program, year 2015–16. The authors wish to thank Lorenza Saitta, Dominik Heider and Ursula Neumann for their support in discussing and developing the ideas reported in this

References (29)

  • C. Drummond et al.

    Explicitly representing expected cost: an alternative to ROC representation

    Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD’00)

    (2000)
  • C. Drummond et al.

    What ROC curves cannot do (and cost curves can)

    Proceedings of the ROCAI

    (2004)
  • C. Drummond et al.

    Cost curves: an improved method for visualizing classifier performance

    Mach. Learn.

    (2006)
  • W. Elazmeh et al.

    A framework for comparative evaluation of classifiers in the presence of class imbalance

    3rd Int. Workshop on ROC Analysis in Machine Learning (ROCML-2006)

    (2006)
  • Cited by (7)

    View all citing articles on Scopus
    View full text