How reliable is your reliability diagram?

doi:10.1016/j.patrec.2019.07.012

Pattern Recognition Letters

Volume 125, 1 July 2019, Pages 687-693

https://doi.org/10.1016/j.patrec.2019.07.012 Get rights and content

Highlights

•
A standardized reliability diagram is proposed for assessing class probabilities.
•
Reliability diagrams are undesirably sensitive to the number of outcomes in each bin.
•
An effective approach for choosing an appropriate number of bins is proposed.
•
Simulation and example results demonstrate the effectiveness of the proposed methods.

Abstract

It is often necessary to evaluate probabilistic classifiers in terms of the quality of class probability estimates. A popular tool for assessing class probabilities is the reliability diagram, which is based on data binning. While the reliability diagram is visually appealing, it is difficult to statistically determine whether the probabilities are reliable. In this paper, we propose a standardized reliability diagram to assess a binary probabilistic classifier. The proposed method uses the transforms of the Poisson binomial distribution to the normal distribution. The results of the method provide valuable inferences over the (unscaled) reliability diagram. Moreover, we show that the assessment results may be undesirably dependent on the sample size in each bin. As a remedy, we also introduce an approach that chooses an appropriate number of bins for relatively consistent test results regardless of the sample size. Simulation and example results demonstrate the effectiveness of the proposed approaches.

Introduction

In supervised learning, a probabilistic classifier is a classifier that produces class probability estimates for an instance. Class probabilities are useful for estimating the costs of classification decisions. Obtaining well-calibrated class probabilities is therefore important in many data mining/pattern recognition applications.

To assess whether the probability estimates are well-calibrated, many approaches have been introduced such as graphical approaches (e.g., the reliability diagram) or scoring approaches (e.g., the Brier score). The reliability diagram [10] visualizes how well the class probabilities are estimated by (1) binning the probability outcomes and (2) plotting the observed relative frequency of the positive examples against the predicted average probability for each bin. If the probability estimates are reliable, the resulting plot is expected to show a linear pattern near the diagonal. Compared with other scoring approaches, the reliability diagram is appealing because it visualizes the level of calibration of the probability outcomes at a given probability interval. Reliability diagrams have been frequently employed to assess the quality of class probabilities [3], [4], [11].

Despite its popularity in practice, there is surprisingly little research on the reliability diagram as an evaluation measure. One drawback of the reliability diagram is that it is unclear how far the observed relative frequency can be from the predicted probability when the probability estimates are reliable. Some methods have been introduced to create an interval that covers the range of values that the observed relative frequency can take under the assumption that the probabilities in the bin are reliable. In [17], a confidence interval for the predicted probability of each bin is obtained by assuming that the number of positive examples in the bin follows a binomial distribution. However, using the binomial distribution is unrealistic because it assumes that all probabilities in each bin are the same for all bins, which is unlikely to occur in most cases. In [3], the authors overcome this problem by constructing confidence bars using resampled observed frequencies for each bin.

While confidence bars/intervals provide statistical assessments at a given level of significance, the results may be highly affected by the number of test instances in each bin. Since the power (i.e. 1 - probability of type II error) of a statistical test generally increases with sample size, if we use a small number of bins and assign many probabilities to each bin, small departures from the diagonal line will be considered significant. In other words, when assessing the same set of probability outcomes in test data, two reliability diagrams with confidence intervals based on different bin sizes may result in different evaluation assessments. Hence, it will be important to control the power of statistical tests to determine an appropriate bin size. To our knowledge, there is no published work to address this problem of the reliability diagram.

In this paper, we propose a standardized reliability diagram that provides an immediate visual interpretation of a probabilistic classifier. The proposed method assumes that the number of positive examples in each bin follows a Poisson binomial distribution. Unlike the method introduced in [3], the proposed method does not require the resampling step, thus giving an immediate visualized result. Moreover, we propose an approach to choosing the number of bins for the consistent assessment of the (standardized) reliability diagram. Through the simulation, we show that the selecting approach can lead to consistent test results across different sizes of test data. With an appropriate bin size, the standardized reliability diagram provides a simple but more effective and reliable visual assessment.

The rest of this paper is organized as follows. In Section 2, we briefly explain how a reliability diagram is made. In Section 3, we describe the detailed procedures of the proposed approaches. In Section 4, we investigate the effectiveness of the selection method using simulated data. In Section 5, we illustrate how the proposed methods are employed in practice using the letter data set. In Section 6, we provide a brief discussion and draw conclusions.

Section snippets

Reliability diagram

In a binary classification problem, a probabilistic classifier f: R^d → {0, 1} is trained where x ∈ R^d is the input and y ∈ {0, 1} is the output. Suppose that the test data contain N instances $(x_{1}, y_{1}), \dots, (x_{N}, y_{N})$ and the trained model produces the probability estimates $p_{1}, \dots, p_{N}$ where p_i is an estimate of $P (y_{i} = 1 | x_{i})$ .

For a reliability diagram, the N probabilities are grouped into bins B_k ( $k = 1, 2, \dots, K$ ) such that the bins divide the unit interval into K non-overlapping subintervals. We may choose the B_k

Poisson binomial random variable

For each bin B_k, we set the null hypothesis as the probability outcomes in B_k are reliable. That is, the null hypothesis is $H_{0} : p_{i} = P (Y_{i} = 1 | x_{i})$ for all i ∈ I_k. Under the null hypothesis, each Y_i at x_i in B_k is a Bernoulli random variable with the probability of success p_i and y_i is the observed value. Then $\sum_{i \in B_{k}} Y_{i},$ the sum of the independent and non-identically distributed Bernoulli random variables, follows a Poisson binomial distribution with the mean $\sum_{i \in B_{k}} p_{i}$ and the variance $σ_{k}^{2} = \sum_{i \in B_{k}} p_{i} (1 - p_{i})$ .

Simulation study

In this section, we use simulated data to show that SRD results with a fixed K are sensitive to the size of the test data and that the proposed selecting method effectively chooses the K* values for consistent test results. We used the following binary model for each outcome: $Y_{i} \sim B e r n o u l l i (q_{i}), where$ $\begin{matrix} \log (\frac{q_{i}}{1 - q_{i}}) & = β_{0} + β_{1} x_{i} + β_{2} x_{i}^{2} + β_{3} r_{i} + β_{4} (x_{i} \cdot r_{i}) \\ = g (x_{i}, r_{i}), \end{matrix}$ where x is a standard normal random variable and r is a Bernoulli variable with the success probability 0.5 that is independent of x. For each

An example using the letter data

This section illustrates how the standardized reliability diagram is used in practice. We used the letter data set from the UCI repository [8]. The data set contains 20,000 instances, each of which corresponds to one of the 26 upper case letters and is described by 16 integer features. As in [11], we converted the data into binary by treating the letters A-M as positives and N-Z as negatives. This yielded a well-balanced, but difficult binary classification problem. We chose 5000 observations

Discussion

From the simulation result, we learned that the number of bins (K) required for the standardized reliability diagram should be chosen based on the size of the test data (N). Also, the proposed selection method effectively chooses K* so that the inference from the reliability diagram is relatively insensitive to N.

In Section 5, the bin size chosen was different for each model. This may be an interesting point because it is different from the traditional approach that fixes the bin size (e.g. $K = 10$

Conflict of interest

I, Hyukjun Gweon, confirm on behalf of all authors that no conflicts of interest exist for any of the authors.

References (19)

G. Collell et al.
A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data
Neurocomputing
(2018)
Y. Hong
On computing the distribution function for the poisson binomial distribution
Comput. Stat. Data Anal.
(2013)
Y. Benjamini et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J. R. Stat. Soc. Ser. B (Methodological)
(1995)
P. Billingsley
Probability and Measure
(1995)
J. Bröcker et al.
Increasing the reliability of reliability diagrams
Weather Forecasting
(2007)
J. Friedman et al.
Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors)
Ann. Stat.
(2000)
D.W. Hosmer et al.
Goodness of fit tests for the multiple logistic regression model
Commun. Stat. - Theory Methods
(1980)
M. Lichman, UCI machine learning repository, 2013....
D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, e1071: misc functions of the department of statistics,...

There are more references available in the full text version of this article.

Cited by (0)

^☆: Editor: Jiwen Lu.

View full text

How reliable is your reliability diagram?☆

Highlights

Abstract

Introduction

Section snippets

Reliability diagram

Poisson binomial random variable

Simulation study

An example using the letter data

Discussion

Conflict of interest

Neurocomputing

Comput. Stat. Data Anal.

Controlling the false discovery rate: a practical and powerful approach to multiple testing

J. R. Stat. Soc. Ser. B (Methodological)

Probability and Measure

Increasing the reliability of reliability diagrams

Weather Forecasting

Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors)

Ann. Stat.

Goodness of fit tests for the multiple logistic regression model

Commun. Stat. - Theory Methods