1 Introduction

In recent years, Decision Support Systems in various domains such as retail, sport, or defence have been incorporating Artificial Intelligence (AI) extensively (Zhou et al., 2021). However, the predictive models used in AI-based Decision Support Systems generally lack transparency and only provide probable results (David, 2017; Ribeiro et al., 2016). This can result in misuse (when users rely on it excessively) or disuse (when users do not rely on it enough) (Alvarado-Valencia & Barrero, 2014; Buçinca et al., 2020).

The lack of transparency has led to the development of eXplainable Artificial Intelligence (XAI), which aims to create AI systems capable of explaining their reasoning to human users. The goal of explanations is to support users in identifying incorrect predictions, especially in critical areas such as medical diagnosis (Gunning & Aha, 2019). An explanation provided by XAI should highlight the underlying model’s strengths and weaknesses and provide insight into how it will perform in the future (David, 2017; Dimanov et al., 2020).

Regarding explanations in XAI, there are two types: local and global. Local explanations focus on the reasons behind individual predictions, while global explanations provide information about the entire model (Guidotti et al., 2018; Moradi & Samwald, 2021; Martens & Foster, 2014). Despite the apparent strength stemming from the possibility of providing explanations for each instance, local explanations typically have some drawbacks. For example, they can be instable, meaning that the same model and instance may result in different explanations, or they can lack robustness, meaning that minor differences in the instance can lead to significantly different explanations (Slack et al., 2021; Rahnama & Boström, 2019). Instability and lack of robustness create issues when evaluating the quality of the explanations. Metrics like fidelity, which measure how well an explanation captures the behaviour of the underlying model, do not give an accurate picture of explanation quality since they depend heavily on the details of the explanation method (Slack et al., 2021; Moradi & Samwald, 2021; Hoffman et al., 2018; Carvalho et al., 2019; Adadi & Berrada, 2018; Wang et al., 2019; Mueller et al., 2019; Agarwal et al., 2022). Furthermore, even the best explanation techniques offer limited insight into model uncertainty and reliability. Recent research has emphasized uncertainty estimation’s role in enhancing the transparency of underlying models (Bhatt et al., 2021; Slack et al., 2021). Although achieving well-calibrated uncertainty has been underscored as a critical factor in fostering transparent decision-making, Bhatt et al. (2021) point out the challenges and complexities of obtaining accurately calibrated uncertainty estimates for complex problems. Moreover, as indicated by Slack et al. (2021), the focus has predominantly leaned towards adopting a well-calibrated underlying model (such as Bayesian) rather than relying on calibration techniques.

The probability estimate that most classifiers output is commonly used as an indicator of the likelihood of each class in local explanation methods for classification. However, it is widely recognized that these classifiers are often poorly calibrated, resulting in probability estimates that do not faithfully represent the actual probability of correctness (Vovk, 2015). Specialized calibration techniques such as Platt Scaling (Platt, 1999) and Venn-Abers (VA) (Vovk & Petej, 2012) have been proposed to tackle these shortcomings. The VA method generates a probability range associated with each prediction, which can be refined into a properly calibrated probability estimate utilizing regularisation.

When employing the VA approach for decision-making, it is essential to recognize that the technique provides intervals for the positive class. These intervals quantify the uncertainty within the probability estimate, offering valuable insights from an explanatory standpoint. The breadth of the interval directly corresponds to the model’s level of uncertainty, with a narrower interval signifying more confidence in the probability estimate. In comparison, a broader interval indicates more substantial uncertainty in said estimates. The uncertainty information can be extended to the features, given that the feature weights are informed by the prediction’s probability estimate. Being able to quantify the uncertainty of feature weights can improve the quality and usefulness of explanations in XAI. Recently, a local explanation method, Calibrated Explanations, utilizing the intervals provided by VA to estimate feature uncertainty was introduced for classification (Löfström et al., 2023).

In recent years, conformal prediction has increasingly been integrated into research about XAI methods, although not focusing on the uncertainty aspect per se. The focus has primarily been on interpretable models (Johansson et al., 2019), increasing the fidelity between model and explanations (Altmeyer et al., 2024), lowering the computational cost (Alkhatib et al., 2023) and explaining reject options (Artelt & Hammer, 2022; Artelt et al., 2022, 2023). Explaining reject options has been defined as an explanation of the uncertainty integral in taking a decision.

Existing explanation methods most commonly focus on explaining decisions from classifiers, despite the fact that regression is widely used in highly critical situations. Due to the lack of specialized explanation techniques for regression, applying methods designed for classification on regression problems is not unusual, highlighting the need for well-founded explanation methods for regression (Letzgus et al., 2022).

The aim of this study is to propose an explanation method - with the same possibility of quantifying the uncertainty of feature weights that Calibrated Explanations provides, through VA, for classification - for a regression context. The conformal prediction framework (Vovk et al., 2005) provides several different techniques for quantifying uncertainty in a regression context. In this paper, the Conformal Predictive Systems (CPSs) technique (Vovk et al., 2019) for uncertainty estimation is used in Calibrated Explanations to allow creation of calibrated explanations with uncertainty estimation for regression. CPSs is not only a very flexible technique, providing a rich set of tools to be used for uncertainty quantification, but it also allows for estimating the probability that the target is above any user-defined threshold. Based on this, a new form of probabilistic explanation for regression is also proposed in this paper. These approaches are user-friendly and model-agnostic, making them easy to use and applicable to diverse underlying models.

In summary, this paper introduces extensions of Calibrated Explanations aimed at regression, with the following characteristics:

  • Fast, reliable, stable and robust feature importance explanations for regression.

  • Calibration of the predictions from the underlying model through the application of CPSs.

  • Explanations with arbitrary forms of uncertainty quantification of the predictions from the underlying model and the feature importance weights through querying of the conformal predictive distribution (CPD) derived from the CPS.

  • Possibility of creating explanations of the probability that the prediction exceeds a user-defined threshold, with uncertainty quantification.

  • Rules with straightforward interpretation in relation to the feature values and the target.

  • Possibility to generate counterfactual rules with uncertainty quantification of the expected predictions (or probability of exceeding a threshold).

  • Conjunctive rules can be created, conveying feature importance for the interaction of included features.

  • Distribution as an open source Python package, making the proposed techniques easily accessible for both scientific and industrial purposes.

2 Background

2.1 Post-hoc explanation methods

The research area of XAI can be broadly categorized into two main types: developing inherently interpretable and transparent models and utilizing post-hoc methods to explain opaque models. Post-hoc explanation techniques seek to construct simplified and interpretable models that reveal the relationship between feature values and the model’s predictions. These explanations, which can be either local or global, often leverage visual aids such as pixel representations, feature importance plots, or word clouds, emphasizing the features, pixels, or words accountable for causing the model’s predictions (Molnar, 2022; Moradi & Samwald, 2021).

Two distinct approaches of explanations exist: factual explanations, where a feature value directly influences the prediction outcome, and counterfactual explanations, which explore the potential impact on predictions when altering a feature’s values (Mothilal et al., 2020; Guidotti, 2022; Wachter et al., 2017). Importantly, counterfactual explanations are intrinsically local. They are particularly human-friendly, mirroring how human reasoning operates (Molnar, 2022).

2.2 Essential characteristics of explanations

Creating high-quality explanations in XAI requires a multidisciplinary approach that draws knowledge from both the Human-Computer Interaction and the Machine Learning fields. The quality of an explanation method depends on the goals it addresses, which may vary. For instance, assessing how users appreciate the explanation interface differs from evaluating if the explanation accurately mirrors the underlying model (Löfström et al., 2022). However, specific characteristics are universally desirable for post-hoc explanation methods. It is crucial that an explanation method accurately reflects the underlying model, which is closely related to the concept that an explanation method should have a high level of fidelity to the underlying model (Slack et al., 2021). Therefore, a reliable explanation must have feature weights that correspond accurately to the actual impact on the estimates to correctly reflect the model’s behavior (Bhatt et al., 2021).

Stability and robustness are two additional critical features of explanation methods (Dimanov et al., 2020; Agarwal et al., 2022; Alvarez-Melis & Jaakkola, 2018). Stability refers to the consistency of the explanations (Slack et al., 2021; Carvalho et al., 2019); the same instance and model should produce identical explanations across multiple runs. On the other hand, robustness refers to the ability of an explanation method to produce consistent results even when an instance undergoes small perturbations (Dimanov et al., 2020) or other circumstances change. Therefore, the essential characteristics of an explanation method in XAI are that it should be reliable, stable, and robust.

2.3 Explanations for classification and regression

Distinguishing between explanations for classification and regression lies in the nature of the insights they offer. In classification, the task involves predicting the specific class an instance belongs to from a set of predefined classes. The accompanying probability estimates reflect the model’s confidence level for each class. Various explanation techniques have been developed for classifiers to clarify the rationale behind the class predictions. Notable methods include SHAP (Lundberg & Lee, 2017), LIME (Ribeiro et al., 2016), and Anchor (Ribeiro et al., 2018). These techniques delve into the factors that contribute to the assignment of a particular class label. Typically, the explanations leverage the concept of feature importance, e.g., words in textual data or pixels in images.

In regression, the paradigm shifts as there are no predetermined classes or categorical values. Instead, each instance is associated with a numerical value, and the prediction strives to approximate this value. Consequently, explanations for regression models cannot rely on the framework of predefined classes. Nevertheless, explanation techniques designed for classifiers, as mentioned above, can often be applied to regression problems, provided these methods concentrate on attributing features to the predicted instance’s output.

2.4 Venn-Abers predictors

Probabilistic predictors compute class labels and associated probability distributions. Validating these predictions is challenging, but calibration focuses on aligning predicted and observed probabilities (Vovk et al., 2005). The goal is well-calibrated models where predicted probabilities match actual accuracy. Venn predictors (Vovk et al., 2004) produce multi-probabilistic predictions, converted to confidence-based probability intervals.

Inductive Venn prediction (Lambrou et al., 2015) involves a Venn taxonomy, categorizing calibration data for probability estimation. Within each category, the estimated probability for test instances falling into a category is the relative frequency of each class label among all calibration instances in that category.

Venn-Abers predictors (VA) (Vovk & Petej, 2012) offer automated taxonomy optimization via isotonic regression, thus introducing dynamic probability intervals. A two-class scoring classifier assigns a prediction score \(s_i\) to an object \(x_i\). A higher score implies higher belief in the positive class. In order to calibrate a model, some data must be set aside and used as a calibration set when using inductive VA predictors. Consequently, split the training set \(\{z_1, \dots , z_i, \dots , z_{n}\}\), with objects \(x_i\) and labels \(y_i\), into a proper training set \(Z_T\) and a calibration set \(\{z_{1}, \dots , z_q\}\).Footnote 1 Train a scoring classifier on \(Z_T\) to compute s for \(\{x_{1},\dots ,x_q,x\}\), where x is the object of the test instance zFootnote 2. Inductive VA prediction is described in Algorithm 1.

Algorithm 1
figure a

Inductive VA prediction

Since the class label of the test instance must be either positive or negative in binary classification, and the lower and upper bounds are the relative frequencies calculated from the calibration set (including the test instance with the positive or negative label assigned), one of them must be the correctly calibrated probability estimate. Thus, the probability interval is well-calibrated provided the data is exchangeable.

In summary, VA produces a calibrated (regularized) probability estimate \(\mathcal {P}\) together with a probability interval with a lower and upper bound \([\mathcal {P}_l,\mathcal {P}_h]\).

2.5 Calibrated explanations for classification

Below is an introduction to Calibrated Explanations for classification (Löfström et al., 2023), which provides the foundation to this paper’s contribution. In the following descriptions, a factual explanation is composed of a calibrated prediction from the underlying model accompanied by an uncertainty interval and a collection of factual feature rules, each composed of a feature weight with an uncertainty interval and a factual condition, covering that feature’s instance value. Counterfactual explanations only contain a collection of counterfactual feature rules, each composed of a prediction estimate with an uncertainty interval and a counterfactual condition, covering alternative instance values for the feature. The prediction estimate represents a probability estimate for classification, whereas for regression, the prediction estimate will be expressed as a potential prediction.

2.5.1 Factual calibrated explanations for classification

Calibrated Explanations is applied to an underlying model with the intention of explaining its predictions of individual instances using rules conveying feature importances. The following is a high-level description of how Calibrated Explanations for classification works, following the original description in Löfström et al. (2023) closely:

Let us assume that a scoring classifier, trained using the proper training set \(Z_T\), exists for which a local explanation for test object x is wanted. Use VA as a calibrator and calibrate the underlying model for x to get the probability interval \([\mathcal {P}_l, \mathcal {P}_h]\) and the calibrated probability estimate \(\mathcal {P}\). For each feature f, use the calibrator to estimate probability intervals (\([\mathcal {P}'_{l.f}, \mathcal {P}'_{h.f}]\)) and calibrated probability estimates (\(\mathcal {P}'_{f}\)) for slightly perturbed versions of object x, changing one feature at a time in a systematic way (see the detailed description below). To get the feature weight (and uncertainty interval) for feature f, calculate the difference between \(\mathcal {P}\) to the average of all \(\mathcal {P}'_{f}\) (and \([\mathcal {P}'_{l.f}, \mathcal {P}'_{h.f}]\))Footnote 3:

$$\begin{aligned} & w_f = \mathcal {P} - \frac{1}{|V_f|-1}\sum \mathcal {P}'_{f}, \end{aligned}$$
(1)
$$\begin{aligned} & w_l^f = \mathcal {P} - \frac{1}{|V_f|-1}\sum \mathcal {P}'_{l.f}, \end{aligned}$$
(2)
$$\begin{aligned} & w_h^f = \mathcal {P} - \frac{1}{|V_f|-1}\sum \mathcal {P}'_{h.f}, \end{aligned}$$
(3)

where \(|V_f|-1\) is the number of perturbed values.

The feature weight is exactly defined to be the difference between the calibrated probability estimate on the original test object x and the estimated (average) calibrated probability estimate achieved on the perturbed versions of x. The upper and lower bounds are defined analogously using the probability intervals from the perturbed versions of x. As long as the same test object, underlying model and calibration set is used, the resulting explanation will also be the same.

More formally, Algorithm 2 describes the steps that are pursued to achieve a factual explanation for a test object x.

Algorithm 2
figure b

Factual calibrated explanations for classification

2.5.2 Counterfactual calibrated explanations for classification

When creating factual explanations, the calibrator’s results from perturbed instances are averaged to calculate feature importance and uncertainty intervals for each feature. When generating counterfactual rules, the calibrator’s results for perturbed instances are instead used to form counterfactual rules. For categorical features, one counterfactual rule is created for each alternative categorical value, and for numerical features, (up to) two rules, representing \(\le\)-rules and >-rules, can be createdFootnote 4. Each feature rule’s expected probability interval is already established as \([\mathcal {P}'_{l.f}, \mathcal {P}'_{h.f}]\), following the Calibrated Explanations process in steps 5 and 10 above, defining one feature rule for each alternative instance value. The condition will be similar as in steps 5 and 10 above, but for the alternative instance value v. Equation (1)’s feature weights are mainly employed to sort counterfactual rules by impact. The calibrated probability estimate \(\mathcal {P}'_{f}\) is normally neglected in counterfactual rules for classification but is calculated and can be used.

2.5.3 Conjunctive calibrated explanations

Each individual rule only conveys the contribution of an individual feature. To counteract this shortcoming, conjoined rules can be derived to estimate the joint contribution between combinations of features. This is done separately from the generation of the feature rules, by combining the established feature rules. For each combination of existing rules, new perturbed instances are created by applying the already established feature rule conditions, limiting the search space of conjunctions to consider to the most important existing rules. Calibration is performed following the same logic as for single feature perturbed instances, making it possible to get well-calibrated conjunctive rules taking feature interaction into account. Algorithm 3 describes the process in more detail.

Algorithm 3
figure c

Conjunctive calibrated explanations

3 Calibrated explanations for regression

The basic idea in Calibrated Explanations for classification is that each factual and counterfactual explanation is derived using three calibrated values: The calibrated probability and the probability interval represented by the lower and upper bound.

For regression, there are two natural use cases, where the obvious one is predicting the continuous target value directly, i.e., standard regression. Another use case is to instead predict the probability of the target being below (or above) a given threshold, basically viewing the problem as a binary classification problem.

Conformal Predictive Systems (CPSs) produce Conformal Predictive Distributions (CPDs), as mentioned in the introduction. CPDs are cumulative distribution functions which can be used for various purposes, such as deriving prediction intervals for specified confidence levels or obtaining the probability of the true target falling below (or above) any threshold.

3.1 Conformal predictive systems

Conformal prediction (Vovk et al., 2005) offer predictive confidence by generating prediction regions, which include the true target with a specified probability. These regions are sets of class labels for classification or prediction intervals for regression.

Errors arise when the true target falls outside the region, yet conformal predictors are automatically valid under exchangeability, yielding an error rate of \(\varepsilon\) over time. Thus, the key evaluation criterion is efficiency, gauged by the region’s size and sharpness for greater insight. Conformal regressors (CRs), specifically an inductive (split) CR (Papadopoulos et al., 2002), follows the steps in Algorithm 4.

Algorithm 4
figure d

Inductive (split) conformal regression

To individualize intervals, the normalized nonconformity function (Papadopoulos et al., 2008) augments nonconformity with \(\sigma _i\) and \(\beta\), where \(\beta\) is a parameter used to control the sensitivity of the nonconformity measure. These adapt intervals based on predicted difficulty \(\sigma _i\) for each \(y_i\). Normalized nonconformity is \(\frac{\left| y_i - h(x_i)\right| }{\sigma _i+\beta }\), and the interval is \(h(x_i) \pm \alpha _s(\sigma _i+\beta )\). This approach yields individualized prediction intervals, accommodating prediction difficulty and enhancing region informativeness.

The process of creating (normalized) inductive CPSs closely resembles the formation of inductive CRs (Vovk et al., 2019). The primary distinction lies in calculating nonconformity scores using actual errors, defined as:

$$\begin{aligned} \alpha_i = y_i - h\left( x_i\right) , \end{aligned}$$
(4)

or normalized errors:

$$\begin{aligned} \alpha_i = \frac{y_i - h\left( x_i\right) }{\sigma _i + \beta }, \end{aligned}$$
(5)

where \(\sigma _i\), \(x_i\), and \(\beta\) retain their prior definitions. The prediction for a test instance \(x_i\) (potentially with an estimated difficulty \(\sigma _i\)) then becomes the following CPD:

$$\begin{aligned} \displaystyle \mathcal {Q}(y) = {\left\{ \begin{array}{ll} \textstyle \frac{i+\tau }{q+1}, \text { if } y\in \left( \mathcal {C}_{(i)},\mathcal {C}_{(i+1)}\right) , & \text {for } i \in \{0,...,q\}\\ \textstyle \frac{i'-1+(i''-i'+2)\tau }{q+1}, \text {if } y = \mathcal {C}_{(i)}, & \text {for } i\in \{1,...,q\} \end{array}\right. } \end{aligned}$$
(6)

where \(\mathcal {C}_{(1)}, \ldots , \mathcal {C}_{(q)}\) are obtained from the calibration scores \(\alpha _1, \ldots , \alpha _q\), sorted in increasing order:

$$\begin{aligned} \mathcal {C}_{(i)} = h\left( x\right) +\alpha _i \end{aligned}$$

or, when using normalization:

$$\begin{aligned} \mathcal {C}_{(i)} = h\left( x\right) +\alpha_i \left(\sigma_i + \beta\right) \end{aligned}$$

with \(\mathcal {C}_{(0)}=-\infty\) and \(\mathcal {C}_{(q+1)}=\infty\). \(\tau\) is sampled from the uniform distribution \(\mathcal {U}(0,1)\) and its role is to allow the \(\mathcal {P}\)-values of target values to be uniformly distributed. \(i''\) is the highest index such that \(y = \mathcal {C}_{(i'')}\), while \(i'\) is the lowest index such that \(y = \mathcal {C}_{(i')}\) (in case of ties). For a specific value y, the function returns the estimated probability \(\mathcal {P}(\mathcal {Y} \le y)\), where \(\mathcal {Y}\) is a random variable corresponding to the true target.

Given a CPD:

  • A two-sided prediction interval for a chosen significance level \(\varepsilon\) can be obtained by \([\mathcal {C}_{\lfloor (\varepsilon /2)(q+1) \rfloor }, \mathcal {C}_{\lceil (1-\varepsilon /2)(q+1) \rceil }]\). Obviously, the interval does not have to be symmetric as long as the covered range of percentiles are \(1-\varepsilon\).

  • One-sided prediction intervals can be obtained by \([\mathcal {C}_{\lfloor \varepsilon (q+1) \rfloor }, \infty ]\) for a lower-bounded interval, and by \([-\infty , \mathcal {C}_{\lceil (1-\varepsilon )(q+1) \rceil }]\) for an upper-bounded interval.

  • Similarly, a point prediction corresponding to the median of the distribution can be obtained by \((\mathcal {C}_{\lceil 0.5(q+1) \rceil }+\mathcal {C}_{\lfloor 0.5(q+1) \rfloor })/2\). Since the median is an unbiased midpoint in the distribution measured on the calibration set, the median prediction can be seen as a calibration of the underlying models prediction. Unless the model is biased, the median will tend to be very close to the prediction of the underlying model.

Fig. 1
figure 1

A CPD with three different intervals representing \(90\%\) confidence are defined: : more than the \(10^{th}\) percentile; : between the \(5^{th}\) and the \(95^{th}\) percentiles; : less than the \(90^{th}\) percentile. The black dotted lines indicate how to determine the probability of the true target being smaller than 0.5, which in this case would be approximately \(80\%\) (Color figure online)

Figure 1 illustrates how the CPD can form one-sided and two-sided confidence intervals. It also illustrates how the probability of the true target falling below a given threshold can be determined, as well as connecting a probability with the threshold it corresponds to.

Compared to a CR, also able to provide valid confidence intervals from the underlying model, a CPS offers richer opportunities to define intervals and probabilities through querying the CPD. One particular strength of a CPS is its ability to calibrate the underlying model. As an example, if the underlying model is consistently overly optimistic, the median from the CPS will adjust for that and provide a calibrated prediction better adjusted to reality.

There are several different ways that difficulty (\(\sigma\)) can be estimated, such as:

  • The (Euclidean) distances to the k nearest neighbors.

  • The standard deviation of the targets of the k nearest neighbors.

  • The absolute errors of the k nearest neighbors.

  • The variance of the predictions of the constituent models, in case the underlying model is an ensemble.

3.2 Factual and counterfactual explanations for regression

In order to get factual Calibrated Explanations for regression, the probability intervals \([\mathcal {P}_l, \mathcal {P}_h]\) and a calibrated probability estimate \(\mathcal {P}\) from VA are exchanged for a confidence interval and the median which are derived from the CPD. The confidence interval is defined by user-selected lower and upper percentiles and allows dynamic selection of arbitrary confidence intervals.

Thus, to produce factual and counterfactual rules in the same way as for classification, the only thing that needs to be adjusted in Algorithm 2 (§2.5.1) is to exchange the calibrator from VA to CPS. Since the confidence interval from CPS is based on the user-provided percentiles, the lower and upper percentiles are two necessary additional parameters. By default, the lower and upper percentiles are \([5^{th}, 95^{th}]\), resulting in a two-sided \(90\%\) confidence interval derived from the CPD. One-sided intervals can in practice be handled as a two-sided interval with either \(-\infty\) or \(\infty\) assigned as lower or upper percentiles. The calibrated probability estimate used in classification is exchanged for the median from the CPD, which in practice represents a calibration of the underlying model’s prediction, neutralizing any systematic bias in the underlying model. Consequently, using a CPS effectively enables factual Calibrated Explanations for regression with uncertainty quantification of both the prediction from the underlying model and each feature rule.

More formally, the confidence interval and the median are derived using Algorithm 5.

Algorithm 5
figure h

Calibrated explanations for regression

The input to the Calibrated Explanations differs between classification and regression: in classification, it is probability estimates; in regression, it is actual predicted values. Thus, factual Calibrated Explanations for regression will result in feature weights indicating changes in predictions rather than changes in probabilities.

3.3 Factual and counterfactual probabilistic calibrated explanations for regression

The simplest approach when trying to predict the probability that a target value is below (or above) a threshold is to treat the problem as a binary classification problem, with the target defined as

$$\begin{aligned} \dot{y}_i= {\left\{ \begin{array}{ll} \textstyle 1 & \text {if } y_i\le t\\ \textstyle 0 & \text {if } y_i>t, \end{array}\right. } \end{aligned}$$
(7)

where y are the regression targets, t the threshold, and \(\dot{y}\) the binary classification target. To obtain the probability, some form of probabilistic classifier is used.

The CPS makes it possible to query any regular regression model for the probability of the target falling below any given threshold. This effectively eliminates the need to treat the problem as a classification problem.

Utilizing this strength to create explanations is straightforward when only the probability is of interest. However, there is no obvious equivalent to the probability interval created by VA in classification or the confidence interval derived from a CPS in regression. Consequently, achieving a calibrated explanation with uncertainty quantification for this scenario is not as easy as creating factual and counterfactual explanations for classification or regression.

The fact that probabilistic predictions for regression can be achieved by viewing it as a classification problem holds a key to a solution. VA needs a score s for both the calibration and the test instances. By using a CPS as a probabilistic scoring function for both calibration and test instances, it becomes possible to use VA to calibrate the probability and provide a probability interval. The score used is the probability (from a CPD) of calibration and test instances being below the given threshold. The isotonic regressors used by VA also need a binary target for the calibration set, which is defined using Equation (7).

Since the CPS is defined using the calibration set, the probabilities achieved on the same calibration set will be biased and consequently not be entirely trustworthy. To counteract that, the original calibration set is split in halves, and one half is used as a calibration set for CPS while the other half is used as calibration set for VA. Since half as many calibration instances are available for both the CPS and the VA, the informational efficiency compared to Calibrated Explanations for classification and regression will obviously be affected. However, the impact will primarily affect the extremes, i.e., when the threshold results in very low or very high probabilities. In these situations, the CPSs may provide less fine-grained probability estimates, likely resulting in slightly more uncertainty when applying VA. The CPS can be pre-fitted at initialization of the CalibratedExplainer whereas VA needs to be initialized for each threshold at explanation time, as described in Algorithm 6.

Algorithm 6
figure i

Initialization of CalibratedExplainer

Algorithm 7
figure j

Probabilistic calibrated explanations for regression

Algorithm 7 describes what is done at explanation time. If the same threshold t is used for a batch of test objects, the same calibrator, \(va_{\mathcal {P}}\), is re-used, improving computational performance as steps 3–5 only needs to be done once.

3.4 Properties of calibrated explanations for regression

The median from a CPD based on the calibration data can be seen as a form of calibration of the underlying model’s prediction, since it may adjust the prediction on the test instance to match what has previously been seen on the calibration set. The calibration will primarily affect systematic bias in the underlying model. Consequently, since Calibrated Explanations calibrates the underlying model, it will create calibrated predictions and explanations. In addition, VA provides uncertainty quantification of both the probability estimates from the underlying model and the feature importance weights through the intervals for probabilistic Calibrated Explanations for regression. By using equality rules for categorical features and binary rules for numerical features (as recommended above), interpreting the meaning of a rule with a corresponding feature weight in relation to the target and instance value is straightforward and unambiguous and follows the same logic as for classification.

The explanations are reliable because the rules straightforwardly define the relationship between the calibrated outcome and the feature weight (for factual explanations) or feature prediction estimate (for counterfactual explanations). The explanations are robust, i.e., consistent, as long as the feature rules cover any perturbations in feature values. Variation in predictions, e.g. when training using different training sets, can be expected to result in some variation in feature rules, corresponding to the variation in predictions. Obviously, the method does not guarantee robustness for perturbations violating a feature rule condition. The factual and counterfactual Calibrated Explanations for regression explanations are stable as long as the same calibration set and model are used. Finally, depending on the size of the calibration set which is used to define a CPS, the generation of factual Calibrated Explanations for regression is comparable to existing solutions such as LIME and SHAP. Generating a probabilistic factual Calibrated Explanations for regression will be slower than Calibrated Explanations for classification since both require a VA to be trained. Compared to Calibrated Explanations for classification, probabilistic explanations for regression will have some additional overhead from using a CPS as well.

A minor difference between classification and regression is related to the discretizers that are used for numerical features. Both the BinaryEntropyDiscretizer and the EntropyDiscretizer (used for classification) require categorical target values for the calibration set as they use a classification tree (with a depth of one and three levels, respectively) to determine the best discretization. For regression, two new discretizers have been added, BinaryRegressorDiscretizer and RegressorDiscretizer, relying on regression trees, also with depths one and three. The discretizers are automatically assigned based on the kind of problem and explanation that is extracted. The same discretizers as used for standard factual and counterfactual Calibrated Explanations for regression must also be applied for probabilistic regression explanations, as it is motivated by the problem type.

If a difficulty estimator is used to get explanations based on normalized CPDs, \(\sigma\) is calculated using the DifficultyEstimator in crepes.extras and passed along to cps (and \(cps_{\mathcal {P}}\) for probabilistic regression explanations) both when fitting and obtaining median and interval values.

Finally, the calibrated predictions and their confidence intervals, which are an integral part of factual Calibrated Explanations, provide the same guarantees as the calibration model used, i.e., the same guarantees as VA for classification and CPSs for regression (or a combination of both for probabilistic regression). However, even if the uncertainty quantification in the form of intervals for the feature rules is also derived from the same calibration model, these feature rule intervals do not necessarily provide the same guarantees. The reason is that the perturbed instances (see steps 5 and 10) are artificial and the combination of feature values may not always exist naturally in the problem domain. Whenever that happens, the underlying model and the calibration model will indicate that it is a strange instance but may not estimate the degree of strangeness correctly as there is no evidence in the data to base a correct estimate on.

A Python implementation of the Calibrated Explanations solution described in this paper is freely available with a BSD3-style license from:

Since it is on PyPI and conda-forge, it can be installed with pip or conda commands. The GitHub repository includes Python scripts to run the examples in this paper, making the results here easily replicable. The repository also includes several notebooks with additional examples. This paper details calibrated-explanations as of version v0.4.0 (whereas the current version at the time of publication is v0.5.1).

Fig. 2
figure 2

Code example on using calibrated-explanations for regression.

Using Calibrated Explanations with regression is done using almost identical function calls as for classification. An example on how to initialise a CalibratedExplainer and create factual and counterfactual explanations for standard and probabilistic regression from a trained model can be seen in Fig. 2. The parameter low_high_percentiles=(5,95) is the default value and can be left out or changed to some other uncertainty interval. In the example, all intervals are defined to 90% confidence. The difference between standard and probabilistic explanations only require exchanging low_high_percentiles=(low,high) with threshold=your_threshold. The threshold parameter is None by default but takes precedence when having a value assigned.

Fig. 3
figure 3

Code example on using calibrated-explanations with normalization.

Normalization can be achieved using DifficultyEstimator from crepes.extras. It currently has four different ways to normalize, as seen in the example shown in Fig. 3, where alternatives 3 and 4 requires an ensemble model, such as a RandomForestRegressor. Creating normalized explanations with standard and probabilistic regression is done exactly the same as without normalization, see Fig. 2, once the difficulty estimator is assigned.

3.5 Summary of calibrated explanations

With the two solutions proposed here, Calibrated Explanations provide a number of possible use cases, which are summarized in Table 1.

Table 1 Summary of characteristics of Calibrated Explanations. All explanations include the calibrated prediction, with confidence intervals, of the explained instance. FR refers to factual explanations visualized using regular plots, FU refers to factual explanations visualized using uncertainty plots, and CF refers to counterfactual explanations and plots. Furthermore, CI refers to a confidence interval, Conjunctive rules indicates that conjunctive rules are possible, Conditional rules indicates support for users to create contextual explanations, Normalization indicates that normalization is supported and # alternative setups refers to the number of ways to run Calibrated Explanations, i.e., w/o normalization or with any of the four different ways to normalize. X marks a core alternative, I marks selectable interval type(s) used by the core alternatives, and O marks optional additions

Both factual and counterfactual explanations are composed of lists of feature rules with conditions and feature weights with confidence intervals (factual explanations) or feature prediction estimates with confidence intervals (counterfactual explanations), as described in Sect. 2.5. Conditional rules was introduced in version v0.3.1 and described in a paper introducing this for analysis of Fairness (Löfström & Löfström, 2024).

4 Experimental setup

The implementation of both the regression and the probabilistic regression solutions is expanding the calibrated-explanations Python package (Löfström et al., 2023) and relies on the ConformalPredictiveSystem from the crepes package (Boström, 2022). By default, ConformalPredictiveSystem is used without normalization but DifficultyEstimator provided by crepes.extras is fully supported by calibrated-explanations, with normalization options corresponding to the list given at the end of Sect. 3.1 and in Fig. 3.

4.1 Presentation of calibrated explanations trough plots

In this paper, three different kinds of plots for Calibrated Explanations are presented. The first two are used when visualizing factual Calibrated Explanations for standard regression. These plots are inspired by LIME, especially the rules in LIME have been seen as providing valuable information in the explanations.

  • Regular explanations, providing Calibrated Explanations without any uncertainty information. These explanations are directly comparable to other feature importance explanation techniques like LIME.

  • Uncertainty explanations, providing Calibrated Explanations including uncertainty intervals to highlight both the importance of a feature and the amount of uncertainty connected with its estimated importance.

For the reasons given in previous sections, Calibrated Explanations is meant to use binary rules with factual explanations (even if all discretizers used by LIME can also be used by Calibrated Explanations). One noteworthy aspect of Calibrated Explanations is that the feature weights only show how each feature separately affects the outcome. It is possible to see combined weights through conjunctions of features (combining two or three different rules into a conjunctive feature rule). It is important to clarify that the feature weights do not convey the same meaning as in attribution-based explanations, like SHAP.

The third kind of plot is a counterfactual plot showing preliminary prediction estimates for each feature when alternative feature values are used.

Feature rules are always ordered based on feature weight, starting with the most impactful rules. When plotting Calibrated Explanations, the user can choose to limit the number of rules to show. Factual explanations have one rule per feature. Counterfactual explanations, where Calibrated Explanations creates as many counterfactual rules as possible, may result in a much larger number of rules, especially for categorical features with many categories.

Internally, Calibrated Explanations uses the same representation for both classification and regression. However, the plots visualizing the explanations have been adapted to suit both standard and probabilistic factual Calibrated Explanations for regression.

4.1.1 Calibrated explanations plots

The same kind of plots exists for regression as for classification. Compared to the plots used for classification, the regression plots differ in two essential aspects.

A common difference for both factual and counterfactual Calibrated Explanations for regression is that the feature weights represent changes in actual target values. For factual Calibrated Explanations for regression, this means that a feature importance of \(+100\) means that the actual feature value contributes with \(+100\) to the prediction. For a counterfactual Calibrated Explanations for regression, showing the prediction estimates with uncertainty intervals, the plot shows what the prediction is estimated to have been if the counterfactual condition would be fulfilled.

A difference that only applies to the factual plots is that the top of the plot omits the probabilities for the different classes and instead shows the median m and the confidence interval [lh] as the prediction.

4.1.2 Probabilistic calibrated explanations plots

Probabilistic factual Calibrated Explanations for regression represents feature importances as probabilities, just like Calibrated Explanations for classification. The only difference needed for the probabilistic plots for regression compared to classification is to change the probabilities for a class label into probabilities for being below (\(\mathcal {P}(y \le t)\)) or above (\(\mathcal {P}(y > t)\)) the given threshold.

4.2 Experiments

The evaluation is divided into an introduction to all different kinds of Calibrated Explanations for regression through plots and an evaluation of performance. All plots are from the California Housing data set (Pace & Barry, 1997). The underlying model in all experiments is a RandomForestRegressor from the sklearn package.

Our proposed algorithm is claimed to be fast, reliable, stable, and robust. These claims requires validation in an evaluation of performance. The explanations are reliable due to the validity of the uncertainty estimates used, i.e., the results achieved by querying the CPD, and from the uncertainty quantification of the feature weights or feature prediction estimates. Speed, stability and robustness will be evaluated in an experiment using the California Housing data set on a fixed set of test instances. Each experiment is repeated 100 times using 500 instances as a calibration set (also used by SHAP and LIME) and 10 test instances. The target values were normalized, i.e., \(y\in [0,1]\). The following setups are evaluated:

  • FCER: Factual explanation.

  • CCER: Counterfactual explanation.

  • PFCER: Probabilistic factual explanation. The threshold is 0.5 for all instances, i.e., the mid-point of the interval of possible target values.

  • PCCER: Probabilistic counterfactual explanation. The threshold is 0.5 for all instances, i.e., the mid-point of the interval of possible target values.

  • LIME: LIME explanation.

  • LIME CPS: LIME explanation using the median from a CPD as prediction. The CPS was based on the underlying random forest regressor.

  • SHAP: SHAP explanation using the Explainer class.

  • SHAP CPS: SHAP explanation using the median from a CPD as prediction. The CPS was based on the underlying random forest regressor. Here, the Explainer class was used.

The evaluated metrics are:

  • Stability means that multiple runs on the same instance and model should produce consistent results. Stability is evaluated by generating explanations for the same predicted instances 100 times with different random seeds (using the iteration counter as random seed). The random seed is used to initialize the numpy.random.seed() and by the discretizers. The largest variance in feature weight (or feature prediction estimate) can be expected among the most important features (by definition of having higher absolute weights). The top feature for each test instance is identified as the feature being most important most often in the 100 runs (i.e., the mode of the feature ranks defined by the absolute feature weight). The variance for the top feature is measured over the 100 runs and the mean variance among the test instances is reported.

  • Robustness means that small variations in the input should not result in large variations in the explanations. Robustness is measured in a similar way as stability, but with the training and calibration set being randomly drawn and a new model being fitted for each run, creating a natural variation in the predictions of the same instances without having to construct artificial instances. Again, the variance of the top feature is used to measure robustness. The same setups as for stability are used except that each run use a new model and calibration set and that the random seed was set to 42 in all experiments.

  • Run time is compared between the setups regarding explanation generation times (in seconds per instance). It is only the method call resulting in an explanation that is measured. Any overhead in initiating the explainer class is not considered. The closest equivalent to probabilistic factual Calibrated Explanations for regression would be to apply LIME and SHAP for classification to a thresholded classification model, as described in section 3.3. Since VA is comparably slow and probabilistic Calibrated Explanations for regression combines both CPSs and VA, with fitting and calls to a CPS or a VA for each calibration instance, it can be expected to be slow.

FCER and PFCER without normalization are compared with the LIME and SHAP alternatives. Additionally, run time is compared across both standard and probabilistic factual and counterfactual Calibrated Explanations with and without normalization. The difficulty estimation uses 500 randomly drawn instances from the training set to estimate difficulty. Stability and robustness are less affected by normalizationFootnote 5.

5 Results

The results are divided into two categories: 1) a presentation of Calibrated Explanations through plots, explaining and showcasing a number of different available ways Calibrated Explanations can be used and viewed; and 2) an evaluation of performance with comparisons to LIME and SHAP.

5.1 Presentation of calibrated explanations through plots

In the following subsections, a number of introductory examples of Calibrated Explanations are given for regression. First, factual and counterfactual explanations for regression are shown, followed by factual and counterfactual explanations for probabilistic regression.

5.1.1 Factual calibrated explanations for regression

The regular plot in Fig. 4 illustrates the calibrated prediction of the underlying model as the solid red line at the top bar together with the \(90\%\) confidence interval in light red. As can be seen, the house price is predicted to be \(\approx\) $285K and with \(90\%\) confidence, the price can be expected to be between [$215K-$370K]. Turning to the feature rules, the solid black line represents the median in the top-bar. The rule condition is shown to the left and the actual instance value is shown to the right of the lower plot area. The fact that this house is located more northbound (latitude > 34.26) has a large negative impact on the price (reducing it with \(\approx\) $95K). On the other hand, since the median income is a bit higher (median income > 3.52), the price is pressed about $60K upwards. Housing median age and population are two more features that clearly impact the price negatively.

Fig. 4
figure 4

A regular plot for the California Housing data set. The top-bar illustrates the median (the red line) and a confidence interval (the light red area), defined by the \(5^{th}\) and the \(95^{th}\) percentiles. The subplot below visualizes the weights associated with each feature. The weights indicate how much that rule contributes to the prediction. Negative weights in red indicate a negative impact on the prediction whereas positive weights in blue indicate a positive impact (Color figure online)

When one-sided intervals are used instead, only the top-bar is affected compared to when using regular plots. Figures 5a and 5b illustrate an upper bounded and a lower bounded explanation for the same instance, with the identical feature rule subplot omitted. As can be seen, the median (solid red line) is the same as before, while the confidence interval stretches one entire side of the bar. The upper bound (\(\approx\) $330K in Fig. 5a) is lower and the lower bound (\(\approx\) $240K in Fig. 5b) is higher compared to the two-sided plot in Fig. 4.

Fig. 5
figure 5

The top bars of one-sided plots with confidence intervals bounded by the \(90^{th}\) upper percentile (Fig. 5a) and the \(10^{th}\) lower percentile (Fig. 5b). The red solid line represents the median. The weights (and consequently the entire subplot visualizing weights) are the same for these one-sided explanations as in Fig. 4

Fig. 6
figure 6

An uncertainty plot for the California Housing data set. The top bar is the same as in Fig. 4, showing the median and the \([5^{th},95^{th}]\) percentiles confidence interval. In the subplot below, the uncertainty of the weights is highlighted, using the \([5^{th},95^{th}]\) percentiles confidence interval in light red or blue for each feature. The weights still indicate how much that rule contributes to the prediction but with a confidence interval highlighting the span of uncertainty for the impact of the feature value and rule combined

Figure 6 illustrates an uncertainty plot for the same instance as beforeFootnote 6. When including uncertainty quantification in the plot, the feature importance has a light colored area corresponding to the span of possible contribution within the confidence used. The grey area surrounding the solid black line represents the same confidence interval as seen in the top bar.

As can be seen, the northbound location still has a large negative impact but the span of uncertainty about exactly how large the impact is covers about $150K, falling approximately within the interval [-$180K, -$30K]. The fact that part of the line is solid in color indicates that we can expect this feature to impact the price at least with -$30K, given the selected confidence level. Looking at the other features, we can see that all of them include the median in the uncertainty interval, meaning that with 90% confidence, these features may impact the price in both directions. Obviously, both median income and in particular housing median age are more likely to have a positive and negative impact, respectively. Since no normalization have been used with this example, all the intervals are similar in width.

Fig. 7
figure 7

A counterfactual plot for the California Housing data set. The large lightest red area in the background is the confidence interval defined by the \(5^{th}\) and the \(95^{th}\) percentiles. Each row represents a counterfactual rule with an interval in darker red indicating what confidence intervals a breach according to the rule condition would result in. The confidence intervals for the counterfactual rules are also defined by the \(5^{th}\) and the \(95^{th}\) percentiles. The solid lines represent the median values (Color figure online)

5.1.2 Counterfactual calibrated explanations for regression

Turning to counterfactual Calibrated Explanations for regression, Fig. 7 shows a counterfactual plot for the same instance as before. Here, the solid line and the very light area behind it, spanning from top to bottom, represent the median and the confidence interval of the calibrated prediction of the underlying model (i.e., the same as in Fig. 4). This is the ground truth that all the counterfactual feature rules should be contrasted against.

Contrary to factual Calibrated Explanations for regression, none of the rules cover the instance values in the counterfactual plot. Instead, there are several examples of the same feature being present in multiple rules. Here the interpretation is that the solid line and lighter red bar for each rule is the expected median and confidence interval achieved if the instance would have had values according to the rule. As an example, with everything else the same but median income > 6.28, then the expected price would be \(\approx\) $405K with a confidence interval of [$340K, $490K]. It is also clear that if the house would have been located further south (latitude < 36.7), the price would go up, and if it would have been even further north (latitude > 37.6), the price would have gone down even further. It is worth noting that as the counterfactual rules presented in Fig. 7 are excluding the instance values, whereas the factual rules in Figs. 4 and 6 are including the instance values, the ordering of features may be completely different between the explanations, despite explaining the same instance.

So far, all examples (using both factual and counterfactual explanations) have used a standard CPS to construct the explanations, with the result that all confidence intervals are almost equal-sized. In Fig. 8, a difficulty estimator based on the standard deviation of the targets of the k nearest neighbors is used. The normalization will both affect the calibration of the underlying model, creating confidence intervals with varying sizes between instances, and the feature intervals. A crude assumption regarding the width of the feature intervals is that when the calibration set contains fewer instances covering an alternative feature value, the feature intervals will tend to be larger due to less information, and vice versa. This does not have to be the whole truth, as difficulty in this example is defined based on the standard deviation of the neighboring instances target values. As can be seen in Fig. 8, normalized counterfactual explanations may generate rules resulting in both smaller and wider confidence intervals then the non-normalized rules.

Fig. 8
figure 8

A normalized counterfactual plot comparable to Fig 7, resulting in rules with varied interval widths as a consequence of the normalization. Difficulty is estimated as the standard deviation of the targets of the k nearest neighbors

Similarly to factual Calibrated Explanations for regression, counterfactual explanations can also be one-sided. Figure 9 shows an upper-bounded explanation with \(90\%\) confidence. The interpretation of the first rule is that, with everything else as before, but median income > 6.28 the price will be below \(\approx\) $450K with 90% certainty. Since the same CPS is used, the median is still the same as for a two-sided explanation.

Fig. 9
figure 9

A one-sided counterfactual plot for the California Housing data set. Confidence intervals are defined by the \(90^{th}\) upper percentile only. The interpretation is that with \(90\%\) certainty, the true value of the original instance will fall within the lightest red area. If the counterfactual rule had been true for each feature individually, the true value will fall within that feature’s darker red area with approximately \(90\%\) certainty

5.1.3 Probabilistic factual calibrated explanations for regression

Fig. 10
figure 10

A regular probabilistic regression plot for the California Housing data set. The plot shows the probability of the prediction for this instance being above the given threshold ($250K in this case). The explanation is similar to a regular plot used in Calibrated Explanations for classification with the main difference being that it shows the probabilities of being below or above the threshold and that the probabilities are given by the CPD

Figure 10 shows a regular probabilistic regression plot for the same instance as above. In this plot, the possibility of querying the CPD about the probability of being below or above a given threshold is utilized. In this case, the threshold is set to a house price of $250K. Here, median income > 3.52 contributes strongly to the probability that the target is above $250K.

Fig. 11
figure 11

An uncertainty probabilistic regression plot for the same explanation as in Fig. 10. The plot includes uncertainties for the feature weights

In Fig. 11, the same explanation is shown with uncertainties. As can be seen, the size of the uncertainty varies a lot between features, depending on the calibration by the VA calibrator.

5.1.4 Probabilistic counterfactual calibrated explanations for regression

Fig. 12 shows a normalized probabilistic counterfactual plot for the same instance. In this case, the normalization used was based on the variance of the predictions of the trees in the random forest. The most influential rule relates to median income, with a lower income increasing the probability for a lower price. The normalization will affect the feature probability estimates and confidence intervals and may consequently also result in a different ordering of rules.

Fig. 12
figure 12

A normalized probabilistic counterfactual plot for the same instance as before

The final example, shown in Fig. 13, illustrates both conjunctive rules, combining two feature conditions in one rule, and normalization using the variance of the predictions of the trees in the random forest. Here, the number of rules to plot has been increased to 15. We see that conjunctive rules often result in more influential rules than single condition rules, illustrated by the majority of rules being conjunctive.

Fig. 13
figure 13

A normalized probabilistic counterfactual plot with conjunctive rules for the same instance as before

Factual or counterfactual rules can be generated without normalization or with any of the normalization options available in DifficultyEstimator in crepes.extras. Conjunctive rules can be added at any time after the explanations are generated. All the examples shown here are from the same instance and the same underlying model, to showcase a subset of available ways the proposed solutions can be used. Further examples can be found in the code repository.

5.2 Performance evaluation

Table 2 shows the results achieved regarding stability, robustness, and run time. Stability is measured using the mean variance when constructing explanations on the same instance using different random seeds, with lower values representing more stability. It is evident that both SHAP setups and both setups for standard regression must be considered stable, since the mean variance is 0 (i.e., less than \(1e-31\)). LIME and probabilistic regression, on the other hand, has a non-negligible mean variance, meaning that they are not, in comparison, as stable. The reason for why probabilistic regression is less stable is related to the sensitivity of the probabilities derived from the CPD. The reason for the sensitivity is that a relatively small change in prediction can easily result in a comparably much larger change in probability for exceeding the threshold, especially if the target is close to the threshold (which is set to 0.5, i.e., the mid-point in the interval of possible target values). LIME and SHAP explanations using the median from a CPD result in similar stability levels as when using the underlying model.

Robustness is measured in a similar way as stability, but with a new model trained using different distributions of training and calibration instances between each run. The results achieved on robustness should be seen in relation to the variance in predictions from the underlying model on the same instances. The reason is that if the predictions that the explanations are based on fluctuate, then we can expect a somewhat similar degree of fluctuation in the feature weights as well, since they are defined using the predictions (the mean prediction variance is \(9.1e-5\)). All setups for Calibrated Explanations have higher mean variance compared to LIME and SHAP (i.e., are being less robust). However, the explanations produced by the setups for Calibrated Explanations do not only rely on the crisp feature weight used to measure the mean variance (i.e., robustness metric) but also include the uncertainty interval, highlighting the degree of uncertainty associated with each feature weight.

Table 2 Evaluation of stability, robustness and run time(s)

Regarding run time, all setups have used the same calibration set of 500 instances, including LIME and SHAP. Between LIME and SHAP, LIME is the fastest and the difference between when explaining the underlying model or when using a CPS is small. The difference between SHAP explaining the underlying model or when using a CPS is fairly large. All setups using Calibrated Explanations is faster than both LIME and SHAP.

Table 3 Run time (s) for different kinds of explanations and normalization

Table 3 show the average time in seconds per instance for creating an explanation with and without normalization for the different kinds of Calibrated Explanations. The most striking result is that using normalization adds a substantial overhead compared to not using normalization: an average of \(2-9\) times increase in run time. Counterfactual explanations are slightly more costly than factual, which is not surprising as they generally generate a larger number of rules. Standard explanations are slightly less than twice as fast as probabilistic explanations without normalization. With normalization, the difference is much smaller, stemming from the fact that only half the calibration set needs normalization, as the other half is used by VA to calibrate the probabilities.

Detailed results comparing stability and robustness for different kinds of difficulty estimations is not included, as the differences compared to not using normalization (see Table 2) is small. Detailed results can be found in the evaluation/regression folder in the repository.

6 Concluding discussion

This paper extends Calibrated Explanations, previously introduced for classification, with support for regression. Two primary use cases are identified: standard regression and probabilistic regression, i.e., measuring the probability of exceeding a threshold. The proposed solution relies on Conformal Predictive Systems (CPS), making it possible to meet the different requirements of the two identified use cases. The proposed solutions provide access to factual and counterfactual explanations with the possibility of conveying uncertainty quantification for the feature rules, just like Calibrated Explanations for classification.

In the paper, the solutions have been demonstrated using several plots, showcasing some of the many ways that the proposed solutions can be used. Furthermore, the paper also includes a comparison with some of the best-known state-of-the-art explanation methods (LIME and SHAP). The results demonstrate that the proposed solution for standard regression is both fast, stable, and robust. The suggested solution is considered reliable for two reasons: 1) The calibration of the underlying model and 2) the uncertainty quantification, highlighting the degree of uncertainty of both prediction and feature weights.

The solution proposed to build probabilistic explanations for regression does not share all the benefits seen for standard regression. The solution has comparable performance as LIME, even if it is slightly faster than LIME. The main strength of this solution is that it provides the possibility of getting probabilistic explanations in relation to an arbitrary threshold from any standard regression model without having to impose any restrictions on the regression model.

6.1 Future work

There are several directions for future work. An interesting area to look into is how this technique can be adapted to explanations of time-series problems. How to capture and convey the dependency between different time steps pose an interesting challenge.

There are room for improvement regarding plot layout. Providing additional ways of visualization is a natural development in the future. This involves implementing support for explanations within image and text prediction, even if these improvements are more closely connected to classification problems.

Another direction for future work is to look into probabilistic explanations using the form \(\mathcal {P}(t1 < y \le t2)\). Such predictions would complement the interval predictions provided by CPS by allowing the user to specify the upper and lower bounds of the uncertainty interval and provide the probability of the true target being inside that interval.

Currently, the average calibrated value is used to define the feature weights in equations (1), 2) , and (3) (see Sect. 2.5.1). There are alternatives to taking the average of the perturbed instances for a specific feature and there is room for theoretical analysis on how the feature weights should be calculated to provide the best insights.

Finally, run time can probably be decreased if implementing the core in C++ or by relying on fast languages being able to run Python code more efficiently, e.g., Mojo.