Skip to main content
Log in

Theory and algorithms for learning with rejection in binary classification

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

We introduce a novel framework for classification with a rejection option that consists of simultaneously learning two functions: a classifier along with a rejection function. We present a full theoretical analysis of this framework including new data-dependent learning bounds in terms of the Rademacher complexities of the classifier and rejection families as well as consistency and calibration results. These theoretical guarantees guide us in designing new algorithms that can exploit different kernel-based hypothesis sets for the classifier and rejection functions. We compare our general framework with the special case of confidence-based rejection for which we also devise alternative loss functions and algorithms. We report the results of several experiments showing that our kernel-based algorithms can yield a notable improvement over the best existing confidence-based rejection algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data availibility statement

The data that support the findings of this study are publicly available at the UCI Data repository: https://archive.ics.uci.edu/ml/index.php

References

  1. Bartlett, P., Wegkamp, M.: Classification with a reject option using a hinge loss. J. Mach, Learn (2008)

    Google Scholar 

  2. Beygelzimer, A., Langford, J., Ravikumar, P.: Error correcting tournaments. In: Arxiv (2008)

  3. Beygelzimer, A., Dani, V., Hayes, T., Langford, J., Zadrozny, B.: Error limiting reductions between classification tasks. In: International conference on machine learning (2005)

  4. Bounsiar, A., Grall, E., Beauseroy, P.: Kernel based rejection method for supervised classification. In: World academy of science, engineering and technology (2007)

  5. Cao, Y., Cai, T., Feng, L., Gu, L., Gu, J., An, B., Niu, G., Sugiyama, M.: Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In: Advances in neural information processing systems (2022)

  6. Capitaine, H.L., Frelicot., C.: An optimum class-rejective decision rule and its evaluation. In: International conference on pattern recognition (2010)

  7. Chaudhuri, K., Zhang, C.: Beyond disagreement-based agnostic active learning. In: Neural information processing systems (2014)

  8. Chow, C.K.: An optimum character recognition system using decision function. IEEE Trans, Comput (1957)

    Book  Google Scholar 

  9. Chow, C.K.: On optimum recognition error and reject trade-off. IEEE Trans, Comput (1970)

    Google Scholar 

  10. CVX Research, I.: CVX: Matlab software for disciplined convex programming, version 2.0 (2012)

  11. DeSalvo, G., Mohri, M., Syed, U.: Learning with deep cascades. In: Algorithmic learning theory (2015)

  12. Dubuisson, B., Masson, M.: Statistical decision rule with incomplete knowledge about classes. In: Pattern recognition (1993)

  13. El-Yaniv, R., Wiener, Y.: On the foundations of noise-free selective classification. J. Mach, Learn (2010)

    Google Scholar 

  14. El-Yaniv, R., Wiener, Y.: Agnostic selective classification. In: Neural information processing systems (2011)

  15. Elkan, C.: The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence (2001)

  16. Freund, Y., Mansour, Y., Schapire, R.: Generalization bounds for averaged classifiers. Ann, Stat (2004)

    Book  Google Scholar 

  17. Fumera, G., Roli, F.: Support vector machines with embedded reject option. In: International conference on pattern recognition (2002)

  18. Fumera, G., Roli, F., Giacinto, G.: Multiple reject thresholds for improving classification reliability. In: International conference on advances in pattern recognition (2000)

  19. Grandvalet, Y., Keshet, J., Rakotomamonjy, A., Canu, S.: Suppport vector machines with a reject option. In: Neural information processing systems (2008)

  20. Herbei, R., Wegkamp, M.: Classification with reject option. Can. J, Stat (2005)

    Google Scholar 

  21. Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Ann, Stat (2002)

    Book  Google Scholar 

  22. Landgrebe, T., Tax, D., Paclik, P., Duin, R.: Interaction between classification and reject performance for distance-based reject-option classifiers. Pattern Recogn, Lett (2005)

    Google Scholar 

  23. Langford, J., Beygelzimer, A.: Sensitive error correcting output codes. In: Conference on learning theory (2005)

  24. Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York (1991)

    Book  Google Scholar 

  25. Lin, H.-T.: Reduction from cost-sensitive multiclass classification to one-versus-one binary classification. In: Journal of machine learning (2014)

  26. Littman, M., Li, L., Walsh, T.: Knows what it knows: A framework for self-aware learning. In: International conference on machine learning (2008)

  27. Long, P.M., Servedio, R.A.: Consistency versus realizable H-consistency for multiclass classification. In: International conference on machine learning (2013)

  28. Mao, A., Mohri, C., Mohri, M., Zhong, Y.: Two-stage learning to defer with multiple experts. In: NeurIPS (2023)

  29. Mao, A., Mohri, M., Zhong, Y.: Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. CoRR to appear (2023a)

  30. Mao, A., Mohri, M., Zhong, Y.: Theoretically grounded loss functions and algorithms for score-based multi-class abstention. CoRR to appear (2023b)

  31. Melvin, I., Weston, J., Leslie, C.S., Noble, W.S.: Combining classifiers for improved classification of proteins from sequence or structure. BMC Bioinformatics (2008)

  32. Mohri, C., Andor, D., Choi, E., Collins, M.: Learning to reject with a fixed predictor: Application to decontextualization. CoRR abs/2301.09044 (2023)

  33. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Boston (2012)

    Google Scholar 

  34. Mozannar, H., Sontag, D.: Consistent estimators for learning to defer to an expert. In: International conference on machine learning, pp. 7076–7087 (2020)

  35. Mozannar, H., Lang, H.,Wei, D., Sattigeri, P., Das, S., Sontag, D.: Who should predict? exact algorithms for learning to defer to humans. In: International conference on artificial intelligence and statistics, pp. 10520–10545 (2023)

  36. Narasimhan, H., Menon, A.K., Jitkrittum, W., Kumar, S.: Learning to reject meets ood detection: Are all abstentions created equal? arXiv preprint arXiv:2301.12386 (2023)

  37. Pereira, C.S., Pires, A.: On optimal reject rules and ROC curves. Pattern Recogn, Lett (2005)

    Google Scholar 

  38. Pietraszek, T.: Optimizing abstaining classifiers using ROC. In: International conference on machine learning (2005)

  39. Ramaswamy, H., Agarwal, S.: Convex calibration dimension for multiclass loss matrices. In: Journal of machine learning (2016)

  40. Tax, D., Duin, R.: Growing a multi-class classifier with a reject option. In: Pattern recognition letters (2008)

  41. Tortorella, F.: An optimal reject rule for binary classifiers. In: International conference on advances in pattern recognition (2001)

  42. Trapeznikov, K., Saligrama, V.: Supervised sequential classification under budget constraints. In: Artificial intelligence and statistics (2013)

  43. Tu, H.-H., Lin, H.-T.: One-sided support vector regression for multiclass cost-sensitive classification. In: International conference on machine learning (2010)

  44. Wang, J., Trapeznikov, K., Saligrama, V.: An LP for sequential learning under budgets. In: Journal of machine learning (2014)

  45. Yuan, M., Wegkamp, M.: Classification methods with reject option based on convex risk minimizations. In: Journal of machine learning (2010)

  46. Yuan, M., Wegkamp, M.: SVMs with a reject option. In: Bernoulli (2011)

  47. Zadrozny, B., Langford, J., Abe, N.: Cost sensitive learning by cost- proportionate example weighting. In: IEEE International conference on data mining (2003)

  48. Zhang, C., Chaudhuri, K.: The extended Littlestone’s dimension for learning with mistakes and abstentions. In: Conference on learning theory (2016)

Download references

Funding

None

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giulia DeSalvo.

Ethics declarations

Competing Interests

None.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Consistency of convex surrogates

In this appendix, we derive two theorems about the consistency of the convex surrogate, \(L_{\text {MH}}\). The first theorem shows that the convex surrogate is calibrated with respect to the Bayes solution and the second theorem upper bounds the excess risk of the rejection loss by the excess risk of the surrogate loss. For both theorems, we will analyze the expected surrogate loss, which can be written in terms of \(\eta (x)\):

$$\begin{aligned} \mathop {\mathrm {\mathbb {E}}}\limits _{(x, y) \sim \mathcal {D}} [L_{\text {MH}}( h, r, x, y)] = \mathop {\mathrm {\mathbb {E}}}\limits _{x} \left[ \eta (x)\phi ( - h(x), r(x)) + (1 - \eta (x)) \phi (h(x), r(x)) \right] , \end{aligned}$$
(1)

where \(\phi ( - h(x), r(x)) = \max \left( 1 + \frac{1}{2} \left( r(x) - h(x)\right) , c \, \left( 1 - \frac{1}{1 - 2c} r(x)\right) , 0\right) \). For simplicity, we also define

$$\begin{aligned} \mathcal {L}_\phi (\eta (x), h(x), r(x)) = \eta (x) \phi ( - h(x), r(x)) + (1 - \eta (x))\phi (h(x), r(x)). \end{aligned}$$
(2)

The idea behind the proof of the first theorem below is to find the minimizer of \((u, v) \mapsto \mathcal {L}_\phi (\eta (x), u, v)\) for any fixed x in order to then re-write it in terms of the infimum of the expected rejection loss.

Theorem 3

Let \((h_\text {M}^*, r_\text {M}^*)\) denote a pair attaining the infimum of the expected surrogate loss, \(\mathop {\mathrm {\mathbb {E}}}\limits _{(x, y)} [ L_{\text {MH}}( h_\text {M}^*, r_\text {M}^*, x, y) ] = \inf _{(h, r )\in \text {meas}} \mathop {\mathrm {\mathbb {E}}}\limits _{(x, y)} [L_{\text {MH}}( h, r, x, y)]\). Then, for \(\beta = \frac{1}{1 - 2c}\) and \(\alpha = 1\),

  1. 1.

    the surrogate loss \(L_{\text {MH}}\) is calibrated with respect to the Bayes classifier: \({{\,\textrm{sign}\,}}(h^*) = {{\,\textrm{sign}\,}}(h_\text {M}^*)\) and \({{\,\textrm{sign}\,}}(r^*) = {{\,\textrm{sign}\,}}(r_\text {M}^*)\);

  2. 2.

    furthermore, the following equality holds for the infima over pairs of measurable functions:

    $$\begin{aligned} \inf _{(h, r)} \mathop {\mathrm {\mathbb {E}}}\limits _{(x, y) \sim \mathcal {D}} [ L_{\text {MH}}( h, r, x, y) ] = (3 - 2c) \inf _{(h, r)} \mathop {\mathrm {\mathbb {E}}}\limits _{(x, y) \sim \mathcal {D}} [ L( h, r, x, y) ]. \end{aligned}$$

Proof

Since the infimum of the expected surrogate loss is over all measurable functions, to determine \((h_\text {M}^*, r_\text {M}^*)\) it suffices to find, for any fixed x the minimizer of \((u, v) \mapsto \mathcal {L}_\phi (\eta (x), u, v)\). For a fixed x, minimizing \(\mathcal {L}_\phi (\eta (x), u, v)\) with respect to (uv) is equivalent to minimizing seven LPs. One can check that the optimal points of these LPs are in the set \((u, v) \in \{ (0, (2c - 2)(1 - 2c)), (3 - 2c, 1 - 2c), ( - 3 + 2c, 1 - 2c) \}\). Evaluating \(\mathcal {L}_\phi (\eta (x), u, v)\) at these points, we find that

$$\begin{aligned}{} & {} \mathcal {L}_\phi (\eta (x), 3 - 2c, 1 - 2c) = (3 - 2c) (1 - \eta (x))\\{} & {} \mathcal {L}_\phi (\eta (x), - 3 + 2c, 1 - 2c) = (3 - 2c) \eta (x)\\{} & {} \mathcal {L}_\phi (\eta (x), 0, (2c - 2)(1 - 2c)) = (3 - 2c) c. \end{aligned}$$

Thus, we can conclude that the minimum of \(\mathcal {L}_\phi (\eta (x), u, v)\) is attained at \((3 - 2c) [ \eta (x) 1_{\eta (x) < c} + c1_{c\le \eta (x) \le 1 - c} + (1 - \eta (x) ) 1_{\eta (x) > 1 - c} ], \) which completes the proof. Below, for completeness, we show how to solve three of these LPs where \(\mathcal {L}_\phi (\eta (x), h, r) = 0\), \(\mathcal {L}_\phi (\eta (x), h, r) = c\left( 1 - \tfrac{1}{1 - 2c}r\right) \), and \(\mathcal {L}_\phi (\eta (x), h, r) = \eta (x)\left( 1 + \frac{1}{2}(r - h)\right) + (1 - \eta (x))\left( 1 + \frac{1}{2}(r + h)\right) \) .

  1. 1.

    For \(\mathcal {L}_\phi (\eta (x), h, r) = 0\), we have the following optimization problem

    $$\begin{aligned}{} & {} \min _{(h, r)} 0\\{} & {} \text {subject to: } c\left( 1 - \frac{1}{1 - 2c}r\right) \le 0, 1 + \frac{1}{2}(r - h)\le 0, 1 + \frac{1}{2}(r + h) \le 0 \end{aligned}$$

    Now the constraint \(c\left( 1 - \frac{1}{1 - 2c}r\right) \le 0\) implies that \(r\ge (1 - 2c)c > 0\). If we sum the remaining constraints \(1 + \frac{1}{2}(r - h)\le 0, 1 + \frac{1}{2}(r + h) \le 0\), they imply that \(r\le - 2\). Thus, this LP is not feasible.

  2. 2.

    For \(\mathcal {L}_\phi (\eta (x), h, r) = c(1 - \frac{1}{1 - 2c}r)\), we have the following optimization problem

    $$\begin{aligned}{} & {} \min _{(h, r)} c\left( 1 - \frac{1}{1 - 2c}r\right) \\{} & {} \text {subject to: } c\left( 1 - \frac{1}{1 - 2c}r\right) \ge 0, 1 + \frac{1}{2}(r - h) \le c\left( 1 - \frac{1}{1 - 2c}r\right) , \\{} & {} \hspace{20mm} 1 + \frac{1}{2}(r + h) \le c\left( 1 - \frac{1}{1 - 2c}r\right) \end{aligned}$$

    Summing the last two constraints and solving for r, we have that \(r\le 2(c - 1)(1 - 2c)\le 0\). Since this optimization problem, we want to maximize r, we can conclude that \(r_M^* = 2(c - 1)(1 - 2c)\) and that \(h_M^* = 0\).

  3. 3.

    For \(\mathcal {L}_\phi (\eta (x), h, r) = \eta (x)(1 + \frac{1}{2}(r - h)) + (1 - \eta (x))(1 + \frac{1}{2}(r + h))\), we have the following problem

    $$\begin{aligned}{} & {} \min _{(h, r)} \eta (x)\left( 1 + \frac{1}{2}(r - h)\right) + (1 - \eta (x))\left( 1 + \frac{1}{2}(r + h)\right) \\{} & {} \text {subject to: }1 + \frac{1}{2}(r - h) \ge 0, 1 + \frac{1}{2}(r - h) \ge c\left( 1 - \frac{1}{1 - 2c}r\right) , \\{} & {} \hspace{17.6mm} 1 + \frac{1}{2}(r + h) \ge 0, 1 + \frac{1}{2}(r + h) \ge c\left( 1 - \frac{1}{1 - 2c}r\right) , \end{aligned}$$

    By simplifying the constraints, we can see that the feasibility region of the optimization problem has to be between the lines \(2 + r \ge h\) and \(h \ge - (2 + r)\) and between \(2(1 - c) + \frac{1}{1 - 2c}r\ge h\) and \(h\ge - \left( 2(1 - c) + \frac{1}{1 - 2c}r\right) \). Notice that \( - (2 + r) = r + 2\) at \(r = - 2\) and that \( - \left( 2(1 - c) + \frac{1}{1 - 2c}r\right) = 2(1 - c) + \frac{1}{1 - 2c}r\) at \(r = 2(c - 1)(1 - 2c)\). Since \( - 2 \le 2(c - 1)(1 - 2c)\) for \(0< c < 0.5\), we have that \( - 2\) is not in the feasibility region of the optimization problem. Thus one of the optimality points is at \(r = 2(c - 1)(1 - 2c)\) and \(h = 0\). We also have that \(2 + r = 2(1 - c) + \frac{1}{1 - 2c}r\) at the point \(r = 1 - 2c\) and \(h = 3 - 2c\). Similarly, \( - (2 + r) = - \left( 2(1 - c) + \frac{1}{1 - 2c}r\right) \) at the point \(r = 1 - 2c\) and \(h = - (3 - 2c)\). Thus all the optimality points are in the set \((h, r)\in \{ (0, 2(c - 1)(1 - 2c)), (3 - 2c, 1 - 2c), ( - (3 - 2c), 1 - 2c) \}\)

We now provide a proof of the excess risk bound of our convex surrogate. It consists of analyzing three cases when \(\mathcal {L}^*(\eta (x)) = c\), \(\mathcal {L}^*(\eta (x)) = \eta (x)\), and \(\mathcal {L}^*(\eta (x)) = 1 - \eta (x)\) and then using the calibration results of the previous theorem.

Theorem 4

Let \(R_M (h, r) = \mathop {\mathrm {\mathbb {E}}}\limits _{(x, y) \sim \mathcal {D}}[ L_{\text {MH}}(h, r, x, y)]\) denote the expected surrogate loss of a pair (hr). Then, the excess risk of (hr) is upper bounded by its surrogate excess error as follows:

$$\begin{aligned} R(h, r) - R^* \le \frac{1}{(1 - c)(1 - 2c)} \left( R_M(h, r) - R^*_M \right) . \end{aligned}$$

Proof

Conditioning on the label y and using the fact that the infimum is over all measurable functions, we can switch the infimum and expectation as follows:

$$\begin{aligned} R(h, r) - R(h^*, r^*)= & {} \mathop {\mathrm {\mathbb {E}}}\limits _x\big [ (\eta (x) - \mathcal {L}^*(\eta (x))) 1_{{{\,\textrm{sgn}\,}}(h)\ne 1, r> 0} \nonumber \\ {}+ & {} (1 - \eta (x) - \mathcal {L}^*(\eta (x))) 1_{ {{\,\textrm{sgn}\,}}(h)\ne - 1, r > 0} + (c - \mathcal {L}^*(\eta (x))) 1_{r\le 0 } \big ] \end{aligned}$$
(3)

where \(\mathcal {L}^*(\eta (x)) = \eta (x) 1_{\eta (x) < c} + c 1_{c\le \eta (x) \le 1 - c} + (1 - \eta (x))1_{\eta (x) > 1 - c} \). We can thus focus on minimizing the components inside the expectation for a fixed x. From the calibration theorem, we have that \( \mathcal {L}_\phi ^*(\eta (x)) = (3 - 2c) \mathcal {L}^*(\eta (x))\). Since \(\mathcal {L}^*(\eta (x))\) admits three values, we can consider the following three cases: \(\mathcal {L}^*(\eta (x)) = c\), \(\mathcal {L}^*(\eta (x)) = \eta (x)\), and \(\mathcal {L}^*(\eta (x)) = 1 - \eta (x)\). Below, we describe one such case, but the remaining can be analyzed by a similar reasoning. When \(c\le \eta (x) \le 1 - c\), we have that \(\mathcal {L}^*(\eta (x)) = c\) and so \(r^*\le 0\). Since by the calibration theorem, \({{\,\textrm{sign}\,}}({r^*}) = {{\,\textrm{sign}\,}}({ r^*_M})\), we have that \(r^*_M \le 0\) as well as \(\mathcal {L_\phi }^*(\eta (x)) = (3 - 2c)c\). Under this case, the A3 can be written as \(R(h, r) - R(h^*, r^*) = \mathop {\mathrm {\mathbb {E}}}\limits _x \big ( (\eta (x) - c) 1_{{{\,\textrm{sgn}\,}}(h)\ne 1, r> 0} + (1 - \eta (x) - c) 1_{ {{\,\textrm{sgn}\,}}(h)\ne - 1, r > 0} \big )\). Note that these indicator functions on the right hand side are mutually exclusive, thus we can just show that each component is bounded. Since for the value of \(\eta (x)\) and c that satisfy \(c\le \eta (x) \le 1 - c\), we have \((\eta (x) - c) 1_{{{\,\textrm{sgn}\,}}(h)\ne 1, r> 0} \le (\eta (x) - c) 1_{h < 0, r > 0} \) and \((1 - \eta (x) - c) 1_{{{\,\textrm{sgn}\,}}(h)\ne - 1, r> 0}\le (1 - \eta (x) - c) 1_{h \ge 0, r > 0}\). Thus, for the first component, we want to show that

$$\begin{aligned} (\eta (x) - c) 1_{h< 0, r> 0} \le \frac{1}{(1 - c)(1 - 2c)} \left( \mathcal {L}_\phi (\eta (x), h, r) - (3 - 2c)c \right) 1_{h < 0, r > 0} \end{aligned}$$

and for the second component, we want to show that

$$\begin{aligned} (1 - \eta (x) - c) 1_{h \ge 0, r> 0} \le \frac{1}{(1 - c)(1 - 2c)} \left( \mathcal {L}_\phi (\eta (x), h, r) - (3 - 2c)c \right) 1_{h \ge 0, r > 0}. \end{aligned}$$

We will prove that the bound holds for each component if there exists a constant \(\kappa > 0\) such that inequality \( 1 - 2c \le \kappa \big ( 1 - (3 - 2c)c \big )\) holds. Since \(\frac{1 - 2c}{1 - (3 - 2c)c} = \frac{1}{1 - c}\), we can easily conclude that \(\kappa : = \frac{1}{1 - c}\). Now since \(\frac{1}{(1 - c)(1 - 2c)}\ge \frac{1}{1 - c}\), we have the inequality of the theorem under this case.

Focusing on the second component, we proceed by first finding the minimum of the \(\mathcal {L}_\phi (\eta (x), h, r)1_{h\ge 0, r > 0} \) and then show that the inequality is satisfied. The optimality points of minimizing \(\mathcal {L}_\phi (\eta (x), h, r)1_{h\ge 0, r > 0} \) are \((h, r)\in \{(3 - 2c, 1 - 2c), (0, 0), (2(1 - c), 0), (0, 1 - 2c) \}\). Evaluating \( \mathcal {L}_{\phi }\) at these optimal points, we have that \( \mathcal {L}_{\phi }(\eta (x), 3 - 2c, 1 - 2c) = (3 - 2c)(1 - \eta (x)) \), \( \mathcal {L}_{\phi }(\eta (x), 0, 1 - 2c) = \frac{3}{2} - c\), \( \mathcal {L}_{\phi }(\eta (x), 0, 0) = 1\), and \( \mathcal {L}_{\phi }(\eta (x), 2(1 - c), 0) = 1 + (1 - 2\eta (x))(1 - c)\). Now since \((3 - 2c)(1 - \eta (x))\ge 1 + (1 - 2\eta (x))(1 - c)\) for \(c\le \eta (x) \le 1 - c\) and since \(\frac{3}{2} - c > 1\) for \(c < \frac{1}{2}\), we can exclude \((2(1 - c), 0)\) and \((0, 1 - 2c)\). Thus, depending on the sign of \(\eta (x)\), the minimum is attained at \( \mathcal {L}_{\phi }(\eta (x), 0, 0) = 1\) or at \(\mathcal {L}_{\phi }(\eta (x), 0, 1 - 2c) = 1 + (1 - 2\eta (x))(1 - c)\). For \(\mathcal {L}_{\phi }(\eta (x), 0, 1 - 2c) = 1 + (1 - 2\eta (x))(1 - c)\), the inequality \(1 - \eta (x) - c \le 1 + (1 - 2\eta (x))(1 - c) - (3 - 2c)c\) holds for all \(c\le \eta (x) \le 1 - c\). While for \( \mathcal {L}_{\phi }(\eta (x), 0, 0) = 1\), since \(c\le \eta (x)\), we have that \(1 - \eta (x) - c \le 1 - 2c \le \kappa \big ( 1 - (3 - 2c)c \big )\) holds.

Now for the first component, we again proceed by first finding the minimum of the \(\mathcal {L}_{\phi }(\eta (x), h, r)1_{h < 0, r > 0}\) and then by showing the inequality is satisfied. By similar reasoning as the calibration theorem, we have that the optimality points are \((h, r)\in \{ ( - (3 - 2c), 1 - 2c), (0, 0), ( - 2(1 - 2c), 0), (0, 1 - 2c) \}\). Evaluating \( \mathcal {L}_{\phi }\) at these points, we have that \(\mathcal {L}_{\phi }(\eta (x), - (3 - 2c), 1 - 2c) = (3 - 2c)\eta (x) \), \( \mathcal {L}_{\phi }(\eta (x), 0, 1 - 2c) = \frac{3}{2} - c\), \( \mathcal {L}_{\phi }(\eta (x), 0, 0) = 1\), and \(\mathcal {L}_{\phi }(\eta (x), - 2(1 - c), 0) = \eta (x)(2 - c) + (1 - \eta (x))c\). By the similar reasoning as above, we can again exclude the points \(( - (3 - 2c), 1 - 2c)\) and \((0, 1 - 2c)\). Depending on the sign of \(\eta (x)\), the minimum is attained at \(\mathcal {L}_\phi (\eta (x), 0, 0) = 1\) or at \(\mathcal {L}_{\phi }(\eta (x), - 2(1 - c), 0) = \eta (x)(2 - c) + (1 - \eta (x))c\). For all \(c\le \eta (x) \le 1 - c\), we have that the inequality \( \eta (x) - c \le \eta (x)(2 - c) + (1 - \eta (x))c\) holds. Now for \(\mathcal {L}_\phi (\eta (x), 0, 0) = 1\), we have that \( \eta (x) - c \le 1 - 2c \le \kappa \big ( 1 - (3 - 2c)c\big )\).

Appendix B Alternative convex surrogate functions

Alternative convex surrogate functions can be found using a concave lower bound formula described here. Let \(u \mapsto \Phi (u)\) and \(u \mapsto \Psi (u)\) be strictly increasing concave functions lower bounding \(1_{u > 0}\). Then, the following inequalities hold:

$$\begin{aligned}{} & {} L(h, r, x, y) \nonumber \\{} & {} \le 1_{yh(x) \le 0} 1_{r(x)> 0} + c \, 1_{r(x) \le 0} \nonumber \\{} & {} = \left( 1 - 1_{yh(x)> 0}\right) 1_{r(x)> 0} + c \, 1_{r(x) \le 0} \nonumber \\{} & {} = 1_{r(x)> 0} - 1_{yh(x)> 0} 1_{r(x)> 0} + c \, 1_{r(x)\le 0} \nonumber \\{} & {} = \left( 1 - 1_{r(x) \le 0}\right) - 1_{yh(x)> 0} 1_{r(x)> 0} + c \, 1_{r(x) \le 0} \nonumber \\{} & {} = 1 - (1 - c) 1_{r(x) \le 0} - 1_{yh(x)> 0}1_{r(x)> 0}\nonumber \\{} & {} = 1 - (1 - c) 1_{r(x) \le 0} - 1_{\min ( yh(x), r(x) ) > 0}\nonumber \\{} & {} \le 1 - (1 - c) \Phi ( - r(x)) - \Psi (\min ( yh(x), r(x) )) \nonumber \\{} & {} = 1 - (1 - c) \Phi ( - r(x)) - \min ( \Psi (yh(x)), \Psi (r(x) )). \end{aligned}$$
(4)

The last term of right - hand side of B4 defines a convex function of h and r since the minimum of two concave functions is concave.

Appendix C Connections to cost-sensitive learning

In this section, we draw connections between the cost-sensitive learning framework and learning with rejection.

The standard cost-sensitive algorithms and theory are designed for unknown distributions; however, in our setting, there is some prior information about the distribution since the rejection label has measure zero, a fact that should be exploited to derive a finer analysis. Moreover, using cost-sensitive algorithms for the rejection setting might not produce any interesting solution since they would treat rejection as any other label and since it is unclear how they would perform with a label for which there is no training data [2, 3, 25]. To elaborate on this, we first introduce a natural model for multi-class classification with rejection which can be viewed as an instance of cost-sensitive models and discuss its properties. The hypothesis set commonly adopted in multi-class classification is that of scoring functions: a scoring function \(h(\cdot , y) :\mathcal {X}\rightarrow \mathbb {R}\) is learned for each class \(y \in \mathcal {Y}\) and the class predicted for \(x \in \mathcal {X}\) is the one with the highest score, that is argmax\(_{y \in \mathcal {Y}} h(x, y)\). This is also the hypothesis set adopted in the more complex multi-class classification scenario of structured prediction where misclassification is cost-sensitive: the loss \(L(y, y')\) of predicting \(y' \in \mathcal {Y}\) instead of the correct class \(y \in \mathcal {Y}\) depends on the pair \((y, y')\).

This suggests a natural model for multi-class classification with rejection. As in the standard multi-class case, we can introduce a scoring function for rejection \(r(x) = h(x,\texttt {r})\), where \(\texttt {r}\) is the rejection symbol. The label predicted, which is either a regular class label or the label \(\texttt {r}\) with the semantics of rejection, is the one with the highest score:

$$\begin{aligned} \textsf {h}(x) = \underset{y \in \mathcal {Y}\cup \{ \texttt {r} \}}{\text {argmax}}\, h(x, y). \end{aligned}$$

Thus, the rejection function r is implicitly defined by \(r(x) = \max _{y \in \mathcal {Y}}h(x, y) - h(x, \texttt {r})\) and the rejection loss can be expressed by

$$\begin{aligned} L(h,r,x,y) =1_{h(x,y) \le \max _{y' \ne y}h(x,y')}1_{h(x,\texttt {r}) < \max _{y\in \mathcal {Y}}h(x,y)} + c1_{h(x,\texttt {r}) \ge \max _{y\in \mathcal {Y}}h(x,y)}. \end{aligned}$$

This loss can be upper bounded by the convex surrogate

$$\begin{aligned} L_{\text {SH}}(h,r,x,y)= \max \Big (0, 1 - [h(x,y) -\underset{{y' \ne y}}{\max }\, h(x,y')], c \big ( 1 - [h(x,y) - h(x,\texttt {r})] \big ) \Big ), \end{aligned}$$

which is closely related to the loss function used in StructSVM. Using \(L_{\text {SH}}\) and linear functions \(h(x, y) = {\varvec{w}}_y \cdot {\varvec{\Phi }}(x)\) for each class \(y \in \mathcal {Y}\cup \{ \texttt {r} \}\) with a norm-2 regularization leads to an algorithm defined by the following optimization problem

$$\begin{aligned} \underset{\text {W},{\varvec{w}}_r,{\varvec{\xi }}}{\text {min}} \hspace{2mm}{} & {} \frac{\lambda }{2}\sum \limits _{l=1}^{k} ||{\varvec{w}}_l||^{2}+ \frac{\lambda '}{2} ||{\varvec{w}}_r||^{2} + \sum \limits _{i = 1}^{m} \xi _i \\ \text {subject to: }{} & {} \xi _i \ge c(1-{\varvec{w}}_{y_i}\cdot \Phi (x_i)+ {\varvec{w}}_{r}\cdot \Phi (x_i) ), \\{} & {} \xi _i\ge 1 -{\varvec{w}}_{y_i}\cdot \Phi (x_i)+ {\varvec{w}}_{l}\cdot \Phi (x_i) ,\\{} & {} \xi _i \ge 0,i\in [1,m], \forall l \in \mathcal {Y}- \{y_i\}, \end{aligned}$$

where \(\text {W}=({\varvec{w}}_1,\ldots ,{\varvec{w}}_k)\) and \({\varvec{\xi }}=(\xi _1,\ldots ,\xi _m)\).

In principle, one can use the theory and learning bounds from structured prediction to derive the optimization problem above, but in the absence of rejection labels in the data, there is no incentive for the rejection scoring function to be large. More precisely, suppose that the dataset has only positive features so that \(\Phi (x_i)\) has only positive elements. Now, considering the constraints of the optimization problem, \({\varvec{w}}_r\) appears only in \(\xi \ge c(1-{\varvec{w}}_{y_i}\cdot \Phi (x_i)+ {\varvec{w}}_{r}\cdot \Phi (x_i) )\) and as a consequence of \(\Phi (x_i)\) being positive, these constraints will push \({\varvec{w}}_{r}\) to be negative. Combining this with the fact that the objective is to minimize \(||{\varvec{w}}_r||^2\), the optimization problem will find a solution such that \(w_r\) is small and negative. One may also see this directly by looking at the KKT conditions for \({\varvec{w}}_r\) of the optimization problem. Thus, for positive \(\Phi (x_i)\) the score for the rejection label will be a small negative number while scores of the other class-labels could be positive. This implies that this method is likely not abstain very often. Thus, while very natural, this cost-sensitive formulation does not lead to a useful algorithm in this scenario. One may seek to modify the objective function to promote larger values for the scoring functions but our attempts typically led to non-convex functions and the absence of an \(\texttt {r}\) label in the training sample remained a problem.

There are existing cost-sensitive algorithms that can be used in the rejection setting [2, 3, 25], which are based on reductions stemming from the work of [23]. However, their guarantees are based on relating the difference of the generalization error and the Bayes optimal error of the cost sensitive problem to that of reduced binary problem by paying a multiplication factor that usually depends on the quality of the reduction, which results in a quantity that is not easy to compare to. Furthermore, as argued by [43], these algorithms can be quite complicated both in terms of their encoding structure and their algorithmic procedure since they reduce the cost-sensitive problem first to a weighted binary classification, that is then converted into a binary classification problem via the Costing algorithm of [47], and which in turn is solved by a standard algorithm for binary classification. Note that the convex surrogate loss approach described in the previous paragraph is closer in nature to the cost-sensitive work of [43], but their algorithm does not apply to the rejection setting.

The analysis of calibration in [39] is not helpful for the analysis of learning with rejections since the main point of their paper is consistency guarantees and analysing a notion they introduce, convex calibration dimension of the loss matrix, which chracterizes when it is possible to design a convex surrogate that is calibrated. Instead, we would need guarantees and an analysis for the convex surrogate, \(L_{\text {MH}}\), and our main concern is not consistency. Additionally, the size of our loss matrix as defined in [39] is small, thus the analysis of the dimensionality is not relevant and in fact their bound for the rejection loss is not tight.

Appendix D Algorithms with kernel-based hypotheses

In this section, we provide further details related to the algorithms with kernel-based hypothesis.

1.1 D.1 Optimization problems

We derive the optimization problem first for loss \(L_{\text {MH}}\) and then for loss \(L_{\text {PH}}\). We find that both the primal and dual optimization problems for \(L_\text {MH}\) and \(L_\text {PH}\) are QPs.

Firstly, by the generalization of the Corollary 9 to a uniform bound over \(\rho , \rho ' \in (0, 1)\) and by picking \(\Lambda = 1\) and \(\Lambda ' = 1\), we have that, for any \(\delta > 0\), with probability at least \(1 - \delta \), the following holds for all \(\rho , \rho ' \in (0, 1)\), \({\mathscr {H}}= \{ \varvec{x} \rightarrow \varvec{w} \cdot \Phi (x):\Vert \varvec{w} \Vert \le 1 \}\) and \({\mathscr {R}}= \{\varvec{x} \rightarrow {\varvec{u}}\cdot \Phi '(x):\Vert {\varvec{u}}\Vert \le 1 \}\):

$$\begin{aligned} R(h, r){} & {} \le \frac{1}{m} \sum \limits _{i = 1}^m \text {max} \left( 1 + \tfrac{\alpha }{2} \left( \tfrac{{\varvec{u}}\cdot \Phi '(x_i) }{\rho '} - \tfrac{y\varvec{w} \cdot \Phi (x_i)}{\rho } \right) , c \, \left( 1 - \tfrac{\beta {\varvec{u}}\cdot \Phi '(x_i)}{\rho '}\right) , 0 \right) \\{} & {} + \alpha \sqrt{\tfrac{(\kappa / \rho )^2}{m} } + \left( 2 \beta c + \alpha \right) \sqrt{\tfrac{(\kappa ' / \rho ')^2}{m}} + C(\rho , \rho ', m, \delta ), \end{aligned}$$

where \(C(\rho , \rho ', m, \delta ) = \sqrt{\frac{\log \frac{1}{\delta }}{2m}} + \sqrt{\frac{\log \log 1/\rho }{m}} + \sqrt{\frac{\log \log 1/\rho '}{m}} \). Secondly, under binary classification, the functions \(h/ \rho \) and \(r / \rho \) admit the same generalization error as h and r for any \(\rho \in (0, 1)\) and \(\rho ' \in (0, 1)\). Thus, with probability at least \(1 - \delta \), the following holds for all \(\rho \in (0, 1)\), \(\rho ' \in (0, 1)\), \(h\in {\mathscr {H}}= \{\varvec{x} \rightarrow \varvec{w} \cdot \Phi (x): \Vert \varvec{w} \Vert \le \frac{1}{\rho } \}\) and \(r\in {\mathscr {R}}= \{\varvec{x} \rightarrow {\varvec{u}}\cdot \Phi '(x): \Vert {\varvec{u}}\Vert \le \frac{1}{\rho '} \} \)

$$\begin{aligned} R(h, r){} & {} \le \frac{1}{m} \sum \limits _{i = 1}^m \text {max} \Big ( 1 + \frac{\alpha }{2}\big ( {\varvec{u}}\cdot \Phi '(x_i) - y\varvec{w} \cdot \Phi (x_i)\big ), c \, \big (1 - \beta {\varvec{u}}\cdot \Phi '(x_i) \big ), 0 \Big ) \\{} & {} + \alpha \sqrt{\frac{(\kappa / \rho )^2}{m} } + (2 \beta c + \alpha ) \sqrt{\frac{(\kappa ' / \rho ')^2}{m}} + C(\rho , \rho ', m, \delta ). \end{aligned}$$

For any \(\rho \in (0, 1)\) and \(\rho ' \in (0, 1)\), the sum term on the right hand side depends on \({\varvec{w}}\) and \({\varvec{u}}\) and so the bound leads to the following optimization problem

$$\begin{aligned} \underset{ \begin{array}{c} \Vert \varvec{w} \Vert ^2 \le \frac{1}{\rho ^2} \\ \Vert {\varvec{u}}\Vert ^2 \le \frac{1}{\rho '^2} \end{array} }{\text {min}} \frac{1}{m} \sum _{i = 1}^m \text {max} \Big ( 1 + \frac{\alpha }{2}\big ({\varvec{u}}\cdot \Phi '(x_i) - y\varvec{w} \cdot \Phi (x_i) \big ), c \, \big (1 - \beta {\varvec{u}}\cdot \Phi '(x_i) \big ), 0 \Big ). \end{aligned}$$

Lastly, we introduce slack variables \(\xi _i\) for \(i\in [1, m]\) along with Lagrange multipliers \(\lambda \ge 0\) and \(\lambda ' \ge 0\) so that the primal optimization problem for \(L_{\text {MH}}\) is as follows:

$$\begin{aligned} \underset{\varvec{w}, \varvec{u}, \varvec{\xi }}{\text {min}}{} & {} \quad \frac{\lambda }{2} \Vert \varvec{w}\Vert ^2 + \frac{\lambda '}{2} \Vert \varvec{u}\Vert ^2 + \sum \limits _{i = 1}^{m} \xi _i \\ \text {subject to}{} & {} \quad \xi _i \ge c\big (1 - \beta (\varvec{u}\cdot \Phi '(x_i) + b') \big ), \\{} & {} \quad \xi _i \ge 1 + \frac{\alpha }{2}\big (\varvec{u}\cdot \Phi '(x_i) + b' - y_i (\varvec{w}\cdot \Phi (x_i) + b)\big ), \\{} & {} \quad \xi _i \ge 0, i \in [1, m], \end{aligned}$$

where we explicitly mark both the offset b of classifier h(x) and offset \(b'\) of rejection function r(x). Since \(K(x_i, x_j) = \Phi (x_i)\cdot \Phi (x_j)\) and \(K'(x_i, x_j) = \Phi '(x_i)\cdot \Phi '(x_j)\), the dual optimization problem is given by the following:

$$\begin{aligned} \underset{{\varvec{\eta }}, {\varvec{\zeta }}}{\text {max}}{} & {} \quad \lambda \lambda ' \sum _{i = 1}^m \eta _i + \lambda \lambda ' c \sum _{i = 1}^m \zeta _i - \frac{ \alpha ^2 \lambda '}{8} \sum _{i, j = 1}^m \eta _i \eta _j y_i y_j K(x_i, x_j) \\{} & {} \quad - \frac{\lambda }{2} \sum _{i, j = 1}^m \left( \frac{\alpha \eta _i}{2} - c\beta \zeta _i\right) \left( \frac{\alpha \eta _j}{2} - c\beta \zeta _j\right) K'(x_i, x_j) \\ \text {subject to}{} & {} \quad \sum _{i = 1}^m\eta _i y_i = 0, \sum \limits _{i = 1}^m \left( \frac{\alpha \eta _i}{2} - c\beta \zeta _i\right) = 0, \\{} & {} \quad \eta _i\ge 0, \zeta _i \ge 0, \eta _i + \zeta _i \le 1, i \in [1, m]. \end{aligned}$$

By a similar reasoning as above, we derive the optimization problem for the surrogate loss \(L_{\text {PH}}\). By introducing slack variables \(\xi _i\) for \(i\in [1, m]\) as well as Lagrange multipliers \(\lambda \ge 0\) and\(\lambda ' \ge 0\), we have the following primal optimization problem for \(L_{\text {PH}}\):

$$\begin{aligned} \underset{\varvec{w}, \varvec{u}, \varvec{\xi }, \varvec{\xi }'}{\text {min}}{} & {} \quad \frac{\lambda }{2} \Vert \varvec{w}\Vert ^2 + \frac{\lambda '}{2} \Vert \varvec{u}\Vert ^2 + \sum \limits _{i = 1}^{m} \xi _i + \sum \limits _{i = 1}^{m} \xi '_i \\ \text {subject to}{} & {} \quad \xi '_i \ge c(1 - \beta (\varvec{u}\cdot \Phi '(x_i) + b') ), \\{} & {} \quad \xi _i \ge 1 + \frac{\alpha }{2} (\varvec{u}\cdot \Phi '(x_i) + b' - y_i (\varvec{w}\cdot \Phi (x_i) + b)),\\{} & {} \quad \xi _i \ge 0, \xi _i' \ge 0, i \in [1, m]. \end{aligned}$$

Since \(K(x_i, x_j) = \Phi (x_i)\cdot \Phi (x_j)\) and \(K'(x_i, x_j) = \Phi '(x_i)\cdot \Phi '(x_j)\), the dual optimization problem of \(L_{\text {PH}}\) is given by the following:

$$\begin{aligned} \underset{{\varvec{\eta }}, {\varvec{\zeta }}}{\text {max}}{} & {} \quad \lambda \lambda ' \sum \limits _{i = 1}^m \eta _i + \lambda \lambda ' c \sum \limits _{i = 1}^m \zeta _i - \frac{\alpha ^2 \lambda '}{8} \sum \limits _{i, j = 1}^m \eta _i \eta _j y_i y_j K(x_i, x_j) \\{} & {} \quad - \frac{\lambda }{2 } \sum \limits _{i, j = 1}^m \Big (\frac{\alpha \eta _i}{2} - c\beta \zeta _i \Big ) \Big (\frac{\alpha \eta _j}{2} - c\beta \zeta _j \Big ) K'(x_i, x_j) \\ \text {subject to}{} & {} \quad \sum _{i = 1}^m \eta _i y_i = 0, \sum \limits _{i = 1}^m \Big (\frac{\alpha \eta _i}{2} - c\beta \zeta _i \Big ) = 0, \\{} & {} \quad 0 \le \eta _i \le 1, 0 \le \zeta _i \le 1, i \in [1, m]. \nonumber \end{aligned}$$

Appendix E Confidence-based rejection algorithms

In this section, we present the optimization problems studied in Section 5 and then report experimental results that compares these different confidence-based rejection algorithms.

1.1 E.1 Optimization problems

We first consider the algorithms that hold for \(\gamma \in [0,1-c]\). The DHL algorithm of [1] solves the following optimization problem

$$\begin{aligned} \underset{{\varvec{\alpha }}, \varvec{\xi }, {\varvec{\beta }}}{\text {min}}{} & {} \quad \sum \limits _{i = 1}^m \xi _i + \frac{1 - 2c}{c}\beta _i \\ \text {subject to }{} & {} \sum \limits _{i = 1, j = 1}^m \alpha _i\alpha _j K(x_i, x_j) \le (1 - c)^2 , \xi _i \ge 1 - y \sum \limits _{i = 1}^m \alpha _i K(x_i, x) \wedge \xi _i \ge 0, \\{} & {} \beta _i \ge - y \sum \limits _{i = 1}^m \alpha _i K(x_i, x) \wedge \beta _i \ge 0, i \in [1, m]. \end{aligned}$$

The optimization problem based on the hinge loss is given by

$$\begin{aligned} \min _{{\varvec{\alpha }}, \varvec{\xi }}{} & {} \quad \sum _{i = 1}^m \xi _i\\ \text {subject to }{} & {} \sum \limits _{i = 1, j = 1}^m \alpha _i\alpha _j K(x_i, x_j) \le (1 - c)^2 ,\\{} & {} \xi _i \ge 1 - y \sum \limits _{i = 1}^m \alpha _i K(x_i, x) \wedge \xi _i \ge 0, i\in [1, m], \end{aligned}$$

and its dual formulation is as follows

$$\begin{aligned} \underset{{\varvec{\alpha }}, {\varvec{\eta }}, \zeta }{\text {max}}{} & {} \sum \limits _{i = 1}^m \alpha _i + \sum \limits _{i = 1}^m \eta _i - \zeta r^2 \\ \text {subject to }{} & {} 0 \le \zeta , 0 \le \alpha _i , 0 \le \eta _i, 0 \le \alpha _i + \eta _i \le 1, i\in [1, m]\\{} & {} \sum \limits _{i = 1}^m (\alpha _i + a\eta _i)(\alpha _j + a\eta _j)y_iy_j K(x_i, x_j) = (\zeta r)^2 \end{aligned}$$

The above shows that the optimization problem solved by DHL is QCQP while the optimization problem based on the hinge loss is a QP.

We now show the optimization problem based on the loss \(L_1\) that holds for all \(\gamma \in [0,1]\):

$$\begin{aligned} \min _{{\varvec{\alpha }}, \varvec{\xi }}{} & {} \hspace{2mm} \sum _{i = 1}^m \xi _i\\ \text {subject to }{} & {} \sum _{i = 1, j = 1}^m \alpha _i\alpha _j K(x_i, x_j) \le 1, \\{} & {} \xi _i \ge 0, \xi _i \ge 1 - y (1 - c) \sum _{i = 1}^m \alpha _i K(x_i, x) , i\in [1, m]. \end{aligned}$$

Its dual formulation is as follows

$$\begin{aligned} \max _{{\varvec{\alpha }}, \zeta }{} & {} \sum _i \alpha _i - \zeta (1 - c)^2\\ \text {subject to }{} & {} \sum _{i, j} \alpha _i \alpha _j y_i y_j K(x_i, x_j) = \zeta ^2 (1 - c)^2, 1\ge \alpha \ge 0, \zeta \ge 0, i\in [1, m] \end{aligned}$$

where we note that this optimization problem is a QCQP.

Table 1 Average rejection loss along with the standard deviations for confidence-based algorithms described in Section 5 across the four data sets for the nine cost values c

1.2 E.2 Empirical comparison of confidence-based rejection algorithms

We tested the confidence-based algorithms on four data sets from the UCI repository: australian, cod, skin, and liver. Table 4 shows the average rejection loss along with the standard deviations for the Hinge loss, \(L_1\) loss, and the DHL confidence-based algorithms across the four data sets for the nine cost values c. These results show that the DHL algorithm outperforms the Hinge and \(L_1\) algorithms across four data sets for most values of c. They were obtained using standard 5-fold cross-validation: For each data set, we split the data randomly into training, test, and validation test in the ration 3:1:1. We allowed the threshold \(\gamma \) to vary in \( \{ 0.1, 0.2, \ldots , 0.9 \}\) and the cost c values range in \(c \in \{ 0.05, 0.1\ldots , 0.45\}\). All the kernels are polynomial degree kernels where the degree d is in \(\{1, 2, 3 \}\) and unlike the experiments done in Section 6, we did not use Gaussian kernels for this initial set of experiments. For a fixed cost value c, we find the combination of parameters \((\gamma , d)\) with smallest average rejection loss on the validation set and report the average rejection loss for these parameters on the test set. Overall, these results show that DHL outperforms the other algorithms on three out of four datasets for most of the values of c. While the other algorithms are seemingly plausible alternatives, we find through these preliminary results that the DHL is the superior algorithm under this confidence based setting.

Appendix F Experiments comparing DHL and CHR algorithms

In the following pages, we provide the results of the several experiments described in Section 6. As a short summary, the CHR algorithm achieves a better performance across all data sets for most values of cost c. Table 5 shows the average rejection loss with standard deviations on the test set. The CHR\(_\text {MH}\) stands for the CHR algorithm based on \(L_{\text {MH}}\). Table 7 reports the average fraction of the test points rejected. Table 9 provide the classification error on the non-rejected points.

Table 2 Average rejection loss along with the standard deviations on the test set for the DHL algorithm and the CHR\(_\text {MH}\) algorithm across different costs
Table 3 Average rejection loss along with the standard deviations on the test set for the DHL algorithm and the CHR\(_\text {MH}\) algorithm across different costs
Table 4 Average fraction of points rejected along with the standard deviations for the DHL algorithm and the CHR\(_\text {MH}\) algorithm across different costs
Table 5 Average fraction of points rejected along with the standard deviations for the DHL algorithm and the CHR\(_\text {MH}\) algorithm across different costs
Table 6 Average classification error on non-rejected points along with the standard deviations for the DHL algorithm and the CHR\(_\text {MH}\) algorithm across different costs
Table 7 Average classification error on non-rejected points along with the standard deviations for the DHL algorithm and the CHR\(_\text {MH}\) algorithm across different costs

1.1 F.1 Experiments comparing CHR algorithms

In this section, we show the results of some initial experiments comparing the two CHR Algorithms. Let CHR\(_\text {PH}\) stand for the CHR algorithm based on \(L_{\text {PH}}\). The experimental set-up is exactly the same as in Section 6 except that we just used polynomial kernels of degree \(d\in \{1, 2, 3\}\). Table 11 shows the average rejection loss with standard deviations on the test set for both algorithms. We find that on average theCHR\(_\text {MH}\) performs slightly better than the CHR\(_\text {PH}\) as is expected since the loss \(L_{\text {PH}}\) is an upper bound of the loss \(L_{\text {MH}}\).

Table 8 Average rejection loss along with the standard deviations on the test set the CHR\(_\text {MH}\) algorithm and the CHR\(_\text {PH}\) algorithm across the seven data sets for the nine cost values c using polynomial kernels

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cortes, C., DeSalvo, G. & Mohri, M. Theory and algorithms for learning with rejection in binary classification. Ann Math Artif Intell 92, 277–315 (2024). https://doi.org/10.1007/s10472-023-09899-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-023-09899-2

Keywords

Navigation