A Natural Threshold Model for Ordinal Regression

Wang, Xingyu; Song, Yanzhi; Yang, Zhouwang

doi:10.1007/s11063-022-11073-4

A Natural Threshold Model for Ordinal Regression

Published: 29 October 2022

Volume 55, pages 4933–4949, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Xingyu Wang¹,
Yanzhi Song¹ &
Zhouwang Yang¹

252 Accesses
1 Altmetric
Explore all metrics

Abstract

The threshold model is one of the most commonly used ordinal regression methods. It projects patterns onto a real axis and uses a list of thresholds to divide the real axis into consecutive intervals, one for each category. We propose using a fixed sequence of natural thresholds (i.e., $0.5, 1.5, \ldots $) to simplify the model. We prove that using fixed thresholds does not degrade the performance after minor changes in the neural network structure. The natural threshold simplifies the logical complexity and implementation complexity of the model, so that the loss function can be easily customized for any special metric. We have designed loss functions for various ordinal regression evaluation metrics and achieved the best level of results on many ordinal regression tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Ordinal Classification Based on the Proportional Odds Model

MORD: Multi-class Classifier for Ordinal Regression

A Generalized Logistic Link Function for Cumulative Link Models in Ordinal Regression

Article 24 January 2017

Data Availability

The MORPH II data set and the Image Aesthetics data set are public, and the diamond color classification data set is non-public.

Code Availability

We will publish the code on github after the paper is accepted.

References

Causey JL, Zhang J, Ma S, Jiang B, Qualls JA, Politte DG, Prior F, Zhang S, Huang X (2018) Highly accurate model for prediction of lung nodule malignancy with ct scans. Sci Rep 8(1):1–12
Article Google Scholar
Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural Comput 19(3):792–815
Article MathSciNet MATH Google Scholar
Gutiérrez PA, Perez-Ortiz M, Sanchez-Monedero J, Fernandez-Navarro F, Hervas-Martinez C (2015) Ordinal regression methods: survey and experimental study. IEEE Trans Knowl Data Eng 28(1):127–146
Article Google Scholar
Gutiérrez PA, Tiňo P, Hervás-Martínez C (2014) Ordinal regression neural networks based on concentric hyperspheres. Neural Netw 59:51–60
Article MATH Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
de La Torre J, Valls A, Puig D (2020) A deep learning interpretable classifier for diabetic retinopathy disease grading. Neurocomputing 396:465–476
Article Google Scholar
Li K, Xing J, Su C, Hu W, Zhang Y, Maybank S (2018) Deep cost-sensitive and order-preserving feature learning for cross-population age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 399–408
Liu H, Lu J, Feng J, Zhou J (2017) Label-sensitive deep metric learning for facial age estimation. IEEE Trans Inf Foren Security 13(2):292–305
Article Google Scholar
Liu X, Fan F, Kong L, Diao Z, Xie W, Lu J, You J (2020) Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing 388:34–44
Article Google Scholar
Liu Y, Kong AWK, Goh CK (2018) A constrained deep neural network for ordinal regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 831–839
Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output CNN for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4920–4928
Pan H, Han H, Shan S, Chen X (2018) Mean-variance loss for deep age estimation from a face. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5285–5294
Ricanek K, Tesafaye T (2006) Morph: A longitudinal image database of normal adult age-progression. In: 7th international conference on automatic face and gesture recognition (FGR06), pp. 341–345. IEEE
Rothe R, Timofte R, Van Gool L (2018) Deep expectation of real and apparent age from a single image without facial landmarks. Int J Comput Vis 126(2–4):144–157
Article MathSciNet Google Scholar
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Schifanella R, Redi M, Aiello LM (2015) An image is worth more than a thousand favorites: Surfacing the hidden beauty of flickr pictures. In: Proceedings of the international AAAI conference on web and social media, vol. 9
Shu Y, Li Q, Liu S, Xu G (2020) Learning with privileged information for photo aesthetic assessment. Neurocomputing 404:304–316
Article Google Scholar
Tian Q, Zhang W, Wang L, Chen S, Yin H (2018) Robust ordinal regression induced by lp-centroid. Neurocomputing 313:184–195
Article Google Scholar
Tian X, Dong Z, Yang K, Mei T (2015) Query-dependent aesthetic model with deep learning for photo quality assessment. IEEE Trans Multimed 17(11):2035–2048
Article Google Scholar
Vargas VM, Gutiérrez PA, Hervás-Martínez C (2020) Cumulative link models for deep ordinal classification. Neurocomputing 401:48–58
Article Google Scholar
Zhang C, Liu S, Xu X, Zhu C (2019) C3ae: Exploring the limits of compact model for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12587–12596

Download references

Acknowledgements

The work is supported by Anhui Center for Applied Mathematics, the Major Project of Science & Technology of Anhui Province (Nos. 202203a05020050, 202103a07020011), the NSF of China (No. 11871447), the Strategic Priority Research Program of Chinese Academy of Sciences (No. XDC 08010100), and the National Key Research and Development Program of MOST of China (No. 2018AAA0101001).

Funding

Funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230026, People’s Republic of China
Xingyu Wang, Yanzhi Song & Zhouwang Yang

Authors

Xingyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanzhi Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhouwang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhouwang Yang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proof of Lemma 1

Lemma 1

Let the prediction function $ Pred _{\mathcal {L}}$ be order–preserving and surjective. Then, for $\forall \alpha \in \Gamma $, $\exists -\infty< b_1< b_2< \cdots< b_{Q-1} < +\infty $, such that

$$\begin{aligned} Pred _{\mathcal {L}}(s, \alpha ) = c_q,\ s \in [b_{q-1}, b_q), \end{aligned}$$

where $q \in \{1, \ldots , Q\}, b_0 = -\infty , b_{Q} = +\infty $. In particular, we call ${\mathbf {b}} = (b_1, \ldots , b_{Q-1})$ the threshold vector of $ Pred _{\mathcal {L}}$.

Proof

For $\forall \alpha \in \Gamma $, let $b_q=\sup \{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_q\}$, $q \in \{1, 2, \ldots , Q-1\}$. Because $c_{q-1} \preceq c_{q}$, it follows that $\{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\} \subseteq \{s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q}\}$ and $\sup \{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\} \le \sup \{s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q}\}$. That means $b_{q-1} \le b_{q}$, so $b_1 \le b_2 \le \cdots \le b_{Q-1}$.

Since $ Pred _{\mathcal {L}}$ is order–preserving, we have $ Pred _{\mathcal {L}}(s, \alpha ) \preceq c_q$ when $b_{q-1}< s < b_q$. By the definition of $b_{q-1}$, $ Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}$ is not established at this time. Thus, $ Pred _{\mathcal {L}}(s, \alpha ) = c_q$ when $b_{q-1}< s < b_q$.

Since $ Pred _{\mathcal {L}}$ is surjective, for $\forall c_q \in {\mathcal {Y}}$, the Lebesgue measure of $\{s | Pred _{\mathcal {L}}(s, \alpha ) = c_q\}$ is greater than 0. Thus, $-\infty< b_1< b_2< \cdots< b_{Q-1} < +\infty $.

To prove that ${\mathcal {L}}(b_{q-1}, \alpha , c_{q-1})={\mathcal {L}}(b_{q-1}, \alpha , c_q)$, we first assume that ${\mathcal {L}}(b_{q-1}, \alpha , c_{q-1}) < {\mathcal {L}}(b_{q-1}, \alpha , c_q)$. Since ${\mathcal {L}}(s, \alpha , y)$ is continuous in s, there exists $\epsilon > 0$ small enough such that ${\mathcal {L}}(b_{q-1}+\epsilon , \alpha , c_{q-1}) < {\mathcal {L}}(b_{q-1}+\epsilon , \alpha , c_q)$. Thus, $ Pred _{\mathcal {L}}(b_{q-1}+\epsilon , \alpha ) = c_{q-1}$ and contradicts $ Pred _{\mathcal {L}}(s, \alpha ) = c_q$ when $b_{q-1}< s < b_q$. The same contradiction can be found for ${\mathcal {L}}(b_{q-1}, \alpha , c_{q-1}) > {\mathcal {L}}(b_{q-1}, \alpha , c_q)$. So ${\mathcal {L}}(b_{q-1}, \alpha , c_{q-1})={\mathcal {L}}(b_{q-1}, \alpha , c_q)$ and $ Pred _{\mathcal {L}}(b_{q-1}, \alpha ) = c_q$. $\square $

1.2 Proof of Lemma 2

Lemma 2

If a single hidden layer neural network has one input unit, Q ReLU activation hidden units, and one output unit, it can represent any strictly monotone and increasing Q–segment linear function.

Proof

For any strictly monotone and increasing Q–segment linear function f(x), let its $Q-1$ segmentation points be $t_2< t_3< \cdots < t_{Q}$ and the slopes of Q segments be $w_1,\ w_2,\ \ldots ,\ w_Q$, respectively. Then, the single hidden layer neural network

$$\begin{aligned} g(x)=&-w_1 \cdot ReLU(-x+t_2)+w_2 \cdot ReLU(x-t_2) \nonumber \\&+\sum _{i=3}^{Q} (w_i-w_{i-1}) \cdot ReLU(x-t_i) + f(t_2) \end{aligned}$$

satisfies $g(t_i)=f(t_i)$, $i \in \{2, \ldots , Q\}$. Thus, f(x) and g(x) are identical. $\square $

1.3 Proof of Theorem 1

Theorem 1

Let $({\mathcal {S}}_1, {\mathcal {L}}_1)$ be any threshold model where the prediction function $ Pred _{{\mathcal {L}}_1}$ derived from ${\mathcal {L}}_1$ is order–preserving and surjective. For any given threshold vector ${\mathbf {d}}$, there exists a threshold model $({\mathcal {S}}_2, {\mathcal {L}}_2)$, which takes ${\mathbf {d}}$ as the threshold vector, and its performance is not weaker than $({\mathcal {S}}_1, {\mathcal {L}}_1)$.

Proof

According to Lemma 1, for $\forall \alpha \in \Gamma $ of the ${\mathcal {L}}_1$, there exists a threshold vector ${\mathbf {b}}$ of the $ Pred _{{\mathcal {L}}_1}$ such that $ Pred _{{\mathcal {L}}_1}(s, \alpha ) = c_q$ when $s \in [b_{q-1}, b_q)$. Let $\phi (x, \alpha )$ be a strictly monotone and increasing Q–segment linear function such that $\phi (b_q, \alpha )=d_q$, and the slopes of the leftmost and rightmost segments are 1. $\phi (x, \alpha )$ is determined by $\alpha $, because the segmentation points ${\mathbf {b}}$ are determined by $\alpha $, and the values at the segmentation points are ${\mathbf {d}}$.

Let

$$\begin{aligned} {\mathcal {L}}_2(s, \alpha , y) = {\mathcal {L}}_1(\phi ^{-1}(s, \alpha ), \alpha , y). \end{aligned}$$

(1)

Then, the prediction function $ Pred _{{\mathcal {L}}_2}$ takes ${\mathbf {d}}$ as the threshold vector. The examples of this loss function transform are seen in Fig. 10.

Let

$$\begin{aligned} {\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}}))&= \varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}), \nonumber \\ \widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta ,{\mathbf {w}}, \alpha ))&= \phi ^{-1}(\varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}),\alpha ), \end{aligned}$$

(2)

where $\varphi (x,{\mathbf {w}})$ is a single hidden layer neural network with one input unit, Q ReLU activation hidden units and one output unit, and its parameter vector is ${\mathbf {w}}$. According to Eqs. 1 and 2,

$$\begin{aligned}&{\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta , {\mathbf {w}}, \alpha )),\alpha ,y) \nonumber \\&={\mathcal {L}}_1(\phi ^{-1}(\varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}),\alpha ),\alpha ,y) \nonumber \\&={\mathcal {L}}_2({\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}})),\alpha ,y) \end{aligned}$$

(3)

is always established.

Denote

$$\begin{aligned} (\theta _1^*, \alpha _1^*)=\arg \min _{\theta , \alpha } \sum _{i=1}^{n}{\mathcal {L}}_1({\mathcal {S}}_1({\mathbf {x}}_i,\theta ),\alpha ,y_i) \end{aligned}$$

and

$$\begin{aligned} (\theta _2^*, {\mathbf {w}}_2^*, \alpha _2^*)=\arg \min _{\theta , {\mathbf {w}}, \alpha } \sum _{i=1}^{n}{\mathcal {L}}_2({\mathcal {S}}_2({\mathbf {x}}_i,(\theta , {\mathbf {w}})),\alpha ,y_i). \end{aligned}$$

According to Eq. 3, $(\theta _2^*, \, {\mathbf {w}}_2^*, \, \alpha _2^*)$ is also the optimal solution for

$$\begin{aligned} \min _{\theta ,\alpha ,{\mathbf {w}}} \sum _{i=1}^{n}{\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}}_i,(\theta ,{\mathbf {w}},\alpha )),\alpha ,y_i). \end{aligned}$$

Therefore,

$$\begin{aligned}&\sum _{i=1}^{n}{\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}}_i,(\theta _2^*,{\mathbf {w}}_2^*, \alpha _2^*)),\alpha _2^*,y_i) \nonumber \\&= \min _{\theta ,\alpha ,{\mathbf {w}}} \sum _{i=1}^{n} {\mathcal {L}}_1(\phi ^{-1}(\varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}),\alpha ),\alpha ,y) \nonumber \\&\le \min _{\theta ,\alpha } \sum _{i=1}^{n}{\mathcal {L}}_1({\mathcal {S}}_1({\mathbf {x}}_i,\theta ),\alpha ,y_i) \nonumber \\&= \sum _{i=1}^{n}{\mathcal {L}}_1({\mathcal {S}}_1({\mathbf {x}}_i,\theta _1^*),\alpha _1^*,y_i). \end{aligned}$$

The inequality sign is established because there exists a ${\mathbf {w}}_0$ such that $\varphi (x,{\mathbf {w}}_0)=\phi (x)$, according to Lemma 2. Since the threshold model $(\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)$ has a minimum loss value on the training set that does not exceed the minimum loss value of $({\mathcal {S}}_1, {\mathcal {L}}_1)$, it can be considered that the performance of $(\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)$ is not weaker than $({\mathcal {S}}_1, {\mathcal {L}}_1)$.

Note that

$$\begin{aligned}&Pred _{{\mathcal {L}}_1}(\widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta ,{\mathbf {w}}, \alpha )),\alpha ) \nonumber \\&= \arg \min _{c_q \in {\mathcal {Y}}} {\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta , {\mathbf {w}}, \alpha )),\alpha ,c_q) \nonumber \\&= \arg \min _{c_q \in {\mathcal {Y}}} {\mathcal {L}}_2({\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}})),\alpha ,c_q) \nonumber \\&= Pred _{{\mathcal {L}}_2}({\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}})),\alpha ). \end{aligned}$$

The predictions of the threshold model $({\mathcal {S}}_2, {\mathcal {L}}_2)$ and $(\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)$ for $\forall {\mathbf {x}} \in {\mathcal {X}}$ are the same when they have the same parameters. Then, the performances of $({\mathcal {S}}_2, {\mathcal {L}}_2)$ and $(\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)$ are the same. Therefore, the performance of $({\mathcal {S}}_2, {\mathcal {L}}_2)$ is not weaker than $({\mathcal {S}}_1, {\mathcal {L}}_1)$. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Song, Y. & Yang, Z. A Natural Threshold Model for Ordinal Regression. Neural Process Lett 55, 4933–4949 (2023). https://doi.org/10.1007/s11063-022-11073-4

Download citation

Accepted: 17 October 2022
Published: 29 October 2022
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11063-022-11073-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Natural Threshold Model for Ordinal Regression

Abstract

Access this article

Similar content being viewed by others

Deep Ordinal Classification Based on the Proportional Odds Model

MORD: Multi-class Classifier for Ordinal Regression

A Generalized Logistic Link Function for Cumulative Link Models in Ordinal Regression

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

1.1 Proof of Lemma 1

Lemma 1

Proof

1.2 Proof of Lemma 2

Lemma 2

Proof

1.3 Proof of Theorem 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Natural Threshold Model for Ordinal Regression

Abstract

Access this article

Similar content being viewed by others

Deep Ordinal Classification Based on the Proportional Odds Model

MORD: Multi-class Classifier for Ordinal Regression

A Generalized Logistic Link Function for Cumulative Link Models in Ordinal Regression

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Lemma 1

Lemma 1

Proof

1.2 Proof of Lemma 2

Lemma 2

Proof

1.3 Proof of Theorem 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation