Skip to main content
Log in

A Natural Threshold Model for Ordinal Regression

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

The threshold model is one of the most commonly used ordinal regression methods. It projects patterns onto a real axis and uses a list of thresholds to divide the real axis into consecutive intervals, one for each category. We propose using a fixed sequence of natural thresholds (i.e., \(0.5, 1.5, \ldots \)) to simplify the model. We prove that using fixed thresholds does not degrade the performance after minor changes in the neural network structure. The natural threshold simplifies the logical complexity and implementation complexity of the model, so that the loss function can be easily customized for any special metric. We have designed loss functions for various ordinal regression evaluation metrics and achieved the best level of results on many ordinal regression tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The MORPH II data set and the Image Aesthetics data set are public, and the diamond color classification data set is non-public.

Code Availability

We will publish the code on github after the paper is accepted.

References

  1. Causey JL, Zhang J, Ma S, Jiang B, Qualls JA, Politte DG, Prior F, Zhang S, Huang X (2018) Highly accurate model for prediction of lung nodule malignancy with ct scans. Sci Rep 8(1):1–12

    Article  Google Scholar 

  2. Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural Comput 19(3):792–815

    Article  MathSciNet  MATH  Google Scholar 

  3. Gutiérrez PA, Perez-Ortiz M, Sanchez-Monedero J, Fernandez-Navarro F, Hervas-Martinez C (2015) Ordinal regression methods: survey and experimental study. IEEE Trans Knowl Data Eng 28(1):127–146

    Article  Google Scholar 

  4. Gutiérrez PA, Tiňo P, Hervás-Martínez C (2014) Ordinal regression neural networks based on concentric hyperspheres. Neural Netw 59:51–60

    Article  MATH  Google Scholar 

  5. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  6. de La Torre J, Valls A, Puig D (2020) A deep learning interpretable classifier for diabetic retinopathy disease grading. Neurocomputing 396:465–476

    Article  Google Scholar 

  7. Li K, Xing J, Su C, Hu W, Zhang Y, Maybank S (2018) Deep cost-sensitive and order-preserving feature learning for cross-population age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 399–408

  8. Liu H, Lu J, Feng J, Zhou J (2017) Label-sensitive deep metric learning for facial age estimation. IEEE Trans Inf Foren Security 13(2):292–305

    Article  Google Scholar 

  9. Liu X, Fan F, Kong L, Diao Z, Xie W, Lu J, You J (2020) Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing 388:34–44

    Article  Google Scholar 

  10. Liu Y, Kong AWK, Goh CK (2018) A constrained deep neural network for ordinal regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 831–839

  11. Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output CNN for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4920–4928

  12. Pan H, Han H, Shan S, Chen X (2018) Mean-variance loss for deep age estimation from a face. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5285–5294

  13. Ricanek K, Tesafaye T (2006) Morph: A longitudinal image database of normal adult age-progression. In: 7th international conference on automatic face and gesture recognition (FGR06), pp. 341–345. IEEE

  14. Rothe R, Timofte R, Van Gool L (2018) Deep expectation of real and apparent age from a single image without facial landmarks. Int J Comput Vis 126(2–4):144–157

    Article  MathSciNet  Google Scholar 

  15. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  16. Schifanella R, Redi M, Aiello LM (2015) An image is worth more than a thousand favorites: Surfacing the hidden beauty of flickr pictures. In: Proceedings of the international AAAI conference on web and social media, vol. 9

  17. Shu Y, Li Q, Liu S, Xu G (2020) Learning with privileged information for photo aesthetic assessment. Neurocomputing 404:304–316

    Article  Google Scholar 

  18. Tian Q, Zhang W, Wang L, Chen S, Yin H (2018) Robust ordinal regression induced by lp-centroid. Neurocomputing 313:184–195

    Article  Google Scholar 

  19. Tian X, Dong Z, Yang K, Mei T (2015) Query-dependent aesthetic model with deep learning for photo quality assessment. IEEE Trans Multimed 17(11):2035–2048

    Article  Google Scholar 

  20. Vargas VM, Gutiérrez PA, Hervás-Martínez C (2020) Cumulative link models for deep ordinal classification. Neurocomputing 401:48–58

    Article  Google Scholar 

  21. Zhang C, Liu S, Xu X, Zhu C (2019) C3ae: Exploring the limits of compact model for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12587–12596

Download references

Acknowledgements

The work is supported by Anhui Center for Applied Mathematics, the Major Project of Science & Technology of Anhui Province (Nos. 202203a05020050, 202103a07020011), the NSF of China (No. 11871447), the Strategic Priority Research Program of Chinese Academy of Sciences (No. XDC 08010100), and the National Key Research and Development Program of MOST of China (No. 2018AAA0101001).

Funding

Funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhouwang Yang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proof of Lemma 1

Lemma 1

Let the prediction function \( Pred _{\mathcal {L}}\) be order–preserving and surjective. Then, for \(\forall \alpha \in \Gamma \), \(\exists -\infty< b_1< b_2< \cdots< b_{Q-1} < +\infty \), such that

$$\begin{aligned} Pred _{\mathcal {L}}(s, \alpha ) = c_q,\ s \in [b_{q-1}, b_q), \end{aligned}$$

where \(q \in \{1, \ldots , Q\}, b_0 = -\infty , b_{Q} = +\infty \). In particular, we call \({\mathbf {b}} = (b_1, \ldots , b_{Q-1})\) the threshold vector of \( Pred _{\mathcal {L}}\).

Proof

For \(\forall \alpha \in \Gamma \), let \(b_q=\sup \{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_q\}\), \(q \in \{1, 2, \ldots , Q-1\}\). Because \(c_{q-1} \preceq c_{q}\), it follows that \(\{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\} \subseteq \{s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q}\}\) and \(\sup \{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\} \le \sup \{s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q}\}\). That means \(b_{q-1} \le b_{q}\), so \(b_1 \le b_2 \le \cdots \le b_{Q-1}\).

Since \( Pred _{\mathcal {L}}\) is order–preserving, we have \( Pred _{\mathcal {L}}(s, \alpha ) \preceq c_q\) when \(b_{q-1}< s < b_q\). By the definition of \(b_{q-1}\), \( Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\) is not established at this time. Thus, \( Pred _{\mathcal {L}}(s, \alpha ) = c_q\) when \(b_{q-1}< s < b_q\).

Since \( Pred _{\mathcal {L}}\) is surjective, for \(\forall c_q \in {\mathcal {Y}}\), the Lebesgue measure of \(\{s | Pred _{\mathcal {L}}(s, \alpha ) = c_q\}\) is greater than 0. Thus, \(-\infty< b_1< b_2< \cdots< b_{Q-1} < +\infty \).

To prove that \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1})={\mathcal {L}}(b_{q-1}, \alpha , c_q)\), we first assume that \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1}) < {\mathcal {L}}(b_{q-1}, \alpha , c_q)\). Since \({\mathcal {L}}(s, \alpha , y)\) is continuous in s, there exists \(\epsilon > 0\) small enough such that \({\mathcal {L}}(b_{q-1}+\epsilon , \alpha , c_{q-1}) < {\mathcal {L}}(b_{q-1}+\epsilon , \alpha , c_q)\). Thus, \( Pred _{\mathcal {L}}(b_{q-1}+\epsilon , \alpha ) = c_{q-1}\) and contradicts \( Pred _{\mathcal {L}}(s, \alpha ) = c_q\) when \(b_{q-1}< s < b_q\). The same contradiction can be found for \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1}) > {\mathcal {L}}(b_{q-1}, \alpha , c_q)\). So \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1})={\mathcal {L}}(b_{q-1}, \alpha , c_q)\) and \( Pred _{\mathcal {L}}(b_{q-1}, \alpha ) = c_q\). \(\square \)

1.2 Proof of Lemma 2

Lemma 2

If a single hidden layer neural network has one input unit, Q ReLU activation hidden units, and one output unit, it can represent any strictly monotone and increasing Q–segment linear function.

Proof

For any strictly monotone and increasing Q–segment linear function f(x), let its \(Q-1\) segmentation points be \(t_2< t_3< \cdots < t_{Q}\) and the slopes of Q segments be \(w_1,\ w_2,\ \ldots ,\ w_Q\), respectively. Then, the single hidden layer neural network

$$\begin{aligned} g(x)=&-w_1 \cdot ReLU(-x+t_2)+w_2 \cdot ReLU(x-t_2) \nonumber \\&+\sum _{i=3}^{Q} (w_i-w_{i-1}) \cdot ReLU(x-t_i) + f(t_2) \end{aligned}$$

satisfies \(g(t_i)=f(t_i)\), \(i \in \{2, \ldots , Q\}\). Thus, f(x) and g(x) are identical. \(\square \)

1.3 Proof of Theorem 1

Theorem 1

Let \(({\mathcal {S}}_1, {\mathcal {L}}_1)\) be any threshold model where the prediction function \( Pred _{{\mathcal {L}}_1}\) derived from \({\mathcal {L}}_1\) is order–preserving and surjective. For any given threshold vector \({\mathbf {d}}\), there exists a threshold model \(({\mathcal {S}}_2, {\mathcal {L}}_2)\), which takes \({\mathbf {d}}\) as the threshold vector, and its performance is not weaker than \(({\mathcal {S}}_1, {\mathcal {L}}_1)\).

Proof

According to Lemma 1, for \(\forall \alpha \in \Gamma \) of the \({\mathcal {L}}_1\), there exists a threshold vector \({\mathbf {b}}\) of the \( Pred _{{\mathcal {L}}_1}\) such that \( Pred _{{\mathcal {L}}_1}(s, \alpha ) = c_q\) when \(s \in [b_{q-1}, b_q)\). Let \(\phi (x, \alpha )\) be a strictly monotone and increasing Q–segment linear function such that \(\phi (b_q, \alpha )=d_q\), and the slopes of the leftmost and rightmost segments are 1. \(\phi (x, \alpha )\) is determined by \(\alpha \), because the segmentation points \({\mathbf {b}}\) are determined by \(\alpha \), and the values at the segmentation points are \({\mathbf {d}}\).

Let

$$\begin{aligned} {\mathcal {L}}_2(s, \alpha , y) = {\mathcal {L}}_1(\phi ^{-1}(s, \alpha ), \alpha , y). \end{aligned}$$
(1)

Then, the prediction function \( Pred _{{\mathcal {L}}_2}\) takes \({\mathbf {d}}\) as the threshold vector. The examples of this loss function transform are seen in Fig. 10.

Let

$$\begin{aligned} {\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}}))&= \varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}), \nonumber \\ \widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta ,{\mathbf {w}}, \alpha ))&= \phi ^{-1}(\varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}),\alpha ), \end{aligned}$$
(2)

where \(\varphi (x,{\mathbf {w}})\) is a single hidden layer neural network with one input unit, Q ReLU activation hidden units and one output unit, and its parameter vector is \({\mathbf {w}}\). According to Eqs. 1 and 2,

$$\begin{aligned}&{\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta , {\mathbf {w}}, \alpha )),\alpha ,y) \nonumber \\&={\mathcal {L}}_1(\phi ^{-1}(\varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}),\alpha ),\alpha ,y) \nonumber \\&={\mathcal {L}}_2({\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}})),\alpha ,y) \end{aligned}$$
(3)

is always established.

Denote

$$\begin{aligned} (\theta _1^*, \alpha _1^*)=\arg \min _{\theta , \alpha } \sum _{i=1}^{n}{\mathcal {L}}_1({\mathcal {S}}_1({\mathbf {x}}_i,\theta ),\alpha ,y_i) \end{aligned}$$

and

$$\begin{aligned} (\theta _2^*, {\mathbf {w}}_2^*, \alpha _2^*)=\arg \min _{\theta , {\mathbf {w}}, \alpha } \sum _{i=1}^{n}{\mathcal {L}}_2({\mathcal {S}}_2({\mathbf {x}}_i,(\theta , {\mathbf {w}})),\alpha ,y_i). \end{aligned}$$

According to Eq. 3, \((\theta _2^*, \, {\mathbf {w}}_2^*, \, \alpha _2^*)\) is also the optimal solution for

$$\begin{aligned} \min _{\theta ,\alpha ,{\mathbf {w}}} \sum _{i=1}^{n}{\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}}_i,(\theta ,{\mathbf {w}},\alpha )),\alpha ,y_i). \end{aligned}$$

Therefore,

$$\begin{aligned}&\sum _{i=1}^{n}{\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}}_i,(\theta _2^*,{\mathbf {w}}_2^*, \alpha _2^*)),\alpha _2^*,y_i) \nonumber \\&= \min _{\theta ,\alpha ,{\mathbf {w}}} \sum _{i=1}^{n} {\mathcal {L}}_1(\phi ^{-1}(\varphi ({\mathcal {S}}_1({\mathbf {x}},\theta ),{\mathbf {w}}),\alpha ),\alpha ,y) \nonumber \\&\le \min _{\theta ,\alpha } \sum _{i=1}^{n}{\mathcal {L}}_1({\mathcal {S}}_1({\mathbf {x}}_i,\theta ),\alpha ,y_i) \nonumber \\&= \sum _{i=1}^{n}{\mathcal {L}}_1({\mathcal {S}}_1({\mathbf {x}}_i,\theta _1^*),\alpha _1^*,y_i). \end{aligned}$$

The inequality sign is established because there exists a \({\mathbf {w}}_0\) such that \(\varphi (x,{\mathbf {w}}_0)=\phi (x)\), according to Lemma 2. Since the threshold model \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) has a minimum loss value on the training set that does not exceed the minimum loss value of \(({\mathcal {S}}_1, {\mathcal {L}}_1)\), it can be considered that the performance of \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) is not weaker than \(({\mathcal {S}}_1, {\mathcal {L}}_1)\).

Fig. 10
figure 10

Examples of the loss function transform. The threshold before transformation is \(b=(-1,1.5,10)\), and the threshold after transformation is \(d=(0.5,1.5,2.5)\). After transforming the score (that is, the x–axis), the corresponding loss functions on each threshold segment in the left picture are transformed into the same color parts in the right picture

Note that

$$\begin{aligned}&Pred _{{\mathcal {L}}_1}(\widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta ,{\mathbf {w}}, \alpha )),\alpha ) \nonumber \\&= \arg \min _{c_q \in {\mathcal {Y}}} {\mathcal {L}}_1(\widetilde{{\mathcal {S}}_1}({\mathbf {x}},(\theta , {\mathbf {w}}, \alpha )),\alpha ,c_q) \nonumber \\&= \arg \min _{c_q \in {\mathcal {Y}}} {\mathcal {L}}_2({\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}})),\alpha ,c_q) \nonumber \\&= Pred _{{\mathcal {L}}_2}({\mathcal {S}}_2({\mathbf {x}},(\theta ,{\mathbf {w}})),\alpha ). \end{aligned}$$

The predictions of the threshold model \(({\mathcal {S}}_2, {\mathcal {L}}_2)\) and \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) for \(\forall {\mathbf {x}} \in {\mathcal {X}}\) are the same when they have the same parameters. Then, the performances of \(({\mathcal {S}}_2, {\mathcal {L}}_2)\) and \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) are the same. Therefore, the performance of \(({\mathcal {S}}_2, {\mathcal {L}}_2)\) is not weaker than \(({\mathcal {S}}_1, {\mathcal {L}}_1)\). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Song, Y. & Yang, Z. A Natural Threshold Model for Ordinal Regression. Neural Process Lett 55, 4933–4949 (2023). https://doi.org/10.1007/s11063-022-11073-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-11073-4

Keywords

Navigation