Abstract
The threshold model is one of the most commonly used ordinal regression methods. It projects patterns onto a real axis and uses a list of thresholds to divide the real axis into consecutive intervals, one for each category. We propose using a fixed sequence of natural thresholds (i.e., \(0.5, 1.5, \ldots \)) to simplify the model. We prove that using fixed thresholds does not degrade the performance after minor changes in the neural network structure. The natural threshold simplifies the logical complexity and implementation complexity of the model, so that the loss function can be easily customized for any special metric. We have designed loss functions for various ordinal regression evaluation metrics and achieved the best level of results on many ordinal regression tasks.
Similar content being viewed by others
Data Availability
The MORPH II data set and the Image Aesthetics data set are public, and the diamond color classification data set is non-public.
Code Availability
We will publish the code on github after the paper is accepted.
References
Causey JL, Zhang J, Ma S, Jiang B, Qualls JA, Politte DG, Prior F, Zhang S, Huang X (2018) Highly accurate model for prediction of lung nodule malignancy with ct scans. Sci Rep 8(1):1–12
Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural Comput 19(3):792–815
Gutiérrez PA, Perez-Ortiz M, Sanchez-Monedero J, Fernandez-Navarro F, Hervas-Martinez C (2015) Ordinal regression methods: survey and experimental study. IEEE Trans Knowl Data Eng 28(1):127–146
Gutiérrez PA, Tiňo P, Hervás-Martínez C (2014) Ordinal regression neural networks based on concentric hyperspheres. Neural Netw 59:51–60
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
de La Torre J, Valls A, Puig D (2020) A deep learning interpretable classifier for diabetic retinopathy disease grading. Neurocomputing 396:465–476
Li K, Xing J, Su C, Hu W, Zhang Y, Maybank S (2018) Deep cost-sensitive and order-preserving feature learning for cross-population age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 399–408
Liu H, Lu J, Feng J, Zhou J (2017) Label-sensitive deep metric learning for facial age estimation. IEEE Trans Inf Foren Security 13(2):292–305
Liu X, Fan F, Kong L, Diao Z, Xie W, Lu J, You J (2020) Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing 388:34–44
Liu Y, Kong AWK, Goh CK (2018) A constrained deep neural network for ordinal regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 831–839
Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output CNN for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4920–4928
Pan H, Han H, Shan S, Chen X (2018) Mean-variance loss for deep age estimation from a face. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5285–5294
Ricanek K, Tesafaye T (2006) Morph: A longitudinal image database of normal adult age-progression. In: 7th international conference on automatic face and gesture recognition (FGR06), pp. 341–345. IEEE
Rothe R, Timofte R, Van Gool L (2018) Deep expectation of real and apparent age from a single image without facial landmarks. Int J Comput Vis 126(2–4):144–157
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Schifanella R, Redi M, Aiello LM (2015) An image is worth more than a thousand favorites: Surfacing the hidden beauty of flickr pictures. In: Proceedings of the international AAAI conference on web and social media, vol. 9
Shu Y, Li Q, Liu S, Xu G (2020) Learning with privileged information for photo aesthetic assessment. Neurocomputing 404:304–316
Tian Q, Zhang W, Wang L, Chen S, Yin H (2018) Robust ordinal regression induced by lp-centroid. Neurocomputing 313:184–195
Tian X, Dong Z, Yang K, Mei T (2015) Query-dependent aesthetic model with deep learning for photo quality assessment. IEEE Trans Multimed 17(11):2035–2048
Vargas VM, Gutiérrez PA, Hervás-Martínez C (2020) Cumulative link models for deep ordinal classification. Neurocomputing 401:48–58
Zhang C, Liu S, Xu X, Zhu C (2019) C3ae: Exploring the limits of compact model for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12587–12596
Acknowledgements
The work is supported by Anhui Center for Applied Mathematics, the Major Project of Science & Technology of Anhui Province (Nos. 202203a05020050, 202103a07020011), the NSF of China (No. 11871447), the Strategic Priority Research Program of Chinese Academy of Sciences (No. XDC 08010100), and the National Key Research and Development Program of MOST of China (No. 2018AAA0101001).
Funding
Funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Lemma 1
Lemma 1
Let the prediction function \( Pred _{\mathcal {L}}\) be order–preserving and surjective. Then, for \(\forall \alpha \in \Gamma \), \(\exists -\infty< b_1< b_2< \cdots< b_{Q-1} < +\infty \), such that
where \(q \in \{1, \ldots , Q\}, b_0 = -\infty , b_{Q} = +\infty \). In particular, we call \({\mathbf {b}} = (b_1, \ldots , b_{Q-1})\) the threshold vector of \( Pred _{\mathcal {L}}\).
Proof
For \(\forall \alpha \in \Gamma \), let \(b_q=\sup \{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_q\}\), \(q \in \{1, 2, \ldots , Q-1\}\). Because \(c_{q-1} \preceq c_{q}\), it follows that \(\{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\} \subseteq \{s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q}\}\) and \(\sup \{ s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\} \le \sup \{s | Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q}\}\). That means \(b_{q-1} \le b_{q}\), so \(b_1 \le b_2 \le \cdots \le b_{Q-1}\).
Since \( Pred _{\mathcal {L}}\) is order–preserving, we have \( Pred _{\mathcal {L}}(s, \alpha ) \preceq c_q\) when \(b_{q-1}< s < b_q\). By the definition of \(b_{q-1}\), \( Pred _{\mathcal {L}}(s, \alpha ) \preceq c_{q-1}\) is not established at this time. Thus, \( Pred _{\mathcal {L}}(s, \alpha ) = c_q\) when \(b_{q-1}< s < b_q\).
Since \( Pred _{\mathcal {L}}\) is surjective, for \(\forall c_q \in {\mathcal {Y}}\), the Lebesgue measure of \(\{s | Pred _{\mathcal {L}}(s, \alpha ) = c_q\}\) is greater than 0. Thus, \(-\infty< b_1< b_2< \cdots< b_{Q-1} < +\infty \).
To prove that \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1})={\mathcal {L}}(b_{q-1}, \alpha , c_q)\), we first assume that \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1}) < {\mathcal {L}}(b_{q-1}, \alpha , c_q)\). Since \({\mathcal {L}}(s, \alpha , y)\) is continuous in s, there exists \(\epsilon > 0\) small enough such that \({\mathcal {L}}(b_{q-1}+\epsilon , \alpha , c_{q-1}) < {\mathcal {L}}(b_{q-1}+\epsilon , \alpha , c_q)\). Thus, \( Pred _{\mathcal {L}}(b_{q-1}+\epsilon , \alpha ) = c_{q-1}\) and contradicts \( Pred _{\mathcal {L}}(s, \alpha ) = c_q\) when \(b_{q-1}< s < b_q\). The same contradiction can be found for \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1}) > {\mathcal {L}}(b_{q-1}, \alpha , c_q)\). So \({\mathcal {L}}(b_{q-1}, \alpha , c_{q-1})={\mathcal {L}}(b_{q-1}, \alpha , c_q)\) and \( Pred _{\mathcal {L}}(b_{q-1}, \alpha ) = c_q\). \(\square \)
1.2 Proof of Lemma 2
Lemma 2
If a single hidden layer neural network has one input unit, Q ReLU activation hidden units, and one output unit, it can represent any strictly monotone and increasing Q–segment linear function.
Proof
For any strictly monotone and increasing Q–segment linear function f(x), let its \(Q-1\) segmentation points be \(t_2< t_3< \cdots < t_{Q}\) and the slopes of Q segments be \(w_1,\ w_2,\ \ldots ,\ w_Q\), respectively. Then, the single hidden layer neural network
satisfies \(g(t_i)=f(t_i)\), \(i \in \{2, \ldots , Q\}\). Thus, f(x) and g(x) are identical. \(\square \)
1.3 Proof of Theorem 1
Theorem 1
Let \(({\mathcal {S}}_1, {\mathcal {L}}_1)\) be any threshold model where the prediction function \( Pred _{{\mathcal {L}}_1}\) derived from \({\mathcal {L}}_1\) is order–preserving and surjective. For any given threshold vector \({\mathbf {d}}\), there exists a threshold model \(({\mathcal {S}}_2, {\mathcal {L}}_2)\), which takes \({\mathbf {d}}\) as the threshold vector, and its performance is not weaker than \(({\mathcal {S}}_1, {\mathcal {L}}_1)\).
Proof
According to Lemma 1, for \(\forall \alpha \in \Gamma \) of the \({\mathcal {L}}_1\), there exists a threshold vector \({\mathbf {b}}\) of the \( Pred _{{\mathcal {L}}_1}\) such that \( Pred _{{\mathcal {L}}_1}(s, \alpha ) = c_q\) when \(s \in [b_{q-1}, b_q)\). Let \(\phi (x, \alpha )\) be a strictly monotone and increasing Q–segment linear function such that \(\phi (b_q, \alpha )=d_q\), and the slopes of the leftmost and rightmost segments are 1. \(\phi (x, \alpha )\) is determined by \(\alpha \), because the segmentation points \({\mathbf {b}}\) are determined by \(\alpha \), and the values at the segmentation points are \({\mathbf {d}}\).
Let
Then, the prediction function \( Pred _{{\mathcal {L}}_2}\) takes \({\mathbf {d}}\) as the threshold vector. The examples of this loss function transform are seen in Fig. 10.
Let
where \(\varphi (x,{\mathbf {w}})\) is a single hidden layer neural network with one input unit, Q ReLU activation hidden units and one output unit, and its parameter vector is \({\mathbf {w}}\). According to Eqs. 1 and 2,
is always established.
Denote
and
According to Eq. 3, \((\theta _2^*, \, {\mathbf {w}}_2^*, \, \alpha _2^*)\) is also the optimal solution for
Therefore,
The inequality sign is established because there exists a \({\mathbf {w}}_0\) such that \(\varphi (x,{\mathbf {w}}_0)=\phi (x)\), according to Lemma 2. Since the threshold model \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) has a minimum loss value on the training set that does not exceed the minimum loss value of \(({\mathcal {S}}_1, {\mathcal {L}}_1)\), it can be considered that the performance of \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) is not weaker than \(({\mathcal {S}}_1, {\mathcal {L}}_1)\).
Note that
The predictions of the threshold model \(({\mathcal {S}}_2, {\mathcal {L}}_2)\) and \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) for \(\forall {\mathbf {x}} \in {\mathcal {X}}\) are the same when they have the same parameters. Then, the performances of \(({\mathcal {S}}_2, {\mathcal {L}}_2)\) and \((\widetilde{{\mathcal {S}}_1}, {\mathcal {L}}_1)\) are the same. Therefore, the performance of \(({\mathcal {S}}_2, {\mathcal {L}}_2)\) is not weaker than \(({\mathcal {S}}_1, {\mathcal {L}}_1)\). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, X., Song, Y. & Yang, Z. A Natural Threshold Model for Ordinal Regression. Neural Process Lett 55, 4933–4949 (2023). https://doi.org/10.1007/s11063-022-11073-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-11073-4