Skip to main content
Log in

Conditional ordinal random fields for structured ordinal-valued label prediction

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Predicting labels of structured data such as sequences or images is a very important problem in statistical machine learning and data mining. The conditional random field (CRF) is perhaps one of the most successful approaches for structured label prediction via conditional probabilistic modeling. In such models, it is traditionally assumed that each label is a random variable from a nominal category set (e.g., class categories) where all categories are symmetric and unrelated from one another. In this paper we consider a different situation of ordinal-valued labels where each label category bears a particular meaning of preference or order. This setup fits many interesting problems/datasets for which one is interested in predicting labels that represent certain degrees of intensity or relevance. We propose a fairly intuitive and principled CRF-like model that can effectively deal with the ordinal-scale labels within an underlying correlation structure. Unlike standard log-linear CRFs, learning the proposed model incurs non-convex optimization. However, the new model can be learned accurately using efficient gradient search. We demonstrate the improved prediction performance achieved by the proposed model on several intriguing sequence/image label prediction tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. It is an extension of our earlier work (conference paper) published in Kim and Pavlovic (2010). Compared to the previous work which is limited only to sequence data focusing on the task of facial emotion intensity prediction for video sequences, we provide extension to lattice-structured image data, along with detailed exposition and more evaluations.

  2. It is mainly due to the density integrability issues.

  3. We use the notation \(\mathbf{x}\) interchangeably for both a structured observation \(\mathbf{x}=\{\mathbf{x}_r\}\) and a vector, which is clearly distinguished by context.

  4. This can be seen as a general form of the popular one-vs-all or one-vs-one treatment for the multi-class problem.

  5. For simplicity, we often drop the dependency on \({\varvec{\theta }}\) in notations.

  6. A clique of a graph is a maximal subset of nodes that are fully connected.

  7. The potential function is a product between the model parameters and the feature vectors, which expresses the goodness of the state/label configuration with respect to the current model. In particular, the node potentials measure this for individual sites (nodes) while the edge potentials aim to capture the relation between adjoining sites (e.g., smoothness in label variation).

  8. It is also possible to use \(\exp (\delta _k)\) in place of \(\delta _k^2\), which can be beneficial for avoiding additional modality to the objective function.

  9. We also tested the static approach (Chu and Ghahramani 2005), the Gaussian process ordinal regressor (GPOR). However, its test performance on this dataset was far worse than that of the SVOR.

  10. We performed the paired \(t\) test for CRF and CORF. The \(p\)-value was 0.0020 for both 0/1 loss and absolute loss.

  11. Facial emotion intensity prediction is particularly important for better understanding of facial emotions. A typical problem is the facial action unit (AU) analysis in computer vision and cognitive science where one aims to identify/recognize which actions of individual muscles or activations of groups of muscles cause a specific facial emotion. The intensity labeling by human experts is accurate but very costly, and hence, automatic emotion intensity prediction is highly advantageous.

  12. Due to our undirected graphical model, one needs to include edge potentials for both directions, i.e., for \(e=(r,s)\), one for \(r \rightarrow s\) and the other for \(s \rightarrow r\).

References

  • Buffoni D, Calauzenes C, Gallinari P, Usunier N (2011) Learning scoring functions with order-preserving losses and standardized supervision. In: Getoor L, Scheffer T (eds) Proceedings of the 28th international conference on machine learning (ICML-11), ICML ’11, ACM, New York, pp 825–832

  • Chu W, Ghahramani Z (2005) Gaussian processes for ordinal regression. J Mach Learn Res 6:1019–1041

    MATH  MathSciNet  Google Scholar 

  • Chu W, Keerthi SS (2005) New approaches to support vector ordinal regression. In: De Raedt L, Wrobel S (eds) Proceedings of the 22nd international machine learning conference, ACM Press, New York

  • Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292

    Google Scholar 

  • Gunawardana A, Mahajan M, Acero A, Platt JC (2005) Hidden conditional random fields for phone classification. In: International conference on speech communication and technology, Lisbon, pp 1117–1120

  • He X, Zemel RS, Perpin a’n MAC (2004) Multiscale conditional random fields for image labeling. In: IEEE conference on computer vision and pattern recognition, pp 695–702

  • Herbrich R, Graepel T, Obermayer K (2000) Large margin rank boundaries for ordinal regression. In: Smola AJ, Bartlett PL (eds) Advances in large margin classifiers. MIT Press, Cambridge

  • Hu Y, Li M, Yu N (2008) Multiple-instance ranking: learning to rank images for image retrieval. In: Computer vision and pattern recognition, Anchorage, USA

  • Ionescu C, Bo L, Sminchisescu C (2009) Structural SVM for visual localization and continuous state estimation. In: International conference on computer vision, pp 1157–1164

  • Jing Y, Baluja S (2008) Pagerank for product image search. In: Proceeding of the 17th international conference on World Wide Web, Beijing, China

  • Jordan MI (2004) Graphical models. Stat Sci 19:140–155

    Article  MATH  Google Scholar 

  • Kim M, Pavlovic V (2009) Discriminative learning for dynamic state prediction. IEEE Trans Pattern Anal Mach Intell 31(10):1847–1861

    Article  Google Scholar 

  • Kim M, Pavlovic V (2010) Structured output ordinal regression for dynamic facial emotion intensity prediction. In: Daniilidis K, Maragos P, Paragios N (eds) European conference on computer vision, Crete, Greece, pp 649–662

  • Kumar S, Hebert M (2006) Discriminative random fields. Int J Comput Vis 68:179–201

    Article  Google Scholar 

  • Lafferty J (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Kaufmann M (ed) International conference on machine learning, Williamstown, pp 282–289

  • Lien J, Kanade T, Cohn J, Li C (2000) Detection, tracking, and classification of action units in facial expression. J Robot Auton Syst 31(3):131–146

    Google Scholar 

  • Liu Y, Liu Y, Chan KCC (2011) Ordinal regression via manifold learning. In: Twenty-Fifth AAAI conference on, artificial intelligence, pp 398–403

  • Locarnini RA, Mishonov AV, Antonov JI, Boyer TP, Garcia HE (2006) World ocean atlas 2005. In: Levitus S (ed) NOAA Atlas NESDIS. US Government Printing Office, Washington, DC, pp 61–64

    Google Scholar 

  • Mao Y, Lebanon G (2009) Generalized isotonic conditional random fields. Mach Learn 77(2–3):225–248

    Article  Google Scholar 

  • Pavlovic V, Rehg JM, Maccormick J (2000) Learning switching linear models of human motion. In: Advances in neural information processing systems, pp 981–987

  • Qin T, Liu TY, Zhang XD, Wang DS, Li H (2008) Global ranking using continuous conditional random fields. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems. Morgan Kaufmann, San Francisco

  • Shan C, Gong S, WMcOwan P (2005) Conditional mutual information based boosting for facial expression recognition. In: British machine vision conference

  • Shashua A, Levin A (2003) Ranking with large margin principle: two approaches. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems

  • Tian TP, Li R, Sclaroff S (2005) Articulated pose estimation in a learned smooth space of feasible solutions. In: IEEE workshop in computer vision and pattern recognition

  • Tian Y (2004) Evaluation of face resolution for expression analysis. In: IEEE computer vision and pattern recognition workshop on face processing in video

  • Viola P, Jones M (2001) Robust real-time object detection. Int J Comput Vis 57(2):137–154

    Article  Google Scholar 

  • Vishwanathan S, Schraudolph N, Schmidt M, Murphy K (2006) Accelerated training of conditional random fields with stochastic meta-descent. In: Cohen W, Moore A (eds) Proceedings of the 23nd international machine learning conference, Omni Press, Edinburgh

  • Wang S, Quattoni A, Morency LP, Demirdjian D, Darrell T (2006) Hidden conditional random fields for gesture recognition. In: Computer vision and pattern recognition

  • Weiss Y (2001) Comparing the mean field method and belief propagation for approximate inference in MRFs. In: Saad D, Opper M (eds) Advanced mean field methods. MIT Press, Cambridge

    Google Scholar 

  • Weston J, Wang C, Weiss R, Berenzweig A (2012) Latent collaborative retrieval. In: Langford J, Pineau J (eds) Proceedings of the 29th international conference on machine learning (ICML-12), Omnipress, ICML ’12, pp 9–16

  • Yang P, Liu Q, Metaxas DN (2009) Rankboost with l1 regularization for facial expression recognition and intensity estimation. In: International conference on computer vision, pp 1018–1025

  • Yedidia J, Freeman W, Weiss Y (2003) Understanding belief propagation and its generalizations. In: Exploring artificial intelligence in the new millennium, chap 8. Science and Technology Books, Cambridge, pp 239–269

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minyoung Kim.

Additional information

Communicated by Johannes Fürnkranz.

Appendix

Appendix

The gradients \(\frac{\partial z_k(r,c)}{\partial \mu }\), for \(k=0,1\) and \(j=1,\dots ,R-2\), in (16) are summarized as follows:

$$\begin{aligned} \frac{\partial z_k(r,c)}{\partial \mathbf{a}}&= -\frac{1}{\sigma _0^2} {\varvec{\phi }}(\mathbf{x}_r), \end{aligned}$$
(20)
$$\begin{aligned} \frac{\partial z_k(r,c)}{\partial \sigma _0}&= -\frac{2\big (b_{c-k} - \mathbf{a}^{\top }{\varvec{\phi }}(\mathbf{x}_r)\big )}{\sigma _0^3}, \end{aligned}$$
(21)
$$\begin{aligned} \frac{\partial z_0(r,c)}{\partial b_1}&= \left\{ \begin{array}{ll} 0 \quad \quad \mathrm{if} \quad c=R \\ \frac{1}{\sigma _0^2} \quad \mathrm{otherwise} \end{array} \right. , \end{aligned}$$
(22)
$$\begin{aligned} \frac{\partial z_1(r,c)}{\partial b_1}&= \left\{ \begin{array}{ll} 0 \quad \quad \mathrm{if} \quad c=1 \\ \frac{1}{\sigma _0^2} \quad \mathrm{otherwise} \end{array} \right. , \end{aligned}$$
(23)
$$\begin{aligned} \frac{\partial z_0(r,c)}{\partial \delta _j}&= \left\{ \begin{array}{ll} 0 \quad \quad \mathrm{if} \quad c \in \{1,\dots ,j,R\} \\ \frac{2\delta _j}{\sigma _0^2} \quad \mathrm{otherwise} \end{array} \right. , \end{aligned}$$
(24)
$$\begin{aligned} \frac{\partial z_1(r,c)}{\partial \delta _j}&= \left\{ \begin{array}{ll} 0 \quad \quad \mathrm{if} \quad c \in \{1,\dots ,j+1\} \\ \frac{2\delta _j}{\sigma _0^2} \quad \mathrm{otherwise} \end{array} \right. . \end{aligned}$$
(25)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, M. Conditional ordinal random fields for structured ordinal-valued label prediction. Data Min Knowl Disc 28, 378–401 (2014). https://doi.org/10.1007/s10618-013-0305-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0305-2

Keywords

Navigation