A Two-Stage Conditional Random Field Model Based Framework for Multi-Label Classification

Singh, Abhiram Kumar; Chandra Sekhar, C.

doi:10.1007/978-3-319-69900-4_9

Abhiram Kumar Singh¹⁹ &
C. Chandra Sekhar¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

2767 Accesses

Abstract

Multi-label classification (MLC) deals with the task of assigning an instance to all its relevant classes. This task becomes challenging in the presence of the label dependencies. The MLC methods that assume label independence do not use the dependencies among labels. We present a two-stage framework which improves the performance of MLC by using label dependencies. In the first stage, a standard MLC method is used to get the confidence scores for different labels. A conditional random field (CRF) is used in the second stage that improves the performance of the first-stage MLC by using the label dependencies among labels. An optimization-based framework is used to learn the structure and parameters of the CRF. Experiments show that the proposed model performs better than the state-of-the-art methods for MLC.

You have full access to this open access chapter, Download conference paper PDF

The use of data-derived label hierarchies in multi-label classification

Article 18 April 2016

MLRF: Multi-label Classification Through Random Forest with Label-Set Partition

Heuristic Classifier Chains for Multi-label Classification

Keywords

1 Introduction

In the single-label classification (SLC) problem, each data instance is assigned to one class out of two or more classes. However, in real world tasks, an object can have multiple labels. For example, a news article may have multiple topics, an image may have multiple labels and a medical diagnosis may lead to multiple diseases. Multi-label classification (MLC) [1] deals with the task of assigning such instances to all its relevant classes.

Traditional methods for MLC either transform the MLC problem into several SLC problems (problem transformation methods) or adapt an SLC method for multi-label datasets (algorithm adaptation methods). These methods assume the label independence and may give inconsistent output. For example, an instance may be assigned to two mutually exclusive labels. A method that can correct these errors due to inconsistencies by exploiting the label dependencies is likely to give an improved performance.

We present a framework based on the conditional random field (CRF) that tries to correct the erroneous output from a multi-label classifier by using the dependencies among labels. Results of our studies show that capturing dependencies among the class labels significantly improves the performance of MLC.

The rest of the paper is organised as follows. In Sect. 2, we present a brief review of methods for using the label dependencies in MLC. Section 3 presents the proposed framework that uses the CRF to capture the label dependencies and then use the dependencies to correct the errors in the output of an MLC model. In Sect. 4, we present our experimental studies and results.

2 Approaches to Capture Label Correlations

Capturing label correlations and using them for multi-label learning is important for MLC. We review some of the methods for capturing the correlations among labels.

Classifier chain [2] is based on the chain rule decomposition of the joint probability distribution where each factor in the chain decomposition is realized using a binary classifier. The input to a classifier in the chain is augmented with the output from the previous binary classifiers in the chain. The limitation of this method is that the performance depends on the chain order. Ensemble of classifier chains [2] mitigate the problem of performance dependence on the chain order by taking the average over predictions obtained using different chain orders. A Bayesian network is used in [3] to learn the relationship among the labels. Then, it uses the classifier chain method where the topological ordering of labels in the Bayesian network is considered as the chain order and the feature vector is augmented with the output from the parent class classifier. In [4], a cyclic directed graphical model is used to capture the relationships among labels. The model is built by learning a binary classifier for a label given all other labels and input features. Then the Gibbs sampling is used for inference. In [5], a two stage binary relevance method is used. In this method, the input to the second stage of binary classifiers is augmented with the output from the binary classifiers in the first stage.

Methods for MLC using the undirected graphical model have been proposed in [6,7,8,9]. In [6], a pairwise Markov random field is used for joint prediction of labels. Similarly, in [7, 8], a pairwise CRF is used where a tree-structured graph is constructed to identify the set of informative label pairs in [7]. In [8], a fully connected graph with the pairwise clique potentials is used.

3 Enhancing Multi-label Classification Using Label Dependencies

We propose a two-stage framework for multi-label classification. In the first stage, one of the MLC classifiers such as Binary Relevance (BR) [1], ML-kNN [10] or an ensemble of classifiers chains (ECC) is used. In the second stage, the output of MLC in the first stage is refined by using the dependencies among labels captured by a CRF model.

Let $ \mathcal {D}= \left\{ \left( \mathbf x _{n},\mathbf y _{n} \right) ,1\le n\le N \right\} $ be the multi-label data where $\mathbf x _{n}\in \mathfrak {R}^{d} $ is the d-dimensional input instance and $\mathbf y = \left\{ y_{1}, y_{2},...,y_{m}\right\} $ is the m-dimensional desired output vector. Here, m is the number of class labels and $y_{j}\in \left\{ 0,1 \right\} $. MLC deals with learning the mapping $h : \mathfrak {R}^{d}\rightarrow \left\{ 0,1 \right\} ^{\textit{m}}$.

In the BR method for MLC, the multi-label dataset is transformed into m binary classification datasets. In the $j^{th}$ dataset, the instances are considered as the positive instances if they belong to the $j^{th}$ class, otherwise they are considered as the negative instances. Any SLC method can be used to build each of the m classifiers. Prediction for a test instance is obtained from the outputs of the m classifiers. The ML-kNN method is an algorithm adaptation method based on the k-nearest neighbour (kNN) classification for SLC. For a given test instance, the ML-kNN first identifies its k-nearest neighbours. Then the prediction is obtained using the Bayes rule based on the statistical information obtained from the neighbours.

Let $\mathbf s = \left\{ s_{1}, s_{2},...,s_{m}\right\} $ be the set of confidence scores obtained from the first stage where $ s_{j} \in [0,1]$ is the output of the classifier corresponding to the $j^{th}$ class for a given instance $\mathbf x $.

3.1 Conditional Random Field

Conditional Random Field (CRF) [11] is a discriminative undirected probabilistic graphical model that directly models the conditional probability distribution $p(\mathbf y | \mathbf s )$, where y is the set of output variables and s is the set of observed input variables as shown in Fig. 1. In the proposed method, the set of confidence scores $\mathbf s $ obtained from the first stage are used as the input to the CRF. The graph associated with the CRF encodes the dependencies among the output variables. An edge between two nodes in the graph indicates that the corresponding variables are dependent on each other. The conditional probability distribution $p(\mathbf y | \mathbf s )$ is given by the normalized product of clique potentials.

We use a CRF with the pairwise potentials to model the dependencies among the labels $ \mathbf y $ using the output $\mathbf s $ from the first stage. Let $G = \left( V,E \right) $ be the graph associated with the CRF. The nodes V of the graph represents the class variables and the edges E represents the dependence relationships among class variables. The conditional distribution $p(\mathbf y | \mathbf s )$ is given by

$$\begin{aligned} p(\mathbf y | \mathbf s ) = \frac{1}{Z\left( \mathbf s \right) }\prod _{i \in V}\varPhi _{i}\left( y_{i}, \mathbf s \right) \prod _{\left( i,j \right) \in E}\psi _{ij}\left( y_{i},y_{j}, \mathbf s \right) \end{aligned}$$

(1)

where $\varPhi _{i}$ is the node potential associated with $i^{th}$ node and $\psi _{ij}$ is the edge potential associated with the $\left( i,j \right) $ edge. The normalization constant $Z\left( \mathbf s \right) $, also known as the partition function is given by

$$\begin{aligned} Z\left( \mathbf s \right) = \sum _{ \mathbf y } \left[ \prod _{i \in V}\varPhi _{i}\left( y_{i}, \mathbf s \right) \prod _{\left( i,j \right) \in E}\psi _{ij}\left( y_{i},y_{j}, \mathbf s \right) \right] \end{aligned}$$

(2)

For the binary variable $y_{i}\in \left\{ 0,1 \right\} $, the node potential $\varPhi _{i}$ for different assignments of $y_{i}$ is given by

$$\begin{aligned} \varPhi _{i}\left( y_{i},\mathbf s \right) = \left( e^{f_{i}\left( \mathbf s \right) v_{i}^{0}}, e^{f_{i}\left( \mathbf s \right) v_{i}^{1}} \right) \end{aligned}$$

(3)

where $v_{i}^{0}$ and $ v_{i}^{1} $ are the node parameters corresponding to the state $y_{i} = 0$ and $y_{i} = 1$ respectively, and $f_{i}\left( \mathbf s \right) = s_{i}$ is the node feature.

Similarly, the edge potential $\psi _{ij}$ for different assignments of edge $(i,j) = \left\{ 00,01,10,11\right\} $ is defined by

$$\begin{aligned} \psi _{ij}\left( y_{i},y_{j},\mathbf s \right) = \begin{pmatrix} e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{0,0}} &{} e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{0,1}}\\ e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{1,0}}&{} e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{1,1}} \end{pmatrix} \end{aligned}$$

(4)

where $\mathbf f _{ij}\left( \mathbf s \right) = \left[ s_{i}, s_{j} \right] ^{T}$ are the edge features and $(\mathbf w _{ij}^{0,0}, \mathbf w _{ij}^{0,1}, \mathbf w _{ij}^{1,0}, \mathbf w _{ij}^{1,1}) $ are the edge parameters.

Let $ \varvec{ \theta } = \left[ \mathbf v , \mathbf w \right] $ be the combined parametric vector and the respective feature functions be combined as $ F\left( \mathbf s ,\mathbf y \right) $. The Eq. (1) can now be written succinctly as

$$\begin{aligned} p\left( \mathbf y |\mathbf s \right) = \frac{1}{Z\left( \varvec{ \theta } , \mathbf s \right) } exp\left( \varvec{ \theta } ^{T}F\left( \mathbf s ,\mathbf y \right) \right) \end{aligned}$$

(5)

3.2 Objective Function

The objective function for learning the CRF parameters, the negative log likelihood (nll) is given as

$$\begin{aligned} nll\left( \varvec{ \theta } \right) = -\sum _{n=1}^{N} log \ p\left( \mathbf y _{n}|\mathbf s _{n} \right) = - \sum _{n=1}^{N}\left[ \varvec{ \theta } ^{T}F\left( \mathbf s _{n},\mathbf y _{n} \right) -log \ Z\left( \varvec{ \theta } , \mathbf s _{n} \right) \right] \end{aligned}$$

(6)

The gradient for the negative log likelihood [12] is given by

$$\begin{aligned} \nabla nll\left( \varvec{ \theta } \right) = - \sum _{n=1}^{N}\left[ F\left( \mathbf s _{n},\mathbf y _{n} \right) - E_\mathbf{y '}\left[ F\left( \mathbf s ,\mathbf y ' \right) \right] \right] \end{aligned}$$

(7)

where $ E_\mathbf{y '}\left[ F\left( \mathbf s ,\mathbf y ' \right) \right] = \sum _\mathbf{y '} p\left( \mathbf y '|\mathbf s \right) F\left( \mathbf s ,\mathbf y ' \right) $ are the expectations for the feature functions. To find these expectations, we have to run an inference algorithm to compute model distribution $p\left( \mathbf y '|\mathbf s \right) $ for all values of $\mathbf y '$. This makes computing gradient very expensive. Two main solutions to address this issue are: (a) use an approximate inference algorithm such as loopy belief propagation and (b) use a surrogate objective function such as pseudo-likelihood. We consider the second method that uses the pseudo-likelihood. The negative log pseudo-likelihood (nlpl) for a CRF is given by

$$\begin{aligned} nlpl\left( \varvec{ \varvec{ \theta }} \right) = -\sum _{n = 1}^{N} log \ PL\left( \mathbf {y}_{n}|\mathbf s _{n} \right) =-\sum _{n = 1}^{N} \sum _{i\in V}log \ p\left( y_{i,n}|\mathbf {y}_{\mathcal {N}_{i},n}, \mathbf s _{n};\varvec{\varvec{ \theta } } \right) \end{aligned}$$

(8)

where $\mathbf {y}_{\mathcal {N}_{i},n}$ is the set of neighbours $\mathcal {N}_{i}$ for the $i^{th}$ node and the $n^{th}$ instance. The negative log pseudo-likelihood is a convex function in parameters $\varvec{ \theta }$ and known to be a consistent estimator, i.e., it returns the same set of parameters as the maximum likelihood estimate for $\varvec{ \theta }$ when the number of instances goes to infinity [15].

Using the concise notation,

$$\begin{aligned} p\left( y_{i}|\mathbf {y}_{\mathcal {N}_{i}}, \mathbf s ;\varvec{\varvec{ \theta } } \right) = \frac{1}{Z_{i}\left( \varvec{ \theta }_{i} , \mathbf s \right) } exp\left( \varvec{ \theta }_{i} ^{T}F_{i}\left( \mathbf s ,\mathbf y \right) \right) \end{aligned}$$

(9)

where $\varvec{ \theta }_{i} = \left( \mathbf {v}_{i}, \left\{ \mathbf {w}_{ij} \right\} _{j\in \mathcal {N}_{i}} \right) $ are the parameters corresponding to $i^{th}$ node and its neighbours, $Z_{i}$ is the local partition function, and $F_{i}$ is the local feature vector. The local partition function $Z_{i}$ can be computed by summing only over the values of $y_{i}$.

3.3 CRF Structure and Parameter Learning

The structure of a CRF can be learnt by minimizing the regularized negative log pseudo-likelihood function with $L_{1}$ regularization [13]. The $L_{1}$ norm based regularization is known to give a sparse solution. We impose $L_{1}$ regularization for each set of parameters associated with the edges in the graph [14]. This causes sparsity in the edge weight parameters where all parameters associated with a specific edge go to zero simultaneously. Using $L_{2}$ regularizer for the node parameters, the regularization term $R(\varvec{ \theta } )$ can be written as

$$\begin{aligned} R(\varvec{ \theta } ) = \lambda _{1}\left\| \mathbf v \right\| _{2}^{2} + \lambda _{2} \sum _{b \in E}\left\| \mathbf w _{b} \right\| _{2} \end{aligned}$$

(10)

where $\mathbf w _{b} = (\mathbf w _{ij}^{0,0}, \mathbf w _{ij}^{0,1}, \mathbf w _{ij}^{1,0}, \mathbf w _{ij}^{1,1}) $ is the set of weight parameters for different configuration of the edge $b = (i,j)$. Parameters of the CRF are found by minimizing the regularized loss function as given below

$$\begin{aligned} \varvec{ \theta } ^{*} = arg min_{\varvec{ \theta } }(nlpl\left( \varvec{ \theta } \right) + R(\varvec{ \theta } ) ) \end{aligned}$$

(11)

We use the projected quasi-Newton [16] method to solve the above optimization problem. The structure of the CRF then corresponds to all edges in the graph that has non-zero weight parameters. After fixing the structure of the CRF, the $L_{2}$ norm regularization is used over the edge parameters. The limited-memory BFGS [17] method is used to further fine-tune the model’s parameters for the given structure. After training the model, the loopy belief propagation method is used to obtain the final predictions.

4 Experiments

We performed the experiments on the following multi-label datasets; Emotion, Enron, Medical, Scene and Yeast from Mulan [18].

The evaluation metrics used to compare the various methods are: Accuracy, Subset-accuracy (exact match) and Hamming loss [1].

Table 1. Accuracy comparison of different single-stage MLC methods(BR, ML-kNN and ECC) with the proposed two-stage method using CRF.

Full size table

Table 2. Performance comparison of our proposed method ($CRF_{ECC}$) with other state-of-the-art-methods: Collective Multi-Label classification (CML) [8], Meta Binary Relevance (MBR) [5] and Conditional Dependency Network (CDN) [4]

Full size table

We compared the performance of the proposed method with different existing methods for MLC. The BR, ML-kNN and ECC based MLC are used in the first stage. Logistic regression with $L_{2}$ regularization is used as the base classifier for BR method. SVMs were used as base classifiers for ECC. For ML-kNN, we used the code released on the internet by the author. We used the UGM-toolbox [19] for CRF implementation. Other MLC methods were implemented using MEKA ^{Footnote 1}. All hyper-parameters are tuned using the cross-validation method.

The performance of proposed two-stage method using different MLCs in the first stage is presented in Table 1. For all the three MLC methods, the CRF based two-stage method is able to enhance the performance. The improvement is more significant in datasets that have a high correlation among class labels. Table 2 presents the comparison of the proposed method against the other existing methods. The proposed method performs better than all other methods. This shows the effectiveness of capturing label dependencies for MLC.

5 Conclusion

In this paper, we proposed a two-stage framework for multi-label classification using the conditional random field. It captures the dependencies among labels to improve the MLC performance. An optimization-based framework is used for learning the structure of the CRF. Experimental results shows the effectiveness of the proposed method for benchmark multi-label datasets.

Notes

1.
http://meka.sourceforge.net/.

References

Zhang, M.-L., Zhou, Z.-H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Article Google Scholar
Read, J., Bernhard, F.P., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)
Article MathSciNet Google Scholar
Zhang, M.-L., Zhang, K.: Multi-label learning by exploiting label dependency. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 999–1008. ACM (2010)
Google Scholar
Guo, Y., Gu, S.: Multi-label classification using conditional dependency networks. In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 1300 (2011)
Google Scholar
Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS, vol. 3056, pp. 22–30. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24775-3_5
Chapter Google Scholar
Arias, J., Gamez, J.A., Nielsen, T.D., Puerta, J.M.: A scalable pairwise class interaction framework for multidimensional classification. Int. J. Approximate Reasoning 68, 194–210 (2016)
Article MATH Google Scholar
Li, X., Zhao, F., Guo, Y.: Multi-label image classification with a probabilistic label enhancement model. In: Proceedings of Uncertainty in Artificial Intelligence (2014)
Google Scholar
Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 195–200. ACM (2005)
Google Scholar
Naeini, M.P., Batal, I., Liu, Z., Hong, C., Hauskrecht, M.: An optimization-based framework to learn conditional random fields for multi-label classification. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 992–1000. Society for Industrial and Applied Mathematics (2014)
Google Scholar
Zhang, M.-L., Zhou, Z.-H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Article MATH Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)
Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT press, Cambridge (2012)
MATH Google Scholar
Schmidt, M.W., et al.: Structure learning in random fields for heart motion abnormality detection. In: CVPR, vol. 1(1) (2008)
Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
Article MATH MathSciNet Google Scholar
Besag, J.: Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika 64(3), 616–618 (1977)
Article MATH MathSciNet Google Scholar
Schmidt, M.W., Van Den Berg, E., Friedlander, M.P., Murphy, K.P.: Optimizing costly functions with simple constraints: a limited-memory projected quasi-Newton Algorithm. In: AISTATS, vol. 5 (2009)
Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Prog. 45(1), 503–528 (1989)
Article MATH MathSciNet Google Scholar
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Mulan, I.V.: A Java library for multi-label learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)
MATH MathSciNet Google Scholar
Schmidt, M.: UGM: a Matlab toolbox for probabilistic undirected graphical models (2007). http://www.cs.ubc.ca/~schmidtm/Software/UGM.html

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India
Abhiram Kumar Singh & C. Chandra Sekhar

Authors

Abhiram Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
C. Chandra Sekhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhiram Kumar Singh .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, A.K., Chandra Sekhar, C. (2017). A Two-Stage Conditional Random Field Model Based Framework for Multi-Label Classification. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_9
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)