Abstract
Multi-label classification (MLC) deals with the task of assigning an instance to all its relevant classes. This task becomes challenging in the presence of the label dependencies. The MLC methods that assume label independence do not use the dependencies among labels. We present a two-stage framework which improves the performance of MLC by using label dependencies. In the first stage, a standard MLC method is used to get the confidence scores for different labels. A conditional random field (CRF) is used in the second stage that improves the performance of the first-stage MLC by using the label dependencies among labels. An optimization-based framework is used to learn the structure and parameters of the CRF. Experiments show that the proposed model performs better than the state-of-the-art methods for MLC.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the single-label classification (SLC) problem, each data instance is assigned to one class out of two or more classes. However, in real world tasks, an object can have multiple labels. For example, a news article may have multiple topics, an image may have multiple labels and a medical diagnosis may lead to multiple diseases. Multi-label classification (MLC) [1] deals with the task of assigning such instances to all its relevant classes.
Traditional methods for MLC either transform the MLC problem into several SLC problems (problem transformation methods) or adapt an SLC method for multi-label datasets (algorithm adaptation methods). These methods assume the label independence and may give inconsistent output. For example, an instance may be assigned to two mutually exclusive labels. A method that can correct these errors due to inconsistencies by exploiting the label dependencies is likely to give an improved performance.
We present a framework based on the conditional random field (CRF) that tries to correct the erroneous output from a multi-label classifier by using the dependencies among labels. Results of our studies show that capturing dependencies among the class labels significantly improves the performance of MLC.
The rest of the paper is organised as follows. In Sect. 2, we present a brief review of methods for using the label dependencies in MLC. Section 3 presents the proposed framework that uses the CRF to capture the label dependencies and then use the dependencies to correct the errors in the output of an MLC model. In Sect. 4, we present our experimental studies and results.
2 Approaches to Capture Label Correlations
Capturing label correlations and using them for multi-label learning is important for MLC. We review some of the methods for capturing the correlations among labels.
Classifier chain [2] is based on the chain rule decomposition of the joint probability distribution where each factor in the chain decomposition is realized using a binary classifier. The input to a classifier in the chain is augmented with the output from the previous binary classifiers in the chain. The limitation of this method is that the performance depends on the chain order. Ensemble of classifier chains [2] mitigate the problem of performance dependence on the chain order by taking the average over predictions obtained using different chain orders. A Bayesian network is used in [3] to learn the relationship among the labels. Then, it uses the classifier chain method where the topological ordering of labels in the Bayesian network is considered as the chain order and the feature vector is augmented with the output from the parent class classifier. In [4], a cyclic directed graphical model is used to capture the relationships among labels. The model is built by learning a binary classifier for a label given all other labels and input features. Then the Gibbs sampling is used for inference. In [5], a two stage binary relevance method is used. In this method, the input to the second stage of binary classifiers is augmented with the output from the binary classifiers in the first stage.
Methods for MLC using the undirected graphical model have been proposed in [6,7,8,9]. In [6], a pairwise Markov random field is used for joint prediction of labels. Similarly, in [7, 8], a pairwise CRF is used where a tree-structured graph is constructed to identify the set of informative label pairs in [7]. In [8], a fully connected graph with the pairwise clique potentials is used.
3 Enhancing Multi-label Classification Using Label Dependencies
We propose a two-stage framework for multi-label classification. In the first stage, one of the MLC classifiers such as Binary Relevance (BR) [1], ML-kNN [10] or an ensemble of classifiers chains (ECC) is used. In the second stage, the output of MLC in the first stage is refined by using the dependencies among labels captured by a CRF model.
Let \( \mathcal {D}= \left\{ \left( \mathbf x _{n},\mathbf y _{n} \right) ,1\le n\le N \right\} \) be the multi-label data where \(\mathbf x _{n}\in \mathfrak {R}^{d} \) is the d-dimensional input instance and \(\mathbf y = \left\{ y_{1}, y_{2},...,y_{m}\right\} \) is the m-dimensional desired output vector. Here, m is the number of class labels and \(y_{j}\in \left\{ 0,1 \right\} \). MLC deals with learning the mapping \(h : \mathfrak {R}^{d}\rightarrow \left\{ 0,1 \right\} ^{\textit{m}}\).
In the BR method for MLC, the multi-label dataset is transformed into m binary classification datasets. In the \(j^{th}\) dataset, the instances are considered as the positive instances if they belong to the \(j^{th}\) class, otherwise they are considered as the negative instances. Any SLC method can be used to build each of the m classifiers. Prediction for a test instance is obtained from the outputs of the m classifiers. The ML-kNN method is an algorithm adaptation method based on the k-nearest neighbour (kNN) classification for SLC. For a given test instance, the ML-kNN first identifies its k-nearest neighbours. Then the prediction is obtained using the Bayes rule based on the statistical information obtained from the neighbours.
Let \(\mathbf s = \left\{ s_{1}, s_{2},...,s_{m}\right\} \) be the set of confidence scores obtained from the first stage where \( s_{j} \in [0,1]\) is the output of the classifier corresponding to the \(j^{th}\) class for a given instance \(\mathbf x \).
3.1 Conditional Random Field
Conditional Random Field (CRF) [11] is a discriminative undirected probabilistic graphical model that directly models the conditional probability distribution \(p(\mathbf y | \mathbf s )\), where y is the set of output variables and s is the set of observed input variables as shown in Fig. 1. In the proposed method, the set of confidence scores \(\mathbf s \) obtained from the first stage are used as the input to the CRF. The graph associated with the CRF encodes the dependencies among the output variables. An edge between two nodes in the graph indicates that the corresponding variables are dependent on each other. The conditional probability distribution \(p(\mathbf y | \mathbf s )\) is given by the normalized product of clique potentials.
We use a CRF with the pairwise potentials to model the dependencies among the labels \( \mathbf y \) using the output \(\mathbf s \) from the first stage. Let \(G = \left( V,E \right) \) be the graph associated with the CRF. The nodes V of the graph represents the class variables and the edges E represents the dependence relationships among class variables. The conditional distribution \(p(\mathbf y | \mathbf s )\) is given by
where \(\varPhi _{i}\) is the node potential associated with \(i^{th}\) node and \(\psi _{ij}\) is the edge potential associated with the \(\left( i,j \right) \) edge. The normalization constant \(Z\left( \mathbf s \right) \), also known as the partition function is given by
For the binary variable \(y_{i}\in \left\{ 0,1 \right\} \), the node potential \(\varPhi _{i}\) for different assignments of \(y_{i}\) is given by
where \(v_{i}^{0}\) and \( v_{i}^{1} \) are the node parameters corresponding to the state \(y_{i} = 0\) and \(y_{i} = 1\) respectively, and \(f_{i}\left( \mathbf s \right) = s_{i}\) is the node feature.
Similarly, the edge potential \(\psi _{ij}\) for different assignments of edge \((i,j) = \left\{ 00,01,10,11\right\} \) is defined by
where \(\mathbf f _{ij}\left( \mathbf s \right) = \left[ s_{i}, s_{j} \right] ^{T}\) are the edge features and \((\mathbf w _{ij}^{0,0}, \mathbf w _{ij}^{0,1}, \mathbf w _{ij}^{1,0}, \mathbf w _{ij}^{1,1}) \) are the edge parameters.
Let \( \varvec{ \theta } = \left[ \mathbf v , \mathbf w \right] \) be the combined parametric vector and the respective feature functions be combined as \( F\left( \mathbf s ,\mathbf y \right) \). The Eq. (1) can now be written succinctly as
3.2 Objective Function
The objective function for learning the CRF parameters, the negative log likelihood (nll) is given as
The gradient for the negative log likelihood [12] is given by
where \( E_\mathbf{y '}\left[ F\left( \mathbf s ,\mathbf y ' \right) \right] = \sum _\mathbf{y '} p\left( \mathbf y '|\mathbf s \right) F\left( \mathbf s ,\mathbf y ' \right) \) are the expectations for the feature functions. To find these expectations, we have to run an inference algorithm to compute model distribution \(p\left( \mathbf y '|\mathbf s \right) \) for all values of \(\mathbf y '\). This makes computing gradient very expensive. Two main solutions to address this issue are: (a) use an approximate inference algorithm such as loopy belief propagation and (b) use a surrogate objective function such as pseudo-likelihood. We consider the second method that uses the pseudo-likelihood. The negative log pseudo-likelihood (nlpl) for a CRF is given by
where \(\mathbf {y}_{\mathcal {N}_{i},n}\) is the set of neighbours \(\mathcal {N}_{i}\) for the \(i^{th}\) node and the \(n^{th}\) instance. The negative log pseudo-likelihood is a convex function in parameters \(\varvec{ \theta }\) and known to be a consistent estimator, i.e., it returns the same set of parameters as the maximum likelihood estimate for \(\varvec{ \theta }\) when the number of instances goes to infinity [15].
Using the concise notation,
where \(\varvec{ \theta }_{i} = \left( \mathbf {v}_{i}, \left\{ \mathbf {w}_{ij} \right\} _{j\in \mathcal {N}_{i}} \right) \) are the parameters corresponding to \(i^{th}\) node and its neighbours, \(Z_{i}\) is the local partition function, and \(F_{i}\) is the local feature vector. The local partition function \(Z_{i}\) can be computed by summing only over the values of \(y_{i}\).
3.3 CRF Structure and Parameter Learning
The structure of a CRF can be learnt by minimizing the regularized negative log pseudo-likelihood function with \(L_{1}\) regularization [13]. The \(L_{1}\) norm based regularization is known to give a sparse solution. We impose \(L_{1}\) regularization for each set of parameters associated with the edges in the graph [14]. This causes sparsity in the edge weight parameters where all parameters associated with a specific edge go to zero simultaneously. Using \(L_{2}\) regularizer for the node parameters, the regularization term \(R(\varvec{ \theta } )\) can be written as
where \(\mathbf w _{b} = (\mathbf w _{ij}^{0,0}, \mathbf w _{ij}^{0,1}, \mathbf w _{ij}^{1,0}, \mathbf w _{ij}^{1,1}) \) is the set of weight parameters for different configuration of the edge \(b = (i,j)\). Parameters of the CRF are found by minimizing the regularized loss function as given below
We use the projected quasi-Newton [16] method to solve the above optimization problem. The structure of the CRF then corresponds to all edges in the graph that has non-zero weight parameters. After fixing the structure of the CRF, the \(L_{2}\) norm regularization is used over the edge parameters. The limited-memory BFGS [17] method is used to further fine-tune the model’s parameters for the given structure. After training the model, the loopy belief propagation method is used to obtain the final predictions.
4 Experiments
We performed the experiments on the following multi-label datasets; Emotion, Enron, Medical, Scene and Yeast from Mulan [18].
The evaluation metrics used to compare the various methods are: Accuracy, Subset-accuracy (exact match) and Hamming loss [1].
We compared the performance of the proposed method with different existing methods for MLC. The BR, ML-kNN and ECC based MLC are used in the first stage. Logistic regression with \(L_{2}\) regularization is used as the base classifier for BR method. SVMs were used as base classifiers for ECC. For ML-kNN, we used the code released on the internet by the author. We used the UGM-toolbox [19] for CRF implementation. Other MLC methods were implemented using MEKA Footnote 1. All hyper-parameters are tuned using the cross-validation method.
The performance of proposed two-stage method using different MLCs in the first stage is presented in Table 1. For all the three MLC methods, the CRF based two-stage method is able to enhance the performance. The improvement is more significant in datasets that have a high correlation among class labels. Table 2 presents the comparison of the proposed method against the other existing methods. The proposed method performs better than all other methods. This shows the effectiveness of capturing label dependencies for MLC.
5 Conclusion
In this paper, we proposed a two-stage framework for multi-label classification using the conditional random field. It captures the dependencies among labels to improve the MLC performance. An optimization-based framework is used for learning the structure of the CRF. Experimental results shows the effectiveness of the proposed method for benchmark multi-label datasets.
Notes
References
Zhang, M.-L., Zhou, Z.-H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Read, J., Bernhard, F.P., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)
Zhang, M.-L., Zhang, K.: Multi-label learning by exploiting label dependency. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 999–1008. ACM (2010)
Guo, Y., Gu, S.: Multi-label classification using conditional dependency networks. In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 1300 (2011)
Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS, vol. 3056, pp. 22–30. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24775-3_5
Arias, J., Gamez, J.A., Nielsen, T.D., Puerta, J.M.: A scalable pairwise class interaction framework for multidimensional classification. Int. J. Approximate Reasoning 68, 194–210 (2016)
Li, X., Zhao, F., Guo, Y.: Multi-label image classification with a probabilistic label enhancement model. In: Proceedings of Uncertainty in Artificial Intelligence (2014)
Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 195–200. ACM (2005)
Naeini, M.P., Batal, I., Liu, Z., Hong, C., Hauskrecht, M.: An optimization-based framework to learn conditional random fields for multi-label classification. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 992–1000. Society for Industrial and Applied Mathematics (2014)
Zhang, M.-L., Zhou, Z.-H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT press, Cambridge (2012)
Schmidt, M.W., et al.: Structure learning in random fields for heart motion abnormality detection. In: CVPR, vol. 1(1) (2008)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
Besag, J.: Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika 64(3), 616–618 (1977)
Schmidt, M.W., Van Den Berg, E., Friedlander, M.P., Murphy, K.P.: Optimizing costly functions with simple constraints: a limited-memory projected quasi-Newton Algorithm. In: AISTATS, vol. 5 (2009)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Prog. 45(1), 503–528 (1989)
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Mulan, I.V.: A Java library for multi-label learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)
Schmidt, M.: UGM: a Matlab toolbox for probabilistic undirected graphical models (2007). http://www.cs.ubc.ca/~schmidtm/Software/UGM.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Singh, A.K., Chandra Sekhar, C. (2017). A Two-Stage Conditional Random Field Model Based Framework for Multi-Label Classification. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)