Keywords

1 Introduction

In the single-label classification (SLC) problem, each data instance is assigned to one class out of two or more classes. However, in real world tasks, an object can have multiple labels. For example, a news article may have multiple topics, an image may have multiple labels and a medical diagnosis may lead to multiple diseases. Multi-label classification (MLC) [1] deals with the task of assigning such instances to all its relevant classes.

Traditional methods for MLC either transform the MLC problem into several SLC problems (problem transformation methods) or adapt an SLC method for multi-label datasets (algorithm adaptation methods). These methods assume the label independence and may give inconsistent output. For example, an instance may be assigned to two mutually exclusive labels. A method that can correct these errors due to inconsistencies by exploiting the label dependencies is likely to give an improved performance.

We present a framework based on the conditional random field (CRF) that tries to correct the erroneous output from a multi-label classifier by using the dependencies among labels. Results of our studies show that capturing dependencies among the class labels significantly improves the performance of MLC.

The rest of the paper is organised as follows. In Sect. 2, we present a brief review of methods for using the label dependencies in MLC. Section 3 presents the proposed framework that uses the CRF to capture the label dependencies and then use the dependencies to correct the errors in the output of an MLC model. In Sect. 4, we present our experimental studies and results.

2 Approaches to Capture Label Correlations

Capturing label correlations and using them for multi-label learning is important for MLC. We review some of the methods for capturing the correlations among labels.

Classifier chain [2] is based on the chain rule decomposition of the joint probability distribution where each factor in the chain decomposition is realized using a binary classifier. The input to a classifier in the chain is augmented with the output from the previous binary classifiers in the chain. The limitation of this method is that the performance depends on the chain order. Ensemble of classifier chains [2] mitigate the problem of performance dependence on the chain order by taking the average over predictions obtained using different chain orders. A Bayesian network is used in [3] to learn the relationship among the labels. Then, it uses the classifier chain method where the topological ordering of labels in the Bayesian network is considered as the chain order and the feature vector is augmented with the output from the parent class classifier. In [4], a cyclic directed graphical model is used to capture the relationships among labels. The model is built by learning a binary classifier for a label given all other labels and input features. Then the Gibbs sampling is used for inference. In [5], a two stage binary relevance method is used. In this method, the input to the second stage of binary classifiers is augmented with the output from the binary classifiers in the first stage.

Methods for MLC using the undirected graphical model have been proposed in [6,7,8,9]. In [6], a pairwise Markov random field is used for joint prediction of labels. Similarly, in [7, 8], a pairwise CRF is used where a tree-structured graph is constructed to identify the set of informative label pairs in [7]. In [8], a fully connected graph with the pairwise clique potentials is used.

3 Enhancing Multi-label Classification Using Label Dependencies

We propose a two-stage framework for multi-label classification. In the first stage, one of the MLC classifiers such as Binary Relevance (BR) [1], ML-kNN [10] or an ensemble of classifiers chains (ECC) is used. In the second stage, the output of MLC in the first stage is refined by using the dependencies among labels captured by a CRF model.

Let \( \mathcal {D}= \left\{ \left( \mathbf x _{n},\mathbf y _{n} \right) ,1\le n\le N \right\} \) be the multi-label data where \(\mathbf x _{n}\in \mathfrak {R}^{d} \) is the d-dimensional input instance and \(\mathbf y = \left\{ y_{1}, y_{2},...,y_{m}\right\} \) is the m-dimensional desired output vector. Here, m is the number of class labels and \(y_{j}\in \left\{ 0,1 \right\} \). MLC deals with learning the mapping \(h : \mathfrak {R}^{d}\rightarrow \left\{ 0,1 \right\} ^{\textit{m}}\).

In the BR method for MLC, the multi-label dataset is transformed into m binary classification datasets. In the \(j^{th}\) dataset, the instances are considered as the positive instances if they belong to the \(j^{th}\) class, otherwise they are considered as the negative instances. Any SLC method can be used to build each of the m classifiers. Prediction for a test instance is obtained from the outputs of the m classifiers. The ML-kNN method is an algorithm adaptation method based on the k-nearest neighbour (kNN) classification for SLC. For a given test instance, the ML-kNN first identifies its k-nearest neighbours. Then the prediction is obtained using the Bayes rule based on the statistical information obtained from the neighbours.

Let \(\mathbf s = \left\{ s_{1}, s_{2},...,s_{m}\right\} \) be the set of confidence scores obtained from the first stage where \( s_{j} \in [0,1]\) is the output of the classifier corresponding to the \(j^{th}\) class for a given instance \(\mathbf x \).

3.1 Conditional Random Field

Conditional Random Field (CRF) [11] is a discriminative undirected probabilistic graphical model that directly models the conditional probability distribution \(p(\mathbf y | \mathbf s )\), where y is the set of output variables and s is the set of observed input variables as shown in Fig. 1. In the proposed method, the set of confidence scores \(\mathbf s \) obtained from the first stage are used as the input to the CRF. The graph associated with the CRF encodes the dependencies among the output variables. An edge between two nodes in the graph indicates that the corresponding variables are dependent on each other. The conditional probability distribution \(p(\mathbf y | \mathbf s )\) is given by the normalized product of clique potentials.

Fig. 1.
figure 1

A factor graph representation of the proposed CRF based model. The unshaded circles represent the class variables \(\mathbf y \). The shaded circles represent the input variables \(\mathbf s \). The edges amongst nodes represent the dependencies among class variables. The solid blocks represent the factors associated with those variables.

We use a CRF with the pairwise potentials to model the dependencies among the labels \( \mathbf y \) using the output \(\mathbf s \) from the first stage. Let \(G = \left( V,E \right) \) be the graph associated with the CRF. The nodes V of the graph represents the class variables and the edges E represents the dependence relationships among class variables. The conditional distribution \(p(\mathbf y | \mathbf s )\) is given by

$$\begin{aligned} p(\mathbf y | \mathbf s ) = \frac{1}{Z\left( \mathbf s \right) }\prod _{i \in V}\varPhi _{i}\left( y_{i}, \mathbf s \right) \prod _{\left( i,j \right) \in E}\psi _{ij}\left( y_{i},y_{j}, \mathbf s \right) \end{aligned}$$
(1)

where \(\varPhi _{i}\) is the node potential associated with \(i^{th}\) node and \(\psi _{ij}\) is the edge potential associated with the \(\left( i,j \right) \) edge. The normalization constant \(Z\left( \mathbf s \right) \), also known as the partition function is given by

$$\begin{aligned} Z\left( \mathbf s \right) = \sum _{ \mathbf y } \left[ \prod _{i \in V}\varPhi _{i}\left( y_{i}, \mathbf s \right) \prod _{\left( i,j \right) \in E}\psi _{ij}\left( y_{i},y_{j}, \mathbf s \right) \right] \end{aligned}$$
(2)

For the binary variable \(y_{i}\in \left\{ 0,1 \right\} \), the node potential \(\varPhi _{i}\) for different assignments of \(y_{i}\) is given by

$$\begin{aligned} \varPhi _{i}\left( y_{i},\mathbf s \right) = \left( e^{f_{i}\left( \mathbf s \right) v_{i}^{0}}, e^{f_{i}\left( \mathbf s \right) v_{i}^{1}} \right) \end{aligned}$$
(3)

where \(v_{i}^{0}\) and \( v_{i}^{1} \) are the node parameters corresponding to the state \(y_{i} = 0\) and \(y_{i} = 1\) respectively, and \(f_{i}\left( \mathbf s \right) = s_{i}\) is the node feature.

Similarly, the edge potential \(\psi _{ij}\) for different assignments of edge \((i,j) = \left\{ 00,01,10,11\right\} \) is defined by

$$\begin{aligned} \psi _{ij}\left( y_{i},y_{j},\mathbf s \right) = \begin{pmatrix} e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{0,0}} &{} e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{0,1}}\\ e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{1,0}}&{} e^\mathbf{f _{ij}\left( \mathbf s \right) \mathbf w _{ij}^{1,1}} \end{pmatrix} \end{aligned}$$
(4)

where \(\mathbf f _{ij}\left( \mathbf s \right) = \left[ s_{i}, s_{j} \right] ^{T}\) are the edge features and \((\mathbf w _{ij}^{0,0}, \mathbf w _{ij}^{0,1}, \mathbf w _{ij}^{1,0}, \mathbf w _{ij}^{1,1}) \) are the edge parameters.

Let \( \varvec{ \theta } = \left[ \mathbf v , \mathbf w \right] \) be the combined parametric vector and the respective feature functions be combined as \( F\left( \mathbf s ,\mathbf y \right) \). The Eq. (1) can now be written succinctly as

$$\begin{aligned} p\left( \mathbf y |\mathbf s \right) = \frac{1}{Z\left( \varvec{ \theta } , \mathbf s \right) } exp\left( \varvec{ \theta } ^{T}F\left( \mathbf s ,\mathbf y \right) \right) \end{aligned}$$
(5)

3.2 Objective Function

The objective function for learning the CRF parameters, the negative log likelihood (nll) is given as

$$\begin{aligned} nll\left( \varvec{ \theta } \right) = -\sum _{n=1}^{N} log \ p\left( \mathbf y _{n}|\mathbf s _{n} \right) = - \sum _{n=1}^{N}\left[ \varvec{ \theta } ^{T}F\left( \mathbf s _{n},\mathbf y _{n} \right) -log \ Z\left( \varvec{ \theta } , \mathbf s _{n} \right) \right] \end{aligned}$$
(6)

The gradient for the negative log likelihood [12] is given by

$$\begin{aligned} \nabla nll\left( \varvec{ \theta } \right) = - \sum _{n=1}^{N}\left[ F\left( \mathbf s _{n},\mathbf y _{n} \right) - E_\mathbf{y '}\left[ F\left( \mathbf s ,\mathbf y ' \right) \right] \right] \end{aligned}$$
(7)

where \( E_\mathbf{y '}\left[ F\left( \mathbf s ,\mathbf y ' \right) \right] = \sum _\mathbf{y '} p\left( \mathbf y '|\mathbf s \right) F\left( \mathbf s ,\mathbf y ' \right) \) are the expectations for the feature functions. To find these expectations, we have to run an inference algorithm to compute model distribution \(p\left( \mathbf y '|\mathbf s \right) \) for all values of \(\mathbf y '\). This makes computing gradient very expensive. Two main solutions to address this issue are: (a) use an approximate inference algorithm such as loopy belief propagation and (b) use a surrogate objective function such as pseudo-likelihood. We consider the second method that uses the pseudo-likelihood. The negative log pseudo-likelihood (nlpl) for a CRF is given by

$$\begin{aligned} nlpl\left( \varvec{ \varvec{ \theta }} \right) = -\sum _{n = 1}^{N} log \ PL\left( \mathbf {y}_{n}|\mathbf s _{n} \right) =-\sum _{n = 1}^{N} \sum _{i\in V}log \ p\left( y_{i,n}|\mathbf {y}_{\mathcal {N}_{i},n}, \mathbf s _{n};\varvec{\varvec{ \theta } } \right) \end{aligned}$$
(8)

where \(\mathbf {y}_{\mathcal {N}_{i},n}\) is the set of neighbours \(\mathcal {N}_{i}\) for the \(i^{th}\) node and the \(n^{th}\) instance. The negative log pseudo-likelihood is a convex function in parameters \(\varvec{ \theta }\) and known to be a consistent estimator, i.e., it returns the same set of parameters as the maximum likelihood estimate for \(\varvec{ \theta }\) when the number of instances goes to infinity [15].

Using the concise notation,

$$\begin{aligned} p\left( y_{i}|\mathbf {y}_{\mathcal {N}_{i}}, \mathbf s ;\varvec{\varvec{ \theta } } \right) = \frac{1}{Z_{i}\left( \varvec{ \theta }_{i} , \mathbf s \right) } exp\left( \varvec{ \theta }_{i} ^{T}F_{i}\left( \mathbf s ,\mathbf y \right) \right) \end{aligned}$$
(9)

where \(\varvec{ \theta }_{i} = \left( \mathbf {v}_{i}, \left\{ \mathbf {w}_{ij} \right\} _{j\in \mathcal {N}_{i}} \right) \) are the parameters corresponding to \(i^{th}\) node and its neighbours, \(Z_{i}\) is the local partition function, and \(F_{i}\) is the local feature vector. The local partition function \(Z_{i}\) can be computed by summing only over the values of \(y_{i}\).

3.3 CRF Structure and Parameter Learning

The structure of a CRF can be learnt by minimizing the regularized negative log pseudo-likelihood function with \(L_{1}\) regularization [13]. The \(L_{1}\) norm based regularization is known to give a sparse solution. We impose \(L_{1}\) regularization for each set of parameters associated with the edges in the graph [14]. This causes sparsity in the edge weight parameters where all parameters associated with a specific edge go to zero simultaneously. Using \(L_{2}\) regularizer for the node parameters, the regularization term \(R(\varvec{ \theta } )\) can be written as

$$\begin{aligned} R(\varvec{ \theta } ) = \lambda _{1}\left\| \mathbf v \right\| _{2}^{2} + \lambda _{2} \sum _{b \in E}\left\| \mathbf w _{b} \right\| _{2} \end{aligned}$$
(10)

where \(\mathbf w _{b} = (\mathbf w _{ij}^{0,0}, \mathbf w _{ij}^{0,1}, \mathbf w _{ij}^{1,0}, \mathbf w _{ij}^{1,1}) \) is the set of weight parameters for different configuration of the edge \(b = (i,j)\). Parameters of the CRF are found by minimizing the regularized loss function as given below

$$\begin{aligned} \varvec{ \theta } ^{*} = arg min_{\varvec{ \theta } }(nlpl\left( \varvec{ \theta } \right) + R(\varvec{ \theta } ) ) \end{aligned}$$
(11)

We use the projected quasi-Newton [16] method to solve the above optimization problem. The structure of the CRF then corresponds to all edges in the graph that has non-zero weight parameters. After fixing the structure of the CRF, the \(L_{2}\) norm regularization is used over the edge parameters. The limited-memory BFGS [17] method is used to further fine-tune the model’s parameters for the given structure. After training the model, the loopy belief propagation method is used to obtain the final predictions.

4 Experiments

We performed the experiments on the following multi-label datasets; Emotion, Enron, Medical, Scene and Yeast from Mulan [18].

The evaluation metrics used to compare the various methods are: Accuracy, Subset-accuracy (exact match) and Hamming loss [1].

Table 1. Accuracy comparison of different single-stage MLC methods(BR, ML-kNN and ECC) with the proposed two-stage method using CRF.
Table 2. Performance comparison of our proposed method (\(CRF_{ECC}\)) with other state-of-the-art-methods: Collective Multi-Label classification (CML) [8], Meta Binary Relevance (MBR) [5] and Conditional Dependency Network (CDN) [4]

We compared the performance of the proposed method with different existing methods for MLC. The BR, ML-kNN and ECC based MLC are used in the first stage. Logistic regression with \(L_{2}\) regularization is used as the base classifier for BR method. SVMs were used as base classifiers for ECC. For ML-kNN, we used the code released on the internet by the author. We used the UGM-toolbox [19] for CRF implementation. Other MLC methods were implemented using MEKA Footnote 1. All hyper-parameters are tuned using the cross-validation method.

The performance of proposed two-stage method using different MLCs in the first stage is presented in Table 1. For all the three MLC methods, the CRF based two-stage method is able to enhance the performance. The improvement is more significant in datasets that have a high correlation among class labels. Table 2 presents the comparison of the proposed method against the other existing methods. The proposed method performs better than all other methods. This shows the effectiveness of capturing label dependencies for MLC.

5 Conclusion

In this paper, we proposed a two-stage framework for multi-label classification using the conditional random field. It captures the dependencies among labels to improve the MLC performance. An optimization-based framework is used for learning the structure of the CRF. Experimental results shows the effectiveness of the proposed method for benchmark multi-label datasets.