1 Introduction

In supervised learning based approaches to multi-class pattern classification, a training example represented by a corresponding feature vector is related to a distinct class (label) describing its semantics. However, for many real-world objects, the single label assumption may not be appropriate. In the task of image annotation, an image can have multiple labels, referred to as a relevant label set. Likewise, in the text categorization task, a news article can be associated with a number of topics like “military”, “business” and “international”. Multi-label learning has been used for a number of applications like automatic multimedia content annotation [1, 2] and bioinformatics [3, 4]. The multi-label learning task involves building models which can predict the relevant label set for a test example.

The existing techniques for multi-label learning are predominantly supervised learning based approaches. These techniques require huge amount of labeled data for building a classifier. As labeling the data is both time-consuming and costly, it is undesirable to use only labeled data. However, unlabeled data is easily available and cheap, and the information from it can be used to build better classifiers. In the recent times, semi-supervised learning techniques have been found to be effective in building classifiers.

There have been a number of techniques for semi-supervised multi-label learning (SSMLL) [5,6,7]. Most of these methods are transductive in nature and they aim at predicting the label set for the existing unlabeled data. It is important to exploit the inherent label correlation present among the labels to boost the performance of the multi-label classifier. For example, it is common to have the natural scene images that contain both “hill” and “tree”.

In [5], a graph-based learning framework that accounts for label consistency in the graph and the correlation among labels is presented. After optimising an objective function, a closed form solution for prediction of labels for the unlabeled data is obtained. In [6], the TRAnsductive Multi-label classification (TRAM) is formulated as an optimization problem of estimating label compositions and a closed-form solution is obtained. This method is extended for estimation of the cardinality of the predicted label set for the unlabeled examples based on the estimated label compositions. In [8], the non-negative matrix factorization algorithm is used to solve the problem where the basic hypothesis is that “two examples which have high similarity in the input space should have similar label memberships”. All the above methods build a graph with nodes representing the labeled and unlabeled examples, and a similarity measure used as weight of an edge between two nodes. Graph based methods form the majority of semi-supervised learning [9] because of their effectiveness and efficacy.

The rest of the paper is organized as follows. We present the graph based methods for semi-supervised learning from multi-label data in Sect. 2. In Sect. 3, we propose two methods to propagate the label correlation in multi-label learning. Studies on benchmark datasets that demonstrate the effectiveness of the proposed methods are presented in Sect. 4.

2 Graph Based Methods for Semi-supervised Learning with Multi-label Data

In semi-supervised multi-label learning, the training set \(\mathcal {D}= \{ (\mathbf{x}_1, Y_1), \ldots , (\mathbf{x}_i, Y_i), \ldots , (\mathbf{x}_L,Y_L),\mathbf{x}_{L+1}, \ldots , \mathbf{x}_{L+j}\, \ldots , \mathbf{x}_{L+U} \}\) consists of L labeled examples \(\mathcal {D}_l=\{(\mathbf{x}_1, Y_1), \ldots , (\mathbf{x}_i, Y_i) \ldots , (\mathbf{x}_L,Y_L)\}\), and U unlabeled examples \(\mathcal {D}_u=\{ \mathbf{x}_{L+1}, \ldots , \mathbf{x}_{L+j}\, \ldots , \mathbf{x}_{L+U} \}\). The total number of examples is \(N = L + U \). The task involves learning a family of K functions, \(f_k: \mathcal {X} \times \mathcal {Y} \longrightarrow \mathbb {R}\). Here, \(f_k(\mathbf x _i, y_k)\) is measure of the confidence of the \(k^{th}\) label \(y_k \in \mathcal {Y}\) being a label of \(\mathbf x _i\). The label vector \(\mathbf y _i = (y_{i1}, \ldots , y_{iK})^T\) is represented as a K-dimensional vector with \(y_{ij} \in \{0, 1 \}\), 0 indicating that the \(j^{th}\) label is not associated with the example \(\mathbf {x}_i\) and 1 indicating that the label \(y_j\) belongs to the label set of the example \(\mathbf {x}_i\). Let \(\mathbf {W}\) denote the \(N \times N\) weight matrix where \(w_{ij}\) represents similarity between \(\mathbf{x}_i\) and \(\mathbf{x}_j\). The matrix \(\mathbf {\varDelta } =\mathbf D -\mathbf W \) is called the combinatorial graph Laplacian matrix where \(\mathbf D \) is a \(N \times N\) diagonal matrix with entries \({d}_{ii}= \sum _{j=1}^N w_{ij}\). The normalized combinatorial Laplacian is \(L=\mathbf{D^{-1/2}\varDelta D^{-1/2}}\). We define \(\mathbf {\Lambda }\), a \(N \times N\) diagonal matrix with \({\lambda }_{ii}=\infty \) for \(i \le L\), and \(\lambda _{ii}= 0\) otherwise. The vector \(\mathbf f = [f_1 f_2 \cdots f_N]^T\) has the confidence scores for each of the N examples. This setting is for a binary setting that can be extended to multi-label setting.

In [9], the objective function in the Gaussian Random Field (GRF) method for graph based semi-supervised learning in the single label setting is formulated as follows:

$$\begin{aligned}&E (\mathbf f )= E_l (\mathbf f )+ \alpha E_s (\mathbf f )\,\, \text {where}\end{aligned}$$
(1)
$$\begin{aligned}&E_l (\mathbf f )= \infty \sum _{i \in \mathcal {L}} (f_i-y_i)^2= (\mathbf f -\mathbf y )^T \varLambda (\mathbf f -\mathbf y ) \end{aligned}$$
(2)
$$\begin{aligned}&E_s (\mathbf f )=\frac{1}{2} \sum _{i, j \in \mathcal {L}\cup {\mathcal {U}}} {w}_{ij} (f_i-f_j)^2 = \mathbf f ^T \varDelta \mathbf f \end{aligned}$$
(3)

Here, \(E_l (\mathbf f )\) is the term that corresponds to deviation from the already assigned labels (labeled data) and \(E_s (\mathbf f )\) is the penalty term that corresponds to smoothness of labels over the graph. In the second term, if the two examples \(\mathbf {x}_i\) and \(\mathbf {x}_j\) are similar, the predictions \(f_i\) and \(f_j\) should be close as well.

In [10], the authors propose the Local and Global Consistency (LGC) method where the two terms are modified given below.

$$\begin{aligned}&E_l (\mathbf f )= \sum _{i \in \mathcal {L}\cup \mathcal {U}} (f_i-y_i)^2= (\mathbf f -\mathbf y )^T (\mathbf f -\mathbf y ) \end{aligned}$$
(4)
$$\begin{aligned}&E_s (\mathbf f )=\frac{1}{2} \sum _{i, j \in \mathcal {L}\cup {\mathcal {U}}} {w}_{ij} (\frac{f_i}{\sqrt{d_{ii}}}-\frac{f_j}{\sqrt{d_{jj}}})^2 = \mathbf f ^T L \mathbf f \end{aligned}$$
(5)

In the multi-label setting, the matrix \(\mathbf Y \) is an \(N \times K\) matrix such that \(y_{ik}\) is equal to 1 if a labeled example \(\mathbf x _i\) has label k associated with it, and 0 otherwise. This corresponds to the given ground truth. Similarly, the predicted matrix \(\mathbf{F}\) is an \(N \times K\) matrix where \(f_{ik}\) indicates the confidence of the example \(\mathbf {x}_i\) in the label \({ y_k}\).

The approach in [5] introduces a term \(E_c (\mathbf {F})\) corresponding to regularizer for the label correlation. In the \(K \times K\) label correlation matrix \(\mathbf C \), the entry \({c}_{kl}\) represents the correlation between label \(y_k\) and label \(y_l\) that can be estimated using the label based co-occurrence. The term \(E_c(\mathbf {F})\) is defined as follows:

$$\begin{aligned} E_c (\mathbf{F}) = \sum _{i=1}^N \sum _{k, l =1}^{K} { c}_{kl} ({f}_{ik}-{f}_{il})^2= -tr \mathbf{(FC^{'}F^T)} \end{aligned}$$
(6)

where \(\mathbf C ^{'}= \mathbf C -\mathbf D _c\) and \(\mathbf D _c\) is a diagonal matrix with diagonal entries \(d_{c^{'}_{ii}}= \sum _{j=1}^K c_{ij}\). Here, \(tr (\mathbf{M})\) is the trace of the matrix \(\mathbf {M}\). The term \(E_c(\mathbf {F})\) quantifies the smoothness in the label space rather than in the input space. If the correlation between two labels \(y_k\) and \(y_l\) is high, the predictions \(f_{ik}\) and \(f_{il}\) should be similar.

In the multi-label case, the terms \(E_l (\mathbf F )\), \( E_s (\mathbf F )\), and \(E_c (\mathbf F )\) are computed as follows:

$$\begin{aligned}&E_l (\mathbf F ) = tr \mathbf { ( (F-Y)^T \Lambda (F- Y))} \end{aligned}$$
(7)
$$\begin{aligned}&E_s (\mathbf F ) = tr \mathbf { (F^ T \varDelta F)} \end{aligned}$$
(8)
$$\begin{aligned}&E_c (\mathbf F ) = -tr\mathbf { (FC^{'}F^T )} \end{aligned}$$
(9)
$$\begin{aligned}&E (\mathbf F ) = tr \mathbf{( (F - Y)^T \Lambda (F- Y)) }+\alpha . tr \mathbf{(F^T \varDelta F) }- \beta . tr \mathbf{(FC^{'}F^T)} \end{aligned}$$
(10)

where \({\alpha }\) and \({\beta }\) are the trade-off parameters. The formulation in (10) is referred to as Multi-Label Correlation Gaussian Random Field (MLC-GRF) and the solution turns out to be the Sylvester Equation [11].

Similarly, the objective function in the Multi-Label Correlation Local and Global Consistency (MLC-LGC) method is formulated in Eq. (11) where \(\mu ,\nu \) are hyper-parameters. The solution to the optimization problem in (11) is given by Eq. (12)

$$\begin{aligned}&E (\mathbf{F}) = tr{ \mathbf ( (F-Y)^T (F-Y))}+ \mu . tr \mathbf{(F^TLF)} -\nu . tr \mathbf{(FC^{'}F^T)} \end{aligned}$$
(11)
$$\begin{aligned}&(\mu \mathbf{L+I})\mathbf{F}- \nu { \mathbf FC^{'}} = \mathbf{Y} \end{aligned}$$
(12)

This equation also turns out to be a Sylvester equation similar to the solution of the MLC-GRF method.

3 Proposed Methods for Semi-supervised Multi-label Learning

3.1 Label Correlation Propagation-GRF (CP-GRF)

As discussed in previous sections, incorporating label correlation information can help improve the performance of the classifier. Our fundamental hypothesis here is that the predictions for a given example should be label correlation consistent i.e., if two labels are correlated and the prediction score for one of the labels is high, the score for the other label should also be high. Thus, the labels of one example are propagated to the correlated labels of that example as follows:

$$\begin{aligned} {\begin{matrix} f^{ (t+1)}_{ik} = f^{initial}_{ik} + \alpha \sum _{l=1}^K f^t_{il}c_{lk} i =1,2 ...N k=1,2 ...K \end{matrix}} \end{aligned}$$
(13)

Here, \(f^t\) represents the prediction for a given example at iteration t. The prediction score of a label for a given example is obtained both from the initial prediction (\(f^{initial}\)) and the other labels based on the correlation between the labels. The parameter \(\alpha \) balances the two terms and is chosen by cross-validation. This iterative update is repeated till convergence. The Eq. (13) in the matrix form is given below.

$$\begin{aligned} \mathbf{F}^{ (t+1)} = \mathbf{F}^{inital}+ \alpha \mathbf{F}^t \mathbf{C} \end{aligned}$$
(14)

The CP-GRF method is similar to the page rank approach [12] where the page rank of a webpage is proportional to the page rank of its incoming neighbours. The CP-GRF method takes the higher order label correlations into consideration by iteratively propagating the second order correlations. This method involves the propagation on label correlation graph for each example. The CP-GRF method is expected to perform well when the number of labels is large and the labels have significant correlations. The initial predictions can in principle be obtained from any typical multi-label classifier. In order to validate our method, \(\mathbf {F}^{initial}\) is obtained from the GRF method.

3.2 Weighted Label Correlation Propagation-GRF (WCP-GRF)

The WCP-GRF method is an extension of the CP-GRF method. In this method, the correlation is propagated not only from the other correlated labels but also based on the examples close to the particular example. The hypothesis here is that the label correlation is a local effect i.e., the predictions for a given label will be influenced by the predictions for correlated labels of neighbors. Let \(kNN(\mathbf{x_i} )\) represent the k-nearest neighbours of \(\mathbf{x_i}\). The update equations in the proposed WCP-GRF method are given by:

$$\begin{aligned}&f_{ik}^{t+1} = f_{ik}^{initial}+ \alpha \sum _{\mathbf {x}_j \in kNN (\mathbf {x}_i) } w_{ij}f^t_{jl}c_{lk} \end{aligned}$$
(15)
$$\begin{aligned}&\mathbf{F}^{ (t+1)} = \mathbf{F}^{inital}+ \alpha \mathbf{WF}^t \mathbf{C} \end{aligned}$$
(16)

As seen in (15), in each iteration there is a contribution from the initial prediction (any multi-label classifier) as well as from the correlated labels of the k nearest neighbors in the feature space. Since the contribution is only from the nearest neighbors, the WCP-GRF method uses the k-NN based weight matrix \(\mathbf {W}\) i.e., the weight entries are 0 if the two nodes are not in the k-nearest neighbors of each other. The matrix update in WCP-GRF method is given by Eq. (16). Here, the initial predictions \(\mathbf {F}^{initial}\) are obtained from the GRF method and therefore this is called WCP-GRF method. We use the iterative method and terminate at convergence or after a sufficient number of iterations.

The rest of the proposed methods are the same as the previously discussed methods but use the normalized graph Laplacian instead of the usual graph Laplacian. The other two proposed methods use the LGC method for the initial predictions. The Label Correlation Propagation-LGC (CP-LGC) is the same as Section in 3.1 except for the fact the initial predictions come from the LGC method. Similarly, the Weighted Label Correlation Propagation (WCP-LGC) is the same as in Sect. 3.2 except for the fact that the initial predictions are taken from the LGC method.

4 Experiments and Results

Details of benchmark datasets used for comparison of the various methods are given in Table 1. The evaluation metrics used to compare the various methods are: One error, Coverage, Average precision, Hamming loss and Ranking loss [13]. The label correlation matrix \(\mathbf{C}\) is calculated using the cosine similarity on the labeled data. For most of the datasets, the label correlation for several pairs of labels is low. However, there are a few pairs of labels with high correlation.

Table 1. Details of the datasets used for experimentation

The dataset has been divided into 10% labeled data, 70% unlabeled data and the rest as test data. All hyper-parameters were selected by choosing the maximum average-precision on the validation data over 5 runs. All the results correspond to average of 5 runs and the standard deviation of each metric has also been recorded. The weight matrix used is k-nearest neighbours with a Gaussian function where the width parameter is chosen based on performance on the validation data. The number of nearest neighbours (k) is fixed to 15 as it does not affect the results much. For Hamming loss, the number of labels for a given example is chosen based on the average cardinality of the dataset. All the techniques have been implemented in MATLAB and run on 32 GB RAM 8-core machine. The parameters \( \beta \), \(\mu \) and \(\nu \) were chosen based on cross-validation. The performance of different methods for multi-label classification on different datasets is presented in Table 2. In Fig. 1, the average precision for different sizes of the labeled dataset and for various datasets is plotted. The test ratio is fixed at 20% and the labeled data size is varied. In all the datasets, it is seen that the average precision increases with increase in size of labeled dataset.

Table 2. Performance of the transductive methods for multi-label classification on different datasets
Fig. 1.
figure 1

Variation of average precision with size of the labeled data for different datasets

In Fig. 2, the variation of average precision with size of unlabeled dataset for two of the datasets is plotted. We keep the labeled ratio fixed at 5% and test ratio at 20%. We vary the unlabeled dataset size and observe the effect on average precision. Again, we observe that the average precision increases with increase in the unlabeled data thus showing the importance of unlabeled data.

Fig. 2.
figure 2

Variation of average precision with the size of the unlabeled dataset for different datasets

We observe the following for the transductive methods:

  • In most of the cases, the weighted label correlation based method perform better than the other methods.

  • In general, the LGC based methods perform better than the GRF based methods. This is expected as the normalized combinatorial Laplacian is a better representative than the conventional combinatorial graph Laplacian.

  • Increase in labeled and unlabeled (to a certain extent) data results in an increase in performance

5 Summary and Conclusion

In this paper, we introduced the problem of semi-supervised multi-label learning and discussed some of the recent graph-based semi-supervised methods. We proposed the label correlation based propagation methods to improve the predictions. The proposed methods outperform the state-of-art methods. Extensive experiments validate our hypothesis of the importance of accounting for label correlation and also show the importance of using labeled data and unlabeled data. In our future work, we would like to incorporate higher order correlation directly.