Skip to main content
Log in

The use of data-derived label hierarchies in multi-label classification

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

An Erratum to this article was published on 05 July 2016

Abstract

Instead of traditional (multi-class) learning approaches that assume label independency, multi-label learning approaches must deal with the existing label dependencies and relations. Many approaches try to model these dependencies in the process of learning and integrate them in the final predictive model, without making a clear difference between the learning process and the process of modeling the label dependencies. Also, the label relations incorporated in the learned model are not directly visible and can not be (re)used in conjunction with other learning approaches. In this paper, we investigate the use of label hierarchies in multi-label classification, constructed in a data-driven manner. We first consider flat label sets and construct label hierarchies from the label sets that appear in the annotations of the training data by using a hierarchical clustering approach. The obtained hierarchies are then used in conjunction with hierarchical multi-label classification (HMC) approaches (two local model approaches for HMC, based on SVMs and PCTs, and two global model approaches, based on PCTs for HMC and ensembles thereof). The experimental results reveal that the use of the data-derived label hierarchy can significantly improve the performance of single predictive models in multi-label classification as compared to the use of a flat label set, while this is not preserved for the ensemble models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The PCT framework is implemented in the CLUS system, which is available at http://www.cs.kuleuven.be/~dtai/clus.

  2. We use the term parent(λ) for the direct parent label (the label at the previous level that is directly connected to λ) and the term ancestor for all parent labels from the root of the hierarchy to the parent(λ) (including parent(λ)).

  3. http://mulan.sourceforge.net/

  4. http://clus.sourceforge.net

References

  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1), 105–139.

    Article  Google Scholar 

  • Blockeel, H., Raedt, L.D., & Ramon, J. (1998). Top-down induction of clustering trees. In Proceedings of the 15th international conference on machine learning (pp. 55–63).

  • Boutell, M.R., Luo, J., Shen, X., & Brown, C.M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757–1771.

    Article  Google Scholar 

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    MathSciNet  MATH  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  MATH  Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., & Stone, C.J. (1984). Classification and regression trees. Chapman & Hall/CRC.

  • Brinker, K., Fürnkranz, J., & Hüllermeier, E. (2006). A unified model for multilabel classification and ranking. In Proceedings of the 2006 conference on ECAI 2006: 17th european conference on artificial intelligence August 29 – September 1, 2006, Riva del Garda, Italy (pp. 489–493).

  • Chang, C.C., & Lin, C.J. (2001). LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

  • Clare, A., & King, R.D. (2001). Knowledge discovery in multi-label phenotype data. In Proceedings of the 5th european conference on PKDD (pp. 42–53).

  • Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.

    MathSciNet  MATH  Google Scholar 

  • Duygulu, P., Barnard, K., de Freitas, J., & Forsyth, D. (2002). Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Proceedings of the 7th european conference on computer vision (pp. 349–354).

  • Elisseeff, A., & Weston, J. (2005). A kernel method for Multi-Labelled classification. In Proceedings of the annual ACM conference on research and development in information retrieval (pp. 274–281).

  • Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11, 86–92.

    Article  MathSciNet  MATH  Google Scholar 

  • Gibaja, E., & Ventura, S. (2015). A tutorial on multilabel learning. ACM Computing Surveys, 47(3), 52:1–52:38.

    Article  Google Scholar 

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The weka data mining software: an update. SIGKDD Explorations, 11, 10–18.

    Article  Google Scholar 

  • Katakis, I., Tsoumakas, G., & Vlahavas, I. (2008). Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD discovery challenge (pp. 124–135).

  • Klimt, B., & Yang, Y. (2004). The enron corpus: a new dataset for email classification research. In Proceedings of the 15th european conference on machine learning (pp. 217–226).

  • Kocev, D. (2011). Ensembles for predicting structured outputs. Ph.D. thesis, IPS Jožef Stefan, Ljubljana, Slovenia.

  • Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2007). Ensembles of multi-objective decision trees. In Proceedings of the 18th european conference on machine learning (pp. 624–631).

  • Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833.

    Article  Google Scholar 

  • Kong, X., & Yu, P.S. (2011). An ensemble-based approach to fast classification of multilabel data streams. In Proceedings of the 7th international conference on collaborative computing: Networking, Applications and Worksharing (pp. 95–104).

  • Levatić, J., Kocev, D., & Džeroski, S. (2014). The importance of the label hierarchy in hierarchical multi-label classification. Journal of Intelligent Information Systems, 45(2), 247–271.

    Article  Google Scholar 

  • Li, P., Li, H., & Wu, M. (2013). Multi-label ensemble based on variable pairwise constraint projection. Information Sciences, 222(0), 269–281.

    Article  MathSciNet  Google Scholar 

  • Madjarov, G., Dimitrovski, I., Gjorgjevikj, D., & Deroski, S. (2015). Evaluation of different data-derived label hierarchies in multi-label classification. In New frontiers in mining complex patterns, lecture notes in computer science, (Vol. 8983 pp. 19–37): Springer international publishing.

  • Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45 (9), 3084–3104.

    Article  Google Scholar 

  • Nemenyi, P.B. (1963). Distribution-free multiple comparisons. Ph.D. thesis, Princeton University.

  • Quinlan, J.R. (1993). C4.5: Programs for machine learning (Morgan Kaufmann series in machine learning) morgan kaufmann.

  • Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification. In Proceedings of the 20th european conference on machine learning (pp. 254–269).

  • Silla Carlos, N.J., & Freitas, A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22, 31–72.

    Article  MathSciNet  MATH  Google Scholar 

  • Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., & Smeulders, A.W.M. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th annual ACM international conference on multimedia (pp. 421–430).

  • Srivastava, A., & Zane-Ulman, B. (2005). Discovering recurring anoMalies in text reports regarding complex space systems. In Proceedings of the IEEE aerospace conference (pp. 55–63).

  • Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. In Proceedings of the 9th international conference on music information retrieval (pp. 320–330).

  • Tsoumakas, G., & Katakis, I. (2007). Multi label classification: an overview. International Journal of Data Warehouse and Mining, 3(3), 1–13.

    Article  Google Scholar 

  • Tsoumakas, G., Katakis, I., & Vlahavas, I. (2008). Effective and efficient multilabel classification in domains with large number of labels. In Proceedings of the ECML/PKDD workshop on mining multidimensional data (pp. 30–44).

  • Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: an ensemble method for multilabel classification. In Proceedings of the 18th european conference on machine learning (pp. 406–417).

  • Vens, C., Struyf, J., Schietgat, L., Džeroski, S., & Blockeel, H. (2008). Decision trees for hierarchical multi-label classification. Machine Learning, 73 (2), 185–214.

    Article  Google Scholar 

  • Zhang, M.L., & Zhou, Z.H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to acknowledge the support of the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gjorgji Madjarov.

Appendices

Appendix A: Evaluation measures

In this section, we present the measures that are used to evaluate the predictive performance of the compared methods in our experiments. In the definitions below, \(\mathcal {Y}_{i}\) denotes the set of true labels of example x i and h(x i ) denotes the set of predicted labels for the same examples. All definitions refer to the multi-label setting.

1.1 A.1 Example based measures

Hamming loss evaluates how many times an example-label pair is misclassified, i.e., label not belonging to the example is predicted or a label belonging to the example is not predicted. The smaller the value of h a m m i n g_l o s s(h), the better the performance. The performance is perfect when h a m m i n g_l o s s(h)=0. This metric is defined as:

$$ hamming\_loss(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{1}{Q}\left|h(\mathbf{x_{i}}){\Delta} \mathcal{Y}_{i}\right| $$
(7)

where Δ stands for the symmetric difference between two sets, N is the number of examples and Q is the total number of possible class labels.

Accuracy for a single example x i is defined by the Jaccard similarity coefficients between the label sets h(x i ) and \(\mathcal {Y}_{i}\). Accuracy is micro-averaged across all examples.

$$ accuracy(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left|h(\mathbf{x_{i}})\bigcap \mathcal{Y}_{i}\right|}{\left|h(\mathbf{x_{i}})\bigcup \mathcal{Y}_{i}\right|} $$
(8)

Precision is defined as:

$$ precision(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left|h(\mathbf{x_{i}})\bigcap \mathcal{Y}_{i}\right|}{\left|h(\mathbf{x_{i}})\right|} $$
(9)

Recall is defined as:

$$ recall(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left|h(\mathbf{x_{i}})\bigcap \mathcal{Y}_{i}\right|}{\left|\mathcal{Y}_{i}\right|} $$
(10)

F 1 score is the harmonic mean between precision and recall and is defined as:

$$ F_{1}=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{2 \times \left|h(\mathbf{x_{i}}) \cap \mathcal{Y}_{i}\right|}{\left|h(\mathbf{x_{i}})\right| + \left|\mathcal{Y}_{i}\right|} $$
(11)

F 1 is an example based metric and its value is an average over all examples in the dataset. F 1 reaches its best value at 1 and worst score at 0.

Subset accuracy or classification accuracy is defined as follows:

$$ subset\_accuracy(h)=\frac{1}{N}\sum\limits^{N}_{i=1}I(h(\mathbf{x_{i}})=\mathcal{Y}_{i}) $$
(12)

where I(t r u e) = 1 and I(f a l s e) = 0. This is a very strict evaluation measure as it requires the predicted set of labels to be an exact match of the true set of labels.

1.2 A.2 Label based measures

Macro precision (precision averaged across all labels) is defined as:

$$ macro\_precision=\frac{1}{Q}\sum\limits^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fp_{j}} $$
(13)

where t p j , f p j are the number of true positives and false positives for the label λ j considered as a binary class.

Macro recall (recall averaged across all labels) is defined as:

$$ macro\_recall=\frac{1}{Q}\sum\limits^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fn_{j}} $$
(14)

where t p j , f p j are defined as for the macro precision and f n j is the number of false negatives for the label λ j considered as a binary class.

Macro F 1 is the harmonic mean between precision and recall, where the average is calculated per label and then averaged across all labels. If p j and r j are the precision and recall for all λ j h(x i ) from \(\lambda _{j} \in \mathcal {Y}_{i}\), the macro F 1 is

$$ macro\_F_{1}=\frac{1}{Q}\sum\limits^{Q}_{j=1}\frac{2\times p_{j} \times r_{j}}{p_{j} + r_{j}} $$
(15)

Micro precision (precision averaged over all the example/label pairs) is defined as:

$$ micro\_precision=\frac{{\sum}^{Q}_{j=1}{tp_{j}}}{{\sum}^{Q}_{j=1}{tp_{j}} + {\sum}^{Q}_{j=1}{fp_{j}}} $$
(16)

where t p j , f p j are defined as for macro precision.

Micro recall (recall averaged over all the example/label pairs) is defined as:

$$ micro\_recall=\frac{{\sum}^{Q}_{j=1}{tp_{j}}}{{\sum}^{Q}_{j=1}{tp_{j}} + {\sum}^{Q}_{j=1}{fn_{j}}} $$
(17)

where t p j and f n j are defined as for macro recall.

Micro F 1 is the harmonic mean between micro precision and micro recall. Micro F 1 is defined as:

$$ micro\_F_{1}=\frac{2 \times micro\_precision \times micro\_recall}{micro\_precision + micro\_recall} $$
(18)

1.3 A.3 Ranking based measures

One error evaluates how many times the top-ranked label is not in the set of relevant labels of the example. The metric o n e_e r r o r(f) takes values between 0 and 1. The smaller the value of o n e_e r r o r(f), the better the performance. This evaluation metric is defined as:

$$ one\_error(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\left[\!\!\!\left[\hspace{1 mm}\left[\arg\max_{\lambda \in \mathcal{Y}} f(\mathbf{x_{i}}, \lambda)\right] \notin \mathcal{Y}_{i} \hspace{1 mm}\right]\!\!\!\right] $$
(19)

where \(\lambda \in \mathcal {L} = \left \{\lambda _{1}, \lambda _{2}, ..., \lambda _{Q}\right \}\) and [ [π] ] equals 1 if π holds and 0 otherwise for any predicate π. Note that, for single-label classification problems, the One Error is identical to ordinary classification error.

Coverage evaluates how far, on average, we need to go down the list of ranked labels in order to cover all the relevant labels of the example. The smaller the value of c o v e r a g e(f), the better the performance.

$$ coverage(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\max_{\lambda \in \mathcal{Y}_{i}} rank_{f}(\mathbf{x_{i}}, \lambda) - 1 $$
(20)

where r a n k f (x i ,λ) denotes the position of the label λ in the ranking. It maps the outputs of f(x i ,λ) for any \(\lambda \in \mathcal {L}\) to {λ 1,λ 2,...,λ Q } so that f(x i ,λ m )>f(x i ,λ n ) implies r a n k f (x i ,λ m )<r a n k f (x i ,λ n ). The smallest possible value for c o v e r a g e(f) is l c , i.e., the label cardinality of the given dataset.

Ranking loss evaluates the average fraction of label pairs that are reversely ordered for the particular example given by:

$$ ranking\ loss(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left| D_{i}\right|}{\left| \mathcal{Y}_{i}\right|\left|\bar{\mathcal{Y}_{i}}\right|} $$
(21)

where \(D_{i} = \{(\lambda _{m}, \lambda _{n}) | f(\mathbf {x_{i}}, \lambda _{m}) \leq f(\mathbf {x_{i}}, \lambda _{n}), (\lambda _{m}, \lambda _{n}) \in \mathcal {Y}_{i} \times \bar {\mathcal {Y}_{i}}\}\), while \(\bar {\mathcal {Y}}\) denotes the complementary set of \(\mathcal {Y}\) in \(\mathcal {L}\). The smaller the value of r a n k i n g_l o s s(f), the better the performance, so the performance is perfect when r a n k i n g_l o s s(f)=0.

Average Precision is the average fraction of labels ranked above an actual label \(\lambda \in \mathcal {Y}_{i}\) that actually are in \(\mathcal {Y}_{i}\). The performance is perfect when a v g_p r e c i s i o n(f)=1; the larger the value of a v g_p r e c i s i o n(f), the better the performance. This metric is defined as:

$$ avg\_precision(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{1}{\left| \mathcal{Y}_{i}\right|}\sum\limits_{\lambda \in \mathcal{Y}_{i}}\frac{\left| \mathcal{L}_{i}\right|}{rank_{f}(\mathbf{x_{i}}, \lambda)} $$
(22)

where \(\mathcal {L}_{i}=\{\lambda ^{\prime }| rank_{f}(\mathbf {x_{i}}, \lambda ^{\prime }) \leq rank_{f}(\mathbf {x_{i}}, \lambda ), \lambda ^{\prime } \in \mathcal {Y}_{i}\}\) and r a n k f (x i ,λ) is defined as in coverage above.

1.4 B.1 Results on the example-based evaluation measures

Table 4 The performance of the multi-label classification approaches in terms of the examples-based evaluation measures
Table 5 The performance of the multi-label classification approaches in terms of the examples-based evaluation measures
Fig. 6
figure 6

The critical diagrams for the example-based evaluation measures: The results from the Nemenyi post-hoc test at 0.05 significance level on all the datasets

1.5 B.2 Results on the label-based evaluation measures

The big difference between the micro-based and macro-based evaluation measures appears due to the averaging strategy of the obtained predictions. It is more emphasized on the large datasets that have highly unbalanced number of examples per label. Namely, the averaging in the micro-based measures is made across the predictions per example for all labels, while the averaging in the macro-based measures is made across the predictions per label for all examples, which means that for macro-based measures the labels with small number of examples are equally important as the labels with large number of examples.

Table 6 The performance of the multi-label classification approaches in terms of the label-based evaluation measures
Table 7 The performance of the multi-label classification approaches in terms of the label-based evaluation measures
Fig. 7
figure 7

The critical diagrams for the label-based evaluation measures: The results from the Nemenyi post-hoc test at 0.05 significance level on all the datasets

1.6 B.3 Results on the ranking-based evaluation measures

Fig. 8
figure 8

The critical diagrams for the ranking-based evaluation measures: The results from the Nemenyi post-hoc test at 0.05 significance level on all the datasets

Table 8 The performance of the multi-label classification approaches in terms of the ranking-based evaluation measures

Appendix B: Complete results from the experimental evaluation

In this section, we present the complete results from the experimental evaluation. We present the results based on the evaluation measures. Tables 4567 and 8 give the performance of the compared methods on each of the datasets measured in terms of the example based, label based and ranking based evaluation measures. The first column of the tables lists the dataset, while the remaining columns show the performance of each method for every dataset. The best results per dataset are shown in boldface. For the bookmarks dataset, HOMER did not manage to construct a predictive model within one week under the available resources. The corresponding entries in the tables with the results are marked with DNF (Did Not Finish).

To assess whether the overall differences in performance across the different approaches are statistically significant, we also employed the corrected Friedman test (Friedman 1940) and the post-hoc Nemenyi test (Nemenyi 1963) as recommended by Demšar (2006). We present the results from the Nemenyi post-hoc test with average rank diagrams (Demšar 2006). These are given in Figs. 67 and 8. A critical diagram contains an enumerated axis on which the average ranks of the algorithms are drawn. The algorithms are depicted along the axis in such a manner that the best ranking ones are at the right-most side of the diagram. The lines for the average ranks of the algorithms that do not differ significantly (at the significance level of p=0.05) are connected with a line. For the bookmarks dataset, we penalize HOMER that does not finish by assigning it the lowest value (i.e., the lowest rank value) for each evaluation measure.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madjarov, G., Gjorgjevikj, D., Dimitrovski, I. et al. The use of data-derived label hierarchies in multi-label classification. J Intell Inf Syst 47, 57–90 (2016). https://doi.org/10.1007/s10844-016-0405-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-016-0405-8

Keywords

Navigation