The use of data-derived label hierarchies in multi-label classification

Madjarov, Gjorgji; Gjorgjevikj, Dejan; Dimitrovski, Ivica; Džeroski, Sašo

doi:10.1007/s10844-016-0405-8

The use of data-derived label hierarchies in multi-label classification

Published: 18 April 2016

Volume 47, pages 57–90, (2016)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Gjorgji Madjarov¹,
Dejan Gjorgjevikj¹,
Ivica Dimitrovski¹ &
…
Sašo Džeroski²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

An Erratum to this article was published on 05 July 2016

Abstract

Instead of traditional (multi-class) learning approaches that assume label independency, multi-label learning approaches must deal with the existing label dependencies and relations. Many approaches try to model these dependencies in the process of learning and integrate them in the final predictive model, without making a clear difference between the learning process and the process of modeling the label dependencies. Also, the label relations incorporated in the learned model are not directly visible and can not be (re)used in conjunction with other learning approaches. In this paper, we investigate the use of label hierarchies in multi-label classification, constructed in a data-driven manner. We first consider flat label sets and construct label hierarchies from the label sets that appear in the annotations of the training data by using a hierarchical clustering approach. The obtained hierarchies are then used in conjunction with hierarchical multi-label classification (HMC) approaches (two local model approaches for HMC, based on SVMs and PCTs, and two global model approaches, based on PCTs for HMC and ensembles thereof). The experimental results reveal that the use of the data-derived label hierarchy can significantly improve the performance of single predictive models in multi-label classification as compared to the use of a flat label set, while this is not preserved for the ensemble models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification

Multi-label classification with label clusters

Article 04 November 2024

Predictive Bi-clustering Trees for Hierarchical Multi-label Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The PCT framework is implemented in the CLUS system, which is available at http://www.cs.kuleuven.be/~dtai/clus.
We use the term parent(λ) for the direct parent label (the label at the previous level that is directly connected to λ) and the term ancestor for all parent labels from the root of the hierarchy to the parent(λ) (including parent(λ)).
http://mulan.sourceforge.net/
http://clus.sourceforge.net

References

Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1), 105–139.
Article Google Scholar
Blockeel, H., Raedt, L.D., & Ramon, J. (1998). Top-down induction of clustering trees. In Proceedings of the 15th international conference on machine learning (pp. 55–63).
Boutell, M.R., Luo, J., Shen, X., & Brown, C.M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757–1771.
Article Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
MathSciNet MATH Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., & Stone, C.J. (1984). Classification and regression trees. Chapman & Hall/CRC.
Brinker, K., Fürnkranz, J., & Hüllermeier, E. (2006). A unified model for multilabel classification and ranking. In Proceedings of the 2006 conference on ECAI 2006: 17th european conference on artificial intelligence August 29 – September 1, 2006, Riva del Garda, Italy (pp. 489–493).
Chang, C.C., & Lin, C.J. (2001). LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Clare, A., & King, R.D. (2001). Knowledge discovery in multi-label phenotype data. In Proceedings of the 5th european conference on PKDD (pp. 42–53).
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
MathSciNet MATH Google Scholar
Duygulu, P., Barnard, K., de Freitas, J., & Forsyth, D. (2002). Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Proceedings of the 7th european conference on computer vision (pp. 349–354).
Elisseeff, A., & Weston, J. (2005). A kernel method for Multi-Labelled classification. In Proceedings of the annual ACM conference on research and development in information retrieval (pp. 274–281).
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11, 86–92.
Article MathSciNet MATH Google Scholar
Gibaja, E., & Ventura, S. (2015). A tutorial on multilabel learning. ACM Computing Surveys, 47(3), 52:1–52:38.
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The weka data mining software: an update. SIGKDD Explorations, 11, 10–18.
Article Google Scholar
Katakis, I., Tsoumakas, G., & Vlahavas, I. (2008). Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD discovery challenge (pp. 124–135).
Klimt, B., & Yang, Y. (2004). The enron corpus: a new dataset for email classification research. In Proceedings of the 15th european conference on machine learning (pp. 217–226).
Kocev, D. (2011). Ensembles for predicting structured outputs. Ph.D. thesis, IPS Jožef Stefan, Ljubljana, Slovenia.
Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2007). Ensembles of multi-objective decision trees. In Proceedings of the 18th european conference on machine learning (pp. 624–631).
Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833.
Article Google Scholar
Kong, X., & Yu, P.S. (2011). An ensemble-based approach to fast classification of multilabel data streams. In Proceedings of the 7th international conference on collaborative computing: Networking, Applications and Worksharing (pp. 95–104).
Levatić, J., Kocev, D., & Džeroski, S. (2014). The importance of the label hierarchy in hierarchical multi-label classification. Journal of Intelligent Information Systems, 45(2), 247–271.
Article Google Scholar
Li, P., Li, H., & Wu, M. (2013). Multi-label ensemble based on variable pairwise constraint projection. Information Sciences, 222(0), 269–281.
Article MathSciNet Google Scholar
Madjarov, G., Dimitrovski, I., Gjorgjevikj, D., & Deroski, S. (2015). Evaluation of different data-derived label hierarchies in multi-label classification. In New frontiers in mining complex patterns, lecture notes in computer science, (Vol. 8983 pp. 19–37): Springer international publishing.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45 (9), 3084–3104.
Article Google Scholar
Nemenyi, P.B. (1963). Distribution-free multiple comparisons. Ph.D. thesis, Princeton University.
Quinlan, J.R. (1993). C4.5: Programs for machine learning (Morgan Kaufmann series in machine learning) morgan kaufmann.
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification. In Proceedings of the 20th european conference on machine learning (pp. 254–269).
Silla Carlos, N.J., & Freitas, A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22, 31–72.
Article MathSciNet MATH Google Scholar
Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., & Smeulders, A.W.M. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th annual ACM international conference on multimedia (pp. 421–430).
Srivastava, A., & Zane-Ulman, B. (2005). Discovering recurring anoMalies in text reports regarding complex space systems. In Proceedings of the IEEE aerospace conference (pp. 55–63).
Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. In Proceedings of the 9th international conference on music information retrieval (pp. 320–330).
Tsoumakas, G., & Katakis, I. (2007). Multi label classification: an overview. International Journal of Data Warehouse and Mining, 3(3), 1–13.
Article Google Scholar
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2008). Effective and efficient multilabel classification in domains with large number of labels. In Proceedings of the ECML/PKDD workshop on mining multidimensional data (pp. 30–44).
Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: an ensemble method for multilabel classification. In Proceedings of the 18th european conference on machine learning (pp. 406–417).
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., & Blockeel, H. (2008). Decision trees for hierarchical multi-label classification. Machine Learning, 73 (2), 185–214.
Article Google Scholar
Zhang, M.L., & Zhou, Z.H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.
Article Google Scholar

Download references

Acknowledgments

We would like to acknowledge the support of the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovikj 16, 1000, Skopje, Macedonia
Gjorgji Madjarov, Dejan Gjorgjevikj & Ivica Dimitrovski
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000, Ljubljana, Slovenia
Sašo Džeroski

Authors

Gjorgji Madjarov
View author publications
You can also search for this author inPubMed Google Scholar
Dejan Gjorgjevikj
View author publications
You can also search for this author inPubMed Google Scholar
Ivica Dimitrovski
View author publications
You can also search for this author inPubMed Google Scholar
Sašo Džeroski
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Gjorgji Madjarov.

Appendices

Appendix A: Evaluation measures

In this section, we present the measures that are used to evaluate the predictive performance of the compared methods in our experiments. In the definitions below, $\mathcal {Y}_{i}$ denotes the set of true labels of example x _i and h(x _i) denotes the set of predicted labels for the same examples. All definitions refer to the multi-label setting.

1.1 A.1 Example based measures

Hamming loss evaluates how many times an example-label pair is misclassified, i.e., label not belonging to the example is predicted or a label belonging to the example is not predicted. The smaller the value of h a m m i n g_l o s s(h), the better the performance. The performance is perfect when h a m m i n g_l o s s(h)=0. This metric is defined as:

$$ hamming\_loss(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{1}{Q}\left|h(\mathbf{x_{i}}){\Delta} \mathcal{Y}_{i}\right| $$

(7)

where Δ stands for the symmetric difference between two sets, N is the number of examples and Q is the total number of possible class labels.

Accuracy for a single example x _i is defined by the Jaccard similarity coefficients between the label sets h(x _i) and $\mathcal {Y}_{i}$. Accuracy is micro-averaged across all examples.

$$ accuracy(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left|h(\mathbf{x_{i}})\bigcap \mathcal{Y}_{i}\right|}{\left|h(\mathbf{x_{i}})\bigcup \mathcal{Y}_{i}\right|} $$

(8)

Precision is defined as:

$$ precision(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left|h(\mathbf{x_{i}})\bigcap \mathcal{Y}_{i}\right|}{\left|h(\mathbf{x_{i}})\right|} $$

(9)

Recall is defined as:

$$ recall(h)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left|h(\mathbf{x_{i}})\bigcap \mathcal{Y}_{i}\right|}{\left|\mathcal{Y}_{i}\right|} $$

(10)

F ₁ score is the harmonic mean between precision and recall and is defined as:

$$ F_{1}=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{2 \times \left|h(\mathbf{x_{i}}) \cap \mathcal{Y}_{i}\right|}{\left|h(\mathbf{x_{i}})\right| + \left|\mathcal{Y}_{i}\right|} $$

(11)

F ₁ is an example based metric and its value is an average over all examples in the dataset. F ₁ reaches its best value at 1 and worst score at 0.

Subset accuracy or classification accuracy is defined as follows:

$$ subset\_accuracy(h)=\frac{1}{N}\sum\limits^{N}_{i=1}I(h(\mathbf{x_{i}})=\mathcal{Y}_{i}) $$

(12)

where I(t r u e) = 1 and I(f a l s e) = 0. This is a very strict evaluation measure as it requires the predicted set of labels to be an exact match of the true set of labels.

1.2 A.2 Label based measures

Macro precision (precision averaged across all labels) is defined as:

$$ macro\_precision=\frac{1}{Q}\sum\limits^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fp_{j}} $$

(13)

where t p _j, f p _j are the number of true positives and false positives for the label λ _j considered as a binary class.

Macro recall (recall averaged across all labels) is defined as:

$$ macro\_recall=\frac{1}{Q}\sum\limits^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fn_{j}} $$

(14)

where t p _j, f p _j are defined as for the macro precision and f n _j is the number of false negatives for the label λ _j considered as a binary class.

Macro F ₁ is the harmonic mean between precision and recall, where the average is calculated per label and then averaged across all labels. If p _j and r _j are the precision and recall for all λ _j∈h(x _i) from $\lambda _{j} \in \mathcal {Y}_{i}$, the macro F ₁ is

$$ macro\_F_{1}=\frac{1}{Q}\sum\limits^{Q}_{j=1}\frac{2\times p_{j} \times r_{j}}{p_{j} + r_{j}} $$

(15)

Micro precision (precision averaged over all the example/label pairs) is defined as:

$$ micro\_precision=\frac{{\sum}^{Q}_{j=1}{tp_{j}}}{{\sum}^{Q}_{j=1}{tp_{j}} + {\sum}^{Q}_{j=1}{fp_{j}}} $$

(16)

where t p _j, f p _j are defined as for macro precision.

Micro recall (recall averaged over all the example/label pairs) is defined as:

$$ micro\_recall=\frac{{\sum}^{Q}_{j=1}{tp_{j}}}{{\sum}^{Q}_{j=1}{tp_{j}} + {\sum}^{Q}_{j=1}{fn_{j}}} $$

(17)

where t p _j and f n _j are defined as for macro recall.

Micro F ₁ is the harmonic mean between micro precision and micro recall. Micro F ₁ is defined as:

$$ micro\_F_{1}=\frac{2 \times micro\_precision \times micro\_recall}{micro\_precision + micro\_recall} $$

(18)

1.3 A.3 Ranking based measures

One error evaluates how many times the top-ranked label is not in the set of relevant labels of the example. The metric o n e_e r r o r(f) takes values between 0 and 1. The smaller the value of o n e_e r r o r(f), the better the performance. This evaluation metric is defined as:

$$ one\_error(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\left[\!\!\!\left[\hspace{1 mm}\left[\arg\max_{\lambda \in \mathcal{Y}} f(\mathbf{x_{i}}, \lambda)\right] \notin \mathcal{Y}_{i} \hspace{1 mm}\right]\!\!\!\right] $$

(19)

where $\lambda \in \mathcal {L} = \left \{\lambda _{1}, \lambda _{2}, ..., \lambda _{Q}\right \}$ and [ [π] ] equals 1 if π holds and 0 otherwise for any predicate π. Note that, for single-label classification problems, the One Error is identical to ordinary classification error.

Coverage evaluates how far, on average, we need to go down the list of ranked labels in order to cover all the relevant labels of the example. The smaller the value of c o v e r a g e(f), the better the performance.

$$ coverage(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\max_{\lambda \in \mathcal{Y}_{i}} rank_{f}(\mathbf{x_{i}}, \lambda) - 1 $$

(20)

where r a n k _f(x _i,λ) denotes the position of the label λ in the ranking. It maps the outputs of f(x _i,λ) for any $\lambda \in \mathcal {L}$ to {λ ₁,λ ₂,...,λ _Q} so that f(x _i,λ _m)>f(x _i,λ _n) implies r a n k _f(x _i,λ _m)<r a n k _f(x _i,λ _n). The smallest possible value for c o v e r a g e(f) is l _c, i.e., the label cardinality of the given dataset.

Ranking loss evaluates the average fraction of label pairs that are reversely ordered for the particular example given by:

$$ ranking\ loss(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{\left| D_{i}\right|}{\left| \mathcal{Y}_{i}\right|\left|\bar{\mathcal{Y}_{i}}\right|} $$

(21)

where $D_{i} = \{(\lambda _{m}, \lambda _{n}) | f(\mathbf {x_{i}}, \lambda _{m}) \leq f(\mathbf {x_{i}}, \lambda _{n}), (\lambda _{m}, \lambda _{n}) \in \mathcal {Y}_{i} \times \bar {\mathcal {Y}_{i}}\}$, while $\bar {\mathcal {Y}}$ denotes the complementary set of $\mathcal {Y}$ in $\mathcal {L}$. The smaller the value of r a n k i n g_l o s s(f), the better the performance, so the performance is perfect when r a n k i n g_l o s s(f)=0.

Average Precision is the average fraction of labels ranked above an actual label $\lambda \in \mathcal {Y}_{i}$ that actually are in $\mathcal {Y}_{i}$. The performance is perfect when a v g_p r e c i s i o n(f)=1; the larger the value of a v g_p r e c i s i o n(f), the better the performance. This metric is defined as:

$$ avg\_precision(f)=\frac{1}{N}\sum\limits^{N}_{i=1}\frac{1}{\left| \mathcal{Y}_{i}\right|}\sum\limits_{\lambda \in \mathcal{Y}_{i}}\frac{\left| \mathcal{L}_{i}\right|}{rank_{f}(\mathbf{x_{i}}, \lambda)} $$

(22)

where $\mathcal {L}_{i}=\{\lambda ^{\prime }| rank_{f}(\mathbf {x_{i}}, \lambda ^{\prime }) \leq rank_{f}(\mathbf {x_{i}}, \lambda ), \lambda ^{\prime } \in \mathcal {Y}_{i}\}$ and r a n k _f(x _i,λ) is defined as in coverage above.

1.4 B.1 Results on the example-based evaluation measures

Table 4 The performance of the multi-label classification approaches in terms of the examples-based evaluation measures

Full size table

Table 5 The performance of the multi-label classification approaches in terms of the examples-based evaluation measures

Full size table

1.5 B.2 Results on the label-based evaluation measures

The big difference between the micro-based and macro-based evaluation measures appears due to the averaging strategy of the obtained predictions. It is more emphasized on the large datasets that have highly unbalanced number of examples per label. Namely, the averaging in the micro-based measures is made across the predictions per example for all labels, while the averaging in the macro-based measures is made across the predictions per label for all examples, which means that for macro-based measures the labels with small number of examples are equally important as the labels with large number of examples.

Table 6 The performance of the multi-label classification approaches in terms of the label-based evaluation measures

Full size table

Table 7 The performance of the multi-label classification approaches in terms of the label-based evaluation measures

Full size table

1.6 B.3 Results on the ranking-based evaluation measures

Table 8 The performance of the multi-label classification approaches in terms of the ranking-based evaluation measures

Full size table

Appendix B: Complete results from the experimental evaluation

In this section, we present the complete results from the experimental evaluation. We present the results based on the evaluation measures. Tables 4, 5, 6, 7 and 8 give the performance of the compared methods on each of the datasets measured in terms of the example based, label based and ranking based evaluation measures. The first column of the tables lists the dataset, while the remaining columns show the performance of each method for every dataset. The best results per dataset are shown in boldface. For the bookmarks dataset, HOMER did not manage to construct a predictive model within one week under the available resources. The corresponding entries in the tables with the results are marked with DNF (Did Not Finish).

To assess whether the overall differences in performance across the different approaches are statistically significant, we also employed the corrected Friedman test (Friedman 1940) and the post-hoc Nemenyi test (Nemenyi 1963) as recommended by Demšar (2006). We present the results from the Nemenyi post-hoc test with average rank diagrams (Demšar 2006). These are given in Figs. 6, 7 and 8. A critical diagram contains an enumerated axis on which the average ranks of the algorithms are drawn. The algorithms are depicted along the axis in such a manner that the best ranking ones are at the right-most side of the diagram. The lines for the average ranks of the algorithms that do not differ significantly (at the significance level of p=0.05) are connected with a line. For the bookmarks dataset, we penalize HOMER that does not finish by assigning it the lowest value (i.e., the lowest rank value) for each evaluation measure.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Madjarov, G., Gjorgjevikj, D., Dimitrovski, I. et al. The use of data-derived label hierarchies in multi-label classification. J Intell Inf Syst 47, 57–90 (2016). https://doi.org/10.1007/s10844-016-0405-8

Download citation

Received: 31 July 2015
Revised: 22 March 2016
Accepted: 29 March 2016
Published: 18 April 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s10844-016-0405-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The use of data-derived label hierarchies in multi-label classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification

Multi-label classification with label clusters

Predictive Bi-clustering Trees for Hierarchical Multi-label Classification

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Evaluation measures

1.1 A.1 Example based measures

1.2 A.2 Label based measures

1.3 A.3 Ranking based measures

1.4 B.1 Results on the example-based evaluation measures

1.5 B.2 Results on the label-based evaluation measures

1.6 B.3 Results on the ranking-based evaluation measures

Appendix B: Complete results from the experimental evaluation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now