Supervised topic models with weighted words: multi-label document classification

Zou, Yue-peng; Ouyang, Ji-hong; Li, Xi-ming

doi:10.1631/FITEE.1601668

Supervised topic models with weighted words: multi-label document classification

Published: 11 June 2018

Volume 19, pages 513–523, (2018)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

239 Accesses
Explore all metrics

Abstract

Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks. Representative models include labeled latent Dirichlet allocation (L-LDA) and dependency-LDA. However, these models neglect the class frequency information of words (i.e., the number of classes where a word has occurred in the training data), which is significant for classification. To address this, we propose a method, namely the class frequency weight (CF-weight), to weight words by considering the class frequency knowledge. This CF-weight is based on the intuition that a word with higher (lower) class frequency will be less (more) discriminative. In this study, the CF-weight is used to improve L-LDA and dependency-LDA. A number of experiments have been conducted on real-world multi-label datasets. Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label Classification via Label-Topic Pairs

Latent Topic-Aware Multi-label Classification

Subset Labeled LDA: A Topic Model for Extreme Multi-label Classification

References

Blei DM, McAuliffe JD, 2007. Supervised topic models. 20th Int Conf on Neural Information Processing Systems, p.121–128.
Google Scholar
Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3:993–1022.
MATH Google Scholar
Chang CC, Lin CJ, 2016. LIBSVM—a Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ [Accessed on May 22, 2018].
Google Scholar
Debole F, Sebastiani F, 2004. Supervised term weighting for automated text categorization. In: Sirmakessis S (Ed.), Text Mining and Its Applications. Springer, Berlin, p.81–97. https://doi.org/10.1007/978-3-540-45219-5_7
Google Scholar
Ghahramani Z, 2001. An introduction to hidden Markov models and Bayesian networks. Int J Patt Recogn Artif Intell, 15(1):9–42. https://doi.org/10.1142/S0218001401000836
Article Google Scholar
Griffiths TL, Steyvers M, 2004. Finding scientific topics. Proc Nat Acad Sci USA, 101(Suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101
Article Google Scholar
Guan H, Zhou JY, Guo MY, 2009. A class-feature-centroid classifier for text categorization. 18th Int Conf on World Wide Web, p.201–210. https://doi.org/10.1145/1526709.1526737
Google Scholar
Kim D, Kim S, Oh A, 2012. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. 29th Int Conf on Machine Learning, p.675–682.
Google Scholar
Lacoste-Julien S, Sha F, Jordan MI, 2008. DiscLDA: discriminative learning for dimensionality reduction and classification. 21st Int Conf on Neural Information Processing Systems, p.897–904.
Google Scholar
Lee S, Kim J, Myaeng SH, 2015. An extension of topic models for text classification: a term weighting approach. Int Conf on Big Data and Smart Computing, p.217–224. https://doi.org/10.1109/35021BIGCOMP.2015.7072834
Google Scholar
Li XM, Ouyang JH, Zhou XT, 2015a. Centroid prior topic model for multi-label classification. Patt Recogn Lett, 62:8–13. https://doi.org/10.1016/j.patrec.2015.04.012
Article Google Scholar
Li XM, Ouyang JH, Zhou XT, 2015b. Supervised topic models for multi-label classification. Neurocomputing, 149:811–819. https://doi.org/10.1016/j.neucom.2014.07.053
Article Google Scholar
Machine Learning & Knowledge Discovery Group, 2011. Learning from Multi-label Data. http://mlkd.csd.auth.gr/multilabel.html [Accessed on May 12, 2018].
Google Scholar
Madsen RE, Kauchak D, Elkan C, 2005. Modeling word burstiness using the Dirichlet distribution. 22nd Int Conf on Machine Learning, p.545–552. https://doi.org/10.1145/1102351.1102420
Google Scholar
Petterson J, Smola A, Caetano T, et al., 2010. Word features for latent Dirichlet allocation. 23rd Int Conf on Neural Information Processing Systems, p.1921–1929.
Google Scholar
Ramage D, Hall D, Nallapati R, et al., 2009. Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora. Conf on Empirical Methods in Natural Language Processing, p.248–256. https://doi.org/10.3115/1699510.1699543
Google Scholar
Ramage D, Manning CD, Dumais S, 2011. Partially labeled topic models for interpretable text mining. 17th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.457-465. https://doi.org/10.1145/2020408.2020481
Google Scholar
Reisinger J, Waters A, Silverthorn B, et al., 2010. Spherical topic models. Proc 27th Int Conf on Machine Learning, p.1-8.
Google Scholar
Rubin TN, Chambers A, Smyth P, et al., 2012. Statistical topic models for multi-label document classification. Mach Learn, 88(1–2): 157–208. https://doi.org/10.1007/s10994-011-5272-5
Article MathSciNet MATH Google Scholar
Salton G, Buckley C, 1988. Term-weighting approaches in automatic text retrieval. Inform Process Manag, 24(5): 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Article Google Scholar
Shang LF, Chan KP, Pan GD, 2011. DTTM: a discriminative temporal topic model for facial expression recognition. 7th Int Conf on Advances in Visual Computing, p.596–606. https://doi.org/10.1007/978-3-642-24028-7_55
Chapter Google Scholar
Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, et al., 2011a. Mulan: a Java library for multi-label learning. J Mach Learn Res, 12(7):2411–2414.
MathSciNet MATH Google Scholar
Tsoumakas G, Katakis I, Vlahavas I, 2011b. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng, 23(7):1079–1089. https://doi.org/10.1109/TKDE.2010.164
Article Google Scholar
Wilson AT, Chew PA, 2010. Term weighting schemes for latent Dirichlet allocation. Human Language Technologies: Annual Conf of the North American Chapter of the Association for Computational Linguistics, p.465–473.
Google Scholar
Zhu J, Ahmed A, Xing EP, 2012. MedLDA: maximum margin supervised topic models. 26th Annual Int Conf on Machine Learning, p.1257–1264. https://doi.org/10.1145/1553374.1553535
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Yue-peng Zou, Ji-hong Ouyang & Xi-ming Li
MOE Key Laboratory of Symbolic Computation and Knowledge Engineering, Jilin University, Changchun, 130012, China
Yue-peng Zou, Ji-hong Ouyang & Xi-ming Li

Authors

Yue-peng Zou
View author publications
You can also search for this author inPubMed Google Scholar
Ji-hong Ouyang
View author publications
You can also search for this author inPubMed Google Scholar
Xi-ming Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xi-ming Li.

Additional information

Project supported by the National Natural Science Foundation of China (No. 61602204)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zou, Yp., Ouyang, Jh. & Li, Xm. Supervised topic models with weighted words: multi-label document classification. Frontiers Inf Technol Electronic Eng 19, 513–523 (2018). https://doi.org/10.1631/FITEE.1601668

Download citation

Received: 26 October 2016
Revised: 03 January 2017
Published: 11 June 2018
Issue Date: April 2018
DOI: https://doi.org/10.1631/FITEE.1601668

Key words

CLC number

TP391

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised topic models with weighted words: multi-label document classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-label Classification via Label-Topic Pairs

Latent Topic-Aware Multi-label Classification

Subset Labeled LDA: A Topic Model for Extreme Multi-label Classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Subscribe and save

Buy Now