Abstract
Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks. Representative models include labeled latent Dirichlet allocation (L-LDA) and dependency-LDA. However, these models neglect the class frequency information of words (i.e., the number of classes where a word has occurred in the training data), which is significant for classification. To address this, we propose a method, namely the class frequency weight (CF-weight), to weight words by considering the class frequency knowledge. This CF-weight is based on the intuition that a word with higher (lower) class frequency will be less (more) discriminative. In this study, the CF-weight is used to improve L-LDA and dependency-LDA. A number of experiments have been conducted on real-world multi-label datasets. Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.
Similar content being viewed by others
References
Blei DM, McAuliffe JD, 2007. Supervised topic models. 20th Int Conf on Neural Information Processing Systems, p.121–128.
Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3:993–1022.
Chang CC, Lin CJ, 2016. LIBSVM—a Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ [Accessed on May 22, 2018].
Debole F, Sebastiani F, 2004. Supervised term weighting for automated text categorization. In: Sirmakessis S (Ed.), Text Mining and Its Applications. Springer, Berlin, p.81–97. https://doi.org/10.1007/978-3-540-45219-5_7
Ghahramani Z, 2001. An introduction to hidden Markov models and Bayesian networks. Int J Patt Recogn Artif Intell, 15(1):9–42. https://doi.org/10.1142/S0218001401000836
Griffiths TL, Steyvers M, 2004. Finding scientific topics. Proc Nat Acad Sci USA, 101(Suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101
Guan H, Zhou JY, Guo MY, 2009. A class-feature-centroid classifier for text categorization. 18th Int Conf on World Wide Web, p.201–210. https://doi.org/10.1145/1526709.1526737
Kim D, Kim S, Oh A, 2012. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. 29th Int Conf on Machine Learning, p.675–682.
Lacoste-Julien S, Sha F, Jordan MI, 2008. DiscLDA: discriminative learning for dimensionality reduction and classification. 21st Int Conf on Neural Information Processing Systems, p.897–904.
Lee S, Kim J, Myaeng SH, 2015. An extension of topic models for text classification: a term weighting approach. Int Conf on Big Data and Smart Computing, p.217–224. https://doi.org/10.1109/35021BIGCOMP.2015.7072834
Li XM, Ouyang JH, Zhou XT, 2015a. Centroid prior topic model for multi-label classification. Patt Recogn Lett, 62:8–13. https://doi.org/10.1016/j.patrec.2015.04.012
Li XM, Ouyang JH, Zhou XT, 2015b. Supervised topic models for multi-label classification. Neurocomputing, 149:811–819. https://doi.org/10.1016/j.neucom.2014.07.053
Machine Learning & Knowledge Discovery Group, 2011. Learning from Multi-label Data. http://mlkd.csd.auth.gr/multilabel.html [Accessed on May 12, 2018].
Madsen RE, Kauchak D, Elkan C, 2005. Modeling word burstiness using the Dirichlet distribution. 22nd Int Conf on Machine Learning, p.545–552. https://doi.org/10.1145/1102351.1102420
Petterson J, Smola A, Caetano T, et al., 2010. Word features for latent Dirichlet allocation. 23rd Int Conf on Neural Information Processing Systems, p.1921–1929.
Ramage D, Hall D, Nallapati R, et al., 2009. Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora. Conf on Empirical Methods in Natural Language Processing, p.248–256. https://doi.org/10.3115/1699510.1699543
Ramage D, Manning CD, Dumais S, 2011. Partially labeled topic models for interpretable text mining. 17th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.457-465. https://doi.org/10.1145/2020408.2020481
Reisinger J, Waters A, Silverthorn B, et al., 2010. Spherical topic models. Proc 27th Int Conf on Machine Learning, p.1-8.
Rubin TN, Chambers A, Smyth P, et al., 2012. Statistical topic models for multi-label document classification. Mach Learn, 88(1–2): 157–208. https://doi.org/10.1007/s10994-011-5272-5
Salton G, Buckley C, 1988. Term-weighting approaches in automatic text retrieval. Inform Process Manag, 24(5): 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Shang LF, Chan KP, Pan GD, 2011. DTTM: a discriminative temporal topic model for facial expression recognition. 7th Int Conf on Advances in Visual Computing, p.596–606. https://doi.org/10.1007/978-3-642-24028-7_55
Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, et al., 2011a. Mulan: a Java library for multi-label learning. J Mach Learn Res, 12(7):2411–2414.
Tsoumakas G, Katakis I, Vlahavas I, 2011b. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng, 23(7):1079–1089. https://doi.org/10.1109/TKDE.2010.164
Wilson AT, Chew PA, 2010. Term weighting schemes for latent Dirichlet allocation. Human Language Technologies: Annual Conf of the North American Chapter of the Association for Computational Linguistics, p.465–473.
Zhu J, Ahmed A, Xing EP, 2012. MedLDA: maximum margin supervised topic models. 26th Annual Int Conf on Machine Learning, p.1257–1264. https://doi.org/10.1145/1553374.1553535
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (No. 61602204)
Rights and permissions
About this article
Cite this article
Zou, Yp., Ouyang, Jh. & Li, Xm. Supervised topic models with weighted words: multi-label document classification. Frontiers Inf Technol Electronic Eng 19, 513–523 (2018). https://doi.org/10.1631/FITEE.1601668
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1601668
Key words
- Supervised topic model
- Multi-label classification
- Class frequency
- Labeled latent Dirichlet allocation (L-LDA)
- Dependency-LDA