Abstract
We propose a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples (i.e., in a semi-supervised fashion). Most topic models incorporate documents’ class labels by generating them after generating the words. In these models, the training class labels have small effect on the estimated topics, as they are effectively treated as just another word, amongst a huge set of word features. In this paper, we propose to increase the influence of class labels on topic models by generating the words in each document conditioned on the class label. We show that our specific generative process improves classification performance with small loss in test set log-likelihood. Within our framework, we provide a principled mechanism to control the contributions of the class labels and the word space to the likelihood function. Experiments show our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models.






Similar content being viewed by others
Notes
Again, with minor abuse of notation, we sometimes write \(v_{d}=y_d\) to indicate \(v_{dc}=1\) for \(c=y_d\) and \(v_{dc}=0\) otherwise.
Note that having such a separate labeled validation set is not so realistic in the semi-supervised setting, where labels may be scarce. Thus, in some sense, we are comparing with “upper bound” performance achievable by ssLDA.
References
Blei DM, Carin L, Dunson D (2012) Probabilistic topic models. Commun ACM 55:77–84. doi:10.1109/MSP.2010.938079
Blei DM, McAuliffe JD (2010) Supervised topic models. arXiv preprint arXiv:1003.0783 pp 1–22
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. JMLR 3:993–1022
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233
Lehmann J, Isele R et al (2014) DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web 5:1–29
Miller DJ, Uyar HS (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Proc. of NIPS 571–577
Mimno D, McCallum A (2012) Topic models conditioned on arbitrary features with dirichlet-multinomial regression, in UAI'08 Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp 411–418
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, Springer, Netherlands, pp 355–368
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494
Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208
Soleimani H, Miller DJ (2015) Parsimonious topic models with salient word discovery. IEEE Trans Knowl Data Eng 27:824–837
Vo DT, Ock CY (2015) Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst Appl 42:1684–1698
Wang C, Blei DM, Li FF (2009) Simultaneous image classification and annotation. In: CVPR, pp 1903–1910
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: SIGKDD, pp 424–433
Yu J, Rui Y, Tang YY, Tao D (2014) High-order distance-based multiview stochastic learning in image classification. IEEE Trans Cybern 44:2431–2442
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45:767–779
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Soleimani, H., Miller, D.J. Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification. Pattern Anal Applic 22, 299–309 (2019). https://doi.org/10.1007/s10044-017-0629-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-017-0629-4