Skip to main content
Log in

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

We propose a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples (i.e., in a semi-supervised fashion). Most topic models incorporate documents’ class labels by generating them after generating the words. In these models, the training class labels have small effect on the estimated topics, as they are effectively treated as just another word, amongst a huge set of word features. In this paper, we propose to increase the influence of class labels on topic models by generating the words in each document conditioned on the class label. We show that our specific generative process improves classification performance with small loss in test set log-likelihood. Within our framework, we provide a principled mechanism to control the contributions of the class labels and the word space to the likelihood function. Experiments show our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Again, with minor abuse of notation, we sometimes write \(v_{d}=y_d\) to indicate \(v_{dc}=1\) for \(c=y_d\) and \(v_{dc}=0\) otherwise.

  2. Note that having such a separate labeled validation set is not so realistic in the semi-supervised setting, where labels may be scarce. Thus, in some sense, we are comparing with “upper bound” performance achievable by ssLDA.

  3. http://people.csail.mit.edu/jrennie/20Newsgroups/.

  4. http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.

  5. http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=11.

  6. http://wiki.dbpedia.org/Downloads.

References

  1. Blei DM, Carin L, Dunson D (2012) Probabilistic topic models. Commun ACM 55:77–84. doi:10.1109/MSP.2010.938079

    Article  Google Scholar 

  2. Blei DM, McAuliffe JD (2010) Supervised topic models. arXiv preprint arXiv:1003.0783 pp 1–22

  3. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. JMLR 3:993–1022

    MATH  Google Scholar 

  4. Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233

    Article  MATH  Google Scholar 

  5. Lehmann J, Isele R et al (2014) DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web 5:1–29

    Google Scholar 

  6. Miller DJ, Uyar HS (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Proc. of NIPS 571–577

  7. Mimno D, McCallum A (2012) Topic models conditioned on arbitrary features with dirichlet-multinomial regression, in UAI'08 Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp 411–418

  8. Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, Springer, Netherlands, pp 355–368

    MATH  Google Scholar 

  9. Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134

    Article  MATH  Google Scholar 

  10. Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin

    MATH  Google Scholar 

  11. Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494

  12. Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208

    Article  MathSciNet  MATH  Google Scholar 

  13. Soleimani H, Miller DJ (2015) Parsimonious topic models with salient word discovery. IEEE Trans Knowl Data Eng 27:824–837

    Article  Google Scholar 

  14. Vo DT, Ock CY (2015) Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst Appl 42:1684–1698

    Article  Google Scholar 

  15. Wang C, Blei DM, Li FF (2009) Simultaneous image classification and annotation. In: CVPR, pp 1903–1910

  16. Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: SIGKDD, pp 424–433

  17. Yu J, Rui Y, Tang YY, Tao D (2014) High-order distance-based multiview stochastic learning in image classification. IEEE Trans Cybern 44:2431–2442

    Article  Google Scholar 

  18. Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45:767–779

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hossein Soleimani.

A Appendix:

A Appendix:

Tables 2 and 3 show test set correct classification rate and test set log-likelihood for MCCTM and the baselines on all datasets.

Table 2 Comparison of test set correct classification rate (test CCR)
Table 3 Comparison of test set log-likelihood on all datasets

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Soleimani, H., Miller, D.J. Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification. Pattern Anal Applic 22, 299–309 (2019). https://doi.org/10.1007/s10044-017-0629-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-017-0629-4

Keywords

Navigation