Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Soleimani, Hossein; Miller, David J.

doi:10.1007/s10044-017-0629-4

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Theoretical Advances
Published: 06 June 2017

Volume 22, pages 299–309, (2019)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

We propose a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples (i.e., in a semi-supervised fashion). Most topic models incorporate documents’ class labels by generating them after generating the words. In these models, the training class labels have small effect on the estimated topics, as they are effectively treated as just another word, amongst a huge set of word features. In this paper, we propose to increase the influence of class labels on topic models by generating the words in each document conditioned on the class label. We show that our specific generative process improves classification performance with small loss in test set log-likelihood. Within our framework, we provide a principled mechanism to control the contributions of the class labels and the word space to the likelihood function. Experiments show our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Topic Models to Label Documents for Classification

Improving Classification Using Topic Correlation and Expectation Propagation

Neural labeled LDA: a topic model for semi-supervised document classification

Article 15 October 2021

Notes

Again, with minor abuse of notation, we sometimes write $v_{d}=y_d$ to indicate $v_{dc}=1$ for $c=y_d$ and $v_{dc}=0$ otherwise.
Note that having such a separate labeled validation set is not so realistic in the semi-supervised setting, where labels may be scarce. Thus, in some sense, we are comparing with “upper bound” performance achievable by ssLDA.
http://people.csail.mit.edu/jrennie/20Newsgroups/.
http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.
http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=11.
http://wiki.dbpedia.org/Downloads.

References

Blei DM, Carin L, Dunson D (2012) Probabilistic topic models. Commun ACM 55:77–84. doi:10.1109/MSP.2010.938079
Article Google Scholar
Blei DM, McAuliffe JD (2010) Supervised topic models. arXiv preprint arXiv:1003.0783 pp 1–22
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. JMLR 3:993–1022
MATH Google Scholar
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233
Article MATH Google Scholar
Lehmann J, Isele R et al (2014) DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web 5:1–29
Google Scholar
Miller DJ, Uyar HS (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Proc. of NIPS 571–577
Mimno D, McCallum A (2012) Topic models conditioned on arbitrary features with dirichlet-multinomial regression, in UAI'08 Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp 411–418
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, Springer, Netherlands, pp 355–368
MATH Google Scholar
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Article MATH Google Scholar
Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin
MATH Google Scholar
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494
Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208
Article MathSciNet MATH Google Scholar
Soleimani H, Miller DJ (2015) Parsimonious topic models with salient word discovery. IEEE Trans Knowl Data Eng 27:824–837
Article Google Scholar
Vo DT, Ock CY (2015) Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst Appl 42:1684–1698
Article Google Scholar
Wang C, Blei DM, Li FF (2009) Simultaneous image classification and annotation. In: CVPR, pp 1903–1910
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: SIGKDD, pp 424–433
Yu J, Rui Y, Tang YY, Tao D (2014) High-order distance-based multiview stochastic learning in image classification. IEEE Trans Cybern 44:2431–2442
Article Google Scholar
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45:767–779
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, Pennsylvania, 16802, USA
Hossein Soleimani & David J. Miller

Authors

Hossein Soleimani
View author publications
You can also search for this author inPubMed Google Scholar
David J. Miller
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hossein Soleimani.

A Appendix:

Tables 2 and 3 show test set correct classification rate and test set log-likelihood for MCCTM and the baselines on all datasets.

Table 2 Comparison of test set correct classification rate (test CCR)

Full size table

Table 3 Comparison of test set log-likelihood on all datasets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Soleimani, H., Miller, D.J. Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification. Pattern Anal Applic 22, 299–309 (2019). https://doi.org/10.1007/s10044-017-0629-4

Download citation

Received: 05 August 2016
Accepted: 29 May 2017
Published: 06 June 2017
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s10044-017-0629-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using Topic Models to Label Documents for Classification

Improving Classification Using Topic Correlation and Expectation Propagation

Neural labeled LDA: a topic model for semi-supervised document classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

A Appendix:

A Appendix:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now