Improving text categorization bootstrapping via unsupervised learning

Published: 14 October 2009 Publication History


We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70--160 labeled documents per category.


Index Terms

  1. Improving text categorization bootstrapping via unsupervised learning



    Julien Velcin

    With the overwhelming amount of information available nowadays, the task of classifying it is becoming more and more costly. This is especially the case when dealing with texts, because the labeling process is particularly difficult for the experts. One solution is bootstrapping-boosting the learning process by using a preliminary step. The authors propose in this paper an original bootstrapping strategy based on unsupervised learning. Their strategy is twofold. First, the cosine distance between a list of word seeds and the unlabeled instances in the latent semantic indexing (LSI) space is calculated. Second, this distance is mapped into class posterior probabilities via a Gaussian mixture model (GMM). The evaluation is well written. It uses two well-known datasets-Reuters and 20 Newsgroups-and an additional original Wikipedia benchmark. The authors show that their algorithm obtains results that are comparable with a standard support vector machine (SVM) classifier, but without using any labels. Furthermore, the word seeds contain only one word for each category-the category name. As the authors stress in the conclusion, it seems natural that using complementary words should lead to even better results. Online Computing Reviews Service

    Information & Contributors


    Published In

    cover image ACM Transactions on Speech and Language Processing
    ACM Transactions on Speech and Language Processing   Volume 6, Issue 1
    October 2009
    24 pages
    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 October 2009
    Accepted: 01 July 2009
    Revised: 01 June 2009
    Received: 01 July 2008
    Published in TSLP Volume 6, Issue 1


    Author Tags

    1. Text categorization
    2. bootstrapping
    3. unsupervised machine learning


    • Research-article
    • Research
    • Refereed


