skip to main content
research-article

Improving text categorization bootstrapping via unsupervised learning

Published: 14 October 2009 Publication History

Abstract

We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70--160 labeled documents per category.

References

[1]
Abney, S. 2002. Bootstrapping. In Proceeding of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02).
[2]
Abney, S. 2004. Understanding the Yarowsky algorithm. Comput. Linguist. 30, 3.
[3]
Adami, G., Avesani, P., and Sona, D. 2003. Bootstrapping for hierarchical document classication. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03).
[4]
Barendregt, H. 1984. The Lambda Calculus: Its Syntax and Semantics. North Holland, Amsterdam.
[5]
Bekkerman, R. 2003. Distributional clustering of words for text categorization. M.S. thesis, Technion-Israel Institute of Technology.
[6]
Berry, M. 1992. Large-scale sparse singular value computations. Int. J. Supercomput. Appl. 6, 1, 13--49.
[7]
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT). 92--100.
[8]
Collins, M. and Singer, Y. 1999. Unsupervised models for named entity classification. In Proceedings of the EMNLP'99 Conference.
[9]
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci.
[10]
Fellbaum, C. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge, MA.
[11]
Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Machine Learn. Resear. 8, 2297--2345.
[12]
Gliozzo, A. and Strapparava, C. 2005. Domains kernels for text categorization. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL'05).
[13]
Gliozzo, A., Strapparava, C., and Dagan, I. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Comput. Speech Lang. 18, 275--299.
[14]
Gliozzo, A., Strapparava, C., and Dagan, I. 2005. Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the Joint Conference on Human Language Technology/Empirical Methods in Natural Language Processing (HLT/EMNLP).
[15]
Godbole, S., Harpale, A., Sarawagi, S., and Chakrabarti, S. 2004. Document classication through interactive supervision of document and term labels. In Proceedings of the 15th European Conference on Machine Learning (ECML) and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD).
[16]
Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., Eds. MIT Press, Cambridge, MA, 169--184.
[17]
Ko, Y. and Seo, J. 2000. Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics.
[18]
Ko, Y. and Seo, J. 2002. Text categorization using feature projections. In Proceedings of the International Conference on Computational Linguistics.
[19]
Ko, Y. and Seo, J. 2004. Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL'04).
[20]
Lang, K. 1995. NewsWeeder: Learning to filter netnews. In Proceedings of the12th International Conference on Machine Learning (ICML'95). 331--339.
[21]
Liu, B., Li, X., Lee, W. S., and Yu, P. S. 2004. Text classification by labeling words. In Proceedings of the Conference on Natural Language Processing and Information Extraction.
[22]
Magnini, B. and Cavaglia, G. 2000. Integrating subject field codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC'00). 1413--1418.
[23]
Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. 2001. Using domain information for word sense disambiguation. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL2 ). 111--114.
[24]
Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. 2002. The role of domain information in word sense disambiguation. Natural Lang. Engin. 8, 4, 359--373.
[25]
McCallum, A. and Nigam, K. 1999. Text classification by bootstrapping with keywords, EM and shrinkage. In Proceedings of the Workshop for Unsupervised Learning in Natural Language Processing (ACL'99).
[26]
Redner, R. and Walker, H. 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26, 2, 195--239.
[27]
Rifkin, R. and Klautau, A. 2004. In defense of one-vs-all classification. J. Machine Learn. Resear. 5, 101--141.
[28]
Salton, G. and McGill, M. 1983. In Introduction to Modern Information Retrieval. McGraw-Hill, New York.
[29]
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47.
[30]
Silverman, B. W. 1986. In Density Estimation for Statistics and Data Analysis. Chapman and Hall.
[31]
Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer, Berlin.
[32]
Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. 189--196.
[33]
Zhang, Y. and Callan, J. 2001. Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), ACM, New York.

Cited By

View all
  • (2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
  • (2020)Enhanced Bootstrapping Algorithm for Automatic Annotation of TweetsInternational Journal of Cognitive Informatics and Natural Intelligence10.4018/IJCINI.202004010314:2(35-60)Online publication date: 1-Oct-2020
  • (2020)An Attention-based Deep Relevance Model for Few-shot Document FilteringACM Transactions on Information Systems10.1145/341997239:1(1-35)Online publication date: 6-Oct-2020
  • Show More Cited By

Index Terms

  1. Improving text categorization bootstrapping via unsupervised learning

    Recommendations

    Reviews

    Julien Velcin

    With the overwhelming amount of information available nowadays, the task of classifying it is becoming more and more costly. This is especially the case when dealing with texts, because the labeling process is particularly difficult for the experts. One solution is bootstrapping-boosting the learning process by using a preliminary step. The authors propose in this paper an original bootstrapping strategy based on unsupervised learning. Their strategy is twofold. First, the cosine distance between a list of word seeds and the unlabeled instances in the latent semantic indexing (LSI) space is calculated. Second, this distance is mapped into class posterior probabilities via a Gaussian mixture model (GMM). The evaluation is well written. It uses two well-known datasets-Reuters and 20 Newsgroups-and an additional original Wikipedia benchmark. The authors show that their algorithm obtains results that are comparable with a standard support vector machine (SVM) classifier, but without using any labels. Furthermore, the word seeds contain only one word for each category-the category name. As the authors stress in the conclusion, it seems natural that using complementary words should lead to even better results. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Speech and Language Processing
    ACM Transactions on Speech and Language Processing   Volume 6, Issue 1
    October 2009
    24 pages
    ISSN:1550-4875
    EISSN:1550-4883
    DOI:10.1145/1596515
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 October 2009
    Accepted: 01 July 2009
    Revised: 01 June 2009
    Received: 01 July 2008
    Published in TSLP Volume 6, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Text categorization
    2. bootstrapping
    3. unsupervised machine learning

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
    • (2020)Enhanced Bootstrapping Algorithm for Automatic Annotation of TweetsInternational Journal of Cognitive Informatics and Natural Intelligence10.4018/IJCINI.202004010314:2(35-60)Online publication date: 1-Oct-2020
    • (2020)An Attention-based Deep Relevance Model for Few-shot Document FilteringACM Transactions on Information Systems10.1145/341997239:1(1-35)Online publication date: 6-Oct-2020
    • (2020)Seed-Guided Deep Document ClusteringAdvances in Information Retrieval10.1007/978-3-030-45439-5_1(3-16)Online publication date: 14-Apr-2020
    • (2019)Filtering and Classifying Relevant Short Text with a Few Seed WordsData and Information Management10.2478/dim-2019-00113:3(165-186)Online publication date: Sep-2019
    • (2019)The Application of Artificial Intelligence Technologies as a Substitute for Reading and to Support and Enhance the Authoring of Scientific Review ArticlesIEEE Access10.1109/ACCESS.2019.29177197(65263-65276)Online publication date: 2019
    • (2018)Seed-Guided Topic Model for Document Filtering and ClassificationACM Transactions on Information Systems10.1145/323825037:1(1-37)Online publication date: 6-Dec-2018
    • (2017)Intensional Learning to Efficiently Build up Automatically Annotated Emotion CorporaIEEE Transactions on Affective Computing10.1109/TAFFC.2017.2764470(1-1)Online publication date: 2017
    • (2016)Effective Document Labeling with Very Few Seed WordsProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983721(85-94)Online publication date: 24-Oct-2016
    • (2016)Exploiting a Bootstrapping Approach for Automatic Annotation of Emotions in Texts2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2016.78(726-734)Online publication date: Oct-2016
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media