research-article

Improving text categorization bootstrapping via unsupervised learning

Authors:

Alfio Gliozzo,

Carlo Strapparava,

Ido DaganAuthors Info & Claims

ACM Transactions on Speech and Language Processing (TSLP), Volume 6, Issue 1

Article No.: 1, Pages 1 - 24

https://doi.org/10.1145/1596515.1596516

Published: 14 October 2009 Publication History

Get Access

Abstract

We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70--160 labeled documents per category.

References

[1]

Abney, S. 2002. Bootstrapping. In Proceeding of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02).

Digital Library

Google Scholar

[2]

Abney, S. 2004. Understanding the Yarowsky algorithm. Comput. Linguist. 30, 3.

Digital Library

Google Scholar

[3]

Adami, G., Avesani, P., and Sona, D. 2003. Bootstrapping for hierarchical document classication. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03).

Digital Library

Google Scholar

[4]

Barendregt, H. 1984. The Lambda Calculus: Its Syntax and Semantics. North Holland, Amsterdam.

Google Scholar

[5]

Bekkerman, R. 2003. Distributional clustering of words for text categorization. M.S. thesis, Technion-Israel Institute of Technology.

Digital Library

Google Scholar

[6]

Berry, M. 1992. Large-scale sparse singular value computations. Int. J. Supercomput. Appl. 6, 1, 13--49.

Digital Library

Google Scholar

[7]

Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT). 92--100.

Digital Library

Google Scholar

[8]

Collins, M. and Singer, Y. 1999. Unsupervised models for named entity classification. In Proceedings of the EMNLP'99 Conference.

Google Scholar

[9]

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci.

Crossref

Google Scholar

[10]

Fellbaum, C. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge, MA.

Google Scholar

[11]

Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Machine Learn. Resear. 8, 2297--2345.

Digital Library

Google Scholar

[12]

Gliozzo, A. and Strapparava, C. 2005. Domains kernels for text categorization. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL'05).

Digital Library

Google Scholar

[13]

Gliozzo, A., Strapparava, C., and Dagan, I. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Comput. Speech Lang. 18, 275--299.

Crossref

Google Scholar

[14]

Gliozzo, A., Strapparava, C., and Dagan, I. 2005. Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the Joint Conference on Human Language Technology/Empirical Methods in Natural Language Processing (HLT/EMNLP).

Digital Library

Google Scholar

[15]

Godbole, S., Harpale, A., Sarawagi, S., and Chakrabarti, S. 2004. Document classication through interactive supervision of document and term labels. In Proceedings of the 15th European Conference on Machine Learning (ECML) and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD).

Digital Library

Google Scholar

[16]

Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., Eds. MIT Press, Cambridge, MA, 169--184.

Digital Library

Google Scholar

[17]

Ko, Y. and Seo, J. 2000. Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics.

Digital Library

Google Scholar

[18]

Ko, Y. and Seo, J. 2002. Text categorization using feature projections. In Proceedings of the International Conference on Computational Linguistics.

Digital Library

Google Scholar

[19]

Ko, Y. and Seo, J. 2004. Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL'04).

Digital Library

Google Scholar

[20]

Lang, K. 1995. NewsWeeder: Learning to filter netnews. In Proceedings of the12th International Conference on Machine Learning (ICML'95). 331--339.

Crossref

Google Scholar

[21]

Liu, B., Li, X., Lee, W. S., and Yu, P. S. 2004. Text classification by labeling words. In Proceedings of the Conference on Natural Language Processing and Information Extraction.

Google Scholar

[22]

Magnini, B. and Cavaglia, G. 2000. Integrating subject field codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC'00). 1413--1418.

Google Scholar

[23]

Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. 2001. Using domain information for word sense disambiguation. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL2 ). 111--114.

Digital Library

Google Scholar

[24]

Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. 2002. The role of domain information in word sense disambiguation. Natural Lang. Engin. 8, 4, 359--373.

Digital Library

Google Scholar

[25]

McCallum, A. and Nigam, K. 1999. Text classification by bootstrapping with keywords, EM and shrinkage. In Proceedings of the Workshop for Unsupervised Learning in Natural Language Processing (ACL'99).

Google Scholar

[26]

Redner, R. and Walker, H. 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26, 2, 195--239.

Digital Library

Google Scholar

[27]

Rifkin, R. and Klautau, A. 2004. In defense of one-vs-all classification. J. Machine Learn. Resear. 5, 101--141.

Digital Library

Google Scholar

[28]

Salton, G. and McGill, M. 1983. In Introduction to Modern Information Retrieval. McGraw-Hill, New York.

Digital Library

Google Scholar

[29]

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47.

Digital Library

Google Scholar

[30]

Silverman, B. W. 1986. In Density Estimation for Statistics and Data Analysis. Chapman and Hall.

Google Scholar

[31]

Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer, Berlin.

Digital Library

Google Scholar

[32]

Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. 189--196.

Digital Library

Google Scholar

[33]

Zhang, Y. and Callan, J. 2001. Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), ACM, New York.

Digital Library

Google Scholar

Cited By

View all

Mu HZhang SWang YSun YXu H(2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650066
Jan RMohd MHakak N(2020)Enhanced Bootstrapping Algorithm for Automatic Annotation of TweetsInternational Journal of Cognitive Informatics and Natural Intelligence10.4018/IJCINI.202004010314:2(35-60)Online publication date: 1-Oct-2020
https://dl.acm.org/doi/10.4018/IJCINI.2020040103
Liu BLi CZhou WJi FDuan YChen H(2020)An Attention-based Deep Relevance Model for Few-shot Document FilteringACM Transactions on Information Systems10.1145/341997239:1(1-35)Online publication date: 6-Oct-2020
https://dl.acm.org/doi/10.1145/3419972
Show More Cited By

Index Terms

Improving text categorization bootstrapping via unsupervised learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Investigating unsupervised learning for text categorization bootstrapping
HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

We propose a generalized bootstrapping algorithm in which categories are described by relevant seed features. Our method introduces two unsupervised steps that improve the initial categorization step of the bootstrapping scheme: (i) using Latent ...
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use ...
Automatic text categorization by unsupervised learning
COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

The goal of text categorization is to classify documents into a certain number of predefined categories. The previous works in this area have used a large number of labeled training documents for supervised learning. One problem is that it is difficult ...

Reviews

Reviewer: Julien Velcin

With the overwhelming amount of information available nowadays, the task of classifying it is becoming more and more costly. This is especially the case when dealing with texts, because the labeling process is particularly difficult for the experts. One solution is bootstrapping-boosting the learning process by using a preliminary step. The authors propose in this paper an original bootstrapping strategy based on unsupervised learning. Their strategy is twofold. First, the cosine distance between a list of word seeds and the unlabeled instances in the latent semantic indexing (LSI) space is calculated. Second, this distance is mapped into class posterior probabilities via a Gaussian mixture model (GMM). The evaluation is well written. It uses two well-known datasets-Reuters and 20 Newsgroups-and an additional original Wikipedia benchmark. The authors show that their algorithm obtains results that are comparable with a standard support vector machine (SVM) classifier, but without using any labels. Furthermore, the word seeds contain only one word for each category-the category name. As the authors stress in the conclusion, it seems natural that using complementary words should lead to even better results. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing

ACM Transactions on Speech and Language Processing Volume 6, Issue 1

October 2009

24 pages

ISSN:1550-4875

EISSN:1550-4883

DOI:10.1145/1596515

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2009

Accepted: 01 July 2009

Revised: 01 June 2009

Received: 01 July 2008

Published in TSLP Volume 6, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
774
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Mu HZhang SWang YSun YXu H(2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650066
Jan RMohd MHakak N(2020)Enhanced Bootstrapping Algorithm for Automatic Annotation of TweetsInternational Journal of Cognitive Informatics and Natural Intelligence10.4018/IJCINI.202004010314:2(35-60)Online publication date: 1-Oct-2020
https://dl.acm.org/doi/10.4018/IJCINI.2020040103
Liu BLi CZhou WJi FDuan YChen H(2020)An Attention-based Deep Relevance Model for Few-shot Document FilteringACM Transactions on Information Systems10.1145/341997239:1(1-35)Online publication date: 6-Oct-2020
https://dl.acm.org/doi/10.1145/3419972
Fard MThonet TGaussier E(2020)Seed-Guided Deep Document ClusteringAdvances in Information Retrieval10.1007/978-3-030-45439-5_1(3-16)Online publication date: 14-Apr-2020
https://dl.acm.org/doi/10.1007/978-3-030-45439-5_1
Li CChen SQi Y(2019)Filtering and Classifying Relevant Short Text with a Few Seed WordsData and Information Management10.2478/dim-2019-00113:3(165-186)Online publication date: Sep-2019
https://doi.org/10.2478/dim-2019-0011
Buchkremer RDemund AEbener SGampfer FJagering DJurgens AKlenke SKrimpmann DSchmank JSpiekermann MWahlers MWiepke M(2019)The Application of Artificial Intelligence Technologies as a Substitute for Reading and to Support and Enhance the Authoring of Scientific Review ArticlesIEEE Access10.1109/ACCESS.2019.29177197(65263-65276)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2917719
Li CChen SXing JSun AMa Z(2018)Seed-Guided Topic Model for Document Filtering and ClassificationACM Transactions on Information Systems10.1145/323825037:1(1-37)Online publication date: 6-Dec-2018
https://dl.acm.org/doi/10.1145/3238250
Canales LStrapparava CBoldrini EMartinez-Barco P(2017)Intensional Learning to Efficiently Build up Automatically Annotated Emotion CorporaIEEE Transactions on Affective Computing10.1109/TAFFC.2017.2764470(1-1)Online publication date: 2017
https://doi.org/10.1109/TAFFC.2017.2764470
Li CXing JSun AMa ZMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Effective Document Labeling with Very Few Seed WordsProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983721(85-94)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983721
Canales LStrapparava CBoldrini EMartnez-Barco P(2016)Exploiting a Bootstrapping Approach for Automatic Annotation of Emotions in Texts2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2016.78(726-734)Online publication date: Oct-2016
https://doi.org/10.1109/DSAA.2016.78
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Investigating unsupervised learning for text categorization bootstrapping

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Automatic text categorization by unsupervised learning

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations