skip to main content
10.1145/1244002.1244189acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Semi-supervised single-label text categorization using centroid-based classifiers

Published: 11 March 2007 Publication History

Abstract

In this paper we study the effect of using unlabeled data in conjunction with a small portion of labeled data on the accuracy of a centroid-based classifier used to perform single-label text categorization. We chose to use centroid-based methods because they are very fast when compared with other classification methods, but still present an accuracy close to that of the state-of-the-art methods. Efficiency is particularly important for very large domains, like regular news feeds, or the web.
We propose the combination of Expectation-Maximization with a centroid-based method to incorporate information about the unlabeled data during the training phase. We also propose an alternative to EM, based on the incremental update of a centroid-based method with the unlabeled documents during the training phase.
We show that these approaches can greatly improve accuracy relatively to a simple centroid-based method, in particular when there are very small amounts of labeled data available (as few as one single document per class).
Using one synthetic and three real-world datasets, we show that, if the initial model of the data is sufficiently precise, using unlabeled data improves performance. On the other hand, using unlabeled data degrades performance if the initial model is not precise enough.

References

[1]
A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. Mooney. Model-based overlapping clustering. In KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 532--537. ACM Press, 2005.
[2]
M. Bilenko, S. Basu, and R. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 11. ACM Press, 2004.
[3]
A. Cardoso-Cachopo and A. Oliveira. Empirical evaluation of centroid-based models for single-label text categorization. Technical Report 7/2006, INESC-ID, June 2006.
[4]
W. Chuang, A. Tiyyagura, J. Yang, and G. Giuffrida. A fast algorithm for hierarchical text classification. In Proceedings of DaWaK-00, 2nd International Conference on Data Warehousing and Knowledge Discovery, pages 409--418. Springer Verlag, 2000.
[5]
W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2):141--173, 1999.
[6]
F. Debole and F. Sebastiani. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 56(6):584--596, 2004.
[7]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1--38, 1977.
[8]
E.-H. Han and G. Karypis. Proceedings of the 4th european conference on centroid-based document classification: Analysis and experimental results. In Principles of Data Mining and Knowledge Discovery, pages 424--431, 2000.
[9]
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 282--289. Springer Verlag, 1994.
[10]
D. Ittner, D. Lewis, and D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301--315, 1995.
[11]
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 143--151. Morgan Kaufmann Publishers, 1997.
[12]
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142. Springer Verlag, 1998.
[13]
V. Lertnattee and T. Theeramunkong. Effect of term distributions on centroid-based text categorization. Information Sciences, 158(1):89--115, 2004.
[14]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
[15]
A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 350--358. Morgan Kaufmann Publishers, 1998.
[16]
D. Miller and H. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems, volume 9, pages 571--577. MIT Press, 1997.
[17]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000.
[18]
G. Salton. Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
[19]
H. Schütze, D. Hull, and J. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pages 229--237, 1995.
[20]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[21]
B. Shahshahani and D. Landgrebe. The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes Phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5):1087--1095, 1994.
[22]
S. Shankar and G. Karypis. Weight adjustment schemes for a centroid based classifier, 2000. Computer Science Technical Report TR00-035, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota.
[23]
E. Wiener, J. Pedersen, and A. Weigend. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 317--332, 1995.

Cited By

View all
  • (2023)DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy PatternsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613092(2157-2161)Online publication date: 30-Nov-2023
  • (2023)BERT-Enhanced Graph Convolutional Network for News Text Classification2023 IEEE Smart World Congress (SWC)10.1109/SWC57546.2023.10448791(1-8)Online publication date: 28-Aug-2023
  • (2023)Efficient Clustering-Based electrocardiographic biometric identificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119609219:COnline publication date: 1-Jun-2023
  • Show More Cited By

Index Terms

  1. Semi-supervised single-label text categorization using centroid-based classifiers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '07: Proceedings of the 2007 ACM symposium on Applied computing
    March 2007
    1688 pages
    ISBN:1595934804
    DOI:10.1145/1244002
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 March 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. centroid-based models
    2. online learning
    3. semi-supervised learning
    4. single-label text categorization

    Qualifiers

    • Article

    Conference

    SAC07
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Upcoming Conference

    SAC '25
    The 40th ACM/SIGAPP Symposium on Applied Computing
    March 31 - April 4, 2025
    Catania , Italy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy PatternsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613092(2157-2161)Online publication date: 30-Nov-2023
    • (2023)BERT-Enhanced Graph Convolutional Network for News Text Classification2023 IEEE Smart World Congress (SWC)10.1109/SWC57546.2023.10448791(1-8)Online publication date: 28-Aug-2023
    • (2023)Efficient Clustering-Based electrocardiographic biometric identificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119609219:COnline publication date: 1-Jun-2023
    • (2022)Text Vectorization Method Based on Concept Mining Using Clustering Techniques2022 VI International Conference on Information Technologies in Engineering Education (Inforino)10.1109/Inforino53888.2022.9782908(1-10)Online publication date: 12-Apr-2022
    • (2021)Bert-Enhanced Text Graph Neural Network for ClassificationEntropy10.3390/e2311153623:11(1536)Online publication date: 18-Nov-2021
    • (2020)MaxMin clustering for historical analogySN Applied Sciences10.1007/s42452-020-03202-22:8Online publication date: 28-Jul-2020
    • (2020)Multilabel graph-based classification for missing labelsInternational Journal on Digital Libraries10.1007/s00799-020-00295-3Online publication date: 12-Oct-2020
    • (2019)Assessing Centroid-Based Classification Models for Intrusion Detection System Using Composite IndicatorsProcedia Computer Science10.1016/j.procs.2019.11.170161(665-676)Online publication date: 2019
    • (2019)UTTAMA: An Intrusion Detection System Based on Feature Clustering and Feature TransformationFoundations of Science10.1007/s10699-019-09589-525:4(1049-1075)Online publication date: 5-Mar-2019
    • (2017)Text ClassificationHandbook of Research on Machine Learning Innovations and Trends10.4018/978-1-5225-2229-4.ch033(740-761)Online publication date: 2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media