Article

Semi-supervised single-label text categorization using centroid-based classifiers

Authors:

Ana Cardoso-Cachopo,

Arlindo L. OliveiraAuthors Info & Claims

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

Pages 844 - 851

https://doi.org/10.1145/1244002.1244189

Published: 11 March 2007 Publication History

Abstract

In this paper we study the effect of using unlabeled data in conjunction with a small portion of labeled data on the accuracy of a centroid-based classifier used to perform single-label text categorization. We chose to use centroid-based methods because they are very fast when compared with other classification methods, but still present an accuracy close to that of the state-of-the-art methods. Efficiency is particularly important for very large domains, like regular news feeds, or the web.

We propose the combination of Expectation-Maximization with a centroid-based method to incorporate information about the unlabeled data during the training phase. We also propose an alternative to EM, based on the incremental update of a centroid-based method with the unlabeled documents during the training phase.

We show that these approaches can greatly improve accuracy relatively to a simple centroid-based method, in particular when there are very small amounts of labeled data available (as few as one single document per class).

Using one synthetic and three real-world datasets, we show that, if the initial model of the data is sufficiently precise, using unlabeled data improves performance. On the other hand, using unlabeled data degrades performance if the initial model is not precise enough.

References

[1]

A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. Mooney. Model-based overlapping clustering. In KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 532--537. ACM Press, 2005.

Digital Library

[2]

M. Bilenko, S. Basu, and R. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 11. ACM Press, 2004.

Digital Library

[3]

A. Cardoso-Cachopo and A. Oliveira. Empirical evaluation of centroid-based models for single-label text categorization. Technical Report 7/2006, INESC-ID, June 2006.

[4]

W. Chuang, A. Tiyyagura, J. Yang, and G. Giuffrida. A fast algorithm for hierarchical text classification. In Proceedings of DaWaK-00, 2nd International Conference on Data Warehousing and Knowledge Discovery, pages 409--418. Springer Verlag, 2000.

Digital Library

[5]

W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2):141--173, 1999.

Digital Library

[6]

F. Debole and F. Sebastiani. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 56(6):584--596, 2004.

Digital Library

[7]

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1--38, 1977.

[8]

E.-H. Han and G. Karypis. Proceedings of the 4th european conference on centroid-based document classification: Analysis and experimental results. In Principles of Data Mining and Knowledge Discovery, pages 424--431, 2000.

Digital Library

[9]

D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 282--289. Springer Verlag, 1994.

Digital Library

[10]

D. Ittner, D. Lewis, and D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301--315, 1995.

[11]

T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 143--151. Morgan Kaufmann Publishers, 1997.

Digital Library

[12]

T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142. Springer Verlag, 1998.

Digital Library

[13]

V. Lertnattee and T. Theeramunkong. Effect of term distributions on centroid-based text categorization. Information Sciences, 158(1):89--115, 2004.

Digital Library

[14]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.

[15]

A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 350--358. Morgan Kaufmann Publishers, 1998.

Digital Library

[16]

D. Miller and H. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems, volume 9, pages 571--577. MIT Press, 1997.

Digital Library

[17]

K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000.

Digital Library

[18]

G. Salton. Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.

Digital Library

[19]

H. Schütze, D. Hull, and J. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pages 229--237, 1995.

Digital Library

[20]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.

Digital Library

[21]

B. Shahshahani and D. Landgrebe. The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes Phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5):1087--1095, 1994.

[22]

S. Shankar and G. Karypis. Weight adjustment schemes for a centroid based classifier, 2000. Computer Science Technical Report TR00-035, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota.

[23]

E. Wiener, J. Pedersen, and A. Weigend. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 317--332, 1995.

Cited By

Shanbhag SChimalakonda SSharma VKaulgud VChandra SBlincoe KTonella P(2023)DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy PatternsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613092(2157-2161)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613092
Li LGeng Y(2023)BERT-Enhanced Graph Convolutional Network for News Text Classification2023 IEEE Smart World Congress (SWC)10.1109/SWC57546.2023.10448791(1-8)Online publication date: 28-Aug-2023
https://doi.org/10.1109/SWC57546.2023.10448791
Meltzer DLuengo D(2023)Efficient Clustering-Based electrocardiographic biometric identificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119609219:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119609
Show More Cited By

Index Terms

Semi-supervised single-label text categorization using centroid-based classifiers
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use ...
Semi-supervised multi-label classification using incomplete label information
Highlights
- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
Abstract
Classifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...
Semi-supervised partial label learning algorithm via reliable label propagation
Abstract
Partial label learning (PLL) is a weakly supervised learning method that is able to predict one label as the correct answer from a given candidate label set. In PLL, when all possible candidate labels are as signed to real-world training examples, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

March 2007

1688 pages

ISBN:1595934804

DOI:10.1145/1244002

Conference Chairs:
Yookun Cho
Seoul National University, Seoul, Korea
,
Roger L. Wainwright
University of Tulsa, Tulsa, Oklahoma
,
Hisham M. Haddad
Kennesaw State University, Kennesaw, Georgia
,
Sung Y. Shin
South Dakota State University, Brookings, South Dakota
,
Program Chair:
Yong Wan Koo
The University of Suwon, Gyeongggi-do, Korea

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SAC07

Sponsor:

SIGAPP

SAC07: The 2007 ACM Symposium on Applied Computing

March 11 - 15, 2007

Seoul, Korea

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shanbhag SChimalakonda SSharma VKaulgud VChandra SBlincoe KTonella P(2023)DENT: A Tool for Tagging Stack Overflow Posts with Deep Learning Energy PatternsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613092(2157-2161)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613092
Li LGeng Y(2023)BERT-Enhanced Graph Convolutional Network for News Text Classification2023 IEEE Smart World Congress (SWC)10.1109/SWC57546.2023.10448791(1-8)Online publication date: 28-Aug-2023
https://doi.org/10.1109/SWC57546.2023.10448791
Meltzer DLuengo D(2023)Efficient Clustering-Based electrocardiographic biometric identificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119609219:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119609
Mansour AMohammad JKravchenko Y(2022)Text Vectorization Method Based on Concept Mining Using Clustering Techniques2022 VI International Conference on Information Technologies in Engineering Education (Inforino)10.1109/Inforino53888.2022.9782908(1-10)Online publication date: 12-Apr-2022
https://doi.org/10.1109/Inforino53888.2022.9782908
Yang YCui X(2021)Bert-Enhanced Text Graph Neural Network for ClassificationEntropy10.3390/e2311153623:11(1536)Online publication date: 18-Nov-2021
https://doi.org/10.3390/e23111536
Sumikawa YIkejiri RYoshikawa R(2020)MaxMin clustering for historical analogySN Applied Sciences10.1007/s42452-020-03202-22:8Online publication date: 28-Jul-2020
https://doi.org/10.1007/s42452-020-03202-2
Sumikawa YMiyazaki T(2020)Multilabel graph-based classification for missing labelsInternational Journal on Digital Libraries10.1007/s00799-020-00295-3Online publication date: 12-Oct-2020
https://doi.org/10.1007/s00799-020-00295-3
Setiawan BDjanali SAhmad TAziz M(2019)Assessing Centroid-Based Classification Models for Intrusion Detection System Using Composite IndicatorsProcedia Computer Science10.1016/j.procs.2019.11.170161(665-676)Online publication date: 2019
https://doi.org/10.1016/j.procs.2019.11.170
Nagaraja AUma BGunupudi R(2019)UTTAMA: An Intrusion Detection System Based on Feature Clustering and Feature TransformationFoundations of Science10.1007/s10699-019-09589-525:4(1049-1075)Online publication date: 5-Mar-2019
https://doi.org/10.1007/s10699-019-09589-5
Ahmed BWahiba B(2017)Text ClassificationHandbook of Research on Machine Learning Innovations and Trends10.4018/978-1-5225-2229-4.ch033(740-761)Online publication date: 2017
https://doi.org/10.4018/978-1-5225-2229-4.ch033
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten