Evaluation and Construction of Training Corpuses for Text Classification: A Preliminary Study

Zhou, Shuigeng; Guan, Jihong

doi:10.1007/3-540-36271-1_9

Shuigeng Zhou⁵ &
Jihong Guan^5,6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2553))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

464 Accesses
1 Citations

Abstract

Text classification is becoming more and more important with the rapid growth of on-line information available. It was observed that the quality of training corpus impacts the performance of the trained classifier. This paper proposes an approach to build high-quality training corpuses for better classification performance by first exploring the properties of training corpuses, and then giving an algorithm for constructing training corpuses semi-automatically. Preliminary experimental results validate our approach: classifiers based on the training corpuses constructed by our approach can achieve good performance while the training corpus’ size is significantly reduced. Our approach can be used for building efficient and lightweight classification systems.

This work was supported by Hubei Provincial Natural Science Foundation (No. 2001ABB050) and the Natural Science Foundation of China (NSFC) (No. 60173027).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang Y. and Liu X. A re-examination of text categorization methods. Proceedings of the 22^nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), 1999.
Google Scholar
B. Masand, G. Linoff, and D. Waltz. Classifying news stories using memory-based reasoning. Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), 1992, 59–65.
Google Scholar
K. Lang. Newsweeder: learning to filter netnews. Proceedings of the twelfth International Conference on Machine Learning (ICML’95), 1995.
Google Scholar
T. Joachims, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the World Wide Web. Proceedings of 1997 International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.
Google Scholar
C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Proceedings of the Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.
Google Scholar
Y. Yang and C. G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 12(3): 252–277, 1994
Article Google Scholar
McCallum and K. Nigam. A comparison of event models for navie bayes text classification. Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
Google Scholar
W. W. Cohen. Text categorization and relational learning. Proceedings of the Twelfth International Conference on Machine Learning (ICML’95), Morgan Kaufmann, 1995.
Google Scholar
E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), 1995.
Google Scholar
T. Joachims. Text categorization with support vector machines: learning with many relevant features. Proceedings of 10th European Conference on Machine Learning (ECML’98), 1998, 137–142.
Google Scholar
R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Proceedings of 11th Annual Conference on Computational Learning Theory, 1998, 80–91.
Google Scholar
S. Zhou, J. Guan. Chinese documents classification based on N-grams. A. Gelbukh (Ed.): Intelligent Text Processing and Computational Linguistics, LNCS, Vol. 2276, Spring-Verlag, 2002, 405–414.
Chapter Google Scholar
S. Zhou, Y. Fan, J. Hu, F. Yu and Y. Hu. Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure. H. Lu, and A. Zhou (Eds.), Web-Age Information Management. LNCS, vol.1846, Springer-Verlag, 2000, 215–226.
Chapter Google Scholar
Apte, F. Damerau, and S. Weiss. Towards language independent automated learning of text categorization models. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), 1994.
Google Scholar
D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94), 1994.
Google Scholar
Yang Y. and Pederson J. Feature selection in statistical learning of text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML’97), Morgan Kaufmann, 1997, 412–420.
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab of Software Engineering, Wuhan University, 430072, Wuhan, China
Shuigeng Zhou & Jihong Guan
School of Computer Science, Wuhan University, 430079, Wuhan, China
Jihong Guan

Authors

Shuigeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Guan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Systems Sciences, Royal Institute of Technology, Forum 100, 16440, Kista, Sweden
Birger Andersson , Maria Bergholtz & Paul Johannesson , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, S., Guan, J. (2002). Evaluation and Construction of Training Corpuses for Text Classification: A Preliminary Study. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds) Natural Language Processing and Information Systems. NLDB 2002. Lecture Notes in Computer Science, vol 2553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36271-1_9

Download citation

DOI: https://doi.org/10.1007/3-540-36271-1_9
Published: 28 February 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00307-6
Online ISBN: 978-3-540-36271-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics