Abstract
Usually, in traditional text categorization systems based on Vector Space Model, there is no context information in a feature vector, which limited the performance of the system. To make use of more information, it is natural to select bi-gram feature in addition to unigram feature. However, the longer the feature is, the more important the feature selection algorithm is to get good balance in feature space This paper proposed two feature extraction methods which can get better feature balance for document categorization. Experiments show that our extended bi-gram feature improved system performance greatly.
This paper is supported by Natural Science Foundation No.60272019 and No. 60321002
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yang, Y., Liu, X.: A re_examination of text categorization methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49 (1999)
Lewis, D.: Representation and Learning in Information Retrieval Technical Report UM-CS-1991-1993. Department of Computer Science, University of Massachusetts, Amherst, MA
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998, Seattle, Washington (1998)
Aizawa, A.: Linguistic techniques to improve the performance of automatic text categorization. In: Proceedings 6th NLP Pac. Rim Symp. NLPRS 2001 (2001)
Sahami, M.: Using Machine Learning to Improve Information Access. PhD Thesis (1998) Stanford University, Computer Science Department
Koller, D., Sahami, M.: Hierarchically Classifying Documents Using Very Few Words. In: ICML 1997: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 170–178. Morgan Kaufmann, San Francisco (1997)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on “Learning for Text Categorization” (1998)
Dumais, S.T., Platt, J., Heckerman, D., et al.: Inductive learning algorithms and representations for text categorization. Technical report, Microsoft Research (1998)
Tan, C.-m., Wang, Y.-f., Lee, C.-d.: The Use of Bi-grams to Enhance Text Categorization. Journal Information Processing & Management (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, X., Zhu, X. (2005). Extended Bi-gram Features in Text Categorization. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds) Pattern Recognition and Image Analysis. IbPRIA 2005. Lecture Notes in Computer Science, vol 3523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11492542_47
Download citation
DOI: https://doi.org/10.1007/11492542_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26154-4
Online ISBN: 978-3-540-32238-2
eBook Packages: Computer ScienceComputer Science (R0)