Extended Bi-gram Features in Text Categorization

Zhang, Xian; Zhu, Xiaoyan

doi:10.1007/11492542_47

Xian Zhang¹⁹ &
Xiaoyan Zhu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3523))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

1600 Accesses

Abstract

Usually, in traditional text categorization systems based on Vector Space Model, there is no context information in a feature vector, which limited the performance of the system. To make use of more information, it is natural to select bi-gram feature in addition to unigram feature. However, the longer the feature is, the more important the feature selection algorithm is to get good balance in feature space This paper proposed two feature extraction methods which can get better feature balance for document categorization. Experiments show that our extended bi-gram feature improved system performance greatly.

This paper is supported by Natural Science Foundation No.60272019 and No. 60321002

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, Y., Liu, X.: A re_examination of text categorization methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49 (1999)
Google Scholar
Lewis, D.: Representation and Learning in Information Retrieval Technical Report UM-CS-1991-1993. Department of Computer Science, University of Massachusetts, Amherst, MA
Google Scholar
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998, Seattle, Washington (1998)
Google Scholar
Aizawa, A.: Linguistic techniques to improve the performance of automatic text categorization. In: Proceedings 6th NLP Pac. Rim Symp. NLPRS 2001 (2001)
Google Scholar
Sahami, M.: Using Machine Learning to Improve Information Access. PhD Thesis (1998) Stanford University, Computer Science Department
Google Scholar
Koller, D., Sahami, M.: Hierarchically Classifying Documents Using Very Few Words. In: ICML 1997: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 170–178. Morgan Kaufmann, San Francisco (1997)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on “Learning for Text Categorization” (1998)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., et al.: Inductive learning algorithms and representations for text categorization. Technical report, Microsoft Research (1998)
Google Scholar
Tan, C.-m., Wang, Y.-f., Lee, C.-d.: The Use of Bi-grams to Enhance Text Categorization. Journal Information Processing & Management (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department Of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China
Xian Zhang & Xiaoyan Zhu

Authors

Xian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyan Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto Superior Técnico & Instituto de Sistemas e Robótica,, 1049-001, Lisboa, Portugal
Jorge S. Marques
ETSI Informática y e Telecomunicación, University of Granada, 18071, Granada, Spain
Nicolás Pérez de la Blanca
Instituto Superior Técnico, CERENA-Centro de Recursos Naturais e Ambiente, Av. Rovisco Pais, 1049-001, Lisboa, Portugal
Pedro Pina

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Zhu, X. (2005). Extended Bi-gram Features in Text Categorization. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds) Pattern Recognition and Image Analysis. IbPRIA 2005. Lecture Notes in Computer Science, vol 3523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11492542_47

Download citation

DOI: https://doi.org/10.1007/11492542_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26154-4
Online ISBN: 978-3-540-32238-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics