A New Type of Feature – Loose N-Gram Feature in Text Categorization

Zhang, Xian; Zhu, Xiaoyan

doi:10.1007/978-3-540-72847-4_49

A New Type of Feature – Loose N-Gram Feature in Text Categorization

Xian Zhang¹ &
Xiaoyan Zhu¹

Conference paper

1566 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4477))

Abstract

This paper introduces a new type of feature in text categorization. Based on an interesting linguistic observation, Loose N-gram feature, defined as co-occurring words within limited range, is quite different from traditional features, such as words, phrases or n-grams. Not only retaining useful context information, this kind of feature also has considerable classification ability. The features generated by our algorithm have acceptable statistical characteristics, thus can effectively avoid the sparseness problem. Experiment results show that the Loose N-gram feature is helpful and promising in statistical text categorization systems, especially for the categorization tasks which rely on more semantic information. Our new type of feature could also be helpful in Information Retrieval research.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aizawa, A.N.: Linguistic techniques to improve the performance of automatic text categorization. In: NLPRS’01, Tokyo, Japan, vol. 11, pp. 307–314 (2001)
Google Scholar
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines (2001)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of ACM SIGIR’96, pp. 307–315. ACM Press, New York (1996)
Google Scholar
Hersh, W., Cohen, A., Yang, J., Bhupatiraju, R.T., Roberts, P., Hearst, M.: TREC 2005 genomics track overview (2005), http://trec.nist.gov/pubs.html
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Article MATH Google Scholar
Mikheev, A.: Tagging sentence boundaries. In: Proceedings of NAACL’00, San Francisco, CA. ACM International Conference Proceeding Series, vol. 4, pp. 264–271 (2000)
Google Scholar
Riloff, E., Lorenzen, J.: Extraction-based text categorization: generating domain-specific role relationships automatically. In: NLIR, Springer, Berlin (1999)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of ACM SIGIR ’96, pp. 21–29. ACM Press, New York (1996)
Google Scholar
Smadja, F.A.: From n-grams to collocations: an evaluation of Xtract. In: Proceedings of Annual Meeting of the ACL, ACL, Morristown, NJ, pp. 279–284 (1991)
Google Scholar
Tan, C., Wang, Y., Lee, C.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)
Article MATH Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR ’99, pp. 42–49. ACM Press, New York (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML’97. San Francisco, CA, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Xian Zhang & Xiaoyan Zhu

Authors

Xian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyan Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joan Martí José Miguel Benedí Ana Maria Mendonça Joan Serrat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Zhu, X. (2007). A New Type of Feature – Loose N-Gram Feature in Text Categorization. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2007. Lecture Notes in Computer Science, vol 4477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72847-4_49

Download citation

DOI: https://doi.org/10.1007/978-3-540-72847-4_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72846-7
Online ISBN: 978-3-540-72847-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics