Abstract
This paper introduces a new type of feature in text categorization. Based on an interesting linguistic observation, Loose N-gram feature, defined as co-occurring words within limited range, is quite different from traditional features, such as words, phrases or n-grams. Not only retaining useful context information, this kind of feature also has considerable classification ability. The features generated by our algorithm have acceptable statistical characteristics, thus can effectively avoid the sparseness problem. Experiment results show that the Loose N-gram feature is helpful and promising in statistical text categorization systems, especially for the categorization tasks which rely on more semantic information. Our new type of feature could also be helpful in Information Retrieval research.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aizawa, A.N.: Linguistic techniques to improve the performance of automatic text categorization. In: NLPRS’01, Tokyo, Japan, vol. 11, pp. 307–314 (2001)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines (2001)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of ACM SIGIR’96, pp. 307–315. ACM Press, New York (1996)
Hersh, W., Cohen, A., Yang, J., Bhupatiraju, R.T., Roberts, P., Hearst, M.: TREC 2005 genomics track overview (2005), http://trec.nist.gov/pubs.html
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Mikheev, A.: Tagging sentence boundaries. In: Proceedings of NAACL’00, San Francisco, CA. ACM International Conference Proceeding Series, vol. 4, pp. 264–271 (2000)
Riloff, E., Lorenzen, J.: Extraction-based text categorization: generating domain-specific role relationships automatically. In: NLIR, Springer, Berlin (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of ACM SIGIR ’96, pp. 21–29. ACM Press, New York (1996)
Smadja, F.A.: From n-grams to collocations: an evaluation of Xtract. In: Proceedings of Annual Meeting of the ACL, ACL, Morristown, NJ, pp. 279–284 (1991)
Tan, C., Wang, Y., Lee, C.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR ’99, pp. 42–49. ACM Press, New York (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML’97. San Francisco, CA, pp. 412–420 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Zhang, X., Zhu, X. (2007). A New Type of Feature – Loose N-Gram Feature in Text Categorization. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2007. Lecture Notes in Computer Science, vol 4477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72847-4_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-72847-4_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72846-7
Online ISBN: 978-3-540-72847-4
eBook Packages: Computer ScienceComputer Science (R0)