Skip to main content

A New Type of Feature – Loose N-Gram Feature in Text Categorization

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4477))

Abstract

This paper introduces a new type of feature in text categorization. Based on an interesting linguistic observation, Loose N-gram feature, defined as co-occurring words within limited range, is quite different from traditional features, such as words, phrases or n-grams. Not only retaining useful context information, this kind of feature also has considerable classification ability. The features generated by our algorithm have acceptable statistical characteristics, thus can effectively avoid the sparseness problem. Experiment results show that the Loose N-gram feature is helpful and promising in statistical text categorization systems, especially for the categorization tasks which rely on more semantic information. Our new type of feature could also be helpful in Information Retrieval research.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aizawa, A.N.: Linguistic techniques to improve the performance of automatic text categorization. In: NLPRS’01, Tokyo, Japan, vol. 11, pp. 307–314 (2001)

    Google Scholar 

  2. Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)

    Google Scholar 

  3. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines (2001)

    Google Scholar 

  4. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of ACM SIGIR’96, pp. 307–315. ACM Press, New York (1996)

    Google Scholar 

  5. Hersh, W., Cohen, A., Yang, J., Bhupatiraju, R.T., Roberts, P., Hearst, M.: TREC 2005 genomics track overview (2005), http://trec.nist.gov/pubs.html

  6. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)

    Article  MATH  Google Scholar 

  7. Mikheev, A.: Tagging sentence boundaries. In: Proceedings of NAACL’00, San Francisco, CA. ACM International Conference Proceeding Series, vol. 4, pp. 264–271 (2000)

    Google Scholar 

  8. Riloff, E., Lorenzen, J.: Extraction-based text categorization: generating domain-specific role relationships automatically. In: NLIR, Springer, Berlin (1999)

    Google Scholar 

  9. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  10. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of ACM SIGIR ’96, pp. 21–29. ACM Press, New York (1996)

    Google Scholar 

  11. Smadja, F.A.: From n-grams to collocations: an evaluation of Xtract. In: Proceedings of Annual Meeting of the ACL, ACL, Morristown, NJ, pp. 279–284 (1991)

    Google Scholar 

  12. Tan, C., Wang, Y., Lee, C.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)

    Article  MATH  Google Scholar 

  13. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR ’99, pp. 42–49. ACM Press, New York (1999)

    Google Scholar 

  14. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML’97. San Francisco, CA, pp. 412–420 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Joan Martí José Miguel Benedí Ana Maria Mendonça Joan Serrat

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Zhang, X., Zhu, X. (2007). A New Type of Feature – Loose N-Gram Feature in Text Categorization. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2007. Lecture Notes in Computer Science, vol 4477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72847-4_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72847-4_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72846-7

  • Online ISBN: 978-3-540-72847-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics