Skip to main content

Categorizing Web Information on Subject with Statistical Language Modeling

  • Conference paper
Web Information Systems – WISE 2004 (WISE 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3306))

Included in the following conference series:

Abstract

With the rapid growth of the available information on the Internet, it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools, has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others, achieving approximately 90% evaluated by F1 score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aas, K., Eikvil, L.: Text Categorization: A Survey. Technical Report #941, Norwegian Computing Center (1999)

    Google Scholar 

  2. Joachim, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Processing of ICML 1997, 14th International Conference on Machine Learning, pp. 143-151 (1996)

    Google Scholar 

  3. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  4. Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics

    Google Scholar 

  5. Peng, F., Schuurmans, D., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Language Models. Information Retrieval 7(3-4), 317–345 (2004)

    Article  Google Scholar 

  6. Rosenfeld, R.: Two decades of Statistical Language Modeling: Where Do We Go From Here? Proceedings of the IEEE 88(8) (2000)

    Google Scholar 

  7. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (1999)

    MATH  Google Scholar 

  8. Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196 (October 1991)

    Google Scholar 

  9. Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-35(3), 400–401 (1987)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, X., Wang, T., Zhou, H., Chen, H. (2004). Categorizing Web Information on Subject with Statistical Language Modeling. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30480-7_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23894-2

  • Online ISBN: 978-3-540-30480-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics