Skip to main content

Filtering Contents with Bigrams and Named Entities to Improve Text Classification

  • Conference paper
Information Retrieval Technology (AIRS 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Included in the following conference series:

Abstract

We present a new method for the classification of “noisy” documents, based on filtering contents with bigrams and named entities. The method is applied to call for tender documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the “4-Universities” collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Paradis, F., Nie, J.Y.: Étude sur l’impact du sous-langage dans la classification automatique d’appels d’offres. In: CORIA, Grenoble, France (2005)

    Google Scholar 

  2. Lehrberger, J.: Automatic translation and the concept of sublanguage. In: Kittredge, R., Lehrberger, J. (eds.) Sublanguage: Studies of Language in Restricted Semantic Domains (1982)

    Google Scholar 

  3. Biber, D.: Using register-diversified corpora for general language studies. Computational linguistics 19 (1993)

    Google Scholar 

  4. Yiming Yang, J.O.P.: A comparative study on feature selection in text categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning (1997)

    Google Scholar 

  5. Lewis, D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 37–50 (1992)

    Google Scholar 

  6. Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorizatio. Information Processing and Management: an International Journal 38, 529–546 (2002)

    Article  MATH  Google Scholar 

  7. Zhang, L., Yao, T.: Filtering junk mail with a maximum entropy model. In: Proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL 2003), pp. 446–453 (2003)

    Google Scholar 

  8. Cavnar, W.: N-gram-based text filtering for trec-2. In: Second Text REtrieval Conference (TREC) (1993)

    Google Scholar 

  9. Denoyer, L., Zaragoza, H., Gallinari, P.: Hmm-based passage models for document classification and ranking (2001)

    Google Scholar 

  10. Orasan, C., Pekar, V., Hasler, L.: A comparison of summarisation methods based on term specificity estimation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), pp. 1037–1041 (2004)

    Google Scholar 

  11. Nobata, C., Sekine, S., Murata, M., Uchimoto, K., Utiyama, M., Isahara, H.: Sentence extraction system assembling multiple evidence (2001)

    Google Scholar 

  12. Jansche, M.: Named entity extraction with conditional markov models and classifiers. In: The 6th Conference on Natural Language Learning (2002)

    Google Scholar 

  13. Paradis, F., Ma, Q., Nie, J.Y., Vaucher, S., Garneau, J.F., Gérin-Lajoie, R., Tajarobi, A.: Mboi: Un outil pour la veille d’opportunités sur l’internet. In: Colloque sur la Veille Strategique Scientifique et Technologique, Toulouse, France (2004)

    Google Scholar 

  14. Rennie, J.D.M., Lawrence Shih, J.T., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (2003)

    Google Scholar 

  15. Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999); An excellent reference paper for comparisons of classification algorithms on the Reuters collection

    Google Scholar 

  16. Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval (2001)

    Google Scholar 

  17. McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Paradis, F., Nie, JY. (2005). Filtering Contents with Bigrams and Named Entities to Improve Text Classification. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_11

Download citation

  • DOI: https://doi.org/10.1007/11562382_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29186-2

  • Online ISBN: 978-3-540-32001-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics