Filtering Contents with Bigrams and Named Entities to Improve Text Classification

Paradis, François; Nie, Jian-Yun

doi:10.1007/11562382_11

François Paradis²⁰ &
Jian-Yun Nie²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Included in the following conference series:

Asia Information Retrieval Symposium

1010 Accesses
1 Citations

Abstract

We present a new method for the classification of “noisy” documents, based on filtering contents with bigrams and named entities. The method is applied to call for tender documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the “4-Universities” collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Paradis, F., Nie, J.Y.: Étude sur l’impact du sous-langage dans la classification automatique d’appels d’offres. In: CORIA, Grenoble, France (2005)
Google Scholar
Lehrberger, J.: Automatic translation and the concept of sublanguage. In: Kittredge, R., Lehrberger, J. (eds.) Sublanguage: Studies of Language in Restricted Semantic Domains (1982)
Google Scholar
Biber, D.: Using register-diversified corpora for general language studies. Computational linguistics 19 (1993)
Google Scholar
Yiming Yang, J.O.P.: A comparative study on feature selection in text categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning (1997)
Google Scholar
Lewis, D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 37–50 (1992)
Google Scholar
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorizatio. Information Processing and Management: an International Journal 38, 529–546 (2002)
Article MATH Google Scholar
Zhang, L., Yao, T.: Filtering junk mail with a maximum entropy model. In: Proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL 2003), pp. 446–453 (2003)
Google Scholar
Cavnar, W.: N-gram-based text filtering for trec-2. In: Second Text REtrieval Conference (TREC) (1993)
Google Scholar
Denoyer, L., Zaragoza, H., Gallinari, P.: Hmm-based passage models for document classification and ranking (2001)
Google Scholar
Orasan, C., Pekar, V., Hasler, L.: A comparison of summarisation methods based on term specificity estimation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), pp. 1037–1041 (2004)
Google Scholar
Nobata, C., Sekine, S., Murata, M., Uchimoto, K., Utiyama, M., Isahara, H.: Sentence extraction system assembling multiple evidence (2001)
Google Scholar
Jansche, M.: Named entity extraction with conditional markov models and classifiers. In: The 6th Conference on Natural Language Learning (2002)
Google Scholar
Paradis, F., Ma, Q., Nie, J.Y., Vaucher, S., Garneau, J.F., Gérin-Lajoie, R., Tajarobi, A.: Mboi: Un outil pour la veille d’opportunités sur l’internet. In: Colloque sur la Veille Strategique Scientifique et Technologique, Toulouse, France (2004)
Google Scholar
Rennie, J.D.M., Lawrence Shih, J.T., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (2003)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999); An excellent reference paper for comparisons of classification algorithms on the Reuters collection
Google Scholar
Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval (2001)
Google Scholar
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow

Download references

Author information

Authors and Affiliations

Université de Montréal, Canada
François Paradis & Jian-Yun Nie

Authors

François Paradis
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Yun Nie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
Computer and Communication Media Research, NEC Corp., Miyazaki 4-1-1, Miyamae-ku, 216-8555, Kawasaki, Japan
Akio Yamada
Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Helen Meng
School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paradis, F., Nie, JY. (2005). Filtering Contents with Bigrams and Named Entities to Improve Text Classification. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_11

Download citation

DOI: https://doi.org/10.1007/11562382_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics