Abstract
We present a new method for the classification of “noisy” documents, based on filtering contents with bigrams and named entities. The method is applied to call for tender documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the “4-Universities” collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Paradis, F., Nie, J.Y.: Étude sur l’impact du sous-langage dans la classification automatique d’appels d’offres. In: CORIA, Grenoble, France (2005)
Lehrberger, J.: Automatic translation and the concept of sublanguage. In: Kittredge, R., Lehrberger, J. (eds.) Sublanguage: Studies of Language in Restricted Semantic Domains (1982)
Biber, D.: Using register-diversified corpora for general language studies. Computational linguistics 19 (1993)
Yiming Yang, J.O.P.: A comparative study on feature selection in text categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning (1997)
Lewis, D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 37–50 (1992)
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorizatio. Information Processing and Management: an International Journal 38, 529–546 (2002)
Zhang, L., Yao, T.: Filtering junk mail with a maximum entropy model. In: Proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL 2003), pp. 446–453 (2003)
Cavnar, W.: N-gram-based text filtering for trec-2. In: Second Text REtrieval Conference (TREC) (1993)
Denoyer, L., Zaragoza, H., Gallinari, P.: Hmm-based passage models for document classification and ranking (2001)
Orasan, C., Pekar, V., Hasler, L.: A comparison of summarisation methods based on term specificity estimation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), pp. 1037–1041 (2004)
Nobata, C., Sekine, S., Murata, M., Uchimoto, K., Utiyama, M., Isahara, H.: Sentence extraction system assembling multiple evidence (2001)
Jansche, M.: Named entity extraction with conditional markov models and classifiers. In: The 6th Conference on Natural Language Learning (2002)
Paradis, F., Ma, Q., Nie, J.Y., Vaucher, S., Garneau, J.F., Gérin-Lajoie, R., Tajarobi, A.: Mboi: Un outil pour la veille d’opportunités sur l’internet. In: Colloque sur la Veille Strategique Scientifique et Technologique, Toulouse, France (2004)
Rennie, J.D.M., Lawrence Shih, J.T., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (2003)
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999); An excellent reference paper for comparisons of classification algorithms on the Reuters collection
Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval (2001)
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Paradis, F., Nie, JY. (2005). Filtering Contents with Bigrams and Named Entities to Improve Text Classification. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_11
Download citation
DOI: https://doi.org/10.1007/11562382_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)