Abstract
This paper describes an approach based on the use of Google News as a source of information in order to generate a learning corpus for an information filtering task. The INFILE (INformation FILtering Evaluation) track of the CLEF (Cross-Lingual Evaluation Forum) 2009 campaign has been used as framework. The information filtering task can be seen as a document classification task, so a supervised learning scheme has been followed. Two learning corpora have been proved: one using the text of the topics as learning data to train a classifier, and another one where training data have been generated from Google News pages, using the keywords of topics as queries. Results show that the use of Google News for generating learning data does not improve the results obtained using only topic descriptions as learning corpora.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Besançon, R., Chaudiron, S., Mostefa, D., Hamon, O., Timimi, I., Choukri, K.: Overview of CLEF 2008 INFILE Pilot Track. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 939–946. Springer, Heidelberg (2009)
Besançon, R., Chaudiron, S., Mostefa, D., Timimi, I., Choukri, K.: The INFILE Project: a Crosslingual Filtering Systems Evaluation Campaign. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA) (2008)
Besançon, R., Chaudiron, S., Mostefa, D., Timimi, I., Choukri, K., Laïb, M.: Overview of CLEF 2009 INFILE track. In: Peters, C., Nunzio, G.D., Kurimo, M., Mandl, T., Mostefa, D., Peñas, A., Roda, G. (eds.) In Press. LNCS, Springer, Heidelberg (2009)
Couto, F.M., Martins, B., Silva, M.J.: Classifying biological articles using web resources. In: SAC 2004, Proceedings of the 2004 ACM symposium on Applied computing. pp. 111–115. ACM, New York (2004)
Díaz-Galiano, M.C., Perea-Ortega, J.M., Martín-Valdivia, M.T., Montejo-Ráez, A., Ureña-López, L.A.: SINAI at TRECVID 2007. In: Over, P. (ed.) Proceedings of the TRECVID 2007 Workshop (TRECVID 2007) (2007)
Gligorov, R., ten Kate, W., Aleksovski, Z., van Harmelen, F.: Using google distance to weight approximate ontology matches. In: WWW ’07: Proceedings of the 16th international conference on World Wide Web, pp. 767–776. ACM, New York (2007)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998), citeseer.ist.psu.edu/joachims97text.html
Perea-Ortega, J.M., Montejo-Ráez, A., Díaz-Galiano, M.C., Martín-Valdivia, M.T., Ureña-López, L.A.: Using an Information Retrieval System for Video Classification. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 927–930. Springer, Heidelberg (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Montejo-Ráez, A., Perea-Ortega, J.M., Díaz-Galiano, M.C., Ureña-López, L.A. (2010). Experiments with Google News for Filtering Newswire Articles. In: Peters, C., et al. Multilingual Information Access Evaluation I. Text Retrieval Experiments. CLEF 2009. Lecture Notes in Computer Science, vol 6241. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15754-7_46
Download citation
DOI: https://doi.org/10.1007/978-3-642-15754-7_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15753-0
Online ISBN: 978-3-642-15754-7
eBook Packages: Computer ScienceComputer Science (R0)