Skip to main content

NLP-Driven Constructive Learning for Filtering an IR Document Stream

  • Conference paper
Evaluation of Multilingual and Multi-modal Information Retrieval (CLEF 2006)


Feature engineering is known as one of the most important challenges for knowledge acquisition, since any inductive learning system depends upon an efficient representation model to find good solutions to a given problem. We present an NLP-driven constructive learning method for building features based upon noun phrases structures, which are supposed to carry the highest discriminatory information. The method was test at the CLEF 2006 Ad-Hoc, monolingual (Portuguese) IR track. A classification model was obtained using this representation scheme over a small subset of the relevance judgments to filter false-positives documents returned by the IR-system. The goal was to increase the overall precision. The experiment achieved a MAP gain of 41.3%, in average, over three selected topics. The best F1-measure for the text classification task over the proposed text representation model was 77.1%. The results suggest that relevant linguistic features can be exploited by NLP techniques in a domain specific application, and can be used suscesfully in text categorization, which can act as an important coadjuvant process for other high-level IR tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Arcoverde, J.M.A., Nunes, M.d.G.V., Scardua, W.: Using noun phrases for local analysis in automatic query expansion. In: CLEF working notes for ad-hoc, monolingue, Portuguese track (2006)

    Google Scholar 

  2. Apté, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorisation models. In: Research and Development in Information Retrieval, pp. 23–30 (1994)

    Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  4. Turtle, H.R.: Inference Networks for Document Retrieval. PhD thesis (1991)

    Google Scholar 

  5. Ratnaparkhi, A.: A maximum entropy part-of-speech tagger, University of Pennsylvania, USA (1996)

    Google Scholar 

  6. Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21(4), 543–565 (1995)

    Google Scholar 

  7. Michalski, R.S.: Pattern recognition as knowledge-guided computer induction, Tech. Report 927 (1978)

    Google Scholar 

  8. Bloedorn, E., Michalski, R.S.: Data-driven constructive induction. IEEE Intelligent Systems 13(2), 30–37 (1998)

    Article  Google Scholar 

  9. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1), 1–6 (2004)

    Article  Google Scholar 

  10. Raskutti, B., Kowalczyk, A.: Extreme rebalancing for svms: a case study. SIGKDD Explorations 6(1), 60–69 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Carol Peters Paul Clough Fredric C. Gey Jussi Karlgren Bernardo Magnini Douglas W. Oard Maarten de Rijke Maximilian Stempfhuber

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Arcoverde, J.M.A., Nunes, M.d.G.V. (2007). NLP-Driven Constructive Learning for Filtering an IR Document Stream. In: Peters, C., et al. Evaluation of Multilingual and Multi-modal Information Retrieval. CLEF 2006. Lecture Notes in Computer Science, vol 4730. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74998-1

  • Online ISBN: 978-3-540-74999-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics