Skip to main content

An Approach for Adding Noise-Tolerance to Restricted-Domain Information Retrieval

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6177))

Abstract

Corpus of Information Retrieval (IR) systems are formed by text documents that often come from rather heterogeneous sources, such as Web sites or OCR (Optical Character Recognition) systems. Faithfully converting these sources into flat text files is not a trivial task, since noise can be easily introduced due to spelling or typeset errors. Importantly, if the size of the corpus is large enough, then redundancy helps in controlling the effects of noise because the same text often appears with and without noise throughout the corpus. Conversely, noise becomes a serious problem in restricted-domain IR where corpus is usually small and it has little or no redundancy. Therefore, noise hinders the retrieval task in restricted domains and erroneous results are likely to be obtained. In order to overcome this situation, this paper presents an approach for using restricted-domain resources, such as Knowledge Organization Systems (KOS), to add noise-tolerance to existing IR systems. To show the suitability of our approach in one real restricted-domain case study, a set of experiments has been carried out for the agricultural domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aunimo, L., Heinonen, O., Kuuskoski, R., Makkonen, J., Petit, R., Virtanen, O.: Question answering system for incomplete and noisy data. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 193–206. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)

    Google Scholar 

  3. Bourdaillet, J., Ganascia, J.-G.: Alignment of noisy unstructured data. In: IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (January 2007)

    Google Scholar 

  4. Chen, Q., Li, M., Zhou, M.: Improving query spelling correction using web search results. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 2007, pp. 181–189. Association for Computational Linguistics (2007)

    Google Scholar 

  5. Clark, A.: Pre-processing very noisy text. In: Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics (2003)

    Google Scholar 

  6. Esser, W.M.: Fault-tolerant fulltext information retrieval in digital multilingual encyclopedias with weighted pattern morphing. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 338–352. Springer, Heidelberg (2004)

    Google Scholar 

  7. Fernández, A.C., Díaz, J., Fundora, A., Muñoz, R.: Un algoritmo para la extracción de características lexicográficas en la comparación de palabras. In: IV Convención Científica Internacional de La Universidad De Matanzas CIUM 2009 (2009)

    Google Scholar 

  8. Gómez, J.M.: Recuperación de Pasajes Multilingüe para la Búsqueda de Respuestas. Phd. thesis, Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain (2007)

    Google Scholar 

  9. Gómez, J.M., Rosso, P., Sanchis, E.: Jirs language-independent passage retrieval system: A comparative study. In: 5th Internacional Conference on Natural Language Proceeding (ICON 2007), MaCMillan Publisher, Basingstoke (2007)

    Google Scholar 

  10. Hassan, T., Baumgartner, R.: Intelligent text extraction from pdf documents. In: CIMCA/IAWTIC, pp. 2–6 (2005)

    Google Scholar 

  11. Hodge, G.: Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. In: The Digital Library Federation Council on Library and Information Resources (2000)

    Google Scholar 

  12. Kantor, P.B., Voorhees, E.M.: The trec-5 confusion track: Comparing retrieval methods for scanned text. Inf. Retr. 2(2/3), 165–176 (2000)

    Article  Google Scholar 

  13. Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005)

    Google Scholar 

  14. Knoblock, C.A., Lopresti, D.P., Roy, S., Venkata Subramaniam, L.: Special issue on noisy text analytics. IJDAR 10(3-4), 127–128 (2007)

    Article  Google Scholar 

  15. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)

    Google Scholar 

  16. Li, M., Zhu, M., Zhang, Y., Zhou, M.: Exploring distributional similarity based models for query spelling correction. In: ACL (2006)

    Google Scholar 

  17. Lopresti, D.P., Roy, S., Schulz, K., Subramaniam, L.V.: Special issue on noisy text analytics. IJDAR 12, 139–140 (2009)

    Article  Google Scholar 

  18. Minock, M.: Where are the “killer applications” of restricted domain question answering? In: Proceedings of the IJCAI Workshop on Knowledge Reasoning in Question Answering, Edinburgh, Scotland, p. 4 (2005)

    Google Scholar 

  19. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48(3), 443–453 (1970)

    Article  Google Scholar 

  20. Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Overview of the trec 2008 legal track. In: TREC (2008)

    Google Scholar 

  21. Venkata Subramaniam, L., Roy, S., Faruquie, T.A., Negi, S.: A survey of types of text noise and techniques to handle noisy text. In: AND 2009: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 115–122. ACM, New York (2009)

    Chapter  Google Scholar 

  22. Vila, K., Ferrández, A.: Developing an ontology for improving question answering in the agricultural domain. In: MTSR, pp. 245–256 (2009)

    Google Scholar 

  23. Vinciarelli, A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1895 (2005)

    Article  Google Scholar 

  24. Winkler, W.E.: Overview of record linkage and current research directions. Research report series, rrs, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vila, K., Díaz, J., Fernández, A., Ferrández, A. (2010). An Approach for Adding Noise-Tolerance to Restricted-Domain Information Retrieval. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds) Natural Language Processing and Information Systems. NLDB 2010. Lecture Notes in Computer Science, vol 6177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13881-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13881-2_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13880-5

  • Online ISBN: 978-3-642-13881-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics