An Approach for Adding Noise-Tolerance to Restricted-Domain Information Retrieval

Vila, Katia; Díaz, Josval; Fernández, Antonio; Ferrández, Antonio

doi:10.1007/978-3-642-13881-2_1

Katia Vila²⁰,
Josval Díaz²⁰,
Antonio Fernández²⁰ &
…
Antonio Ferrández²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6177))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1319 Accesses
6 Citations

Abstract

Corpus of Information Retrieval (IR) systems are formed by text documents that often come from rather heterogeneous sources, such as Web sites or OCR (Optical Character Recognition) systems. Faithfully converting these sources into flat text files is not a trivial task, since noise can be easily introduced due to spelling or typeset errors. Importantly, if the size of the corpus is large enough, then redundancy helps in controlling the effects of noise because the same text often appears with and without noise throughout the corpus. Conversely, noise becomes a serious problem in restricted-domain IR where corpus is usually small and it has little or no redundancy. Therefore, noise hinders the retrieval task in restricted domains and erroneous results are likely to be obtained. In order to overcome this situation, this paper presents an approach for using restricted-domain resources, such as Knowledge Organization Systems (KOS), to add noise-tolerance to existing IR systems. To show the suitability of our approach in one real restricted-domain case study, a set of experiments has been carried out for the agricultural domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aunimo, L., Heinonen, O., Kuuskoski, R., Makkonen, J., Petit, R., Virtanen, O.: Question answering system for incomplete and noisy data. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 193–206. Springer, Heidelberg (2003)
Chapter Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)
Google Scholar
Bourdaillet, J., Ganascia, J.-G.: Alignment of noisy unstructured data. In: IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (January 2007)
Google Scholar
Chen, Q., Li, M., Zhou, M.: Improving query spelling correction using web search results. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 2007, pp. 181–189. Association for Computational Linguistics (2007)
Google Scholar
Clark, A.: Pre-processing very noisy text. In: Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics (2003)
Google Scholar
Esser, W.M.: Fault-tolerant fulltext information retrieval in digital multilingual encyclopedias with weighted pattern morphing. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 338–352. Springer, Heidelberg (2004)
Google Scholar
Fernández, A.C., Díaz, J., Fundora, A., Muñoz, R.: Un algoritmo para la extracción de características lexicográficas en la comparación de palabras. In: IV Convención Científica Internacional de La Universidad De Matanzas CIUM 2009 (2009)
Google Scholar
Gómez, J.M.: Recuperación de Pasajes Multilingüe para la Búsqueda de Respuestas. Phd. thesis, Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain (2007)
Google Scholar
Gómez, J.M., Rosso, P., Sanchis, E.: Jirs language-independent passage retrieval system: A comparative study. In: 5th Internacional Conference on Natural Language Proceeding (ICON 2007), MaCMillan Publisher, Basingstoke (2007)
Google Scholar
Hassan, T., Baumgartner, R.: Intelligent text extraction from pdf documents. In: CIMCA/IAWTIC, pp. 2–6 (2005)
Google Scholar
Hodge, G.: Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. In: The Digital Library Federation Council on Library and Information Resources (2000)
Google Scholar
Kantor, P.B., Voorhees, E.M.: The trec-5 confusion track: Comparing retrieval methods for scanned text. Inf. Retr. 2(2/3), 165–176 (2000)
Article Google Scholar
Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005)
Google Scholar
Knoblock, C.A., Lopresti, D.P., Roy, S., Venkata Subramaniam, L.: Special issue on noisy text analytics. IJDAR 10(3-4), 127–128 (2007)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)
Google Scholar
Li, M., Zhu, M., Zhang, Y., Zhou, M.: Exploring distributional similarity based models for query spelling correction. In: ACL (2006)
Google Scholar
Lopresti, D.P., Roy, S., Schulz, K., Subramaniam, L.V.: Special issue on noisy text analytics. IJDAR 12, 139–140 (2009)
Article Google Scholar
Minock, M.: Where are the “killer applications” of restricted domain question answering? In: Proceedings of the IJCAI Workshop on Knowledge Reasoning in Question Answering, Edinburgh, Scotland, p. 4 (2005)
Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48(3), 443–453 (1970)
Article Google Scholar
Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Overview of the trec 2008 legal track. In: TREC (2008)
Google Scholar
Venkata Subramaniam, L., Roy, S., Faruquie, T.A., Negi, S.: A survey of types of text noise and techniques to handle noisy text. In: AND 2009: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 115–122. ACM, New York (2009)
Chapter Google Scholar
Vila, K., Ferrández, A.: Developing an ontology for improving question answering in the agricultural domain. In: MTSR, pp. 245–256 (2009)
Google Scholar
Vinciarelli, A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1895 (2005)
Article Google Scholar
Winkler, W.E.: Overview of record linkage and current research directions. Research report series, rrs, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Matanzas, Varadero Road, 40100, Matanzas, Cuba
Katia Vila, Josval Díaz & Antonio Fernández
Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig Road, 03690, Alicante, Spain
Antonio Ferrández

Authors

Katia Vila
View author publications
You can also search for this author in PubMed Google Scholar
Josval Díaz
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Ferrández
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Cardiff University, UK
Christina J. Hopfe & Haijiang Li &
Informatics Research Institute, University of Salford, M5 4WT, Greater Manchester, UK
Yacine Rezgui
Centre National des Arts et Métiers,
Elisabeth Métais
School of Computer Science, Cardiff University, UK
Alun Preece

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vila, K., Díaz, J., Fernández, A., Ferrández, A. (2010). An Approach for Adding Noise-Tolerance to Restricted-Domain Information Retrieval. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds) Natural Language Processing and Information Systems. NLDB 2010. Lecture Notes in Computer Science, vol 6177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13881-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-13881-2_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13880-5
Online ISBN: 978-3-642-13881-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics