Abstract
Corpus of Information Retrieval (IR) systems are formed by text documents that often come from rather heterogeneous sources, such as Web sites or OCR (Optical Character Recognition) systems. Faithfully converting these sources into flat text files is not a trivial task, since noise can be easily introduced due to spelling or typeset errors. Importantly, if the size of the corpus is large enough, then redundancy helps in controlling the effects of noise because the same text often appears with and without noise throughout the corpus. Conversely, noise becomes a serious problem in restricted-domain IR where corpus is usually small and it has little or no redundancy. Therefore, noise hinders the retrieval task in restricted domains and erroneous results are likely to be obtained. In order to overcome this situation, this paper presents an approach for using restricted-domain resources, such as Knowledge Organization Systems (KOS), to add noise-tolerance to existing IR systems. To show the suitability of our approach in one real restricted-domain case study, a set of experiments has been carried out for the agricultural domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aunimo, L., Heinonen, O., Kuuskoski, R., Makkonen, J., Petit, R., Virtanen, O.: Question answering system for incomplete and noisy data. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 193–206. Springer, Heidelberg (2003)
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)
Bourdaillet, J., Ganascia, J.-G.: Alignment of noisy unstructured data. In: IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (January 2007)
Chen, Q., Li, M., Zhou, M.: Improving query spelling correction using web search results. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 2007, pp. 181–189. Association for Computational Linguistics (2007)
Clark, A.: Pre-processing very noisy text. In: Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics (2003)
Esser, W.M.: Fault-tolerant fulltext information retrieval in digital multilingual encyclopedias with weighted pattern morphing. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 338–352. Springer, Heidelberg (2004)
Fernández, A.C., Díaz, J., Fundora, A., Muñoz, R.: Un algoritmo para la extracción de características lexicográficas en la comparación de palabras. In: IV Convención Científica Internacional de La Universidad De Matanzas CIUM 2009 (2009)
Gómez, J.M.: Recuperación de Pasajes Multilingüe para la Búsqueda de Respuestas. Phd. thesis, Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain (2007)
Gómez, J.M., Rosso, P., Sanchis, E.: Jirs language-independent passage retrieval system: A comparative study. In: 5th Internacional Conference on Natural Language Proceeding (ICON 2007), MaCMillan Publisher, Basingstoke (2007)
Hassan, T., Baumgartner, R.: Intelligent text extraction from pdf documents. In: CIMCA/IAWTIC, pp. 2–6 (2005)
Hodge, G.: Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. In: The Digital Library Federation Council on Library and Information Resources (2000)
Kantor, P.B., Voorhees, E.M.: The trec-5 confusion track: Comparing retrieval methods for scanned text. Inf. Retr. 2(2/3), 165–176 (2000)
Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005)
Knoblock, C.A., Lopresti, D.P., Roy, S., Venkata Subramaniam, L.: Special issue on noisy text analytics. IJDAR 10(3-4), 127–128 (2007)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)
Li, M., Zhu, M., Zhang, Y., Zhou, M.: Exploring distributional similarity based models for query spelling correction. In: ACL (2006)
Lopresti, D.P., Roy, S., Schulz, K., Subramaniam, L.V.: Special issue on noisy text analytics. IJDAR 12, 139–140 (2009)
Minock, M.: Where are the “killer applications” of restricted domain question answering? In: Proceedings of the IJCAI Workshop on Knowledge Reasoning in Question Answering, Edinburgh, Scotland, p. 4 (2005)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48(3), 443–453 (1970)
Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Overview of the trec 2008 legal track. In: TREC (2008)
Venkata Subramaniam, L., Roy, S., Faruquie, T.A., Negi, S.: A survey of types of text noise and techniques to handle noisy text. In: AND 2009: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 115–122. ACM, New York (2009)
Vila, K., Ferrández, A.: Developing an ontology for improving question answering in the agricultural domain. In: MTSR, pp. 245–256 (2009)
Vinciarelli, A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1895 (2005)
Winkler, W.E.: Overview of record linkage and current research directions. Research report series, rrs, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vila, K., Díaz, J., Fernández, A., Ferrández, A. (2010). An Approach for Adding Noise-Tolerance to Restricted-Domain Information Retrieval. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds) Natural Language Processing and Information Systems. NLDB 2010. Lecture Notes in Computer Science, vol 6177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13881-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-13881-2_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13880-5
Online ISBN: 978-3-642-13881-2
eBook Packages: Computer ScienceComputer Science (R0)