Abstract
Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited.
In contrast with this, nowadays, documents which are part of a corpus often have a rich structure. They are structured using XML (Extensible Markup Language)[1] or in some other format which can be converted to XML in a more or less simple way. So, building classical IRSs to work with these kinds of corpus will not benefit from this structure and results will not be improved.
In addition, several of these corpora are very large and include hundreds or thousands of documents which in turn include millions or hundreds of millions of words. Therefore, there is the need to build efficient and flexible IRSs which work with large structured corpora.
Partially supported by Ministerio de Educación y Ciencia (MEC) and FEDER (TIN2004-07246-C02-01 and TIN2004-07246-C02-02), by MEC (HF2002-81), and by Xunta de Galicia (PGIDIT02PXIB30501PR, PGIDIT02SIN01E and PGIDIT03SIN30501PR).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
XML (2/5/2005), http://www.w3c.org
Martínez, M.S.L.: CORGA (Corpus de Referencia del Gallego Actual). In: Proc. of Hizkuntza-corpusak. Oraina eta feroa, Borovets, Bulgaria, September 2003, pp. 500–504 (2003)
Davies, M.: Un corpus anotado de 100.000.000 palabras del español histórico y moderno. In: Proceedings of Sociedad Española para el Procesamiento del Lenguaje Natural, Valladolid, Spain, pp. 21–27 (2002)
Davies, M.: Relational n-gram databases as a basis for unlimited annotation on large corpora. In: Proceedings from the Workshop on Shallow Processing of Large Corpora, Lancaster, England, March 2003, pp. 23–33 (2003)
Chaudhri, A.B., Rashid, A., Zicari, R.: XML Data Management, Native XML and XML-Enabled Database Systems. Addison-Wesley, Reading (2003)
Oracle (2/5/2005), http://www.oracle.com
Tamino (2/5/2005), http://www.softwareag.com
Vilares, J., Alonso, M.A., Vilares, M.: Morphological and syntactic processing for Text Retrieval. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 371–380. Springer, Heidelberg (2004)
Alonso, M.A., Vilares, J., Darriba, V.M.: On the Usefulness of Extracting Syntactic Dependencies for Text Indexing. In: O’Neill, M., Sutcliffe, R.F.E., Ryan, C., Eaton, M., Griffith, N.J.L. (eds.) AICS 2002. LNCS (LNAI), vol. 2464, pp. 3–11. Springer, Heidelberg (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barcala, F.M., Molinero, M.A., Domínguez, E. (2005). Information Retrieval and Large Text Structured Corpora. In: Moreno Díaz, R., Pichler, F., Quesada Arencibia, A. (eds) Computer Aided Systems Theory – EUROCAST 2005. EUROCAST 2005. Lecture Notes in Computer Science, vol 3643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11556985_14
Download citation
DOI: https://doi.org/10.1007/11556985_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29002-5
Online ISBN: 978-3-540-31829-3
eBook Packages: Computer ScienceComputer Science (R0)