Information Retrieval and Large Text Structured Corpora

Barcala, Fco. Mario; Molinero, Miguel A.; Domínguez, Eva

doi:10.1007/11556985_14

Fco. Mario Barcala¹⁹,
Miguel A. Molinero²⁰ &
Eva Domínguez¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3643))

Included in the following conference series:

International Conference on Computer Aided Systems Theory

1279 Accesses

Abstract

Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited.

In contrast with this, nowadays, documents which are part of a corpus often have a rich structure. They are structured using XML (Extensible Markup Language)[1] or in some other format which can be converted to XML in a more or less simple way. So, building classical IRSs to work with these kinds of corpus will not benefit from this structure and results will not be improved.

In addition, several of these corpora are very large and include hundreds or thousands of documents which in turn include millions or hundreds of millions of words. Therefore, there is the need to build efficient and flexible IRSs which work with large structured corpora.

Partially supported by Ministerio de Educación y Ciencia (MEC) and FEDER (TIN2004-07246-C02-01 and TIN2004-07246-C02-02), by MEC (HF2002-81), and by Xunta de Galicia (PGIDIT02PXIB30501PR, PGIDIT02SIN01E and PGIDIT03SIN30501PR).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Techniques, Applications, and Issues in Mining Large-Scale Text Databases

Information Retrieval in XML Document: State of the Art

Databases and Data Retrieval

References

XML (2/5/2005), http://www.w3c.org
Martínez, M.S.L.: CORGA (Corpus de Referencia del Gallego Actual). In: Proc. of Hizkuntza-corpusak. Oraina eta feroa, Borovets, Bulgaria, September 2003, pp. 500–504 (2003)
Google Scholar
Davies, M.: Un corpus anotado de 100.000.000 palabras del español histórico y moderno. In: Proceedings of Sociedad Española para el Procesamiento del Lenguaje Natural, Valladolid, Spain, pp. 21–27 (2002)
Google Scholar
Davies, M.: Relational n-gram databases as a basis for unlimited annotation on large corpora. In: Proceedings from the Workshop on Shallow Processing of Large Corpora, Lancaster, England, March 2003, pp. 23–33 (2003)
Google Scholar
Chaudhri, A.B., Rashid, A., Zicari, R.: XML Data Management, Native XML and XML-Enabled Database Systems. Addison-Wesley, Reading (2003)
Google Scholar
Oracle (2/5/2005), http://www.oracle.com
Tamino (2/5/2005), http://www.softwareag.com
Vilares, J., Alonso, M.A., Vilares, M.: Morphological and syntactic processing for Text Retrieval. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 371–380. Springer, Heidelberg (2004)
Chapter Google Scholar
Alonso, M.A., Vilares, J., Darriba, V.M.: On the Usefulness of Extracting Syntactic Dependencies for Text Indexing. In: O’Neill, M., Sutcliffe, R.F.E., Ryan, C., Eaton, M., Griffith, N.J.L. (eds.) AICS 2002. LNCS (LNAI), vol. 2464, pp. 3–11. Springer, Heidelberg (2002)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Centro Ramón Piñeiro, Ctra. Santiago-Noia km. 3, A Barcia, 15896, Santiago de Compostela, Spain
Fco. Mario Barcala & Eva Domínguez
Depto. de Informática, Universidade de Vigo, Campus As Lagoas, s/n, 32004, Ourense, Spain
Miguel A. Molinero

Authors

Fco. Mario Barcala
View author publications
You can also search for this author in PubMed Google Scholar
Miguel A. Molinero
View author publications
You can also search for this author in PubMed Google Scholar
Eva Domínguez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto Universitario de Ciencias y Tecnologícas Cibernéticas, Universidad de Las Palmas de Gran Canaria, Campus de Tafira, 35017, Las Palmas de Gran Canaria,, Las Palmas, Spain
Roberto Moreno Díaz
Systems Theory, Johannes Kepler University Linz, Altenbergerstrasse 69, A-4040, Linz, Austria
Franz Pichler
Universidad de Las Palmas de Gran Canaria, Instituto Universitario de Ciencias y Tecnologícas Cibernéticas, Campus de Tafira, 35017, Las Palmas de Gran Canaria, Las Palmas, Spain
Alexis Quesada Arencibia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barcala, F.M., Molinero, M.A., Domínguez, E. (2005). Information Retrieval and Large Text Structured Corpora. In: Moreno Díaz, R., Pichler, F., Quesada Arencibia, A. (eds) Computer Aided Systems Theory – EUROCAST 2005. EUROCAST 2005. Lecture Notes in Computer Science, vol 3643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11556985_14

Download citation

DOI: https://doi.org/10.1007/11556985_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29002-5
Online ISBN: 978-3-540-31829-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics