Skip to main content

Information Retrieval and Large Text Structured Corpora

  • Conference paper
Computer Aided Systems Theory – EUROCAST 2005 (EUROCAST 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3643))

Included in the following conference series:

  • 1273 Accesses

Abstract

Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited.

In contrast with this, nowadays, documents which are part of a corpus often have a rich structure. They are structured using XML (Extensible Markup Language)[1] or in some other format which can be converted to XML in a more or less simple way. So, building classical IRSs to work with these kinds of corpus will not benefit from this structure and results will not be improved.

In addition, several of these corpora are very large and include hundreds or thousands of documents which in turn include millions or hundreds of millions of words. Therefore, there is the need to build efficient and flexible IRSs which work with large structured corpora.

Partially supported by Ministerio de Educación y Ciencia (MEC) and FEDER (TIN2004-07246-C02-01 and TIN2004-07246-C02-02), by MEC (HF2002-81), and by Xunta de Galicia (PGIDIT02PXIB30501PR, PGIDIT02SIN01E and PGIDIT03SIN30501PR).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. XML (2/5/2005), http://www.w3c.org

  2. Martínez, M.S.L.: CORGA (Corpus de Referencia del Gallego Actual). In: Proc. of Hizkuntza-corpusak. Oraina eta feroa, Borovets, Bulgaria, September 2003, pp. 500–504 (2003)

    Google Scholar 

  3. Davies, M.: Un corpus anotado de 100.000.000 palabras del español histórico y moderno. In: Proceedings of Sociedad Española para el Procesamiento del Lenguaje Natural, Valladolid, Spain, pp. 21–27 (2002)

    Google Scholar 

  4. Davies, M.: Relational n-gram databases as a basis for unlimited annotation on large corpora. In: Proceedings from the Workshop on Shallow Processing of Large Corpora, Lancaster, England, March 2003, pp. 23–33 (2003)

    Google Scholar 

  5. Chaudhri, A.B., Rashid, A., Zicari, R.: XML Data Management, Native XML and XML-Enabled Database Systems. Addison-Wesley, Reading (2003)

    Google Scholar 

  6. Oracle (2/5/2005), http://www.oracle.com

  7. Tamino (2/5/2005), http://www.softwareag.com

  8. Vilares, J., Alonso, M.A., Vilares, M.: Morphological and syntactic processing for Text Retrieval. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 371–380. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Alonso, M.A., Vilares, J., Darriba, V.M.: On the Usefulness of Extracting Syntactic Dependencies for Text Indexing. In: O’Neill, M., Sutcliffe, R.F.E., Ryan, C., Eaton, M., Griffith, N.J.L. (eds.) AICS 2002. LNCS (LNAI), vol. 2464, pp. 3–11. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Barcala, F.M., Molinero, M.A., Domínguez, E. (2005). Information Retrieval and Large Text Structured Corpora. In: Moreno Díaz, R., Pichler, F., Quesada Arencibia, A. (eds) Computer Aided Systems Theory – EUROCAST 2005. EUROCAST 2005. Lecture Notes in Computer Science, vol 3643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11556985_14

Download citation

  • DOI: https://doi.org/10.1007/11556985_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29002-5

  • Online ISBN: 978-3-540-31829-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics