Compilation of a Spanish Representative Corpus

Gelbukh, Alexander; Sidorov, Grigori; Chanona-Hernández, Liliana

doi:10.1007/3-540-45715-1_27

Alexander Gelbukh⁵,
Grigori Sidorov⁵ &
Liliana Chanona-Hernández⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1499 Accesses
8 Citations

Abstract

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

Work done under partial support of CONACyT, CGEPI/COFAA-IPN, and SNI, Mexico.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biber, D., S. Conrad, and D. Reppen (1998). Corpus linguistics. Investigating language structure and use. Cambridge University Press, Cambridge.
Google Scholar
Kilgariff, A. (2001). Web as corpus. In: Proc. of Corpus Linguistics 2001 conference, University center for computer corpus research on language, technical papers vol. 13, Lancaster University, 2001, pp 342–344.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Computing Research, National Polytechnic Institute, USA
Alexander Gelbukh, Grigori Sidorov & Liliana Chanona-Hernández

Authors

Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar
Liliana Chanona-Hernández
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CIC Centro de Investigacion en Computacion, IPN Instituto Politecnico Nacional, Col Zacateno, CP 07738, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gelbukh, A., Sidorov, G., Chanona-Hernández, L. (2002). Compilation of a Spanish Representative Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_27

Download citation

DOI: https://doi.org/10.1007/3-540-45715-1_27
Published: 05 February 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics