Abstract
The methods of corpus linguistics can reveal a great deal of information about word use and language structure by careful processing of very large corpora. This information can be used for adding organizational structure to digital libraries both in terms of individual document content and inter-document relations. The structure discovered by corpus linguistics methods reflects the actual use of words and language style in particular domains and genres, rather than being constrained by pre-built categories. The data presented here has demonstrated the power of simple word classification methods for discovering semantically related word clusters. Work in progress based on the new balanced entropy principle overcomes a number of limitations of current classification methods and should discover more detailed and accurate information about word relations and text structure.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Futrelle, R.P., Zhang, X., Sekiya, Y. (1995). Corpus Linguistics for establishing the natural language content of Digital Library documents. In: Adam, N.R., Bhargava, B.K., Yesha, Y. (eds) Digital Libraries Current Issues. DL 1994. Lecture Notes in Computer Science, vol 916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026855
Download citation
DOI: https://doi.org/10.1007/BFb0026855
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-59282-2
Online ISBN: 978-3-540-49230-6
eBook Packages: Springer Book Archive