Corpus Linguistics for establishing the natural language content of Digital Library documents

Futrelle, Robert P.; Zhang, Xiaolan; Sekiya, Yumiko

doi:10.1007/BFb0026855

Corpus Linguistics for establishing the natural language content of Digital Library documents

Robert P. Futrelle¹,
Xiaolan Zhang² &
Yumiko Sekiya³

Classification and Indexing
Conference paper
First Online: 01 January 2005

167 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 916))

Abstract

The methods of corpus linguistics can reveal a great deal of information about word use and language structure by careful processing of very large corpora. This information can be used for adding organizational structure to digital libraries both in terms of individual document content and inter-document relations. The structure discovered by corpus linguistics methods reflects the actual use of words and language style in particular domains and genres, rather than being constrained by pre-built categories. The data presented here has demonstrated the power of simple word classification methods for discovering semantically related word clusters. Work in progress based on the new balanced entropy principle overcomes a number of limitations of current classification methods and should discover more detailed and accurate information about word relations and text structure.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

Biological Knowledge Laboratory and Scientific Database Project, College of Computer Science, Northeastern University, 161 Cullinane Hall, 02115, Boston, MA
Robert P. Futrelle
Biological Knowledge Laboratory and Scientific Database Project, College of Computer Science, Northeastern University, 161 Cullinane Hall, 02115, Boston, MA
Xiaolan Zhang
Biological Knowledge Laboratory and Scientific Database Project, College of Computer Science, Northeastern University, 161 Cullinane Hall, 02115, Boston, MA
Yumiko Sekiya

Authors

Robert P. Futrelle
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yumiko Sekiya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Nabil R. Adam Bharat K. Bhargava Yelena Yesha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Futrelle, R.P., Zhang, X., Sekiya, Y. (1995). Corpus Linguistics for establishing the natural language content of Digital Library documents. In: Adam, N.R., Bhargava, B.K., Yesha, Y. (eds) Digital Libraries Current Issues. DL 1994. Lecture Notes in Computer Science, vol 916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026855

Download citation

DOI: https://doi.org/10.1007/BFb0026855
Published: 18 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-59282-2
Online ISBN: 978-3-540-49230-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics