Abstract
Small corpora present problems for traditional statistical analysis because of their sparsity of data. We discuss a methodology for classifying words in edited, plain text corpora which has the potential for working on relatively small corpora. This approach, which we calloccurrence-based processing, counts which contexts occur around a given word, but pays no attention to the number of times that each context occurs. We obtain good results on an artificial language and compare our results to Elman's connectionist analysis of the same artificial language. We obtain more modest results on real world corpora, but the results are sufficient to draw some methodological and language theoretical conclusions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
A.V. Aho, J.E. Hopcroft and J.D. Ulman,Data Structures and Algorithms (Addison-Wesley Reading, MA, 1983).
P.A. Bensch, Neostructuralism: A commentary on the correlations between the work of Zelig Harris and Jeffrey Elman, CRL Newsletter, 5.2, Center for Research in Language, University of California, San Diego (1991).
P.A. Bensch, Occurrence-based word categorization, Doctoral Dissertation, University of California, San Diego (1993).
M. Brent, Semantic classification of verbs from their syntactic contexts: Automated lexicography with implications for child language acquisition,Proceedings of the 12th Meeting of the Cognitive Science Society (1990) pp. 428–437.
E. Brill and M. Marcus, Tagging an unfamiliar text with minimum human supervision,Working Notes of the AAAI Fall Symp. on Probabilistic Approaches to Natural Language, ed. R. Goldman (AAAI Press, 1992).
N. Chomsky,Aspects of the Theory of Syntax (MIT Press, Cambridge, MA, 1965).
N. Chomsky,Lectures on Government and Binding: The Pisa Lectures (Foris, Dordrecht, Holland, 1981).
R.O. Dudaand P.E. Hart,Pattern Classification and Scene Analysis (Wiley, New York, 1973).
J.L. Elman, Representation and structure in connectionist models, CRL Technical Report 8903, Center for Research in Language, University of California, San Diego (1989).
J.L. Elman, Finding structure in time, Cognitive Sci. 14(1990)179–211.
S. Finch and N. Chater, Bootstrapping syntactic categories using statistical methods,Background and Experiments in Machine Learning of Natural Language, ed. W. Daelemans and D. Powers (Institute for Language Technology and AI, Tilburg University).
T. Givon,SYNTAX: A Functional-Typological Introduction, Vol. I (Benjamins, Amsterdam, 1984).
J. Grimshaw, Form, function, and the language acquisition device,The Logical Problem of Language Acquisition, ed. C.L. Baker and J.J. McCarthy (MIT Press, Cambridge, MA, 1981).
R. Grishman, L. Hirschman and N.T. Nhan, Discovery procedures for sublanguage selectional patterns: Initial experiments, Comput. Linguistics 12.3(1986)205–215.
Z. Harris,A Grammar of English on Mathematical Principles (Wiley, New York, 1982).
Z. Harris,The Form of Information in Science: Analysis of an Immunology Sublanguage (Kluwer Academic, Dordrecht, The Netherlands, 1989).
T.C. Hu,Combinatorial Algorithms (Addison-Wesley, Reading, MA, 1982).
J. Macnamara,Names for Things: A Study of Child Language (Bradford Books/MIT Press, Cambridge, MA, 1982).
S. Pinker,Language Learnability and Language Development (Harvard University Press, Cambridge, MA, 1984).
S. Pinker,Learnability and Cognition: The Acquisition of Argument Structure (MIT Press, Cambridge, MA, 1989).
A. Radford,Transformational Grammar: A First Course (Cambridge University Press, Cambridge, 1988).
H. Schutze, Part-of-speech induction from scratch, manuscript to be presented at ACL93.
J.R. Taylor,Linguistic Categorization: Prototypes in Linguistic Theory (Clarendon Press, Oxford, 1989).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bensch, P.A., Savitch, W.J. An occurrence-based model of word categorization. Ann Math Artif Intell 14, 1–16 (1995). https://doi.org/10.1007/BF01530891
Issue Date:
DOI: https://doi.org/10.1007/BF01530891