Using Word Association Norms to Measure Corpus Representativeness

Rapp, Reinhard

doi:10.1007/978-3-642-54906-9_1

Using Word Association Norms to Measure Corpus Representativeness

Reinhard Rapp¹⁷

Conference paper

2064 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8403))

Abstract

An obvious way to measure how representative a corpus is for the language environment of a person would be to observe this person over a longer period of time, record all written or spoken input, and compare this data to the corpus in question. As this is not very practical, we suggest here a more indirect way to do this. Previous work suggests that people’s word associations can be derived from corpus statistics. These word associations are known to some degree as psychologists have collected them from test persons in large scale experiments. The output of these experiments are tables of word associations, the so-called word association norms. In this paper we assume that the more representative a corpus is for the language environment of the test persons, the better the associations generated from it should match people’s associations. That is, we compare the corpus-generated associations to the association norms collected from humans, and take the similarity between the two as a measure of corpus representativeness. To our knowledge, this is the first attempt to do so.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8, 243–257 (1993)
Article Google Scholar
Brisbaert, M., New, B.: Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods 41(4), 977–990 (2009)
Article Google Scholar
Burnard, L., Aston, G.: The BNC Handbook: Exploring the British National Corpus with Sara. University Press, Edinburgh (1998)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Google Scholar
Coltheart, M.: The MRC psycholinguistic database. Quarterly Journal of Experimental Psychology 33A, 497–505 (1981)
Article Google Scholar
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum 40(1), 64–69 (2006)
Article Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Francis, W.N., Kuçera, H.: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Brown University, Department of Linguistics, Providence, R.I (1989)
Google Scholar
James, W.: The Principles of Psychology. Holt, New York (1890), Reprinted Dover Publications, New York (1950)
Google Scholar
Jenkins, J.J.: The 1952 Minnesota word association norms. In: Postman, L.J., Keppel, G. (eds.) Norms of Word Association, pp. 1–38. Academic Press, New York (1970)
Google Scholar
Kiss, G.R., Armstrong, C., Milroy, R., Piper, J.: An associative thesaurus of English and its computer analysis. In: Aitken, A.J., Bailey, R.W., Hamilton-Smith, N. (eds.) The Computer and Literary Studies, pp. 153–165. University Press, Edinburgh (1973)
Google Scholar
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211–240 (1997)
Article Google Scholar
McEnery, T., Wilson, A.: Corpus Linguistics. Edinburgh University Press (1996)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the ACL, pp. 311–318. Philadelphia (2002)
Google Scholar
Rapp, R.: Die Berechnung von Assoziationen. Olms, Hildesheim (1996)
Google Scholar
Rapp, R.: From stimulus to associations and back. In: Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science, Marseille, France (2013)
Google Scholar
Rapp, R.: Using word familiarities to measure corpus representativeness. In: Proceedings of the 48. Linguistics Colloquium, Alcala de Henares, Spain (September 2013) (in print)
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: A pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)
Chapter Google Scholar
Saldanha, G.: Principles of corpus linguistics and their application to translation studies research. Tradumática 7, 1–7 (2009)
Google Scholar
Schvaneveldt, R.W., Durso, F.T., Dearholt, D.W.: Network structures in proximity data. In: Bower, G. (ed.) The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 24, pp. 249–284. Academic Press, New York (1989)
Google Scholar
Turney, P.T., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)
MATH MathSciNet Google Scholar
Wettler, M., Rapp, R.: A connectionist system to simulate lexical decisions in information retrieval. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Connectionism in Perspective, pp. 463–469. Elsevier, Amsterdam (1989)
Google Scholar
Wettler, M., Rapp, R., Sedlmeier, P.: Free word associations correspond to contiguities between words in texts. Journal of Quantitative Linguistics 12(2), 111–122 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique Fondamentale, Aix-Marseille Université, 163 Avenue de Luminy - Case 901, 13288, Marseille Cedex 9, France
Reinhard Rapp

Authors

Reinhard Rapp
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rapp, R. (2014). Using Word Association Norms to Measure Corpus Representativeness. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-54906-9_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54905-2
Online ISBN: 978-3-642-54906-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics