Abstract
An obvious way to measure how representative a corpus is for the language environment of a person would be to observe this person over a longer period of time, record all written or spoken input, and compare this data to the corpus in question. As this is not very practical, we suggest here a more indirect way to do this. Previous work suggests that people’s word associations can be derived from corpus statistics. These word associations are known to some degree as psychologists have collected them from test persons in large scale experiments. The output of these experiments are tables of word associations, the so-called word association norms. In this paper we assume that the more representative a corpus is for the language environment of the test persons, the better the associations generated from it should match people’s associations. That is, we compare the corpus-generated associations to the association norms collected from humans, and take the similarity between the two as a measure of corpus representativeness. To our knowledge, this is the first attempt to do so.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8, 243–257 (1993)
Brisbaert, M., New, B.: Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods 41(4), 977–990 (2009)
Burnard, L., Aston, G.: The BNC Handbook: Exploring the British National Corpus with Sara. University Press, Edinburgh (1998)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Coltheart, M.: The MRC psycholinguistic database. Quarterly Journal of Experimental Psychology 33A, 497–505 (1981)
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum 40(1), 64–69 (2006)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Francis, W.N., Kuçera, H.: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Brown University, Department of Linguistics, Providence, R.I (1989)
James, W.: The Principles of Psychology. Holt, New York (1890), Reprinted Dover Publications, New York (1950)
Jenkins, J.J.: The 1952 Minnesota word association norms. In: Postman, L.J., Keppel, G. (eds.) Norms of Word Association, pp. 1–38. Academic Press, New York (1970)
Kiss, G.R., Armstrong, C., Milroy, R., Piper, J.: An associative thesaurus of English and its computer analysis. In: Aitken, A.J., Bailey, R.W., Hamilton-Smith, N. (eds.) The Computer and Literary Studies, pp. 153–165. University Press, Edinburgh (1973)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211–240 (1997)
McEnery, T., Wilson, A.: Corpus Linguistics. Edinburgh University Press (1996)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the ACL, pp. 311–318. Philadelphia (2002)
Rapp, R.: Die Berechnung von Assoziationen. Olms, Hildesheim (1996)
Rapp, R.: From stimulus to associations and back. In: Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science, Marseille, France (2013)
Rapp, R.: Using word familiarities to measure corpus representativeness. In: Proceedings of the 48. Linguistics Colloquium, Alcala de Henares, Spain (September 2013) (in print)
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: A pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)
Saldanha, G.: Principles of corpus linguistics and their application to translation studies research. Tradumática 7, 1–7 (2009)
Schvaneveldt, R.W., Durso, F.T., Dearholt, D.W.: Network structures in proximity data. In: Bower, G. (ed.) The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 24, pp. 249–284. Academic Press, New York (1989)
Turney, P.T., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)
Wettler, M., Rapp, R.: A connectionist system to simulate lexical decisions in information retrieval. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Connectionism in Perspective, pp. 463–469. Elsevier, Amsterdam (1989)
Wettler, M., Rapp, R., Sedlmeier, P.: Free word associations correspond to contiguities between words in texts. Journal of Quantitative Linguistics 12(2), 111–122 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rapp, R. (2014). Using Word Association Norms to Measure Corpus Representativeness. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-54906-9_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54905-2
Online ISBN: 978-3-642-54906-9
eBook Packages: Computer ScienceComputer Science (R0)