Skip to main content

Using Word Association Norms to Measure Corpus Representativeness

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8403))

Abstract

An obvious way to measure how representative a corpus is for the language environment of a person would be to observe this person over a longer period of time, record all written or spoken input, and compare this data to the corpus in question. As this is not very practical, we suggest here a more indirect way to do this. Previous work suggests that people’s word associations can be derived from corpus statistics. These word associations are known to some degree as psychologists have collected them from test persons in large scale experiments. The output of these experiments are tables of word associations, the so-called word association norms. In this paper we assume that the more representative a corpus is for the language environment of the test persons, the better the associations generated from it should match people’s associations. That is, we compare the corpus-generated associations to the association norms collected from humans, and take the similarity between the two as a measure of corpus representativeness. To our knowledge, this is the first attempt to do so.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8, 243–257 (1993)

    Article  Google Scholar 

  2. Brisbaert, M., New, B.: Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods 41(4), 977–990 (2009)

    Article  Google Scholar 

  3. Burnard, L., Aston, G.: The BNC Handbook: Exploring the British National Corpus with Sara. University Press, Edinburgh (1998)

    Google Scholar 

  4. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)

    Google Scholar 

  5. Coltheart, M.: The MRC psycholinguistic database. Quarterly Journal of Experimental Psychology 33A, 497–505 (1981)

    Article  Google Scholar 

  6. Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum 40(1), 64–69 (2006)

    Article  Google Scholar 

  7. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)

    Google Scholar 

  8. Francis, W.N., Kuçera, H.: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Brown University, Department of Linguistics, Providence, R.I (1989)

    Google Scholar 

  9. James, W.: The Principles of Psychology. Holt, New York (1890), Reprinted Dover Publications, New York (1950)

    Google Scholar 

  10. Jenkins, J.J.: The 1952 Minnesota word association norms. In: Postman, L.J., Keppel, G. (eds.) Norms of Word Association, pp. 1–38. Academic Press, New York (1970)

    Google Scholar 

  11. Kiss, G.R., Armstrong, C., Milroy, R., Piper, J.: An associative thesaurus of English and its computer analysis. In: Aitken, A.J., Bailey, R.W., Hamilton-Smith, N. (eds.) The Computer and Literary Studies, pp. 153–165. University Press, Edinburgh (1973)

    Google Scholar 

  12. Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211–240 (1997)

    Article  Google Scholar 

  13. McEnery, T., Wilson, A.: Corpus Linguistics. Edinburgh University Press (1996)

    Google Scholar 

  14. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the ACL, pp. 311–318. Philadelphia (2002)

    Google Scholar 

  15. Rapp, R.: Die Berechnung von Assoziationen. Olms, Hildesheim (1996)

    Google Scholar 

  16. Rapp, R.: From stimulus to associations and back. In: Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science, Marseille, France (2013)

    Google Scholar 

  17. Rapp, R.: Using word familiarities to measure corpus representativeness. In: Proceedings of the 48. Linguistics Colloquium, Alcala de Henares, Spain (September 2013) (in print)

    Google Scholar 

  18. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: A pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  19. Saldanha, G.: Principles of corpus linguistics and their application to translation studies research. Tradumática 7, 1–7 (2009)

    Google Scholar 

  20. Schvaneveldt, R.W., Durso, F.T., Dearholt, D.W.: Network structures in proximity data. In: Bower, G. (ed.) The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 24, pp. 249–284. Academic Press, New York (1989)

    Google Scholar 

  21. Turney, P.T., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)

    MATH  MathSciNet  Google Scholar 

  22. Wettler, M., Rapp, R.: A connectionist system to simulate lexical decisions in information retrieval. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Connectionism in Perspective, pp. 463–469. Elsevier, Amsterdam (1989)

    Google Scholar 

  23. Wettler, M., Rapp, R., Sedlmeier, P.: Free word associations correspond to contiguities between words in texts. Journal of Quantitative Linguistics 12(2), 111–122 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rapp, R. (2014). Using Word Association Norms to Measure Corpus Representativeness. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54906-9_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54905-2

  • Online ISBN: 978-3-642-54906-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics