Skip to main content

Statistical Comparability: Methodological Caveats

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

The notion of comparable corpora implies the notion of comparability. The present paper aims at explicating this notion with respect to statistical methods because statistical comparison requires the use of statistical tests, which again require certain properties of the data under analysis. Linguistic data, however, do not automatically meet these requirements. In corpus linguistics and other linguistic fields, statistical methods are often applied without any previous check of their applicability. The paper will give some warnings and show some examples of corresponding test procedures. A number of other frequently used terms and concepts, such as representativeness, homogeneity, and balanced corpora, play a central role in corpus-linguistic argumentations and will be analysed in the paper, too, as they concern compilation and use of comparable corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altmann, G., Hřebíček, L. (eds.): Quantitative Text Analysis. WVT, Trier (1993)

    Google Scholar 

  2. Guiter, H.; Arapov, M.V. (eds.): Studies on Zipf’s Law. Brockmeyer, Bochum (1982)

    Google Scholar 

  3. Popescu, I.-I., Altmann, G., Gabriel, K., Köhler, R.: Zipf’s law—another view. Qual. Quant. 44(4), 713–731 (2010)

    Google Scholar 

  4. Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)

    Article  MATH  MathSciNet  Google Scholar 

  5. Orlov, J.K.: Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie, Sprache-Rede“ in der statistischen Linguistik). In: Orlov, J.K.; Boroda, M.G.; Nadarejšvili, I.S. (eds.) Sprache, Text, Kunst. Quantitative Analysen, pp. 1–55. Brockmeyer, Bochum (1982)

    Google Scholar 

  6. Best, K.-H.: Satzlänge. In: Köhler, R., Altmann, G., Piotrowski, R.G. (eds.) Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An International Handbook, pp. 298–304. de Gruyter, Berlin (2005)

    Google Scholar 

  7. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)

    Google Scholar 

  8. Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stati. Assoc. 47, 583–621 (1952)

    Article  MATH  Google Scholar 

  9. Altmann Fitter 3.0: Iterative fitting of theoretical probability distributions (Software). Lüdenscheid: RAM-Verlag (2011)

    Google Scholar 

  10. Altmann, Gabriel: Das Problem der Datenhomogenität. Glottometrika 13, 287–298 (1992)

    Google Scholar 

  11. Zipf, G.K.: Human Behavior and the Principle of Least Effort, 2nd edn . Addison-Wesley, Cambridge. 1972. Hafner reprint, New York (1949)

    Google Scholar 

  12. Mandelbrot, B.: A Note on a Class of Skew Distribution Functions. Analysis and Critique of a Paper by H. Simon. Inform. Control 2, 90–99 (1959)

    Google Scholar 

  13. Sampson, G. (ed.): English for the Computer. Clarendon, Oxford (1995)

    Google Scholar 

  14. Köhler, Reinhard: Syntactic structures: properties and interrelations. J. Quantit. Linguist. 6(1), 46–57 (1999)

    Article  Google Scholar 

  15. Hunston, S.: Collection strategies and design decisions. In: Anke Lüdeling, M.K. (eds.) Corpus Linguistics. An International Handbook, pp. 154–168. de Gruyter, Berlin (2008)

    Google Scholar 

  16. Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In Baroni, M., Bernardini, S. (eds.) WaCky! Working papers on the Web as Corpus, Gedit, Bologna (2006). http://wackybook.sslmit.unibo.it

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reinhard Köhler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Köhler, R. (2013). Statistical Comparability: Methodological Caveats. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics