Abstract
The notion of comparable corpora implies the notion of comparability. The present paper aims at explicating this notion with respect to statistical methods because statistical comparison requires the use of statistical tests, which again require certain properties of the data under analysis. Linguistic data, however, do not automatically meet these requirements. In corpus linguistics and other linguistic fields, statistical methods are often applied without any previous check of their applicability. The paper will give some warnings and show some examples of corresponding test procedures. A number of other frequently used terms and concepts, such as representativeness, homogeneity, and balanced corpora, play a central role in corpus-linguistic argumentations and will be analysed in the paper, too, as they concern compilation and use of comparable corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altmann, G., Hřebíček, L. (eds.): Quantitative Text Analysis. WVT, Trier (1993)
Guiter, H.; Arapov, M.V. (eds.): Studies on Zipf’s Law. Brockmeyer, Bochum (1982)
Popescu, I.-I., Altmann, G., Gabriel, K., Köhler, R.: Zipf’s law—another view. Qual. Quant. 44(4), 713–731 (2010)
Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)
Orlov, J.K.: Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie, Sprache-Rede“ in der statistischen Linguistik). In: Orlov, J.K.; Boroda, M.G.; Nadarejšvili, I.S. (eds.) Sprache, Text, Kunst. Quantitative Analysen, pp. 1–55. Brockmeyer, Bochum (1982)
Best, K.-H.: Satzlänge. In: Köhler, R., Altmann, G., Piotrowski, R.G. (eds.) Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An International Handbook, pp. 298–304. de Gruyter, Berlin (2005)
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)
Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stati. Assoc. 47, 583–621 (1952)
Altmann Fitter 3.0: Iterative fitting of theoretical probability distributions (Software). Lüdenscheid: RAM-Verlag (2011)
Altmann, Gabriel: Das Problem der Datenhomogenität. Glottometrika 13, 287–298 (1992)
Zipf, G.K.: Human Behavior and the Principle of Least Effort, 2nd edn . Addison-Wesley, Cambridge. 1972. Hafner reprint, New York (1949)
Mandelbrot, B.: A Note on a Class of Skew Distribution Functions. Analysis and Critique of a Paper by H. Simon. Inform. Control 2, 90–99 (1959)
Sampson, G. (ed.): English for the Computer. Clarendon, Oxford (1995)
Köhler, Reinhard: Syntactic structures: properties and interrelations. J. Quantit. Linguist. 6(1), 46–57 (1999)
Hunston, S.: Collection strategies and design decisions. In: Anke Lüdeling, M.K. (eds.) Corpus Linguistics. An International Handbook, pp. 154–168. de Gruyter, Berlin (2008)
Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In Baroni, M., Bernardini, S. (eds.) WaCky! Working papers on the Web as Corpus, Gedit, Bologna (2006). http://wackybook.sslmit.unibo.it
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Köhler, R. (2013). Statistical Comparability: Methodological Caveats. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)