A short history of statistical association: From correlation to correspondence analysis to copulas

https://doi.org/10.1016/j.jmva.2021.104901Get rights and content

Abstract

We present in three parts different concepts of correlation and statistical association, with some historical notes, starting with Galton’s notion of correlation, subsequently improved by Pearson. Continuing in this first part, we discuss the correlation ratio, the intraclass correlation, multiple correlation, and redundancy analysis. Throughout we use the classic data set of Galton on the heights of parents and their children. In the second part we explain how these same data can be studied from a multivariate viewpoint, using canonical correlation analysis, Procrustes correlation and simple/multiple correspondence analysis. For correspondence analysis, we use the same data as categorized by Galton into intervals of heights for the parents and their children. In this part we also make an incursion into the continuous form of correspondence analysis. The third part is dedicated to bivariate distributions, where we give the main results of bivariate distributions with given marginals, commenting on the correlations of Spearman and Kendall. Seeing that a bivariate distribution can be generated using a copula, we fit Galton’s data to two copulas: the Gaussian copula and the copula which has the best fit.

Introduction

The very early history of Statistics goes back to ancient times, in the Greek, Roman and Arab civilizations, and mostly involves counting: counting the population, the extent of properties and wealth. As the motto goes: “Statisticians count!”. But trying to understand the interrelationship between statistical variables can perhaps be traced back to the early Egyptians with their 8th and 9th century Nilometers [67], which continuously measured the height of the Nile. These data allowed them to predict whether the harvest would be normal, or whether there would be extremes of drought or flooding, information that would determine the levels of taxation. An inherent association between the Nile’s height and expected tax income was thus assumed.

So how can the relationship between two or more variables be described and how can the strength of this association be measured? This paper presents a relaxed review of this fundamental problem of variable association in Statistics, interspersed with some historical details from the latter part of the 19th century onwards. There are three main parts: Section 2 deals with the concept of bivariate correlation and its origins in the work of Bravais and Galton, subsequently improved by Pearson, as well as many variations on this theme developed by other pioneering statisticians such as Fisher. Section 3 exposes the history and concepts underlying correlation between two or more sets of variables, with correspondence analysis as the special case for categorical variables. Section 4 describes the general modelling of bivariate association in the form of copulas. Throughout we will use the famous data [29], [30] on mothers’, fathers’, daughters’ and sons’ heights used by Francis Galton, the father of regression analysis.

Section snippets

Classical statistics: correlation

It is convenient and natural to require that a measure of inter-variable association lies between 0 and 1, such that 0 indicates a total lack of association, or relationship, and 1 a perfect association. In some cases positive and negative relationships can be distinguished, in which case a convenient measure would lie between 1 and 1, with 1 representing perfect negative association. Both the absence and presence of an association require mathematical definitions and guidelines to their

Canonical analysis of sets of variables

In Section 2.8 we considered bivariate responses and bivariate explanatory variables, by looking for the combination of the explanatory variables that explained the maximum variance of the responses. This is an example of canonical analysis and brings us to defining what can be called the mother of all classical multivariate methods: canonical correlation analysis, abbreviated here as CCorA, the relationship between two sets of variables. The method has its origins in 1875, the same epoque of

Continuous bivariate distributions and copulas

When Karl Pearson studied Galton’s height data as well as his own data set on heights of parents and children, he was careful to analyse them in the context of the bivariate normal distribution. For Pearson this was a mathematical model that correctly summarized the observed data. A more general approach to model-based measurement of statistical association consists in using the theory of bivariate distributions with given marginals, which is related to the theory of copulas.

Suppose that H(x,y)=

Summary and conclusion

We have presented in three parts the different approaches to the concept of statistical association, starting from the correlation coefficient, proposed in regression terms by Galton and then improved by Pearson in the rigorous definition that we know today. We used the original data of Galton on the heights of parents and children, to illustrate the correlation coefficient and some associated methods. This coefficient has been extremely useful but is often interpreted incorrectly. The

References (69)

  • BlaikieN.

    Analyzing Quantitative Data: From Description to Explanation

    (2003)
  • BöckenholtU. et al.

    Canonical analysis of contingency tables with linear constraints

    Psychometrika

    (1990)
  • BravaisA.

    Analyse mathématique sur les probabilités des erreurs de situation d’un point. Mémoires presentés par divers savants à l’Académie des Sciences de l’Institut de France

    Sci. Math. Phys.

    (1846)
  • J.D. Carroll, Generalization of canonical correlation analysis to three or more sets of variables, in: Proceedings of...
  • CramérH.

    Mathematical Methods of Statistics

    (1946)
  • CuadrasC.M.

    Interpreting an inequality in multiple regression

    Am. Stat.

    (1993)
  • CuadrasC.M. et al.

    A continuous general multivariate distribution and its properties

    Comm. Statist. Theory Methods

    (1981)
  • CuadrasC.M. et al.

    A comparison of different methods for representing categorical data

    Comm. Statist. Simulation Comput.

    (2006)
  • CuadrasC.M. et al.

    Two generalized bivariate FGM distributions and rank reduction

    Comm. Statist. Theory Methods

    (2020)
  • CuadrasC.M. et al.

    The importance of geometry in multivariate analysis and some applications

  • CuadrasC.M. et al.

    Continuous extensions of matrix formulations in correspondence analysis, with applications to the FGM family of distributions

  • CuadrasC.M. et al.

    Some multivariate measures based on distances and their entropy versions

  • DonnerA. et al.

    A comparison of confidence interval methods for the intraclass correlation coefficient

    Biometrics

    (1986)
  • DuranteF. et al.

    Principles of Copula Theory

    (2016)
  • EdwardsA.W.F. et al.

    A method for cluster analysis

    Biometrics

    (1965)
  • FarlieD.J.G.

    The performance of some correlation coefficients for a general bivariate distribution

    Biometrika

    (1960)
  • FisherR.A.

    The precision of discriminant functions

    Ann. Eugen.

    (1940)
  • FisherR.A.

    Statistical Methods for Research Workers

    (1950)
  • FoxJ.

    Polycor: polychoric and polyserial correlations

    (2019)
  • FréchetM.

    Sur les tableaux de corrélation dont les marges sont données

    Ann. Univ. Lyon. A Sci. Math. Astronomie

    (1951)
  • GaltonF.

    Regression towards mediocrity in hereditary stature

    J. Anthropol. Inst.

    (1886)
  • GaltonF.

    Galton height data

    (2017)
  • GifiA.

    Nonlinear Multivariate Analysis

    (1981)
  • GilulaZ. et al.

    Canonical analysis of contingency tables by maximum likelihood

    J. Amer. Statist. Assoc.

    (1986)
  • View full text