A short history of statistical association: From correlation to correspondence analysis to copulas

doi:10.1016/j.jmva.2021.104901

Journal of Multivariate Analysis

Volume 188, March 2022, 104901

https://doi.org/10.1016/j.jmva.2021.104901 Get rights and content

Abstract

We present in three parts different concepts of correlation and statistical association, with some historical notes, starting with Galton’s notion of correlation, subsequently improved by Pearson. Continuing in this first part, we discuss the correlation ratio, the intraclass correlation, multiple correlation, and redundancy analysis. Throughout we use the classic data set of Galton on the heights of parents and their children. In the second part we explain how these same data can be studied from a multivariate viewpoint, using canonical correlation analysis, Procrustes correlation and simple/multiple correspondence analysis. For correspondence analysis, we use the same data as categorized by Galton into intervals of heights for the parents and their children. In this part we also make an incursion into the continuous form of correspondence analysis. The third part is dedicated to bivariate distributions, where we give the main results of bivariate distributions with given marginals, commenting on the correlations of Spearman and Kendall. Seeing that a bivariate distribution can be generated using a copula, we fit Galton’s data to two copulas: the Gaussian copula and the copula which has the best fit.

Introduction

The very early history of Statistics goes back to ancient times, in the Greek, Roman and Arab civilizations, and mostly involves counting: counting the population, the extent of properties and wealth. As the motto goes: “Statisticians count!”. But trying to understand the interrelationship between statistical variables can perhaps be traced back to the early Egyptians with their 8th and 9th century Nilometers [67], which continuously measured the height of the Nile. These data allowed them to predict whether the harvest would be normal, or whether there would be extremes of drought or flooding, information that would determine the levels of taxation. An inherent association between the Nile’s height and expected tax income was thus assumed.

So how can the relationship between two or more variables be described and how can the strength of this association be measured? This paper presents a relaxed review of this fundamental problem of variable association in Statistics, interspersed with some historical details from the latter part of the 19th century onwards. There are three main parts: Section 2 deals with the concept of bivariate correlation and its origins in the work of Bravais and Galton, subsequently improved by Pearson, as well as many variations on this theme developed by other pioneering statisticians such as Fisher. Section 3 exposes the history and concepts underlying correlation between two or more sets of variables, with correspondence analysis as the special case for categorical variables. Section 4 describes the general modelling of bivariate association in the form of copulas. Throughout we will use the famous data [29], [30] on mothers’, fathers’, daughters’ and sons’ heights used by Francis Galton, the father of regression analysis.

Section snippets

Classical statistics: correlation

It is convenient and natural to require that a measure of inter-variable association lies between 0 and 1, such that 0 indicates a total lack of association, or relationship, and 1 a perfect association. In some cases positive and negative relationships can be distinguished, in which case a convenient measure would lie between $- 1$ and $1$ , with $- 1$ representing perfect negative association. Both the absence and presence of an association require mathematical definitions and guidelines to their

Canonical analysis of sets of variables

In Section 2.8 we considered bivariate responses and bivariate explanatory variables, by looking for the combination of the explanatory variables that explained the maximum variance of the responses. This is an example of canonical analysis and brings us to defining what can be called the mother of all classical multivariate methods: canonical correlation analysis, abbreviated here as CCorA, the relationship between two sets of variables. The method has its origins in 1875, the same epoque of

Continuous bivariate distributions and copulas

When Karl Pearson studied Galton’s height data as well as his own data set on heights of parents and children, he was careful to analyse them in the context of the bivariate normal distribution. For Pearson this was a mathematical model that correctly summarized the observed data. A more general approach to model-based measurement of statistical association consists in using the theory of bivariate distributions with given marginals, which is related to the theory of copulas.

Suppose that $H (x, y) =$

Summary and conclusion

We have presented in three parts the different approaches to the concept of statistical association, starting from the correlation coefficient, proposed in regression terms by Galton and then improved by Pearson in the rigorous definition that we know today. We used the original data of Galton on the heights of parents and children, to illustrate the correlation coefficient and some associated methods. This coefficient has been extremely useful but is often interpreted incorrectly. The

References (69)

CuadrasC.M.
Correspondence analysis and diagonal expansions in terms of distribution functions
J. Statist. Plann. Inference
(2002)
CuadrasC.M.
On the covariance between functions
J. Multivariate Anal.
(2002)
CuadrasC.M.
Constructing copula functions with weighted geometric means
J. Statist. Plann. Inference
(2009)
CuadrasC.M.
Contributions to the diagonal expansion of a bivariate copula with continuous extensions
J. Multivariate Anal.
(2015)
CuadrasC.M. et al.
A parametric approach to correspondence analysis
Linear Algebra Appl.
(2006)
FredricksG.A. et al.
On the relationship between Spearman’s rho and Kendall’s tau for pairs of continuous random variables
J. Statist. Plann. Inference
(2007)
GoodmanL.A.
Correspondence analysis, association analysis, and generalized nonindependence analysis of contingency tables: Saturated and unsaturated models, and appropriate graphical displays
PapadatosN. et al.
A simple method for obtaining the maximal correlation coefficient and related characterizations
J. Multivariate Anal.
(2013)
AitchisonJ. et al.
Biplots of compositional data
J. R. Stat. Soc. Ser. C. Appl. Stat.
(2002)
BenzécriJ.-P.
L’Analyse des Données. Tôme II. L’Analyse desd Correspondances
(1976)

BlaikieN.

Analyzing Quantitative Data: From Description to Explanation

(2003)

BöckenholtU. et al.

Canonical analysis of contingency tables with linear constraints

Psychometrika

(1990)

BravaisA.

Analyse mathématique sur les probabilités des erreurs de situation d’un point. Mémoires presentés par divers savants à l’Académie des Sciences de l’Institut de France

Sci. Math. Phys.

(1846)

J.D. Carroll, Generalization of canonical correlation analysis to three or more sets of variables, in: Proceedings of...

CramérH.

Mathematical Methods of Statistics

(1946)

CuadrasC.M.

Interpreting an inequality in multiple regression

Am. Stat.

(1993)

CuadrasC.M. et al.

A continuous general multivariate distribution and its properties

Comm. Statist. Theory Methods

(1981)

CuadrasC.M. et al.

A comparison of different methods for representing categorical data

Comm. Statist. Simulation Comput.

(2006)

CuadrasC.M. et al.

Two generalized bivariate FGM distributions and rank reduction

Comm. Statist. Theory Methods

(2020)

CuadrasC.M. et al.

The importance of geometry in multivariate analysis and some applications

CuadrasC.M. et al.

Continuous extensions of matrix formulations in correspondence analysis, with applications to the FGM family of distributions

CuadrasC.M. et al.

Some multivariate measures based on distances and their entropy versions

DonnerA. et al.

A comparison of confidence interval methods for the intraclass correlation coefficient

Biometrics

(1986)

DuranteF. et al.

Principles of Copula Theory

(2016)

EdwardsA.W.F. et al.

A method for cluster analysis

Biometrics

(1965)

FarlieD.J.G.

The performance of some correlation coefficients for a general bivariate distribution

Biometrika

(1960)

FisherR.A.

The precision of discriminant functions

Ann. Eugen.

(1940)

FisherR.A.

Statistical Methods for Research Workers

(1950)

FoxJ.

Polycor: polychoric and polyserial correlations

(2019)

FréchetM.

Sur les tableaux de corrélation dont les marges sont données

Ann. Univ. Lyon. A Sci. Math. Astronomie

(1951)

GaltonF.

Regression towards mediocrity in hereditary stature

J. Anthropol. Inst.

(1886)

GaltonF.

Galton height data

(2017)

GifiA.

Nonlinear Multivariate Analysis

(1981)

GilulaZ. et al.

Canonical analysis of contingency tables by maximum likelihood

J. Amer. Statist. Assoc.

(1986)

Cited by (3)

Interpreting Infinite Numbers
2023, Axioms
A History of Psychology: The Emergence of Science and Applications: Seventh Edition
2023, A History of Psychology: The Emergence of Science and Applications: Seventh Edition
COVID-19 and Inland Tourist Destination: A Tourism-related Enterprise View of the Effects and Policy Measures Adopted
2023, Tourism Planning and Development

View full text

A short history of statistical association: From correlation to correspondence analysis to copulas

Abstract

Introduction

Section snippets

Classical statistics: correlation

Canonical analysis of sets of variables

Continuous bivariate distributions and copulas

Summary and conclusion

J. Statist. Plann. Inference

J. Multivariate Anal.

J. Statist. Plann. Inference

J. Multivariate Anal.

Linear Algebra Appl.

J. Statist. Plann. Inference

J. Multivariate Anal.

Biplots of compositional data

J. R. Stat. Soc. Ser. C. Appl. Stat.

L’Analyse des Données. Tôme II. L’Analyse desd Correspondances

Analyzing Quantitative Data: From Description to Explanation

Canonical analysis of contingency tables with linear constraints

Psychometrika

Analyse mathématique sur les probabilités des erreurs de situation d’un point. Mémoires presentés par divers savants à l’Académie des Sciences de l’Institut de France

Sci. Math. Phys.

Mathematical Methods of Statistics

Interpreting an inequality in multiple regression

Am. Stat.

A continuous general multivariate distribution and its properties

Comm. Statist. Theory Methods

A comparison of different methods for representing categorical data

Comm. Statist. Simulation Comput.

Two generalized bivariate FGM distributions and rank reduction

Comm. Statist. Theory Methods

The importance of geometry in multivariate analysis and some applications

Continuous extensions of matrix formulations in correspondence analysis, with applications to the FGM family of distributions

Some multivariate measures based on distances and their entropy versions

A comparison of confidence interval methods for the intraclass correlation coefficient

Biometrics

Principles of Copula Theory

A method for cluster analysis

Biometrics

The performance of some correlation coefficients for a general bivariate distribution

Biometrika

The precision of discriminant functions

Ann. Eugen.

Statistical Methods for Research Workers

Polycor: polychoric and polyserial correlations

Sur les tableaux de corrélation dont les marges sont données

Ann. Univ. Lyon. A Sci. Math. Astronomie

Regression towards mediocrity in hereditary stature

J. Anthropol. Inst.

Galton height data

Nonlinear Multivariate Analysis

Canonical analysis of contingency tables by maximum likelihood

J. Amer. Statist. Assoc.