A short history of statistical association: From correlation to correspondence analysis to copulas
Introduction
The very early history of Statistics goes back to ancient times, in the Greek, Roman and Arab civilizations, and mostly involves counting: counting the population, the extent of properties and wealth. As the motto goes: “Statisticians count!”. But trying to understand the interrelationship between statistical variables can perhaps be traced back to the early Egyptians with their 8th and 9th century Nilometers [67], which continuously measured the height of the Nile. These data allowed them to predict whether the harvest would be normal, or whether there would be extremes of drought or flooding, information that would determine the levels of taxation. An inherent association between the Nile’s height and expected tax income was thus assumed.
So how can the relationship between two or more variables be described and how can the strength of this association be measured? This paper presents a relaxed review of this fundamental problem of variable association in Statistics, interspersed with some historical details from the latter part of the 19th century onwards. There are three main parts: Section 2 deals with the concept of bivariate correlation and its origins in the work of Bravais and Galton, subsequently improved by Pearson, as well as many variations on this theme developed by other pioneering statisticians such as Fisher. Section 3 exposes the history and concepts underlying correlation between two or more sets of variables, with correspondence analysis as the special case for categorical variables. Section 4 describes the general modelling of bivariate association in the form of copulas. Throughout we will use the famous data [29], [30] on mothers’, fathers’, daughters’ and sons’ heights used by Francis Galton, the father of regression analysis.
Section snippets
Classical statistics: correlation
It is convenient and natural to require that a measure of inter-variable association lies between 0 and 1, such that 0 indicates a total lack of association, or relationship, and 1 a perfect association. In some cases positive and negative relationships can be distinguished, in which case a convenient measure would lie between and , with representing perfect negative association. Both the absence and presence of an association require mathematical definitions and guidelines to their
Canonical analysis of sets of variables
In Section 2.8 we considered bivariate responses and bivariate explanatory variables, by looking for the combination of the explanatory variables that explained the maximum variance of the responses. This is an example of canonical analysis and brings us to defining what can be called the mother of all classical multivariate methods: canonical correlation analysis, abbreviated here as CCorA, the relationship between two sets of variables. The method has its origins in 1875, the same epoque of
Continuous bivariate distributions and copulas
When Karl Pearson studied Galton’s height data as well as his own data set on heights of parents and children, he was careful to analyse them in the context of the bivariate normal distribution. For Pearson this was a mathematical model that correctly summarized the observed data. A more general approach to model-based measurement of statistical association consists in using the theory of bivariate distributions with given marginals, which is related to the theory of copulas.
Suppose that
Summary and conclusion
We have presented in three parts the different approaches to the concept of statistical association, starting from the correlation coefficient, proposed in regression terms by Galton and then improved by Pearson in the rigorous definition that we know today. We used the original data of Galton on the heights of parents and children, to illustrate the correlation coefficient and some associated methods. This coefficient has been extremely useful but is often interpreted incorrectly. The
References (69)
Correspondence analysis and diagonal expansions in terms of distribution functions
J. Statist. Plann. Inference
(2002)On the covariance between functions
J. Multivariate Anal.
(2002)Constructing copula functions with weighted geometric means
J. Statist. Plann. Inference
(2009)Contributions to the diagonal expansion of a bivariate copula with continuous extensions
J. Multivariate Anal.
(2015)- et al.
A parametric approach to correspondence analysis
Linear Algebra Appl.
(2006) - et al.
On the relationship between Spearman’s rho and Kendall’s tau for pairs of continuous random variables
J. Statist. Plann. Inference
(2007) Correspondence analysis, association analysis, and generalized nonindependence analysis of contingency tables: Saturated and unsaturated models, and appropriate graphical displays
- et al.
A simple method for obtaining the maximal correlation coefficient and related characterizations
J. Multivariate Anal.
(2013) - et al.
Biplots of compositional data
J. R. Stat. Soc. Ser. C. Appl. Stat.
(2002) L’Analyse des Données. Tôme II. L’Analyse desd Correspondances
(1976)
Analyzing Quantitative Data: From Description to Explanation
Canonical analysis of contingency tables with linear constraints
Psychometrika
Analyse mathématique sur les probabilités des erreurs de situation d’un point. Mémoires presentés par divers savants à l’Académie des Sciences de l’Institut de France
Sci. Math. Phys.
Mathematical Methods of Statistics
Interpreting an inequality in multiple regression
Am. Stat.
A continuous general multivariate distribution and its properties
Comm. Statist. Theory Methods
A comparison of different methods for representing categorical data
Comm. Statist. Simulation Comput.
Two generalized bivariate FGM distributions and rank reduction
Comm. Statist. Theory Methods
The importance of geometry in multivariate analysis and some applications
Continuous extensions of matrix formulations in correspondence analysis, with applications to the FGM family of distributions
Some multivariate measures based on distances and their entropy versions
A comparison of confidence interval methods for the intraclass correlation coefficient
Biometrics
Principles of Copula Theory
A method for cluster analysis
Biometrics
The performance of some correlation coefficients for a general bivariate distribution
Biometrika
The precision of discriminant functions
Ann. Eugen.
Statistical Methods for Research Workers
Polycor: polychoric and polyserial correlations
Sur les tableaux de corrélation dont les marges sont données
Ann. Univ. Lyon. A Sci. Math. Astronomie
Regression towards mediocrity in hereditary stature
J. Anthropol. Inst.
Galton height data
Nonlinear Multivariate Analysis
Canonical analysis of contingency tables by maximum likelihood
J. Amer. Statist. Assoc.
Cited by (3)
Interpreting Infinite Numbers
2023, AxiomsA History of Psychology: The Emergence of Science and Applications: Seventh Edition
2023, A History of Psychology: The Emergence of Science and Applications: Seventh EditionCOVID-19 and Inland Tourist Destination: A Tourism-related Enterprise View of the Effects and Policy Measures Adopted
2023, Tourism Planning and Development