Two types of single-peaked data: Correspondence analysis as an alternative to principal component analysis
Introduction
In this paper we explore the surplus value of correspondence analysis (CA) over principal component analysis (PCA) for analyzing one-dimensional, single-peaked responses, i.e. data conforming to a one-dimensional unfolding model. We will discuss continuous, binary and graded responses.
Single-peaked (unimodal) data naturally arise in a variety of research settings, e.g. marketing research (for example, DeSarbo et al. (2002)), ecological research (for example, De’ath (1999)), and archeology (for example, Kendall (1971)). In psychology single-peaked response curves can be found, for instance, in attitude measurement (Roberts et al., 2000): people with moderate tolerance towards abortion are less likely to agree with items that are either very much in favor of abortion or very much against it.
The essence of an unfolding model is that the probability of agreement with a certain item is inversely related to the distance between the position of the item on the latent continuum and the position of the respondent; the closer an item is located near the respondent’s position on the latent continuum, the more likely the respondent will agree with it. In this case, the latent continuum is called bipolar: ranging from a negative extreme (very much against abortion), via a neutral midpoint (neither against nor in favor of abortion), to a positive extreme (very much in favor of abortion). In the unfolding literature the positions of respondents or objects on this continuum are referred to as ideal points (Coombs, 1964).
There is a vast amount of literature on the inappropriateness of PCA for analyzing data conforming to an unfolding model (Coombs and Kao, 1960, Ross and Cliff, 1964, Davison, 1977, Van Schuur and Kiers, 1994, Van Schuur and Kruijtbosch, 1995, Andrich, 1996, Rost and Luo, 1997, Andrich and Styles, 1998, Roberts et al., 1999, Roberts et al., 2000, Maraun and Rossi, 2001). The main conclusions from this literature are, first, that PCA of one-dimensional unfolding data results in a two-component solution, leading to erroneous conclusions about the dimensionality of the data. Second, the component scores of the persons with extreme positions on the latent scale underestimate the true positions. Third, component loadings of the items with extreme positions on the latent scale are underestimated, resulting in a non-optimal item selection. In the two-component PCA solution both persons and items either lie on a semi-circle (cf. Davison (1977)) or on a semi-circle with inwardly folded endpoints (cf. Roberts et al. (2000)). This inward bending of the endpoints is what is meant by underestimation of the true positions of the extreme persons and items. The current paper offers CA as an alternative to PCA and relates the different problems with PCA, described in the above, to the distinction between two types of unfolding models. To start with the latter, on the one hand, we have models that are a quadratic function of the person-to-item distances, and on the other hand, we have models that are an exponential function of these distances.
The often quoted paper by Davison (1977, see also Maraun and Rossi (2001)) discusses the quadratic unfolding model. For data conforming to this type of model, PCA only suffers the aforementioned “extra-component” problem, but not the problem of underestimation of the locations of extreme persons or items. That is, in the two-component solution, the person and item locations lie on a semi-circle. Furthermore, the inter-item correlation matrix of this type of data shows a “simplex-like” pattern, also referred to as Robinson pattern (Hubert et al., 1998). That is, when the items are ordered according to their location on the latent scale, the correlations along the diagonal of the matrix will be highly positive, moving downward and to the left, the correlations will decrease first to zero, and will decrease further to negative in the lower left-hand corner.
The papers from the field of unfolding item response theory (IRT) (for example, Andrich (1996), Rost and Luo (1997), Andrich and Styles (1998) and Roberts et al. (2000)) discuss exponential unfolding models. For data conforming to this type of models, PCA suffers both the “extra-component” problem, and the problem of underestimation of the locations of extreme persons and items. That is, in the two-component solution, the person and item locations lie on a semi-circle with inwardly bending extremes, a pattern which can be described as a “horseshoe” pattern (cf. Greenacre (1984, p. 226–232)). The problem of the inwardly bending extremes in the PCA solution has been discussed also in the field of ecology (Swan, 1970, Noy-Meir and Austin, 1970, Hill, 1973, De’ath, 1999).
In this paper CA is proposed as an alternative to PCA, since CA is known to represent single-peaked data correctly: Ter Braak (1985) showed that CA approximates the maximum likelihood solution of the Gaussian ordination model; Heiser (1981) showed that CA recovers the person and item order of error-free ratings conforming to an unfolding model. In Section 1.1, using CA as unfolding technique is explained further.
It is known that when unfolding data are strongly one-dimensional, a two-dimensional CA representation will show what is often referred to as the “arch-effect”, where the items and persons are ordered along an arch (but also on the first dimension) according to their position on the scale (for example, Hill (1974) and Hill and Gauch (1980)). We prefer the term “arch” to “horseshoe”, to stress the importance of the outward bending of the extremes of the arch. In that case, the first dimension reflects the correct order of items and persons, as opposed to solutions with inwardly bending extremes (which is the usual shape of a horseshoe), where the order of the items and persons gets mixed up at the endpoints of the first dimension.
When rating scale data are analyzed with CA, the variables are usually “doubled” to create pairs of variables, which form the positive and the negative poles of the rating scale (see for example, Greenacre (1993, Chapter 19), and Greenacre (2007, Chapter 23)). We will explain this type of data coding in CA in Section 1.2. In this paper we show that when unfolding data are doubled, CA, like PCA, is hampered by the undesirable inward bending of the extremes.
In this paper CA with and without doubling is compared to PCA with and without varimax rotation. For this purpose, we simulated continuous, binary, and graded responses using three different unfolding models, which are described in Section 1.3. The first is the quadratic unfolding model as discussed by Davison (1977), which results in continuous responses. Of the exponential unfolding model two variants are compared: the Gaussian ordination model (Ihm and van Groenewoud, 1984), which results in binary responses and the generalized graded unfolding model (Roberts et al., 2000), which results in graded responses. Furthermore, an empirical data set concerning the measurement of personality development was analyzed.
In this paragraph we explain to what types of data CA can be applied, and we explain the rationale behind using CA as unfolding technique. CA is a multivariate technique primarily developed for the analysis of contingency table data (Benzécri, 1992, Greenacre, 1984). However, the technique can be applied to a broader range of data types, as long as the entries of the table contain measures of association strength between row entries and column entries. The association measure is assumed to be some non-negative quantity, where lack of association is indicated by a zero entry (Heiser, 2001).
In the current paper, we use CA as an unfolding technique. Typically an advantage of CA in this context is that it simultaneously scales both persons and items. Of the three most common normalizations in CA (i.e. row principal, column principal, and symmetrical normalization) we choose row principal normalization, so that a person is represented as the centroid (weighted average, with weights proportional to the ratings) of the items he has rated. This approach results in an interpretation of person scores as ideal points (Coombs, 1964). A higher rating of a given person on a given item results in a smaller person-to-item distance in the CA solution. Hence the expected responses are a single-peaked function of the person scores in the CA solution.
The distances between persons in a CA solution with row principal normalization approximate chi-square distances from below (Meulman, 1982, p. 33). The chi-square distance between two persons differs from the usual Euclidean distance, in that for each item, the squared difference between the persons’ scores is weighted by the inverse marginal proportion (i.e. the mass) of each item. As a consequence, persons and items with low mass tend to lie more in the periphery of the CA solution (see for example Ter Braak and Prentice (1988), reprinted in Ter Braak and Prentice (2004, p. 262–263)). In the context of attitude items with ratings ranging from 0 (totally disagree) to 5 (totally agree), an item that few persons choose, i.e. an extreme item will have a low mass. Analogously, a person who agrees with only one item, is very likely to have an extreme opinion, and will have a low mass. We will show that for these extreme items and persons CA (without doubling) typically results in appropriate scale values.
In this paragraph we discuss two types of data coding in CA: undoubled and doubled data. These two approaches are also known as, respectively, asymmetric and symmetric treatment of response categories (see Gifi (1990, p. 294–295)).
Asymmetric treatment of response categories implies performing CA on the raw data table, where, for the simple case of binary responses, disagreement is denoted with 0 and agreement with 1. In effect, only agreement implies similarity between respondents, and not disagreement.
Symmetric treatment of the response categories demands a type of recoding of the data commonly known as “doubling” (see, for example, Benzécri (1992) or (Greenacre, 1984)). This is a type of data coding that complements a respondent’s original ratings with the reverse of these ratings, which are obtained by subtracting the ratings from the maximum rating. For example, for a person with the ratings 0, 2, and 4 on three items with a six-point scale ranging from 0 to 5, the complete set of doubled scores would be 0, 2, 4 along with 5, 3, 1. In effect, both shared with a certain item and shared imply similarity between respondents. An argument for this procedure is, that agreement with a statement is the same as disagreement with the opposite of this statement, so that all items need not be worded in the same direction.
However, in Heiser (1981, Chapter 4) as well as in Benzécri (1992, p. 391, where we assume that in the final paragraph on p. 391 the word “not” is missing by mistake after the word “is” in the sentence “But the presence or absence in a plant of a quality such as being a perennial is of the same nature”) it is stressed that if response categories are thought to give an asymmetrical type of information, CA should be preformed on the undoubled (raw) data. Even when all items are not worded in the same direction, no reverse scoring is needed, as long as the -category is coded with a zero score. In this case, the “attraction power” of items to persons, which is reflected in small person-to-item distances the CA solution, is determined by high ratings. As a consequence, the proximity between persons in the CA solution depends on (the level of) shared agreement, and not on shared disagreement.
The argument for asymmetric treatment of the response categories is that a respondent can have only one reason for agreeing with a certain statement, but either one of two different reasons for disagreeing. That is, a respondent disagrees with the statement when he is either too “positive” to agree with the statement or too “negative”.
In the following we discuss the three different unfolding models that were used to generate single-peaked response data. We classify these models as either one of two different types of unfolding models, that is, quadratic or exponential. The first model is a quadratic function of the person-to-item distances, whereas the second and the third model are exponential functions of these distances.
To recognize single-peaked data empirically, Davison (1977) postulated predictions about the correlations and factor structure of responses to various items where the responses fit a metric, unidimensional unfolding model. Two models were compared. Firstly a model producing error free data: where
is the response of person on item ;
is the discrimination parameter for item ;
is the ideal point for person on the underlying continuum;
is the location of item on the underlying continuum;
is the maximum of the curve for item .
The discrimination parameter for a given item , , indicates the steepness of the response curve. In ecology, the inverse of the discrimination parameter is called the of species , which is a measure of ecological amplitude. That is, the steeper the response curve, the smaller a species tolerance. Note that 0, otherwise the response curve would have a minimum instead of a maximum.
Secondly, Davison discussed a model producing fallible data: where
is a random normal deviate;
is the variance of ;
.
Under the assumption of model (1) with , it follows from the results of Ross and Cliff (1964) that the matrix with elements has rank 3. One of the three components involves the quantities , which are constant across the rows of , the other one involves the quantities , which are constant across the columns of , and the third one the and themselves. In addition, Ross and Cliff showed that centering the columns of reduces its rank to two. In addition to these results, Davison (1977) concluded that (a) the item by item correlation matrix displayed a simplex-like pattern, (b) the signs of first-order partial correlations can be specified in an empirically testable manner, and (c) the items will have a semi-circular, two-factor structure. Along the semi-circle, variables will be ordered by their positions on the latent dimension. This latter fact is influenced by the amount of error included in the model. The most extreme items become mixed up with the last but one extreme items. These conclusions were based on data sets with 100 persons and 10 items, where the items had fixed equally spaced true scale values ranging from −3.00 to +3.00, and the 100 person scores were randomly sampled from a normal distribution, . It turned out that the correlations and factor structure were robust to non-normality of the person score distribution.
It should be noted that in CA not only the columns are centered, but the rows as well (double centering; Gifi (1990, Chapter 8)). For this case, Schoenemann (1970) showed that double centering of further reduces the rank to one, and that the - and -scores are recovered up to a scale factor. Therefore, when we generate data under the Davison model, we will obtain exactly one component with non-zero inertia in CA, due to the double centering. However, the joint scale of the scores depends on the chosen normalization, and may not be equal to the original one.
Here it will be shown that CA approximates the Gaussian ordination model. This is a well-known model in the field of ecology for the single-peaked relationships between the abundance of a species and some environmental variable. However, it could also model the single-peaked relationships between the attitude of a person and some attitude item. Results follow from Ter Braak (1987) and Ihm and van Groenewoud (1984). We will start with the Gaussian ordination model as proposed by Ihm and van Groenewoud. This model is somewhat more general than the standard model since it has an extra parameter () to account for different masses of the persons. The response of person on item is approximated by a model using maximum likelihood given a binomial (or multinomial) distribution. The Gaussian ordination model is where
is the probability that person agrees with item ;
is the ideal point for person on the underlying continuum;
is the location of item on the underlying continuum;
is the maximum of the curve for item ;
is the discrimination parameter for item .
Assuming (equal discrimination parameters) we can rewrite (3) into
with and .
Using the Taylor expansion of first order we obtain
The least-squares estimate of is .
Inserting this expression in (5) we obtain which is the CA model with one component. Note that the first-order Taylor expansion works well for small values of the interaction term . But the relation of CA with Gaussian ordination model holds true as well for large values (Ter Braak, 1985, Ter Braak, 1987). See also Ter Braak (1988) and Zhu et al. (2005) for this link in constrained CA.
The generalized graded unfolding model (GGUM) is a parametric item response model that has been well developed and incorporates features such as variable item discrimination and variable threshold parameters for the response categories (Roberts et al., 2000). The GGUM allows for binary or graded responses, but will be used in the current paper to generate responses on a six-point rating scale. One premise of the GGUM is that for each person there are two subjective responses associated with each observable response, except for the totally agree response. These subjective responses can be seen as two distinct reasons for a person’s response. For instance, when a person strongly disagrees with a certain items this could be for either of two reasons. If on the underlying continuum the item is located more to the right extreme than the person, the person disagrees from below the item. However, if the item is located more to the left extreme than the person, the person disagrees from above the item. The probability that a person will respond using a particular observable answer category is defined as the sum of the probabilities associated with the two corresponding subjective responses. Specifically, the model has the form: where
is an observable response to attitude item ;
corresponds to the strongest level of disagreement;
corresponds to the strongest level of agreement;
is the number of subjective response categories minus 1;
is the number of response categories minus 1 ;
is the location of person on the attitude continuum;
is the location of item on the attitude continuum;
is the discrimination of attitude statement ; and
is the location of the th subjective response category threshold on the attitude continuum relative to the location of item .
Section snippets
Method
The aim of the present research is to compare the performance of CA (with and without doubled items) and PCA (with and without varimax rotation) in terms of the recovery of the “true” scale values. Three types of scale values were of interest: person scale values, item scale values, and scale values of persons and items taken together, referred to as the joint scale.
We chose to include CA with doubled items, with the aim of testing the presumption that, in the case of unfolding data, asymmetric
The three benchmark datasets
This section of results consists of two parts. First the matrices of inter-item correlations for the three benchmark datasets are compared. Second, the results of the two types of CA are compared to the results of the two types of PCA.
The inter-item correlations for the benchmark data conforming to model 1 are displayed in Table 1. The correlation matrix shows a strong Robinson pattern. The inter-item correlations for the benchmark datasets conforming to model 2 and 3 are similar with respect
Discussion
Across all analyses, CA without doubling performs best for unfolding data generated with three different single-peaked models. We have to make one reservation however.
Both the analysis results for the three unfolding benchmark datasets and the results of the simulation study showed that in the case of the model 1 data CA recovered the joint scale poorly, whereas CA of the doubled data recovered the joint scale well. This is an exception in the current and existing results referred to in this
Acknowledgement
This research was conducted while Mark de Rooij was sponsored by the Netherlands Organisation for Scientific Research (NWO), Innovational Grant, no. 452-06-002.
References (42)
Correspondence analysis
- et al.
A theory of gradient analysis
Advances in Ecological Research
(1988) - et al.
A theory of gradient analysis
Advances in Ecological Research
(2004) - et al.
Constrained ordination analysis with flexible response functions
Ecological Modelling
(2005) - et al.
The developmental profile
Journal of Personality Disorders
(2001) A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies
British Journal of Mathematical and Statistical Psychology
(1996)- et al.
The structural relationship between attitude and behavior statements from the unfolding perspective
Psychological Methods
(1998) Correspondence Analysis Handbook
(1992)A Theory of Data
(1964)- et al.
On a connection between factor analysis and multidimensional unfolding
Psychometrika
(1960)
On a metric unidimensional unfolding model for attitudinal and developmental data
Psychometrika
Principal curves: A technique for indirect and direct gradient analysis
Ecology
A gravity based multidimensional scaling model for deriving spatial structures underlying consumer preference/choice judgements
Journal of Consumer Research
Nonlinear Multivariate Analysis
Theory and Applications of Correspondence Analysis
Correspondence Analysis in Practice
Correspondence Analysis in Practice
Reciprocal averaging: An eigenvector method of ordination
Journal of Ecology
Correspondence analysis: A neglected multivariate method
Applied Statistics
Cited by (10)
Model-based simultaneous clustering and ordination of multivariate abundance data in ecology
2017, Computational Statistics and Data AnalysisCitation Excerpt :Examples of algorithm-based techniques include Ward clustering (Ward, 1963) and K-means clustering for classification, and Non-metric Multidimensional Scaling (NMDS, Kruskal and Wish, 1978) and Correspondence Analysis (CA, Hill, 1974) for ordination. The development of algorithm-based techniques for analyzing multivariate data in general remains an ongoing area of research (e.g., Polak et al., 2009; Gijbels and Omelka, 2013). In contrast to algorithm-based methods, clustering and ordination can be approached from a model-based framework.
Special issue on correspondence analysis and related methods
2009, Computational Statistics and Data AnalysisCitation Excerpt :Applications of the latter option are made to linguistic and population genetic data and the idea of introducing power transformations or other parametrizations is extended to related methods such as principal component analysis, nonsymmetrical correspondence analysis and multidimensional scaling. In “Two types of single-peaked data: correspondence analysis as an alternative to principal component analysis”, Polak et al. (2009) compare various alternative approaches for the component-style analysis of ratings data that conform to unfolding distance models. They use two types of simulated unfolding data as gauges: first where the ratings are quadratic functions of person-to-item distances, and second where they are exponential functions.
A General Unfolding IRT Model for Multiple Response Styles
2019, Applied Psychological MeasurementGeneralized Graded Unfolding Model
2018, Handbook of Item Response Theory: Three Volume Set