Summary
The traditional approach to the interpretation of the results from a Principal Component Analysis implicitly discards variables that are weakly correlated with the most important and/or most interesting Principal Components. Some authors argue that this practice is potentially misleading and that it is preferable to take a variable selection approach, comparing variable subsets according to appropriate approximation criteria. In this paper, we propose algorithms for the comparison of all possible subsets according to some of the most important comparison criteria proposed to date. The computational effort of the proposed algorithms is studied and it is shown that, given current computer technology, they are feasible for problems involving up to thirty variables. A free-domain software implementation can be downloaded from the Internet.
Similar content being viewed by others
Notes
2 A different operator to update RV(X, X1 M) can be found in Bonifas et al. (1984). However, the operator presented there appears to be computationally more demanding than the one we use, since it requires the computation of k + 1 bilinear forms involving the n-dimensional rows extracted from the matrix \({\bf{L}}_{{1^\prime }{1^\prime }}^{ - 1}{\bf{X}}_{{1^\prime }}^{\bf{T}}\) where \({{\bf{S}}_{{1^\prime }{1^\prime }}} = {{\bf{L}}_{{1^\prime }{1^\prime }}}{\bf{L}}_{{1^\prime }{1^\prime }}^{\bf{T}}\) is the Choleski decomposition of \({{\bf{S}}_{{1^\prime }{1^\prime }}}\).
References
Beale, E. M. L., Kendall, M. G. & Mann, D. W.(1967), “The Discarding of Variables in Multivariate Analysis”, Biometrika, 54, 357–366.
Bonifas, L., Escoufier, Y., Gonzalez, P.L. & Sabatier R. (1984), “Choix de Variables en Analyse en Composantes Principales”, Revue de Statistique Appliquée, 32, 2, 5–15.
Cadima, J.F. & Jolliffe, I.T. (1995), “Loadings and Correlations in the Interpretation of Principal Components”, Journal of Applied Statistics, 22, 2, 203–214.
Cadima, J.F. & Jolliffe, I.T. (2001), “Variable Selection and the Interpretation of Principal Subspaces”, To appear in Journal of Agricultural, Biological and Environmental Statistics.
Duarte Silva, A.P. (1998), A Leaps and Bounds Algorithm for Variable Selection in Two-Group Discriminant Analysis, in “Advances in Data Science and Classification”, IFCS, Springer, 227–232.
Duarte Silva, A.P. (2001), “Efficient Variable Secreening for Multivariate Analysis”, Journal of Multivariate Analysis, 76, 1, 35–62.
Fenneteau, H. & Bialès, C. (1993), Analyse Statistique des Données. Applications e Cas pour le Marketing, Ellipses, Paris.
Furnival, G.M. (1971), “All Possible Regressions with Less Computation”, Technometrics, 13, 403–408.
Furnival, G.M. & Wilson, R.W. (1974), “Regressions by Leaps and Bounds”, Technometrics, 16, 499–511.
Jeffers, J.N.R. (1967), “Two Case Studies in the Application of Principal Components Analysis”, Journal of Applied Statistics, 16, 225–236.
Jolliffe, I.T. (1972), “Discarding Variables in a Principal Component Analysis, I: Artificial Data”, Journal of Applied Statistics, 21, 160–173.
Jolliffe, I.T. (1973), “Discarding Variables in a Principal Component Analysis, II: Real Data”, Journal of Applied Statistics, 22, 21–31.
Krzanowski, W.J. (1987), “Selection of Variables to Preserve Multivariate Data Structure using Principal Components”, Applied Statistics, 36, 22–33.
Lebart, L., Morineau, A. & Piron, M. (1995), Statistique Exploratoire Multidimensionelle, Dunod, Paris.
Lawless, J. & Singhai, K. (1978), “Efficient Screening of Nonnormal Regression Models” Biometrics, 34, 318–327.
McCabe, G.P. (1975), “Computations for Variable Selection in Discriminant Analysis”, Technometrics, 17, 103–109.
McCabe, G.P. (1984), “Principal Variables”, Technometrics, 26, 2, 137–144.
Minhoto, M.J. (1998), A Redução de Dimensionalidade Através de Subconjuntos de Variáveis Observadas, Unpublished Master Thesis. Universidade Técnica de Lisboa. Instituto Superior de Agronomia, Lisboa.
Morrison, D.F. (1990), Multivariate Statistical Methods, 3rd ed., McGraw-Hill, New York.
Ramsay, J.O., Berge, J. & Styan, G.P.H. (1984), “Matrix Correlation”, Psychometrika, 49, 3, 403–423.
Robert, P. & Escoufier, Y. (1976), “A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient”, Applied Statistics, 25, 3, 257–265.
Author information
Authors and Affiliations
Corresponding author
Appendix: Counting Operations
Appendix: Counting Operations
Formulae (30) – (35) show the number of floating point operations per pivot required to update the comparison criteria (from a k-dimensional subset to a k + 1 dimensional set) and the matrix and/or vector elements associated with t given variables that will be needed in future steps.
Criteria Floating Point Operations per Pivot
The results in (30), (31) and (35) follow from a direct count of the number of operations required to implement (9), (13) and (16) and update the different elements of S22|1 and (va)2|1 that will be needed later. The results in (32) and (33) follow from the fact that \({{\bf{S}}_{{2^\prime }{2^\prime }\left| {{1^\prime }} \right.}}\) and \({\bf{S}}_{{2^\prime }{2^\prime }}^{ - 1}\) have each \({{\left( {p - k - 1} \right)\,\left( {p - k} \right)} \over 2} = {1 \over 2}\left( {{{\left( {p - k - 1} \right)}^2} + \left( {p - k - 1} \right)} \right)\) different elements that need to be updated and squared or cross-multiplied. For these two criteria any sequencing of the F-MC algorithm leads to same effort. To compute the effort required to update the \(tr\,\left( {{{\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)\) criterion we first note that the matrix \(\left( {{\bf{S}}_{{1^\prime }{1^\prime }}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{{1^\prime }{1^\prime }}}} \right)\), not being in general symmetric, has (k + 1)2 different elements. Therefore the number of operations required to update this criterion using (6), (14) and (15) equals \(\left( {k + 1} \right)k + k + {\left( {k + 1} \right)^2} + {1 \over 2}\left( {k + 1} \right)\left( {k + 2} \right) = {5 \over 2}{k^2} + {{11} \over 2}k + 2\). In order to update the additional elements of \({\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{1 \bullet }}} \right)^2},\left( {{\bf{S}}_{11}^{ - 1}\,{{\bf{S}}_{12}}} \right)\) and S22|1 that will be needed in future steps \({1 \over 2}{t^2} + \left( {{7 \over 2} + 3k} \right)t\) more operations are required. The result in (34) is the sum of these two quantities.
Formulae (36) – (41) show the total number of floating point operations required by the different criteria.
Criteria Total Number of Floating Point Operations
These results were derived with the help of the known equalities
and the result
which can be easily proved by induction.
In particular, the results in (36) and (41) follow directly from (30), (35) and (44). The result (37) follows from (31), (42), (44) and noticing that from a k-dimensional subset there are \(\left( {\begin{array}{*{20}c} {p - t} \\ k \\ \end{array} } \right)\) different pivots on the variable Xu = Xt+1. Result (38) can be derived from (32), (42), (43) and (44). Result (39) follows from (33), (42), (43), (44), the addition of the \({{{P^2}\left( {P + 1} \right)} \over 2}\) operations required for the initial inversion of S, and noticing that for this criterion only \(\left( {\begin{array}{*{20}c} {P - 1} \\ {P - K - 1} \\ \end{array} } \right)\) pivots to k-dimensional subsets (k = 1, 2, …, p − 2) are necessary. This latter remark follows from the fact that all subsets including one given variable can be evaluated simultaneously with the complementary subsets, using the relation \(tr\left( {{\bf{S}}_{11}^{ - 1}\,{{\bf{S}}_{11\left| 2 \right.}}} \right) = 2k - p + tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)\) which follows from (4). Finally, (40) was derived from (34), (42), (43), (44), and adding the \({{{p^3} + p} \over 2}\) operations needed for the initial computation of S2.
Rights and permissions
About this article
Cite this article
Duarte Silva, A.P. Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons. Computational Statistics 17, 251–271 (2002). https://doi.org/10.1007/s001800200105
Published:
Issue Date:
DOI: https://doi.org/10.1007/s001800200105