Skip to main content
Log in

Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons

  • Published:
Computational Statistics Aims and scope Submit manuscript

Summary

The traditional approach to the interpretation of the results from a Principal Component Analysis implicitly discards variables that are weakly correlated with the most important and/or most interesting Principal Components. Some authors argue that this practice is potentially misleading and that it is preferable to take a variable selection approach, comparing variable subsets according to appropriate approximation criteria. In this paper, we propose algorithms for the comparison of all possible subsets according to some of the most important comparison criteria proposed to date. The computational effort of the proposed algorithms is studied and it is shown that, given current computer technology, they are feasible for problems involving up to thirty variables. A free-domain software implementation can be downloaded from the Internet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. 1In Bonifas, Escoufier, Gonzalez & Sabatier (1984) it is presented a generalization of criterion (5) applicable to analyses based on several different metrics.

  2. 2 A different operator to update RV(X, X1 M) can be found in Bonifas et al. (1984). However, the operator presented there appears to be computationally more demanding than the one we use, since it requires the computation of k + 1 bilinear forms involving the n-dimensional rows extracted from the matrix \({\bf{L}}_{{1^\prime }{1^\prime }}^{ - 1}{\bf{X}}_{{1^\prime }}^{\bf{T}}\) where \({{\bf{S}}_{{1^\prime }{1^\prime }}} = {{\bf{L}}_{{1^\prime }{1^\prime }}}{\bf{L}}_{{1^\prime }{1^\prime }}^{\bf{T}}\) is the Choleski decomposition of \({{\bf{S}}_{{1^\prime }{1^\prime }}}\).

References

  • Beale, E. M. L., Kendall, M. G. & Mann, D. W.(1967), “The Discarding of Variables in Multivariate Analysis”, Biometrika, 54, 357–366.

    Article  MathSciNet  Google Scholar 

  • Bonifas, L., Escoufier, Y., Gonzalez, P.L. & Sabatier R. (1984), “Choix de Variables en Analyse en Composantes Principales”, Revue de Statistique Appliquée, 32, 2, 5–15.

    MathSciNet  MATH  Google Scholar 

  • Cadima, J.F. & Jolliffe, I.T. (1995), “Loadings and Correlations in the Interpretation of Principal Components”, Journal of Applied Statistics, 22, 2, 203–214.

    Article  MathSciNet  Google Scholar 

  • Cadima, J.F. & Jolliffe, I.T. (2001), “Variable Selection and the Interpretation of Principal Subspaces”, To appear in Journal of Agricultural, Biological and Environmental Statistics.

  • Duarte Silva, A.P. (1998), A Leaps and Bounds Algorithm for Variable Selection in Two-Group Discriminant Analysis, in “Advances in Data Science and Classification”, IFCS, Springer, 227–232.

    Chapter  Google Scholar 

  • Duarte Silva, A.P. (2001), “Efficient Variable Secreening for Multivariate Analysis”, Journal of Multivariate Analysis, 76, 1, 35–62.

    Article  MathSciNet  Google Scholar 

  • Fenneteau, H. & Bialès, C. (1993), Analyse Statistique des Données. Applications e Cas pour le Marketing, Ellipses, Paris.

    Google Scholar 

  • Furnival, G.M. (1971), “All Possible Regressions with Less Computation”, Technometrics, 13, 403–408.

    Article  Google Scholar 

  • Furnival, G.M. & Wilson, R.W. (1974), “Regressions by Leaps and Bounds”, Technometrics, 16, 499–511.

    Article  Google Scholar 

  • Jeffers, J.N.R. (1967), “Two Case Studies in the Application of Principal Components Analysis”, Journal of Applied Statistics, 16, 225–236.

    Article  Google Scholar 

  • Jolliffe, I.T. (1972), “Discarding Variables in a Principal Component Analysis, I: Artificial Data”, Journal of Applied Statistics, 21, 160–173.

    Article  MathSciNet  Google Scholar 

  • Jolliffe, I.T. (1973), “Discarding Variables in a Principal Component Analysis, II: Real Data”, Journal of Applied Statistics, 22, 21–31.

    Article  MathSciNet  Google Scholar 

  • Krzanowski, W.J. (1987), “Selection of Variables to Preserve Multivariate Data Structure using Principal Components”, Applied Statistics, 36, 22–33.

    Article  Google Scholar 

  • Lebart, L., Morineau, A. & Piron, M. (1995), Statistique Exploratoire Multidimensionelle, Dunod, Paris.

    MATH  Google Scholar 

  • Lawless, J. & Singhai, K. (1978), “Efficient Screening of Nonnormal Regression Models” Biometrics, 34, 318–327.

    Article  Google Scholar 

  • McCabe, G.P. (1975), “Computations for Variable Selection in Discriminant Analysis”, Technometrics, 17, 103–109.

    Article  Google Scholar 

  • McCabe, G.P. (1984), “Principal Variables”, Technometrics, 26, 2, 137–144.

    Article  MathSciNet  Google Scholar 

  • Minhoto, M.J. (1998), A Redução de Dimensionalidade Através de Subconjuntos de Variáveis Observadas, Unpublished Master Thesis. Universidade Técnica de Lisboa. Instituto Superior de Agronomia, Lisboa.

    Google Scholar 

  • Morrison, D.F. (1990), Multivariate Statistical Methods, 3rd ed., McGraw-Hill, New York.

    Google Scholar 

  • Ramsay, J.O., Berge, J. & Styan, G.P.H. (1984), “Matrix Correlation”, Psychometrika, 49, 3, 403–423.

    Article  MathSciNet  Google Scholar 

  • Robert, P. & Escoufier, Y. (1976), “A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient”, Applied Statistics, 25, 3, 257–265.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to António Pedro Duarte Silva.

Appendix: Counting Operations

Appendix: Counting Operations

Formulae (30)(35) show the number of floating point operations per pivot required to update the comparison criteria (from a k-dimensional subset to a k + 1 dimensional set) and the matrix and/or vector elements associated with t given variables that will be needed in future steps.

Criteria Floating Point Operations per Pivot

$$\left| {\,{{\bf{S}}_{11}}} \right|\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{1 \over 2}{t^2} + {3 \over 2}t + 1$$
(30)
$$tr\,{{\bf{S}}_{22\left| 1 \right.}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {2 + t} \right)\left( {p - k - 1} \right) - {1 \over 2}{t^2} + - {1 \over 2}t$$
(31)
$${\left\| {{{\bf{S}}_{22\left| 1 \right.}}} \right\|^2}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\left( {p - k - 1} \right)^2} + 2\left( {p - k - 1} \right)$$
(32)
$$tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{3 \over 2}{\left( {p - k - 1} \right)^2} + {7 \over 2}\left( {p - k - 1} \right)$$
(33)
$$tr\left( {{{\left( {{\bf{S}}_{11}^{ - 1}\,{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)\,\,\,\,\,\,\,{5 \over 2}{k^2} + ({{11} \over 2} + 3t)k + {1 \over 2}{t^2} + {7 \over 2}t + 2$$
(34)
$$tr\left( {{\bf{S}}_{11}^{ - 1}\,{\bf{S}}{{\bf{G}}_{11}}} \right)\,\,\,\,\,\,\,{1 \over 2}{t^2} + ({3 \over 2} + q)t + 2q$$
(35)

The results in (30), (31) and (35) follow from a direct count of the number of operations required to implement (9), (13) and (16) and update the different elements of S22|1 and (va)2|1 that will be needed later. The results in (32) and (33) follow from the fact that \({{\bf{S}}_{{2^\prime }{2^\prime }\left| {{1^\prime }} \right.}}\) and \({\bf{S}}_{{2^\prime }{2^\prime }}^{ - 1}\) have each \({{\left( {p - k - 1} \right)\,\left( {p - k} \right)} \over 2} = {1 \over 2}\left( {{{\left( {p - k - 1} \right)}^2} + \left( {p - k - 1} \right)} \right)\) different elements that need to be updated and squared or cross-multiplied. For these two criteria any sequencing of the F-MC algorithm leads to same effort. To compute the effort required to update the \(tr\,\left( {{{\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)\) criterion we first note that the matrix \(\left( {{\bf{S}}_{{1^\prime }{1^\prime }}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{{1^\prime }{1^\prime }}}} \right)\), not being in general symmetric, has (k + 1)2 different elements. Therefore the number of operations required to update this criterion using (6), (14) and (15) equals \(\left( {k + 1} \right)k + k + {\left( {k + 1} \right)^2} + {1 \over 2}\left( {k + 1} \right)\left( {k + 2} \right) = {5 \over 2}{k^2} + {{11} \over 2}k + 2\). In order to update the additional elements of \({\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{1 \bullet }}} \right)^2},\left( {{\bf{S}}_{11}^{ - 1}\,{{\bf{S}}_{12}}} \right)\) and S22|1 that will be needed in future steps \({1 \over 2}{t^2} + \left( {{7 \over 2} + 3k} \right)t\) more operations are required. The result in (34) is the sum of these two quantities.

Formulae (36)(41) show the total number of floating point operations required by the different criteria.

Criteria Total Number of Floating Point Operations

$$\left| {{{\bf{S}}_{11}}} \right|\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,4\left( {{2^p}} \right) - {1 \over 2}{p^2} - {5 \over 2}p - 4$$
(36)
$$tr\,{{\bf{S}}_{22\left| 1 \right.}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {{3 \over 2}p - 1} \right)\left( {{2^p}} \right) - {1 \over 2}{p^2} - {3 \over 2}p + 1$$
(37)
$${\left\| {{{\bf{S}}_{22\left| 1 \right.}}} \right\|^2}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {{1 \over 4}{p^2} + {5 \over 4}p} \right)\left( {{2^p}} \right) - {p^2} - 2p$$
(38)
$$tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {{3 \over {16}}{p^2} + {{11} \over {16}}p - {7 \over 8}} \right)\left( {{2^p}} \right) + {1 \over 2}{p^3} - {p^2} - {1 \over 2}p + 2$$
(39)
$$tr\left( {{{\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)\quad \left( {{5 \over 8}{p^2} + {{19} \over 8}p - 2} \right)\left( {{2^p}} \right) + {1 \over 2}{p^3} - {1 \over 2}{p^2} - p + 2$$
(40)
$$tr\left( {{\bf{S}}_{11}^{ - 1}\,{\bf{S}}{{\bf{G}}_{11}}} \right)\quad \left( {3 + 3q} \right)\left( {{2^p}} \right) - {1 \over 2}{p^2} - \left( {{5 \over 2} + q} \right)p - \left( {3 + 3q} \right)$$
(41)

These results were derived with the help of the known equalities

$$\sum\limits_{a = 1}^{p - 1} {\left( {\begin{array}{*{20}c} p \\ a \\ \end{array} } \right)a = \left( {{2^{p - 1}} - 1} \right)p} $$
(42)
$$\sum\limits_{a = 1}^{p - 1} {\left( {\begin{array}{*{20}c} p \\ a \\ \end{array} } \right){a^2} = \left( {{p^2} + p} \right)\left( {{2^{p - 2}}} \right) - {p^2}} $$
(43)

and the result

$$\sum\limits_{t = 0}^{p - 1} {{2^{p - t - 1}}} \left( {a{t^2} + bt + c} \right) = (3a + b + c)\left( {{2^p}} \right) - a{p^2} - (2a + b)p - (3a + b + c)$$
(44)

which can be easily proved by induction.

In particular, the results in (36) and (41) follow directly from (30), (35) and (44). The result (37) follows from (31), (42), (44) and noticing that from a k-dimensional subset there are \(\left( {\begin{array}{*{20}c} {p - t} \\ k \\ \end{array} } \right)\) different pivots on the variable Xu = Xt+1. Result (38) can be derived from (32), (42), (43) and (44). Result (39) follows from (33), (42), (43), (44), the addition of the \({{{P^2}\left( {P + 1} \right)} \over 2}\) operations required for the initial inversion of S, and noticing that for this criterion only \(\left( {\begin{array}{*{20}c} {P - 1} \\ {P - K - 1} \\ \end{array} } \right)\) pivots to k-dimensional subsets (k = 1, 2, …, p − 2) are necessary. This latter remark follows from the fact that all subsets including one given variable can be evaluated simultaneously with the complementary subsets, using the relation \(tr\left( {{\bf{S}}_{11}^{ - 1}\,{{\bf{S}}_{11\left| 2 \right.}}} \right) = 2k - p + tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)\) which follows from (4). Finally, (40) was derived from (34), (42), (43), (44), and adding the \({{{p^3} + p} \over 2}\) operations needed for the initial computation of S2.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duarte Silva, A.P. Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons. Computational Statistics 17, 251–271 (2002). https://doi.org/10.1007/s001800200105

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s001800200105

Keywords

Navigation