Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons

Duarte Silva, António Pedro

doi:10.1007/s001800200105

Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons

Published: 04 November 2019

Volume 17, pages 251–271, (2002)
Cite this article

Computational Statistics Aims and scope Submit manuscript

António Pedro Duarte Silva¹

936 Accesses
20 Citations
Explore all metrics

Summary

The traditional approach to the interpretation of the results from a Principal Component Analysis implicitly discards variables that are weakly correlated with the most important and/or most interesting Principal Components. Some authors argue that this practice is potentially misleading and that it is preferable to take a variable selection approach, comparing variable subsets according to appropriate approximation criteria. In this paper, we propose algorithms for the comparison of all possible subsets according to some of the most important comparison criteria proposed to date. The computational effort of the proposed algorithms is studied and it is shown that, given current computer technology, they are feasible for problems involving up to thirty variables. A free-domain software implementation can be downloaded from the Internet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Chapter 9: Principal Component Analysis

Principal component analysis

Article 22 December 2022

Principle component analysis: Robust versions

Article 11 March 2017

Notes

¹In Bonifas, Escoufier, Gonzalez & Sabatier (1984) it is presented a generalization of criterion (5) applicable to analyses based on several different metrics.
² A different operator to update RV(X, X₁ M) can be found in Bonifas et al. (1984). However, the operator presented there appears to be computationally more demanding than the one we use, since it requires the computation of k + 1 bilinear forms involving the n-dimensional rows extracted from the matrix ${\bf{L}}_{{1^\prime }{1^\prime }}^{ - 1}{\bf{X}}_{{1^\prime }}^{\bf{T}}$ where ${{\bf{S}}_{{1^\prime }{1^\prime }}} = {{\bf{L}}_{{1^\prime }{1^\prime }}}{\bf{L}}_{{1^\prime }{1^\prime }}^{\bf{T}}$ is the Choleski decomposition of ${{\bf{S}}_{{1^\prime }{1^\prime }}}$.

References

Beale, E. M. L., Kendall, M. G. & Mann, D. W.(1967), “The Discarding of Variables in Multivariate Analysis”, Biometrika, 54, 357–366.
Article MathSciNet Google Scholar
Bonifas, L., Escoufier, Y., Gonzalez, P.L. & Sabatier R. (1984), “Choix de Variables en Analyse en Composantes Principales”, Revue de Statistique Appliquée, 32, 2, 5–15.
MathSciNet MATH Google Scholar
Cadima, J.F. & Jolliffe, I.T. (1995), “Loadings and Correlations in the Interpretation of Principal Components”, Journal of Applied Statistics, 22, 2, 203–214.
Article MathSciNet Google Scholar
Cadima, J.F. & Jolliffe, I.T. (2001), “Variable Selection and the Interpretation of Principal Subspaces”, To appear in Journal of Agricultural, Biological and Environmental Statistics.
Duarte Silva, A.P. (1998), A Leaps and Bounds Algorithm for Variable Selection in Two-Group Discriminant Analysis, in “Advances in Data Science and Classification”, IFCS, Springer, 227–232.
Chapter Google Scholar
Duarte Silva, A.P. (2001), “Efficient Variable Secreening for Multivariate Analysis”, Journal of Multivariate Analysis, 76, 1, 35–62.
Article MathSciNet Google Scholar
Fenneteau, H. & Bialès, C. (1993), Analyse Statistique des Données. Applications e Cas pour le Marketing, Ellipses, Paris.
Google Scholar
Furnival, G.M. (1971), “All Possible Regressions with Less Computation”, Technometrics, 13, 403–408.
Article Google Scholar
Furnival, G.M. & Wilson, R.W. (1974), “Regressions by Leaps and Bounds”, Technometrics, 16, 499–511.
Article Google Scholar
Jeffers, J.N.R. (1967), “Two Case Studies in the Application of Principal Components Analysis”, Journal of Applied Statistics, 16, 225–236.
Article Google Scholar
Jolliffe, I.T. (1972), “Discarding Variables in a Principal Component Analysis, I: Artificial Data”, Journal of Applied Statistics, 21, 160–173.
Article MathSciNet Google Scholar
Jolliffe, I.T. (1973), “Discarding Variables in a Principal Component Analysis, II: Real Data”, Journal of Applied Statistics, 22, 21–31.
Article MathSciNet Google Scholar
Krzanowski, W.J. (1987), “Selection of Variables to Preserve Multivariate Data Structure using Principal Components”, Applied Statistics, 36, 22–33.
Article Google Scholar
Lebart, L., Morineau, A. & Piron, M. (1995), Statistique Exploratoire Multidimensionelle, Dunod, Paris.
MATH Google Scholar
Lawless, J. & Singhai, K. (1978), “Efficient Screening of Nonnormal Regression Models” Biometrics, 34, 318–327.
Article Google Scholar
McCabe, G.P. (1975), “Computations for Variable Selection in Discriminant Analysis”, Technometrics, 17, 103–109.
Article Google Scholar
McCabe, G.P. (1984), “Principal Variables”, Technometrics, 26, 2, 137–144.
Article MathSciNet Google Scholar
Minhoto, M.J. (1998), A Redução de Dimensionalidade Através de Subconjuntos de Variáveis Observadas, Unpublished Master Thesis. Universidade Técnica de Lisboa. Instituto Superior de Agronomia, Lisboa.
Google Scholar
Morrison, D.F. (1990), Multivariate Statistical Methods, 3rd ed., McGraw-Hill, New York.
Google Scholar
Ramsay, J.O., Berge, J. & Styan, G.P.H. (1984), “Matrix Correlation”, Psychometrika, 49, 3, 403–423.
Article MathSciNet Google Scholar
Robert, P. & Escoufier, Y. (1976), “A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient”, Applied Statistics, 25, 3, 257–265.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculdade de Economia e Gestão, Universidade Católica Portuguesa at Porto, Rua Diogo Botel ho 1327, 4169-005, Porto, Portugal
António Pedro Duarte Silva

Authors

António Pedro Duarte Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to António Pedro Duarte Silva.

Appendix: Counting Operations

Formulae (30) – (35) show the number of floating point operations per pivot required to update the comparison criteria (from a k-dimensional subset to a k + 1 dimensional set) and the matrix and/or vector elements associated with t given variables that will be needed in future steps.

Criteria Floating Point Operations per Pivot

$$\left| {\,{{\bf{S}}_{11}}} \right|\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{1 \over 2}{t^2} + {3 \over 2}t + 1$$

(30)

$$tr\,{{\bf{S}}_{22\left| 1 \right.}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {2 + t} \right)\left( {p - k - 1} \right) - {1 \over 2}{t^2} + - {1 \over 2}t$$

(31)

$${\left\| {{{\bf{S}}_{22\left| 1 \right.}}} \right\|^2}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\left( {p - k - 1} \right)^2} + 2\left( {p - k - 1} \right)$$

(32)

$$tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{3 \over 2}{\left( {p - k - 1} \right)^2} + {7 \over 2}\left( {p - k - 1} \right)$$

(33)

$$tr\left( {{{\left( {{\bf{S}}_{11}^{ - 1}\,{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)\,\,\,\,\,\,\,{5 \over 2}{k^2} + ({{11} \over 2} + 3t)k + {1 \over 2}{t^2} + {7 \over 2}t + 2$$

(34)

$$tr\left( {{\bf{S}}_{11}^{ - 1}\,{\bf{S}}{{\bf{G}}_{11}}} \right)\,\,\,\,\,\,\,{1 \over 2}{t^2} + ({3 \over 2} + q)t + 2q$$

(35)

The results in (30), (31) and (35) follow from a direct count of the number of operations required to implement (9), (13) and (16) and update the different elements of S_22|1 and (v_a)_2|1 that will be needed later. The results in (32) and (33) follow from the fact that ${{\bf{S}}_{{2^\prime }{2^\prime }\left| {{1^\prime }} \right.}}$ and ${\bf{S}}_{{2^\prime }{2^\prime }}^{ - 1}$ have each ${{\left( {p - k - 1} \right)\,\left( {p - k} \right)} \over 2} = {1 \over 2}\left( {{{\left( {p - k - 1} \right)}^2} + \left( {p - k - 1} \right)} \right)$ different elements that need to be updated and squared or cross-multiplied. For these two criteria any sequencing of the F-MC algorithm leads to same effort. To compute the effort required to update the $tr\,\left( {{{\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)$ criterion we first note that the matrix $\left( {{\bf{S}}_{{1^\prime }{1^\prime }}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{{1^\prime }{1^\prime }}}} \right)$, not being in general symmetric, has (k + 1)² different elements. Therefore the number of operations required to update this criterion using (6), (14) and (15) equals $\left( {k + 1} \right)k + k + {\left( {k + 1} \right)^2} + {1 \over 2}\left( {k + 1} \right)\left( {k + 2} \right) = {5 \over 2}{k^2} + {{11} \over 2}k + 2$. In order to update the additional elements of ${\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{1 \bullet }}} \right)^2},\left( {{\bf{S}}_{11}^{ - 1}\,{{\bf{S}}_{12}}} \right)$ and S_22|1 that will be needed in future steps ${1 \over 2}{t^2} + \left( {{7 \over 2} + 3k} \right)t$ more operations are required. The result in (34) is the sum of these two quantities.

Formulae (36) – (41) show the total number of floating point operations required by the different criteria.

Criteria Total Number of Floating Point Operations

$$\left| {{{\bf{S}}_{11}}} \right|\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,4\left( {{2^p}} \right) - {1 \over 2}{p^2} - {5 \over 2}p - 4$$

(36)

$$tr\,{{\bf{S}}_{22\left| 1 \right.}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {{3 \over 2}p - 1} \right)\left( {{2^p}} \right) - {1 \over 2}{p^2} - {3 \over 2}p + 1$$

(37)

$${\left\| {{{\bf{S}}_{22\left| 1 \right.}}} \right\|^2}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {{1 \over 4}{p^2} + {5 \over 4}p} \right)\left( {{2^p}} \right) - {p^2} - 2p$$

(38)

$$tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\left( {{3 \over {16}}{p^2} + {{11} \over {16}}p - {7 \over 8}} \right)\left( {{2^p}} \right) + {1 \over 2}{p^3} - {p^2} - {1 \over 2}p + 2$$

(39)

$$tr\left( {{{\left( {{\bf{S}}_{11}^{ - 1}{{\left( {{{\bf{S}}^2}} \right)}_{11}}} \right)}^2}} \right)\quad \left( {{5 \over 8}{p^2} + {{19} \over 8}p - 2} \right)\left( {{2^p}} \right) + {1 \over 2}{p^3} - {1 \over 2}{p^2} - p + 2$$

(40)

$$tr\left( {{\bf{S}}_{11}^{ - 1}\,{\bf{S}}{{\bf{G}}_{11}}} \right)\quad \left( {3 + 3q} \right)\left( {{2^p}} \right) - {1 \over 2}{p^2} - \left( {{5 \over 2} + q} \right)p - \left( {3 + 3q} \right)$$

(41)

These results were derived with the help of the known equalities

$$\sum\limits_{a = 1}^{p - 1} {\left( {\begin{array}{*{20}c} p \\ a \\ \end{array} } \right)a = \left( {{2^{p - 1}} - 1} \right)p} $$

(42)

$$\sum\limits_{a = 1}^{p - 1} {\left( {\begin{array}{*{20}c} p \\ a \\ \end{array} } \right){a^2} = \left( {{p^2} + p} \right)\left( {{2^{p - 2}}} \right) - {p^2}} $$

(43)

and the result

$$\sum\limits_{t = 0}^{p - 1} {{2^{p - t - 1}}} \left( {a{t^2} + bt + c} \right) = (3a + b + c)\left( {{2^p}} \right) - a{p^2} - (2a + b)p - (3a + b + c)$$

(44)

which can be easily proved by induction.

In particular, the results in (36) and (41) follow directly from (30), (35) and (44). The result (37) follows from (31), (42), (44) and noticing that from a k-dimensional subset there are $\left( {\begin{array}{*{20}c} {p - t} \\ k \\ \end{array} } \right)$ different pivots on the variable X_u = X_t+1. Result (38) can be derived from (32), (42), (43) and (44). Result (39) follows from (33), (42), (43), (44), the addition of the ${{{P^2}\left( {P + 1} \right)} \over 2}$ operations required for the initial inversion of S, and noticing that for this criterion only $\left( {\begin{array}{*{20}c} {P - 1} \\ {P - K - 1} \\ \end{array} } \right)$ pivots to k-dimensional subsets (k = 1, 2, …, p − 2) are necessary. This latter remark follows from the fact that all subsets including one given variable can be evaluated simultaneously with the complementary subsets, using the relation $tr\left( {{\bf{S}}_{11}^{ - 1}\,{{\bf{S}}_{11\left| 2 \right.}}} \right) = 2k - p + tr\left( {{\bf{S}}_{22}^{ - 1}\,{{\bf{S}}_{22\left| 1 \right.}}} \right)$ which follows from (4). Finally, (40) was derived from (34), (42), (43), (44), and adding the ${{{p^3} + p} \over 2}$ operations needed for the initial computation of S².

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duarte Silva, A.P. Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons. Computational Statistics 17, 251–271 (2002). https://doi.org/10.1007/s001800200105

Download citation

Published: 04 November 2019
Issue Date: July 2002
DOI: https://doi.org/10.1007/s001800200105

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons

Summary

Access this article

Similar content being viewed by others

Chapter 9: Principal Component Analysis

Principal component analysis

Principle component analysis: Robust versions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Counting Operations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons

Summary

Access this article

Similar content being viewed by others

Chapter 9: Principal Component Analysis

Principal component analysis

Principle component analysis: Robust versions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Counting Operations

Appendix: Counting Operations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation