Exact methods for variable selection in principal component analysis: Guide functions and pre-selection

https://doi.org/10.1016/j.csda.2012.06.014Get rights and content

Abstract

A variable selection problem is analysed for use in Principal Component Analysis (PCA). In this case, the set of original variables is divided into disjoint groups. The problem resides in the selection of variables, but with the restriction that the set of variables that is selected should contain at least one variable from each group. The objective function under consideration is the sum of the first eigenvalues of the correlation matrix of the subset of selected variables. This problem, with no known prior references, has two further difficulties, in addition to that of the variable selection problem: the evaluation of the objective function and the restriction that the subset of selected variables should also contain elements from all of the groups. Two Branch & Bound methods are proposed to obtain exact solutions that incorporate two strategies: the first one is the use of “fast” guide functions as alternatives to the objective function; the second one is the preselection of variables that help to comply with the latter restriction. From the computational tests, it is seen that both strategies are very efficient and achieve significant reductions in calculation times.

Introduction

Data sets with large numbers of variables are processed in many disciplines such as economics, sociology, engineering, medicine and biology, among others. The researcher has to process a lot of data classified by a large number of variables, which are often difficult to summarize or interpret. One useful approach involves the reduction of data dimensionality, while trying to preserve as much of the original information as possible. A common way of doing this is through Principal Component Analysis (PCA).

PCA is widely applied in data mining to investigate data structures. Its purpose is to construct components, each of which contains a maximal amount of variation with respect to the data that are unexplained by any other components. Standard results guarantee that retaining the k Principal Components (PCs) with the largest associated variance produces the k-subset of linear combinations of the n original variables which, according to various criteria, represents the best approximation of the original variables (see, for example, Jolliffe, 2002). The user therefore expects that the information in the data can be summarized into a few principal components. Once the principal components have been determined, all further analysis can be carried out on them instead of on the original data, as they summarize the relevant information. Thus, PCA is frequently considered the first step of a statistical data analysis that aims at data compression: decreasing the dimensionality of the data, but losing as little information as possible.

While PCA is highly effective at reduced-dimensional representation, it does not provide a real reduction of dimensionality in terms of the original variables, since all n original variables are required to define a single Principal Component (PC). As McCabe (1984) has stated, “interpretation of the results and possible subsequent data collection and analysis still involve all of the variables”. Moreover, Cadima and Jolliffe, 1995, Cadima and Jolliffe, 2001 have shown that a PC can provide a misleading measure of variable importance, in terms of preserving information, because it is based on the assumption that the resultant of a linear combination (PC) is dominated by the vectors (variables) with large magnitude coefficients in that linear combination (high PC loadings). This assumption ignores the influence of the magnitude of each vector (the standard deviation of each variable) and the relative positions of the vectors (the pattern of correlations between the variables). One way of achieving a simple interpretation is to reduce the number of variables, that is, to look for a subset of the n variables that approximate, as far as possible, the k retained PCs. We consider the combinatorial problem of identifying, for any arbitrary integer p (kpn), a p-variable subset which is optimal with respect to a given criterion.

In most cases, the inclusion of all the variables in a statistical analysis is, at best, unnecessary and, at worst, a serious impediment to the correct interpretation of the data. From a computational point of view, variable selection is a Nondeterministic Polynomial Time-Hard or NP-Hard problem (Kohavi, 1995); and (Cotta et al., 2004): there is therefore no guarantee of finding the optimum solution. This means that when the size of the problem is large, finding an optimum solution is, in practice, unfeasible. Two different methodological approaches have been developed for variable selection problems: optimal or exact techniques (enumerative techniques), which are able to guarantee an optimal solution, but which are only applicable to small-sized sets; and heuristic techniques, which are able to find good solutions (although unable to guarantee the optimum) within a reasonable amount of time.

The problem of selecting a subset of variables from a larger candidate pool abounds in areas such as multiple linear regression (Furnival and Wilson, 1974, Miller, 2002, Gatu and Kontoghiorghes, 2006, Hofmann et al., 2007, Gatu et al., 2007), logistic regression (Pacheco et al., 2009), polynomial regression (Peixoto, 1987, Brusco et al., 2009b, Brusco and Steinley, 2010), factor analysis (Kano and Harada, 2000, Hogarty et al., 2004), cluster analysis (Brusco and Cradit, 2001, Steinley and Brusco, 2008, Krzanowski and Hand, 2009), and discriminant analysis (McCabe, 1975, McKay and Campbell, 1982a, McKay and Campbell, 1982b, Pacheco et al., 2006).

Specifically, the problem of variable selection in PCA has been investigated by Jolliffe (1972), Jolliffe (1973), Robert and Escoufier (1976), McCabe (1984), Bonifas et al. (1984), Krzanowski (1987a), Falguerolles and Jmel (1993), Mori et al. (1994), Jolliffe (2002), Duarte Silva (2002), Cadima et al. (2004), Mori et al. (2007) and Brusco et al. (2009a) among others. These studies sought to obtain PCs based on a subset of variables, in such a way that they retain as much information as possible in comparison to PCs that are based on all the variables. To address this problem, it is necessary to tackle two secondary problems: (1) the establishment of an objective criterion that can measure the quality or fitness of every subset of variables; and (2) the development of solution procedures for finding optimal, or at least near-optimal, subsets based on these criteria. The methods proposed by Jolliffe, 1972, Jolliffe, 1973 consider PC loadings, and those of McCabe (1984) and Falguerolles and Jmel (1993) use a partial covariance matrix to select a subset of variables which, as far as possible, maintains information on all variables. Robert and Escoufier (1976) and Bonifas et al. (1984) used the RV-coefficient, and Krzanowski, 1987a, Krzanowski, 1987b used Procrustes analysis to evaluate closeness between the configuration of PC computations based on selected variables and the configuration based on all of the variables. Tanaka and Mori (1997) discussed a method called “modified PCA” (MPCA) to derive PCs which were computed by using only a selected subset of variables but which represented all of the variables, including those that were not selected. Since MPCA naturally includes variable selection procedures in its analysis, its criteria can be used directly to detect a reasonable subset of variables. Further criteria may be considered based, for example, on the influence analysis of variables and on predictive residuals, using the concepts reported in Tanaka and Mori (1997), and in Krzanowski, 1987a, Krzanowski, 1987b, respectively; for more details see Iizuka et al. (2003).

Thus, the existence of several methods and criteria is one of the typical features of variable selection in multivariate methods without external variables such as PCA (here the term “external variable” is used as a variable to be predicted or explained using the information derived from other variables). Moreover, the existing methods and criteria often provide different results (selected subsets of variables), which is regarded as a further typical feature. This occurs because each criterion has its own reasonable variable selection method, despite its purpose of selecting variables; we cannot say, therefore, that one is better than any other. These features are not observed in multivariate methods with external variable(s), such as multiple regression analysis (Mori et al., 2007).

In practical applications of variable selection, it is desirable to have a computational environment where those who want to select variables can apply a suitable method for their own selection purposes without difficulties and/or can try various methods and choose the best method by comparing results.

In many studies, the initial variables are divided into previously selected groups. In these cases it is required, or at least recommended to use variables from all groups considered. This happens, for example, in the construction of composite indicators that are used in several areas (economy, society, quality of life, nature, technology, etc.). They are used as measure of the evolution of regions or countries in such areas. The composite indicators should try to cover all points of view of the analysed phenomenon (which may be identified with each of the different groups of variables) and should therefore contain at least one variable from each group (or other types of similar conditions), so that they encompass all the points of view.

The importance of composite indicators is explained in Nardo et al., 2005a, Nardo et al., 2005b and Bandura (2008) among other references. The convenience of using variables from all groups considered is explicitly indicated at least in Nardo et al. (2005a), Ramajo-Hernández and Márquez-Paniagua (2001) and López-García and Castro-Núñez (2004). In several of the examples mentioned in these previous references and links (see http://composite-indicators.jrc.ec.europa.eu/articles_books.htm) there are previous groups of variables and every group participates in the final Composite Indicator. For example, in Tangian (2007) is given an example of building a composite indicator to measure the working conditions. In this work 10 groups of variables are considered (Physical environment, Health, Time factors, etc.). In Chan et al. (2005) a composite indicator is built to be used as an analytic tool to examine the quality of life in Hong Kong. The Index is released annually. It consists of 20 variables that are grouped into three groups: Social (10 variables), economic (7) and environmental (3). In Nardo et al., 2005a, Nardo et al., 2005b a Technology Achievement Index is proposed, and the following groups are considered: creation of technology(2 variables), diffusion of recent innovations (2), diffusion of old innovations (2) and human skills (2). In López-García and Castro-Núñez (2004) an indicator for regional economic activity in Spain is built. The following groups are considered: Agriculture (2 variables), Construction (1), Industry (4), Merchandise Services (9) and No-Merchandise Services (2). Several more examples about it can be found in the literature. Some of them can be taken from Bandura (2008), an annual survey with about 170 international composite indicators.

Several restrictions can be considered in order to ensure the “participation” of every group in the final solution. In this work we consider the simplest one: “to take at least one variable from each group”. This constraint has been considered in Calzada et al. (2011). Nevertheless, other harder and/or more sophisticated restrictions can be considered depending on different questions (initial size of every group, opinion of experts, etc.). In anyway this simplest constraint can illustrate the difficulty of these kinds of problems.

A variable selection problem is studied in this work for subsequent use in PCA, but with the restriction that the set of variables that are selected contain at least 1 variable from each of the previously selected groups. The objective function under consideration is the percentage of the total variation explained by the first k Principal Components of the selected variables, which is to say, the sum of the first k eigenvalues divided by the sum of all eigenvalues. As is said in Cadima and Jolliffe (2001), the first few principal components of a set of variables are derived variables with optimal properties in terms of approximating the original variables. Although other choices are possible (see Jolliffe, 2002, Section 10.1, Jolliffe, 1989) the principal components of interest are conventionally those with the largest variance. How many of these to keep can be determined in a variety of ways (Jolliffe, 2002, Section 6.1 and Richman, 1992 review some of many of the methods that have been suggested). This paper deals with the k principal components simultaneously because if we retain k PCs, we are interested in interpreting the space spanned by those PCs rather than being considered to individual PCs. The restriction that there should be at least one variable from each group in the selected set ensures that all viewpoints are considered. In addition, it helps to avoid the selection of variables that have high correlations between each other. In general the variables of the same group are usually more highly correlated between each other than with the rest of the group. There are no references in the literature for this specific problem.

One difficulty in this problem is as follows: as the objective function under consideration is known (the percentage of the total variation explained by the first k Principal Components), it requires a high number of calculations every time it is evaluated (calculation and sum of eigenvalues). If an exploration is done in which many solutions are analysed, the total computation time of the search might be excessive, due to the evaluation time of the objective function in each solution. This difficulty can be overcome with the design of methods where the search is not directly guided by the objective function, but by other alternative functions that are related to it, which require fewer calculations. The other main difficulty comes from the restriction that selected subsets should have at least one element from each group. The process of constructing solutions (in the context of both exact and heuristic methods) should consider this restriction, as otherwise the constructed solutions might not comply with this construction criterion and might, therefore, be rejected. The inclusion of a strategy which will, as far as possible, prevent the generation of unfeasible solutions, and in consequence the completion achievement of unnecessary calculations appears necessary.

In this paper, two Branch and Bound-based methods are proposed. Two alternative guide functions that differ from the objective function will be used in these exact methods. Specifically they will be used in the branch process. In addition, a pre-selection strategy will be used also in the branch process, which will favour the generation of feasible solutions. As will be confirmed later on, the alternative guide functions and the preselection manage to reduce total computing time considerably, thereby achieving faster and more efficient explorations.

The remainder of this work is organized as follows: Section 2 is devoted to the problem definition and to show some theoretical results that will be used later on. Section 3 describes the proposed guide functions. Section 4 describes the two Branch and Bound methods. The computational experiences are shown in Section 5. They analyse, above all, the effect of the use of guide functions and the pre-selection strategy. Finally, Section 6 presents the final conclusions of the study.

Section snippets

Prior definitions

Consider a data matrix, X, corresponding to m cases and characterized by n variables. We shall label the set of variables V={1,2,,n} (the variables are identified by their indices for the sake of simplicity).

Let xij be the value of variable j in the case i,i=1,,m;j=1,,n;

Let xj be the column vector with the values of variable j, in other words xj=(x1jx2jxmj)j=1,,n;

It is known that X=(xij)i=1..m;j=1..n=(x1x2xn).

Consider the correlation matrix Σ=(ρij)i,j=1..n, in other words ρij is the

Search for values and guide functions

The problem proposed by (1), (2), (3), (4) involves an important difficulty, due to the computation time needed to calculate the function f(S) for any SV. The methods to determine the largest k eigenvalues of a positive-defined symmetric matrix of order p and, as a result, its sum, are iterative methods based on QR decomposition (Francis, 1961–1962); (Kublanovskaya, 1961). These methods usually use various iterations to converge, and θ(p3) operations are required for each iteration. If the

First Branch and Bound method

Corollary 1, Corollary 2 (in the Appendix) allow the design of an exact method Branch and Bound (BnB) based method, similar to some that we can find in the literature for several variable selection problems (Brusco et al., 2009a, Brusco et al., 2009b, Brusco and Steinley, 2011). This method allows to find the optimal solution, Sp and g(p), for all values of p (verifying pk,pq,pn), when the value of n is moderate.

The BnB algorithm performs a recursive analysis of the set of solutions. This

Design of correlation matrix samples

A series of correlation matrix samples will be generated for the different computational tests. The process of generating these matrices, similar to that used in Brusco et al., 2009a, Brusco et al., 2009b, consists of designing population correlation matrices L; a set of vectors following the normal distribution with the L correlation matrix is generated from each population correlation matrix L, and finally the sample correlation matrix corresponding to Σ is obtained.

The method of generating

Conclusions

A variable selection problem has been analysed in this work for use in Principal Component Analysis (PCA). In this case, the set of original variables is divided (partitioned) into disjoint groups. Each group corresponds to a variable type. The problem consists in the selection of variables, but with the restriction that the set of variables that is selected should have at least one variable of each type or group.

This problem, as previously explained, has a wide scope of application, above all

Acknowledgements

This work was partially supported by FEDER and the Spanish Ministry of Science and Innovation (Project ECO2008-06159/ECON), the Regional Government of Castile-Leon (Spain) (Project BU008A10-2), the University of Burgos and “CajaBurgos” (Spain, Internal Projects). Their support is gratefully acknowledged.

References (53)

  • M.J. Brusco et al.

    A variable-selection heuristic for k-means clustering

    Psychometrika

    (2001)
  • M.J. Brusco et al.

    Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis

    Psychometrika

    (2009)
  • M.J. Brusco et al.

    Neighborhood search heuristics for selecting hierarchically well-formulated subsets in polynomial regression

    Naval Research Logistics

    (2010)
  • M.J. Brusco et al.

    An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression

    Technometrics

    (2009)
  • J. Cadima et al.

    Loadings and correlations in the interpretation of principal components

    Journal of Applied Statistics

    (1995)
  • J. Cadima et al.

    Variable selection and the interpretation of principal subspaces

    Journal of Agricultural, Biological, and Environmental Statistics

    (2001)
  • Calzada, J.M., et al. 2011. Boletín de Coyuntura Económica. No 4, Caja Rural...
  • Ying Keung Chan et al.

    Quality of life in Hong Kong: the CUHK Hong Kong quality of life index

    Social Indicators Research

    (2005)
  • C. Cotta et al.

    Evolutionary search of thresholds for robust feature set selection: application to the analysis of microarray data

    Lecture Notes in Computer Science

    (2004)
  • A.P. Duarte Silva

    Discarding variables in principal component analysis: algorithms for all-subsets comparisons

    Computational Statistics

    (2002)
  • A. Falguerolles et al.

    Un critère de choix de variables en analyse en composants principales fondé sur des modèles graphiques gaussiens particuliers

    Canadian Journal of Statistics

    (1993)
  • J.G.F. Francis

    The QR transformation, parts I and II

    The Computer Journal

    (1961–1962)
  • G.M. Furnival et al.

    Regressions by leaps and bounds

    Technometrics

    (1974)
  • C. Gatu et al.

    Branch-and-bound algorithms for computing the best-subset regression models

    Journal of Computational and Graphical Statistics

    (2006)
  • K.Y. Hogarty et al.

    Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and traditional approach

    Psychometrika

    (2004)
  • M. Iizuka et al.

    Computer intensive trials to determine the number of variables in PCA

    Journal of the Japanese Society of Computational Statistics

    (2003)
  • Cited by (0)

    View full text