Elsevier

Pattern Recognition Letters

Volume 33, Issue 15, 1 November 2012, Pages 1991-1999
Pattern Recognition Letters

Hypergraph based information-theoretic feature selection

https://doi.org/10.1016/j.patrec.2012.03.021Get rights and content

Abstract

In many data analysis tasks, one is often confronted with the problem of selecting features from very high dimensional data. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. To overcome this problem it is frequently assumed that features either independently influence the class variable or do so only involving pairwise feature interaction. To overcome this problem, we draw on recent work on hyper-graph clustering to select the most informative feature subset (mIFS) from a set of objects using high-order (rather than pairwise) similarities. There are two novel ingredients. First, we use a new information theoretic criterion referred to as the multidimensional interaction information (MII) to measure the significance of different feature combinations with respect to the class labels. Secondly, we use hypergraph clustering to select the most informative feature subset (mIFS), which has both low redundancy and strong discriminating power. The advantage of MII is that it incorporates third or higher order feature interactions. Hypergraph clustering, which extracts the most informative features. The size of the most informative feature subset (mIFS) is determined automatically. Experimental results demonstrate the effectiveness of our feature selection method on a number of standard data-sets.

Highlights

► We combine MII and hypergraph cluster analysis for feature selection. ► MII criterion can consider third or higher order interactions. ► Optimal size of feature subset can be automatically determined by hypergraph cluster analysis.

Introduction

High-dimensional data pose a significant challenge for pattern recognition. The most popular methods for reducing dimensionality are variance based subspace methods such as PCA (Jollife, 1986). However, the extracted PCA feature vectors only capture sets of features with a significant combined variance, and this renders them relatively ineffective for classification tasks. Hence it is crucial to identify a smaller subset of features that are informative for classification and clustering. Recently, mutual information has been shown to provide a principled way of measuring the mutual dependence of two variables, and has been used by a number of researchers to develop information theoretic feature selection criteria. For example, Battiti (2002) has developed the Mutual Information-Based Feature Selection (MIFS) criterion, where the features are selected in a greedy manner. Given a set of existing selected features S, at each step it locates the feature xi that maximize the relevance to the class I(xi ; C). The selection is regulated by a proportional term βI(xi ; S) that measures the overlap information between the candidate feature and existing features. The parameter β may significantly affect the features selected, and its control remains an open problem. Peng et al. (2005) on the other hand, use the so-called Maximum-Relevance Minimum-Redundancy criterion (MRMR), which is equivalent to MIFS with β=1n-1. Yang and Moody (1999) Joint Mutual Information (JMI) criterion is based on conditional MI and selects features by checking whether they bring additional information to an existing feature set. This method effectively rejects redundant features. Kwak and Choi (2002) improve MIFS by developing MIFS-U under the assumption of a uniform distribution of information for input features. It calculates the MI based on a Parzen window, which is less computationally demanding and also provides better estimates.

However, there are three limitations for the above MI-based feature selection methods. Firstly, the number of selected features need to be specified in advance. In real applications, it is hard to estimate the number of useful features before the feature selection process. The second weakness is that they assume that each individual relevant feature should be dependent with the target class. This means that if a single feature is considered to be relevant it should be correlated with the target class, otherwise the feature is irrelevant (Cheng et al., 2008). So only a small set of relevant features is selected, and larger feature combinations are not considered. The third weakness is that most of these methods focus on ranking features based on an information criterion and select the best K features in a greedy way. Here, commencing from an empty feature pool, features are added into the pool one by one until the user-defined number is reached. However, several authors find that the optimal feature combinations do not give the best classification performance (Cover, 1974, Cover and Thomas, 1991).

Recently, graph-based methods, such as spectral embedding (Belkin and Niyogi, 2002), spectral clustering (Shi and Malik, 2000), and semi-supervised learning (Chung, 1997, Kulis et al., 2005), have played an important role in machine learning due to their ability to encode the similarity relationships among data. Various applications of graph-based methods can be found in clustering (Shi and Malik, 2000, Jain and Zhang, 2007), data mining (Jin et al., 2006), manifold learning (Zhu et al., 2003), subspace learning (He et al., 2006) and speech recognition (Bach and Jordan, 2006). A preliminary step for all these graph-based methods is to establish a graph over the training data. Data samples are represented as vertices of the graph and the edges represent the pairwise similarity relationships between them. The methods for establishing a graph and measuring vertex similarities (i.e. edge weights) critically determine the performance of the subsequent graph-based learning algorithm. There are different similarity-base methods that can be used to determine the edge weights and different methods may lead to different learning results. As a result, we need to carefully select the most suitable measure. In general, the Euclidean distance (Jiang et al., 2004) and Pearson’s correlation coefficient (Rao, 1965) are both widely used as distance or similarity measures. However, both of them are distance based measures, which only account for the proximity of data which follow a particular distribution (Yu et al., 2006). They are not effective to reflect functional similarity, such as positive or negative correlation and interdependency.

In feature selection, the attractive feature of graph representations is that they provide a universal and flexible framework that reflects the underlying manifold structure and the relationships between feature vectors. The best known methods are the Fisher score (Bishop, 1995) and the Laplacian score (He et al., 2006), both of which are belong to the general graph-based feature selection framework. In addition, Zhang and Hancock (2011) has developed a graph based information-theoretic feature selection method which is based on dominant set clustering and multidimensional interaction information (MII) criterion. The Laplacian score uses a nearest neighbor graph to model the local geometric structure of the data, where the pairwise similarities between features are calculated using the heat kernel. In this framework, the feature subset is selected based on the score of the entire feature subset, and the score is calculated in a trace ratio form. It has been proven (He et al., 2006) that with label information the Laplacian score becomes equal to the Fisher score. In order to discover both geometrical and discriminant structure of the data manifold, Nie et al. (2008) construct two weighted graphs to capture the similarity structure of the data. The first is the intra-class or within class similarity graph Gw, while the second is the inter or between class similarity graph Gb. Graphs Gw and Gb are characterized by the weight matrices Aw and Ab. They employ the trace ratio criterion over the graphs to locate the optimal feature subset.

However, in many situations the graph representation for relational patterns can lead to substantial loss of information. This is because in real-world problems objects and their features tend to exhibit multiple relationships rather than simple pairwise ones. As an illustration, let us consider a problem of feature selection. Given four features X1, X2, X3, X4 and class set C, one may construct a graph in which each node corresponds to a feature, and each edge has a pairwise weight corresponding to the mutual information (MI) between features connected by that edge. The pairwise weight can be defined asWi,j=I(Xi;C)+I(Xj;C)-I(Xi;Xj)+I(Xi;Xj|C).Here the mutual information I(Xi ; C) and I(Xj ; C) are used to measure the relevance degree of independent feature, where I(Xi ; Xj) implies the degree of redundancy between two features, and I(Xi ; XjC) measures the influence of the two features combination on the class set C. Assume the existing selected feature subset is {X1, X4}, I(X2,C) = I(X3,C), I(X2, X1∣C) = I(X3, X1∣C), I(X2, X4∣C) = I(X3, X4∣C) and then I(X1, X4, X2)≫ I(X1, X2) + I(X4, X2). This indicates that X2 has a strong affinity with the joint subset {X1, X4}, although it has smaller individual affinity to each of them. So in this situation, X2 may be discarded, and X3 is selected, although the combination {X1, X4, X2} can produce a better cluster than {X1, X4, X3}.

A natural way of remedying the information loss described above is to represent the data set as a hypergraph instead of a graph. Hypergraph representations allow vertices to be multiply connected by hyperedges and can hence capture multiple or higher order relationships between features. Due to their effectiveness in representing multiple relationships, hypergraph based methods have been applied to various practical problems, such as partitioning circuit netlists (Hagen and Kahng, 1992), clustering (Agarwal et al., 2006, Zhou et al., 2007), clustering categorial data (Gibson et al., 2000), and image segmentation (Agarwal et al., 2005.)

For the task of feature selection addressed in this paper, we propose to use a hypergraph-based feature selection algorithm consisting of two steps. Firstly, we construct a hypergraph in which each node corresponds to a feature, and each edge has a weight corresponding to the multidimensional interaction information (MII) among features connected by that edge. Secondly, we apply hypergraph clustering to the hypergraph in order to locate the most informative feature subset (mIFS), which has both low redundancy and strong discriminating power. The advantage of MII is that it incorporates third or higher order feature interactions.

In summary, there are three main contributions in this paper. The first is that we develop a hypergraph representation based on the attributes of feature vectors, i.e. a feature hypergraph. With this representation, the structural information latent in the data can be more effectively modeled. The second is that unlike most existing graph or hypergraph methods, which use distance metrics (i.e. Euclidean distance or Pearson’s correlation coefficient) to represent the weight of edge or hyperedge, here we determine the weight of the hyperedges using an information measure referred to as multidimensional interaction information (MII). There are two advantages of MII. First, it effectively reflects functional similarity, such as the positive or negative correlation and interdependency among features. Second, it is sensitive to the relations between feature combinations, and as a result can be used to seek third or even higher order dependencies between the relevant features. Thirdly, we can use the method to locate the most informative feature subset (mIFS) by hypergraph cluster analysis. In contrast with existing feature selection methods, our proposed methods is able to determine the number of relevant features automatically.

The remainder of this paper is organized as follows. Section 2 describes the relevant background on hypergraph. Section 3 describes how to combine multidimensional interaction information (MII) criterion and hypergraph cluster analysis to locate most informative feature subset (mIFS). The classification methods are presented in Section 4. In Section 5, we first give a description of the real-world benchmark data sets. We then examine the performance of our proposed hypergraph based information-theoretic feature selection method, and compare the classification results with those obtained by alternative feature selection methods. Finally, conclusions are presented in Section 6.

Section snippets

Hypergraph fundamentals

A hypergraph is defined as a triplet H = (V, E, W), where V = {1,  , n} is the node-set, E is a set of non-empty subsets of V or hyperedges and R is a weight function which associates a real value with each edge. A hypergraph is a generalization of a graph. Unlike graph edges which consist of pairs of vertices, hyperedges can be arbitrarily sized sets of vertices. Examples of a hypergraph are shown in Fig. 1. For the hypergraph, the vertex set is V = {v1, v2, v3, v4, v5}, where each vertex represents a

Feature selection using hypergraph cluster analysis

In this paper we aim to utilize the hypergraph cluster analysis to perform feature selection. Using a hypergraph representation of the features, there are two steps to the algorithm, namely (a) computing the weight matrix W based on the multidimensional interaction information (MII) among feature vectors, (b) hypergraph cluster analysis to select the most informative feature subset (mIFS). In the remainder of this paper we describe these elements of our feature selection algorithm in more

Classification

After finding the discriminating features, we then run classification experiments on them by two classifiers. For multi class data set, we use the linear SVM with LIBSVM (Chang and Lin, 2001). However, for binary class data set, we apply the variational EM (VBEM) algorithm (Bishop, 2006) to fit a mixture of Gaussians model to the selected feature subset. After learning the mixture model, we use the a class posteriori probabilities, see Eq. (11), to classify one of testing sample data. Given a

Experiments and comparisons

The data sets used to test the performance of our proposed algorithm are benchmark data sets from the UCI Machine Learning Repository and Statlog. Table 1 summarizes the extents and properties of the ten data-sets.

Our proposed feature selection method (referred to as the MII + HG) utilizes the multidimensional interaction information (MII) criterion and hypergraph cluster analysis for feature selection. It involves applying the MII criterion as the weight measure and then using hypergraph cluster

Conclusions

In this paper, we have presented a new hypergraph based information theoretic approach to feature selection. The proposed feature selection method offers three major advantages. First, the MII criteria is applied to measure the weight of hyperedges, which takes into account high-order feature interactions, overcoming the problem of overselected feature redundancy. As a result the features associated with the greatest amount of joint information can be preserved. Second, hypergraph clustering

References (35)

  • V. Jain et al.

    A spectral approach to shape-based retrieval of articulated 3D models

    Comput. Aided Des.

    (2007)
  • Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D., Belongie, S., 2005. Beyond pairwise clustering. In:...
  • Agarwal, S., Branson, K., Belongie, S., Zhang, C., Yan, S., 2006. Higher order learning with graphs. Proceedings of the...
  • F.R. Bach et al.

    Learning spectral clustering, with application to speech separation

    J. Machine Learn. Res.

    (2006)
  • R. Battiti

    Using Mutual Information for Selecting Features in Super- vised Neural Net Learning

    IEEE Trans. Neural Networks

    (2002)
  • L.E. Baum et al.

    An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology

    Bull. Amer. Math. Soc.

    (1967)
  • M. Belkin et al.

    Laplacian eigenmaps and spectral techniques for embedding and clustering

    Adv. Neural Inform. Process. Systems

    (2002)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • C. Bishop
    (2006)
  • S.R. Bulò et al.

    A game-theoretic approach to hypergraph clustering

    Adv. Neural Inform. Process. Systems

    (2009)
  • Chang, C.-C., Lin, C.-J, 2001. LIBSVM: A library for support vector...
  • Cheng, H., Qin, Z., Qian, W., Liu, W., 2008. Conditional Mutual Information Based Feature Selection. In: IEEE...
  • Chung, F., 1997. Spectral Graph Theory. In: Regional Conference Series in Mathematics American Mathematical Society. 92...
  • T.M. Cover

    The best two independent measurements are not the two best

    IEEE Trans. Systems Man Cybernet.

    (1974)
  • T.M. Cover et al.

    Elements of information theory

    (1991)
  • D. Gibson et al.

    Clustering categorical data: An approach based on dynamical systems

    The VLDB J. Internat. J. Very Large Data Bases

    (2000)
  • L. Hagen et al.

    New spectral methods for ratio cut partitioning and clustering

    IEEE Trans. Comput. Aided Des. Integrat. Circuits Systems

    (1992)
  • Cited by (37)

    • Review of swarm intelligence-based feature selection methods

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      This work utilizes a graph-theoretic clustering method for similar grouping features. In Zhang and Hancock (2012), a hypergraph-based method is proposed for feature selection. This work uses an information-theoretic criterion to evaluate the appropriateness of different features by considering the related class label of each sample.

    • Unsupervised feature selection with adaptive multiple graph learning

      2020, Pattern Recognition
      Citation Excerpt :

      In the second class, the methods construct an adaptive graph in the procedure of feature selection, i.e., the structure of graph changes with selected features. For example, Du et al. [35] learned an adaptive graph for feature selection by preserving the global and local structure; Nie et al. [17] adaptively learned the local structure from the results of feature selection; Zhang et al. [36] constructed a hypergraph from all features to characterize the high-order similarities of data and selected features by the hypergraph clustering, and then they further proposed an adaptive hypergraph learning method to jointly learn the hypergraph and select features [37]. Fan et al. [18] proposed an unsupervised discriminant feature selection method which constructed the graph with pseudo-labels obtained by the results of subspace clustering; Zhu et al. [38] applied subspace clustering to learn the similarity matrix to guide the feature selection; Luo et al. [5] constructed the adaptive graph with structure regularization; Zheng et al. [22] learned a low rank structure for feature selection; Li et al. [39] proposed a generalized uncorrelated regression with adaptive graph for feature selection.

    • Hypergraph isomorphism using association hypergraphs

      2019, Pattern Recognition Letters
      Citation Excerpt :

      In recent years hypergraphs (as opposed to graph) have been used to encode structural information in different fields such as computer vision, pattern recognition and machine learning, thanks to the benefits that derive when dealing with relationships among more than two elements, thus encoding a higher pool of information. In [2] the authors evaluate the complexity traces of different hypergraphs representing images, getting the idea from the work on graphs by Xiao et al. [28], in order to cluster the data; in [33,34] hypergraphs are used for feature selection, relying on the high cardinality of graphs to reflect the functional similarities between more than two features. Moreover, matching this type of structures is of particular interest for solving problems such as, e.g., feature tracking, object recognition, scene registration and shape matching, where high-order relations are commonly used.

    • Identifying the most informative features using a structurally interacting elastic net

      2019, Neurocomputing
      Citation Excerpt :

      To overcome these shortcomings, Liu et al. [23] have proposed the adaptive MI-based feature selection method that automatically determines the number of most informative features, by maximizing the average pairwise informativeness. Zhang and Hancock [24] have developed a hypergraph based information theoretic feature selection method that can automatically determine the size of the most informative feature subset through dominant hypergraph clustering [25]. However, the aforementioned graph-based feature selection methods may lead to significant information loss concerning the relationships between samples from the original vector space.

    View all citing articles on Scopus
    1

    Edwin Hancock is supported by the EU FET project, SIMBAD and by a Royal Society Wolfson Research Merit Award.

    View full text