Consensus unsupervised feature ranking from multiple views

https://doi.org/10.1016/j.patrec.2007.11.012Get rights and content

Abstract

Feature ranking is a kind of feature selection process which ranks the features based on their relevances and importance with respect to the problem. This topic has been well studied in supervised classification area. However, very few works are done for unsupervised clustering under the condition that labels of all instances are unknown beforehand. Thus, feature ranking for unsupervised clustering is a challenging task due to the absence of labels of instances for guiding the computations of the relevances of features. This paper explores the feature ranking approach within the unsupervised clustering area. We propose a novel consensus unsupervised feature ranking approach, termed as unsupervised feature ranking from multiple views (FRMV). The FRMV method firstly obtains multiple rankings of all features from different views of the same data set and then aggregates all the obtained feature rankings into a single consensus one. Experimental results on several real data sets demonstrate that FRMV is often able to identify a better feature ranking when compared with that obtained by a single feature ranking approach.

Introduction

Data analysis often deals with complex data sets containing a large number of features (Schena et al., 1995, Baldi and Hatfield, 2002, Yu and Liu, 2004). For example, classification problems in molecular biology may involve thousands of features. Commonly, not all the features are usable for the learning task. Some of the features are redundant, some of them are irrelevant and noisy, which may even degrade the performance of the learning algorithms. More seriously, high-dimensionality of a data set often causes several problems. For example, data instances in high dimensions become very sparse and most of them look equally far from the centroids of clusters. Therefore, the performance of any learning algorithms using the concept of distance to measure the similarity of data instances may significantly degrade because of such a problem (Fukunaga, 1990, Blum and Langley, 1997). In order to remove the noisy features and to mitigate the curse of dimensionality, an important step related to analyzing high-dimensional data sets is how to select a meaningful subset of all features. This process is commonly termed as feature selection and named as variable selection in (Liu and Motoda, 1998). A good feature selection has several advantages for a learning algorithm, such as lower computational cost, better classification accuracy and result comprehensibility.

However, feature selection is not an easy task. To find a good subset of the feature vector sometimes requires an exponentially number of evaluations, which in fact is intractable if the data set has a large number of features (Liu and Yu, 2005, Pudil et al., 1994, Kim et al., 2000, Kudo and Sklansky, 2000, Debuse and Rayward-Smith, 1997). Another important issue related to feature selection is how to evaluate a candidate feature subset (Liu and Yu, 2005, Kim et al., 2000). Traditional feature selection approaches are supervised under the condition that the labels of all instances are known beforehand. If the labels of all instances are available, feature selection methods will evaluate a candidate feature subset in terms of their classifying accuracies on unseen data instances when wrapper feature selection approaches are adopted (Kohavi and John, 1997). Usually, the data set is divided into two sets: the training set and the test set. The classifier is trained with the training set, while its predictive error rate is estimated on the test set. When the filter approaches are adopted, the relevance of a feature subset is calculated according to the correlations between the features in the feature subset and the labels of all instances (Yu and Liu, 2003). However, we do not have any label information about data instances in many situation and thus we cannot apply labels of instances to estimate the quality of a candidate feature subset (Dy and Brodley, 2004). The absence of labels of instances increases the difficulty of feature selection in unsupervised clustering.

Feature ranking is a relaxed version of feature selection which ranks all features with respect to their relevances and chooses the top ranked features as the working feature vector manually. Therefore, feature ranking can be viewed as a kind of flexible feature subset selection approach. Feature ranking has been well studied in the supervised classification area (Guyon et al., 2004, Stoppiglia et al., 2003). In this paper, we propose a novel unsupervised feature ranking approach, termed as the unsupervised feature ranking from multiple views (FRMV). FRMV aggregates multiple feature rankings obtained from different views of the same data into a single consensus one. Therefore, FRMV is often able to achieve a better feature ranking when compared with a single feature ranking approach. We tested FRMV on several real data sets and experimental results indicated the potentials and effectiveness of FRMV.

Contributions of this work include two facets. Firstly, we extend the feature ranking methodology into the unsupervised data clustering area. Secondly, we propose a stable and robust unsupervised feature ranking approach based on the ensembles of multiple feature rankings obtained from different views of the same data set. To the best of our knowledge, very few works have been done on using ensemble learning for feature selection apart from Jong et al. (2004). In (Jong et al., 2004), a supervised feature ranking approach is proposed: Several rules are extracted from the data set by genetic algorithms in that each rule corresponds to one ranking of features. After that, it aggregates all rankings into a consensus one by the majority voting mechanism. Unlike (Jong et al., 2004) where labels of all data items are known beforehand and a population of diverse feature rankings are achieved by genetic algorithms, our feature ranking approach is unsupervised and the population of diverse feature rankings are obtained by the random subspaces method (Ho, 1998).

The remainder of this paper is arranged as follows. Section 2 briefly introduces the related work for this paper. Section 3 goes into the details of describing FRMV. Experimental results on several real data sets to test the performance of FRMV are given in Section 4. Section 5 concludes this paper.

Section snippets

Related work

In this section, the literatures about unsupervised feature selection and unsupervised clustering ensembles are briefly discussed.

Unsupervised feature ranking from multiple views

This section describes the FRMV. Before further illustrating about the FRMV, some notations used throughout this paper are given as follows. Let D={d1,d2,,dN} denote a data set containing N unlabeled instances and dij is the value of the feature fj in the ith instance of the data set and F={f1,f2,,fn} be all features. RF(k)={rank(k)(f1), rank(k)(f2), ,rank(k)(fn)} represents the kth ranking of all features and rank(k)(fj) is the rank of the feature fj in the kth feature ranking and 1rank(k)(

Experimental results and analysis

Nine UCI data sets are selected to test the performance of FRMV (Blake and Merz). Their names and characteristics are shown in Table 1. In our experiments, we use the Rand Index method to measure the accuracy of a clustering solution I (Rand, 1971). The Rand Index of the clustering solution I is calculated as:I,I(accurate)=2·(n00+n11)n·(n-1)where n11 is the number of pairs of instances which are both in the same group in I and also both in the same group in I(accurate) and n00 denotes the

Conclusions

This paper studied the problem of feature ranking within the unsupervised clustering area. We have proposed a consensus unsupervised feature ranking approach that combines multiple rankings of full features into a single consensus one. The proposed feature ranking approach were tested on several real data sets. Experimental results on several real data sets have demonstrated that ensembles of multiple feature rankings is able to achieve better ranking when compared with the one obtained by a

Acknowledgement

This project is supported by the Project No. 7002023, City University of Hong Kong. The authors would like to thank the comments and the suggestions from the reviewers.

References (36)

  • Fern, X.Z., Brodley, C.E., 2003. Clustering ensembles for high dimensional data clustering. In: Proc. Internat. Conf....
  • J. Fischer et al.

    Beginning for path-based clustering

    IEEE Trans. Pattern Anal. Machine Intell.

    (2003)
  • A. Fred et al.

    Combining multiple clusterings using evidence accumulation

    IEEE Trans. Pattern Anal. Machine Intell.

    (2005)
  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    (1990)
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Mach. Learn.

    (2004)
  • Hall, M.A., 2000. Correlation based feature selection for discrete and numeric class machine learning. In: Proc....
  • Hall, M.A., Smith, L.A., 1997. Feature subset selection: a correlation based filter approach. In: Proc. Internat. Conf....
  • Cited by (62)

    • Clustering ensemble-based novelty score for outlier detection

      2023, Engineering Applications of Artificial Intelligence
    • Consensus ranking as a method to identify non-conservative and dissenting tracers in fingerprinting studies

      2020, Science of the Total Environment
      Citation Excerpt :

      We use a widely used hierarchical clustering method T based on the Ward algorithm (Murtagh and Legendre, 2014) to group the tracers. Feature ranking is a flexible selection process commonly used in machine learning for classification problems when a large number of attributes are present in the dataset (Hong et al., 2008). This technique orders the features by the value of some scoring function to identify uninformative or redundant features which can be removed to increase the accuracy of the model.

    View all citing articles on Scopus
    View full text