Elsevier

Pattern Recognition

Volume 42, Issue 7, July 2009, Pages 1330-1339
Pattern Recognition

Feature selection with dynamic mutual information

https://doi.org/10.1016/j.patcog.2008.10.028Get rights and content

Abstract

Feature selection plays an important role in data mining and pattern recognition, especially for large scale data. During past years, various metrics have been proposed to measure the relevance between different features. Since mutual information is nonlinear and can effectively represent the dependencies of features, it is one of widely used measurements in feature selection. Just owing to these, many promising feature selection algorithms based on mutual information with different parameters have been developed. In this paper, at first a general criterion function about mutual information in feature selector is introduced, which can bring most information measurements in previous algorithms together. In traditional selectors, mutual information is estimated on the whole sampling space. This, however, cannot exactly represent the relevance among features. To cope with this problem, the second purpose of this paper is to propose a new feature selection algorithm based on dynamic mutual information, which is only estimated on unlabeled instances. To verify the effectiveness of our method, several experiments are carried out on sixteen UCI datasets using four typical classifiers. The experimental results indicate that our algorithm achieved better results than other methods in most cases.

Introduction

Since data mining is capable of identifying new, potential and useful information from datasets, it has been widely used in many areas, such as decision support, pattern recognition and financial forecasts [1]. Along with emergence of Internet and bio-informatics, datasets are getting larger and larger. This may lead to traditional mining or learning algorithms become slowing and cannot process information effectively. To mitigate this problem, one of the potent ways is to reduce the amount of data with sampling technique [2], [3]. However, in many applications, there are less data entries while the dimensionality is as big as hundreds or even more. In this case, sampling method is not a good choice. Theoretically, having more features implies more discriminative power in classification. However, this is not always true in practical experience, because not all features are important for understanding or representing the underlying phenomena of interest [4], [5]. Meanwhile, high dimensionality of data may cause the “curse of dimensionality” problem [6]. Thus, feature reduction (or dimensionality reduction) has been addressed to untie this knot.

Feature reduction refers to the study of methods for reducing the number of dimensions describing data. Its general purpose is to employ fewer features to represent data and reduce computational cost, without deteriorating discriminative capability. Since feature reduction can bring lots of advantages to learning algorithms, such as avoiding over-fitting, resisting noise and strengthening prediction performance, it has attracted great attention and many reduction algorithms have been developed during past years. Generally, they can be divided into two broad categories [7]: feature transform (or feature extraction) and feature selection (or variable selection). Feature transform constructs new features by projecting the original feature space to a lower dimensional one. Principal component analysis and independent component analysis are two widely used feature transform methods [5]. Although feature transform can obtain the least dimension, its major drawbacks lie in that its computational overhead is high and the output is hard to be interpreted for users.

Feature selection is the process of choosing a subset of the original feature spaces according to discrimination capability to improve the quality of data. Unlike feature transform, the fewer dimensions obtained by feature selection facilitate exploratory of results in data analysis. Due to this predominance, feature selection has now been widely applied in many domains, such as text categorization [8], image retrieval [9], [10], bioinformatics [11], [12], and intrusion detection [13]. Roughly speaking, there are three kinds of feature selection methods [14], [15], [16], i.e., wrapper, filter and embedded methods. In the embedded model, feature selection is integrated into the process of training for a given learning algorithm. One of the typical embedded methods is C4.5 [17]. Wrappers choose those features with high prediction performance estimated by specified learning algorithms. Since taking prediction capability into consideration, wrappers can achieve better results than others. Unfortunately, wrapper methods are less general and need more computational resources in learning, because they are tightly coupled with specified learning algorithms. Consequently, they are often intractable for large scale problems.

Contrastively, filter selection methods mainly identify a feature subset from the original space on the ground of given evaluation criterions, which are independent of learning algorithms. Due to its computational efficiency, the filter methods are very popular to high-dimension data. Nowadays, a modest number of filter selection algorithms and tools, such as Relief [18] and CFS [4], have been developed, and more efficient approaches are still emerging. It is noticeable that among different evaluation criterions, information metric seems to be more comprehensively studied. The main reason is that information entropy is a good measurement to quantify the uncertainty of feature.

For most feature selection methods on information metric, they offer different forms of evaluation functions. One purpose of this paper is to provide a general measurement for the state-of-the-art techniques on information theory. Under this context, we discuss the relationship between them one by one. After analyzing these algorithms, we observe that information entropy in these algorithms is estimated on the whole sampling space, which is determined once it has been given. That is to say, the values of information metric are invariable throughout the selection procedure. However, this cannot accurately represent the relevant degree between features when the selection procedure continues to work. As we know, each instance in dataset can be either unlabeled or labeled in a mutually exclusive fashion with respect to target classes. For labeled instances, any candidate feature is redundant or irrelevant, because they can be completely classified or recognized by selected features. Thus, information metric should be re-estimated only on unlabeled instances, rather than the whole instance space. Based on this fact, we put forward to a new feature selection algorithm using dynamic mutual information, which is the second purpose of this paper.

The structure of the rest is organized as follows. Section 2 reviews briefly previous related work on filter feature selection algorithms. In Section 3, some basic concepts about feature selection and information theory are given. Section 4 firstly introduces a general information function for feature selection methods, and then discusses the relation between different methods under this context. Section 5 provides a new feature selection algorithm using dynamic mutual information. Several experiments conducted to evaluate the effectiveness of our approach are presented in Section 6. Finally, conclusions and future works are given in the end.

Section snippets

Related work

So far, lots of selection methods have been proposed to identify salient features. This section briefly reviews the state of the art about them. Unless stated otherwise, we only focus our attention on filter feature selection methods. For others, interested readers can refer to previous literatures (e.g., [7], [11], [14], [15], [16]) to get more information.

Typically, filter selection methods work under the framework comprising four components: subset generation, evaluation, stopping criterion

Preliminaries

In this section, at first several basic concepts about mutual information are given, and then the formalism of feature selection is presented.

A general criterion of feature selection on mutual information

As stated above, each classifier is a mapping from F to C. Hence, a feature is relevant to the classes if it embodies important information about the classes, otherwise it is irrelevant or redundant [49]. Since mutual information is good at quantifying how much information is shared by two random variables, it is often taken as evaluation criterion to measure the relevance between features and the class labels. Under this context, those features fF with high predictive power will have larger

A new selector using dynamic mutual information

In this section, a feature subset selection algorithm based on dynamic mutual information will be proposed. Before we delve into the details of our algorithm, let us briefly review the concept of relevance of feature.

Given a dataset T=D(F,C), the classification learning task is to characterize the relationship between F and C so that this relationship can be used to predict future cases. Therefore any good feature subset induced by selection algorithms should preserve the existing relationship

Simulation experiments

In this section, a series of experiments have been carried out to evaluate the effectiveness of the proposed method. To serve for this purpose, brief description of benchmark datasets and experimental design will be firstly given, and then the simulation results will be presented and discussed.

Conclusion and future work

In this paper, a general criterion function for feature selection algorithm based on mutual information is introduced firstly. Under this general scheme, the relationship between this function and other information measurements used in the up-to-date methods is also discussed. In spite of having different forms, they can be roughly grouped into three categories in terms of this criterion.

The second objective of this paper is to propose a new feature selection algorithm based on dynamic mutual

Acknowledgments

The authors are grateful to anonymous referees for their valuable and constructive comments, and Prof. Ying Kim for her valuable suggestions. This work is supported by the Doctor Point Foundation of Educational Department (20060183044) and Science Foundation for Young Teachers of Northeast Normal University (20081003).

About the Author—HUAWEN LIU received his B.Sc. degree in computer science from Jiangxi Normal University in 1999, and M.Sc. degree in computer science from Jilin University, P.R. China, in 2007. At present, he is a Ph.D. candidate in Jilin University. His research interests involve data mining, machine learning, pattern recognition and rough set.

References (63)

  • G.H. John et al.

    Irrelevant feature and the subset selection problem

  • D. Huang et al.

    Effective feature selection scheme using mutual information

    Neurocomputing

    (2005)
  • W.W. Cohen

    Fast effective rule induction

  • U. Fayyad et al.

    From data mining to knowledge discovery in databases

    AI Magazine

    (1996)
  • M. Lindenbaum et al.

    Selective sampling for nearest neighbor classifiers

    Machine Learning

    (2004)
  • A.I. Schein et al.

    Active learning for logistic regression: an evaluation

    Machine Learning

    (2007)
  • M.A. Hall, Correlation-based feature subset selection for machine learning, Ph.D. Dissertation, Department of Computer...
  • I.K. Fodor, A survey of dimension reduction techniques, Technical Report UCRL-ID-148494, Lawrence Livermore National...
  • R. Bellman

    Adaptive Control Processes: A Guided Tour

    (1961)
  • G. Forman

    An extensive empirical study of feature selection metrics for text classification

    Journal of Machine Learning Research

    (2003)
  • J.G. Dy et al.

    Unsupervised feature selection applied to content-based retrieval of lung images

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • D.L. Swets et al.

    Efficient content-based image retrieval using automatic feature selection

  • Y. Saeys et al.

    A review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • E. Xing et al.

    Feature selection for high-dimensional genomic microarray data

  • W. Lee et al.

    Adaptive intrusion detection: a data mining approach

    AI Review

    (2000)
  • I. Guyon et al.

    An introduction to variable and feature selection

    Journal of Machine Learning Research

    (2003)
  • H. Liu et al.

    Toward integrating feature selection algorithms for classification and clustering

    IEEE Transactions on Knowledge and Data Engineering

    (2005)
  • R. Quinlan

    C4.5: Programs for Machine Learning

    (1993)
  • K. Kira, L. Rendell, A practical approach to feature selection, in: Proceedings of the 9th International Conference on...
  • G. Qu et al.

    A new dependency and correlation analysis for features

    IEEE Transactions on Knowledge and Data Engineering

    (2005)
  • I. Kononenko

    Estimating attributes: analysis and extensions of relief

  • Cited by (336)

    • Feature selection using a sinusoidal sequence combined with mutual information

      2023, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    About the Author—HUAWEN LIU received his B.Sc. degree in computer science from Jiangxi Normal University in 1999, and M.Sc. degree in computer science from Jilin University, P.R. China, in 2007. At present, he is a Ph.D. candidate in Jilin University. His research interests involve data mining, machine learning, pattern recognition and rough set.

    About the Author—JIGUI SUN received his M.Sc. degree in mathematics and Ph.D. degree in computer science from Jilin University in 1988 and 1993, respectively. He joined the College of Computer Science and Technology of Jilin University as a lecturer in 1993. Currently, he is a professor with Jilin University, P.R. China, and dean of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education. He has wide research interests, mainly including artificial intelligence, machine learning, data mining, automation reasoning, intelligent planning and intelligent decision support system.

    About the Author—LEI LIU received his B.Sc. and M.Sc. degrees in computer science from Jilin University, P.R. China in 1982 and 1985, respectively. Then he worked at College of Computer Science and Technology, Jilin University, as a lecturer and now he is a professor. During past years, more than 200 his papers have been accepted or published by journals and conferences. His research interests mainly cover semantic web, data mining, computational language and programming theory.

    About the Author—HUIJIE ZHANG received the B.Sc. and M.Sc. degrees in computer science from Jilin University in 1998 and 2004, respectively. Currently, she is a lecturer in Department of Computer Science, Northeast Normal University, P.R. China. Her research interests include Geographical Information System (GIS), data mining and pattern recognition.

    View full text