Elsevier

Pattern Recognition

Volume 61, January 2017, Pages 511-523
Pattern Recognition

An efficient semi-supervised representatives feature selection algorithm based on information theory

https://doi.org/10.1016/j.patcog.2016.08.011Get rights and content

Highlights

  • A relevance gain framework by which the relevance of features can be measured in the unlabeled data.

  • The partition of the directed acyclic graph to cluster the redundant features.

  • Extend the existing Markov blanket algorithms to exploit the information of the unlabeled data.

Abstract

Feature selection (FS) plays an important role in data mining and recognition, especially regarding large scale text, images and biological data. The Markov blanket provides a complete and sound solution to the selection of optimal features in supervised feature selection, and investigates thoroughly the relevance of features relating to class and the conditional independence relationship between features. However, incomplete label information makes it particularly difficult to acquire the optimal feature subset. In this paper, we propose a novel algorithm called the Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS), which is independent of any algorithm used for classification learning, and can rapidly and effectively identify and remove non-essential information and irrelevant and redundant features. More importantly, the unlabeled data are utilized in the Markov blanket as the labeled data through the relevance gain. Our results on several benchmark datasets demonstrate that SRFS can significantly improve upon state of the art supervised and semi-supervised algorithms.

Introduction

Data mining is capable of finding meaningful information and patterns in datasets, and has been widely applied in various fields, such as text/image retrieval engine, machinery signal processing, medical diagnosis support and the financial market [1], [2], [3], [4], [5]. An increasingly serious problem is that traditional data mining and learning algorithms cannot handle these enlarging datasets rapidly and effectively. Although they theoretically have more features and more discriminative power, this is not always true when the dataset's intrinsic manifold is relatively fixed and the additional features are irrelevant and redundant [6], [7], [8]. One of the key solutions to this problem is to reduce the dimensionality of the dataset by identifying and removing those meaningless features before any further analysis.

After a dimensionality reduction, the dataset's computational complexity can be reduced, while preserving or even improving its discriminative capability [9], [10], [11]. It has been proven that dimensionality reduction can enhance learning efficiency, increase predictive accuracy, and reduce computational complexity in both theory and practice. There are two major ways to achieve dimensionality reduction, feature extraction and feature selection. The former methods project data into a new reduced subspace, where the original meanings of features are changed [12], [13], [14], [15]. The latter methods involve selecting the optimal feature subset while removing the irrelevant and redundant features. More importantly, the selected features preserve the original meanings of the features. This might explain why researchers are interested in feature selection [16], [17], [18], [19]. In addition, feature selection can be divided into four categories: wrapper, embedded, filter, and hybrid. The wrapper approach selects a feature subset with a higher prediction performance based on a specified learning algorithm [20], [21], [22]. Similarly, the embedded approach selects the best feature subset during the learning process of a specified learning algorithm [23], [24]. The filter approach chooses a feature subset from the original feature space according to pre-specified evaluation criterions or the dataset's intrinsic property, evaluating the relevance of each feature or subset using only the dataset [25], [26], [27], [28]. The hybrid approach uses an independent criterion and a learning algorithm to evaluate the candidate feature subsets, combining the advantages of the wrapper approach and the filter approach [29], [30], [31]. In this paper, we are particularly interested in the filter feature selection, which provides more general and computational efficiency.

Dependent upon whether the label information is used or not in feature selection, the feature selection be classified into supervised and unsupervised methods. The supervised FS requires a large amount of labeled data. For example, the Markov blanket of a feature represents the set of features required to exactly predict that feature's information using the label information, removing it as a redundant feature [19], [32]. If lacking enough labeled data, it would fail to identify the relevant features that are discriminative to different classes [18], [32], [33], [34]. Unlike the supervised FS, the unsupervised FS only works with unlabeled data and ignores the label information, at which point it could fail to identify the discriminative or important features [35], [36], [37], [38], [39]. Moreover, the labeling work is time consuming and expensive in engineering practices, while many unlabeled data can be readily obtained in the real world. It is also worth noting that research regarding the semi-supervised method [40], [41], [42], [43], [44] in partially labeled data appears to be of greater importance when compared with unsupervised and supervised FS. The semi-supervised feature selection examines how to better identify relevant features; they are discriminative to different classes by effectively exploring the information underlying the limited number of labeled data and the large amount of unlabeled data for real-world applications. Zhao and Liu [41] proposed a filter-based semi-supervised FS, which ranks features via the spectral and mutual information. Zhao et al. [42] also introduced the concept that the labeled data are used to maximize the margin between data from different classes, while the unlabeled data preserve the geometrical structure of the data space. However, the feature-score filter approach could discard important features that are less informative by themselves, but are more informative when combined with other features. Xu et al. [43] proposed a wrapper-based semi-supervised FS by maximizing the classification margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data. These approaches use the correlation of features collectively during the learning process, and may return with higher discriminative or important feature subsets than the filter approaches. Unfortunately, there are less general and more computational resources needed in the learning process.

Inspired from the recent works on the Markov blanket [19], [32], [45], [46], [47], [48], [49] and information theory [8], [50], [51], [52], [53], in the selection of an optimal feature subset, we propose an efficient Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS). SRFS belongs to filter approaches and has performed exceptionally well in removing the irrelevant features and clustering the redundancy features. This work's contributions are as follows: (1) We developed a relevance gain framework through which the relevance of features can be measured in the unlabeled data. (2) We introduced the partition of the directed acyclic graph to cluster the redundant features. (3) We extended the existing Markov blanket algorithms to exploit the additional information entropy contained in the unlabeled data.

The rest of this paper is organized as follows. Section 2 reviews briefly previous related work. In Section 3, we explain the proposed method for the semi-supervised feature selection algorithm, and discuss the relevance gain. Experimental studies are given in Section 4. Finally, we draw the conclusions and future works in Section 5.

Section snippets

Related works

In this section, several basic concepts regarding information theory are given, after which point the correlative feature selection algorithms based on Markov blanket are discussed. Regarding the above works, some notations used through this paper are given as follows. Let F={F1,F2,,FD} indicates a full set of features, where Fi(i=1,2,,D) represents the ith feature of the dataset, feature subset SF. A matrix dataset X={XL,XU}T, the size of which is (N+M)×D, where XL={x1,,xN}T, XU={xN+1,,x(N

Semi-supervised representatives feature selection

In this section, we introduce the framework of Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS). The objective of SRFS is to find a feature subset S with size d that contains the representative features, exploiting both the labeled and unlabeled dataset.

Experimental preparation

To evaluate the effectiveness of our method, we took into consideration different types of feature selection on several benchmark datasets, available from the UCI Machine Learning Repository [62] and the ASU Feature Selection Repository [63]. In all of the experiments, we compared SRFS against three well known feature selection methods, including the locality sensitive semi-supervised feature selection (LS3) [42], mRMR [18] and FCBF [32]. All of these methods have shown their advantages in

Conclusion

In this paper, we propose the SRFS method for feature selection based on the Markov blanket and information theory, because it is a remarkable improvement over the LS3, mRMR and FCBF methods. Our work highlights the relevance gain as a measure of redundancy in order to combine with mutual information to utilize all the information contained in unlabeled and labeled data. Meanwhile, the partition of the directed acyclic graph method clusters the redundant features and distributes features into

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China (61139002, 61501229, 11547040), Guangdong Province Natural Science Foundation (2016A030310051, 2015KONCX143), Shenzhen Fundamental Research Fundation (JCYJ20150625101524056), Project 2016047 supported by SZU R/D Fund and Open Fund supported by SKL-MCCS.

Yintong Wang was born in Yancheng, PR China, in February 1987. He received the B.S. degree in computer application from Southeast University Chengxian College, Jiangsu, China in 2009 and the M.S. degree in computer software and theory from ChangChun University Of Technology, Jilin, China in 2012. He is currently a doctoral candidate at the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, PR China. His research interests include data mining and

References (70)

  • Y. Yang et al.

    A hybrid feature selection scheme for unsupervised learning and its application in bearing fault diagnosis

    Expert Syst. Appl.

    (2011)
  • P. Zhu et al.

    Unsupervised feature selection by regularized self-representation

    Pattern Recognit.

    (2015)
  • S. Wang et al.

    Subspace learning for unsupervised feature selection via matrix factorization

    Pattern Recognit.

    (2015)
  • S. Bandyopadhyay et al.

    Integration of dense subgraph finding with feature clustering for unsupervised feature selection

    Pattern Recognit. Lett.

    (2014)
  • R. Cai et al.

    Bassum: a bayesian semi-supervised method for classification feature selection

    Pattern Recognit.

    (2011)
  • J. Zhao et al.

    Locality sensitive semi-supervised feature selection

    Neurocomputing

    (2008)
  • Z. Zhu et al.

    Markov blanket-embedded genetic algorithm for gene selection

    Pattern Recognit.

    (2007)
  • F. Amiri et al.

    Mutual information-based feature selection for intrusion detection systems

    J. Netw. Comput. Appl.

    (2011)
  • N. Hoque et al.

    Mifs-nd: a mutual information-based feature selection method

    Expert Syst. Appl.

    (2014)
  • J. Huang et al.

    A hybrid genetic algorithm for feature selection wrapper based on mutual information

    Pattern Recognit. Lett.

    (2007)
  • M. Dash et al.

    Consistency-based search in feature selection

    Artif. Intell.

    (2003)
  • Z. Zhu et al.

    Towards a memetic feature selection paradigm

    IEEE Trans. Comput. Intell. Mag.

    (2010)
  • H. Liao et al.

    Predicting missing links via correlation between nodes

    Physica A

    (2014)
  • A.K. Jain et al.

    Statistical pattern recognition: a review

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • B. Mwangi et al.

    A review of feature reduction techniques in neuroimaging

    Neuroinformatics

    (2014)
  • J.B. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • S.T. Roweis et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2000)
  • M. Sugiyama

    Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis

    J. Mach. Learn. Res.

    (2007)
  • Y. Wang et al.

    Semi-supervised local fisher discriminant analysis based on reconstruction probability class

    Int. J. Pattern Recognit. Artif. Intell.

    (2015)
  • M. Robnik Šikonja et al.

    Theoretical and empirical analysis of relieff and rrelieff

    Mach. Learn.

    (2003)
  • H. Peng et al.

    Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • Q. Song et al.

    A fast clustering-based feature subset selection algorithm for high-dimensional data

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • P. Mitra et al.

    Unsupervised feature selection using feature similarity

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • Y. Kim et al.

    Evolutionary model selection in unsupervised learning

    Intell. Data Anal.

    (2002)
  • Cited by (56)

    • Neurodynamics-driven holistic approaches to semi-supervised feature selection

      2023, Neural Networks
      Citation Excerpt :

      Yang et al. (2010) proposed a semi_Fisher score method (Duda & Hart, 1973). Based on the Markov blanket and information theory (Koller & M., 1996; Wang et al., 2017) proposed a semi-supervised representatives feature selection algorithm to identify and remove non-essential information and irrelevant and redundant features. Most of the current semi-supervised filter methods select features incrementally.

    View all citing articles on Scopus

    Yintong Wang was born in Yancheng, PR China, in February 1987. He received the B.S. degree in computer application from Southeast University Chengxian College, Jiangsu, China in 2009 and the M.S. degree in computer software and theory from ChangChun University Of Technology, Jilin, China in 2012. He is currently a doctoral candidate at the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, PR China. His research interests include data mining and artificial intelligence.

    Jiandong Wang was born in November 1945 and received the B.S. in Electrical Engineering from Shanghai Jiao Tong University in China. He is currently Professor and Doctoral Students Adviser in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics. His main research interests include artificial intelligence, data mining and information security.

    Hao Liao was born in Huanggang, PR China, in April 1987. He received the Ph.D degree in information physics from University of Fribourg, Switzerland. He is currently Lecturer in the College of Computer Science and Software Engineering at Shenzhen University. His main research interests include artificial intelligence, data mining, information network, information economy.

    Haiyan Chen was born in Changzhou, PR China, in 1979. She received the B.S., M.S. and Ph. degree in Computer Science and Technology from the Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, P.R. China, in 2002, 2005 and 2014. She is currently Associate Professor in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics. Her study concerns data mining, modeling and simulation.

    View full text