An efficient semi-supervised representatives feature selection algorithm based on information theory

doi:10.1016/j.patcog.2016.08.011

Pattern Recognition

Volume 61, January 2017, Pages 511-523

https://doi.org/10.1016/j.patcog.2016.08.011 Get rights and content

Highlights

•
A relevance gain framework by which the relevance of features can be measured in the unlabeled data.
•
The partition of the directed acyclic graph to cluster the redundant features.
•
Extend the existing Markov blanket algorithms to exploit the information of the unlabeled data.

Abstract

Feature selection (FS) plays an important role in data mining and recognition, especially regarding large scale text, images and biological data. The Markov blanket provides a complete and sound solution to the selection of optimal features in supervised feature selection, and investigates thoroughly the relevance of features relating to class and the conditional independence relationship between features. However, incomplete label information makes it particularly difficult to acquire the optimal feature subset. In this paper, we propose a novel algorithm called the Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS), which is independent of any algorithm used for classification learning, and can rapidly and effectively identify and remove non-essential information and irrelevant and redundant features. More importantly, the unlabeled data are utilized in the Markov blanket as the labeled data through the relevance gain. Our results on several benchmark datasets demonstrate that SRFS can significantly improve upon state of the art supervised and semi-supervised algorithms.

Introduction

Data mining is capable of finding meaningful information and patterns in datasets, and has been widely applied in various fields, such as text/image retrieval engine, machinery signal processing, medical diagnosis support and the financial market [1], [2], [3], [4], [5]. An increasingly serious problem is that traditional data mining and learning algorithms cannot handle these enlarging datasets rapidly and effectively. Although they theoretically have more features and more discriminative power, this is not always true when the dataset's intrinsic manifold is relatively fixed and the additional features are irrelevant and redundant [6], [7], [8]. One of the key solutions to this problem is to reduce the dimensionality of the dataset by identifying and removing those meaningless features before any further analysis.

After a dimensionality reduction, the dataset's computational complexity can be reduced, while preserving or even improving its discriminative capability [9], [10], [11]. It has been proven that dimensionality reduction can enhance learning efficiency, increase predictive accuracy, and reduce computational complexity in both theory and practice. There are two major ways to achieve dimensionality reduction, feature extraction and feature selection. The former methods project data into a new reduced subspace, where the original meanings of features are changed [12], [13], [14], [15]. The latter methods involve selecting the optimal feature subset while removing the irrelevant and redundant features. More importantly, the selected features preserve the original meanings of the features. This might explain why researchers are interested in feature selection [16], [17], [18], [19]. In addition, feature selection can be divided into four categories: wrapper, embedded, filter, and hybrid. The wrapper approach selects a feature subset with a higher prediction performance based on a specified learning algorithm [20], [21], [22]. Similarly, the embedded approach selects the best feature subset during the learning process of a specified learning algorithm [23], [24]. The filter approach chooses a feature subset from the original feature space according to pre-specified evaluation criterions or the dataset's intrinsic property, evaluating the relevance of each feature or subset using only the dataset [25], [26], [27], [28]. The hybrid approach uses an independent criterion and a learning algorithm to evaluate the candidate feature subsets, combining the advantages of the wrapper approach and the filter approach [29], [30], [31]. In this paper, we are particularly interested in the filter feature selection, which provides more general and computational efficiency.

Dependent upon whether the label information is used or not in feature selection, the feature selection be classified into supervised and unsupervised methods. The supervised FS requires a large amount of labeled data. For example, the Markov blanket of a feature represents the set of features required to exactly predict that feature's information using the label information, removing it as a redundant feature [19], [32]. If lacking enough labeled data, it would fail to identify the relevant features that are discriminative to different classes [18], [32], [33], [34]. Unlike the supervised FS, the unsupervised FS only works with unlabeled data and ignores the label information, at which point it could fail to identify the discriminative or important features [35], [36], [37], [38], [39]. Moreover, the labeling work is time consuming and expensive in engineering practices, while many unlabeled data can be readily obtained in the real world. It is also worth noting that research regarding the semi-supervised method [40], [41], [42], [43], [44] in partially labeled data appears to be of greater importance when compared with unsupervised and supervised FS. The semi-supervised feature selection examines how to better identify relevant features; they are discriminative to different classes by effectively exploring the information underlying the limited number of labeled data and the large amount of unlabeled data for real-world applications. Zhao and Liu [41] proposed a filter-based semi-supervised FS, which ranks features via the spectral and mutual information. Zhao et al. [42] also introduced the concept that the labeled data are used to maximize the margin between data from different classes, while the unlabeled data preserve the geometrical structure of the data space. However, the feature-score filter approach could discard important features that are less informative by themselves, but are more informative when combined with other features. Xu et al. [43] proposed a wrapper-based semi-supervised FS by maximizing the classification margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data. These approaches use the correlation of features collectively during the learning process, and may return with higher discriminative or important feature subsets than the filter approaches. Unfortunately, there are less general and more computational resources needed in the learning process.

Inspired from the recent works on the Markov blanket [19], [32], [45], [46], [47], [48], [49] and information theory [8], [50], [51], [52], [53], in the selection of an optimal feature subset, we propose an efficient Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS). SRFS belongs to filter approaches and has performed exceptionally well in removing the irrelevant features and clustering the redundancy features. This work's contributions are as follows: (1) We developed a relevance gain framework through which the relevance of features can be measured in the unlabeled data. (2) We introduced the partition of the directed acyclic graph to cluster the redundant features. (3) We extended the existing Markov blanket algorithms to exploit the additional information entropy contained in the unlabeled data.

The rest of this paper is organized as follows. Section 2 reviews briefly previous related work. In Section 3, we explain the proposed method for the semi-supervised feature selection algorithm, and discuss the relevance gain. Experimental studies are given in Section 4. Finally, we draw the conclusions and future works in Section 5.

Section snippets

Related works

In this section, several basic concepts regarding information theory are given, after which point the correlative feature selection algorithms based on Markov blanket are discussed. Regarding the above works, some notations used through this paper are given as follows. Let $F = {F_{1}, F_{2}, \dots, F_{D}}$ indicates a full set of features, where $F_{i} (i = 1, 2, \dots, D)$ represents the ith feature of the dataset, feature subset $S \subset F$ . A matrix dataset $X = {X_{L}, X_{U}}^{T}$ , the size of which is $(N + M) \times D$ , where $X_{L} = {x_{1}, \dots, x_{N}}^{T}$ , $X_{U} = {x_{N + 1}, \dots, x_{(N}$

Semi-supervised representatives feature selection

In this section, we introduce the framework of Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS). The objective of SRFS is to find a feature subset S with size d that contains the representative features, exploiting both the labeled and unlabeled dataset.

Experimental preparation

To evaluate the effectiveness of our method, we took into consideration different types of feature selection on several benchmark datasets, available from the UCI Machine Learning Repository [62] and the ASU Feature Selection Repository [63]. In all of the experiments, we compared SRFS against three well known feature selection methods, including the locality sensitive semi-supervised feature selection (LS3) [42], mRMR [18] and FCBF [32]. All of these methods have shown their advantages in

Conclusion

In this paper, we propose the SRFS method for feature selection based on the Markov blanket and information theory, because it is a remarkable improvement over the LS3, mRMR and FCBF methods. Our work highlights the relevance gain as a measure of redundancy in order to combine with mutual information to utilize all the information contained in unlabeled and labeled data. Meanwhile, the partition of the directed acyclic graph method clusters the redundant features and distributes features into

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China (61139002, 61501229, 11547040), Guangdong Province Natural Science Foundation (2016A030310051, 2015KONCX143), Shenzhen Fundamental Research Fundation (JCYJ20150625101524056), Project 2016047 supported by SZU R/D Fund and Open Fund supported by SKL-MCCS.

Yintong Wang was born in Yancheng, PR China, in February 1987. He received the B.S. degree in computer application from Southeast University Chengxian College, Jiangsu, China in 2009 and the M.S. degree in computer software and theory from ChangChun University Of Technology, Jilin, China in 2012. He is currently a doctoral candidate at the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, PR China. His research interests include data mining and

References (70)

H. Müller et al.
A review of content-based image retrieval systems in medical applications-clinical benefits and future directions
Int. J. Med. Inform.
(2004)
J.P. Van Oosten et al.
Separability versus prototypicality in handwritten word-image retrieval
Pattern Recognit.
(2014)
X. Zhang et al.
A causal feature selection algorithm for stock prediction modeling
Neurocomputing
(2014)
Z. Zeng et al.
A novel feature selection method considering feature interaction
Pattern Recognit.
(2015)
W. Yang et al.
A multi-manifold discriminant analysis method for image feature extraction
Pattern Recognit.
(2011)
C. Lian et al.
An evidential classifier based on feature selection and two-step classification strategy
Pattern Recognit.
(2015)
I.A. Gheyas et al.
Feature subset selection in large dimensionality domains
Pattern Recognit.
(2010)
S. He et al.
Robust twin boosting for feature selection from high-dimensional omics data with label noise
Inf. Sci.
(2015)
C. Freeman et al.
An evaluation of classifier-specific filter measure performance for feature selection
Pattern Recognit.
(2015)
J. Huang et al.
A hybrid genetic algorithm for feature selection wrapper based on mutual information
Pattern Recognit. Lett.
(2007)

Y. Yang et al.

A hybrid feature selection scheme for unsupervised learning and its application in bearing fault diagnosis

Expert Syst. Appl.

(2011)

P. Zhu et al.

Unsupervised feature selection by regularized self-representation

Pattern Recognit.

(2015)

S. Wang et al.

Subspace learning for unsupervised feature selection via matrix factorization

Pattern Recognit.

(2015)

S. Bandyopadhyay et al.

Integration of dense subgraph finding with feature clustering for unsupervised feature selection

Pattern Recognit. Lett.

(2014)

R. Cai et al.

Bassum: a bayesian semi-supervised method for classification feature selection

Pattern Recognit.

(2011)

J. Zhao et al.

Locality sensitive semi-supervised feature selection

Neurocomputing

(2008)

Z. Zhu et al.

Markov blanket-embedded genetic algorithm for gene selection

Pattern Recognit.

(2007)

F. Amiri et al.

Mutual information-based feature selection for intrusion detection systems

J. Netw. Comput. Appl.

(2011)

N. Hoque et al.

Mifs-nd: a mutual information-based feature selection method

Expert Syst. Appl.

(2014)

J. Huang et al.

A hybrid genetic algorithm for feature selection wrapper based on mutual information

Pattern Recognit. Lett.

(2007)

M. Dash et al.

Consistency-based search in feature selection

Artif. Intell.

(2003)

Z. Zhu et al.

Towards a memetic feature selection paradigm

IEEE Trans. Comput. Intell. Mag.

(2010)

H. Liao et al.

Predicting missing links via correlation between nodes

Physica A

(2014)

A.K. Jain et al.

Statistical pattern recognition: a review

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

I. Guyon et al.

An introduction to variable and feature selection

J. Mach. Learn. Res.

(2003)

B. Mwangi et al.

A review of feature reduction techniques in neuroimaging

Neuroinformatics

(2014)

J.B. Tenenbaum et al.

A global geometric framework for nonlinear dimensionality reduction

Science

(2000)

S.T. Roweis et al.

Nonlinear dimensionality reduction by locally linear embedding

Science

(2000)

M. Sugiyama

Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis

J. Mach. Learn. Res.

(2007)

Y. Wang et al.

Semi-supervised local fisher discriminant analysis based on reconstruction probability class

Int. J. Pattern Recognit. Artif. Intell.

(2015)

M. Robnik Šikonja et al.

Theoretical and empirical analysis of relieff and rrelieff

Mach. Learn.

(2003)

H. Peng et al.

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

(2005)

Q. Song et al.

A fast clustering-based feature subset selection algorithm for high-dimensional data

IEEE Trans. Knowl. Data Eng.

(2013)

P. Mitra et al.

Unsupervised feature selection using feature similarity

IEEE Trans. Pattern Anal. Mach. Intell.

(2002)

Y. Kim et al.

Evolutionary model selection in unsupervised learning

Intell. Data Anal.

(2002)

Cited by (56)

Semi-supervised attribute reduction based on label distribution and label irrelevance
2023, Information Fusion
Attribute reduction in partially labeled data, also called semi-supervised attribute reduction, is an important issue. In recent years, the research on semi-supervised attribute reduction has attracted the attention of many scholars. Unfortunately, most existing semi-supervised attribute reduction methods do not handle the information loss caused by missing labels well. Meanwhile, these methods in general only consider the relevance between attributes and labels to measure attribute correlations, which ignores the irrelevant information contained in the attributes with respect to the labels. In view of this, this paper proposes a novel semi-supervised attribute reduction algorithm considering attribute relevance, redundancy and label irrelevance from the perspective of label distribution. Firstly, the membership degree of unlabeled objects relative to labels is defined by fuzzy similarity relation, which implements information restoration and converts partially labeled data into label distribution data. Secondly, some fuzzy uncertainty measures for label distribution are defined and related properties are investigated accordingly. Additionally, considering that irrelevant information brought by attributes may lead to over-fitting, label irrelevance criterion based on fuzzy uncertainty measures is constructed. Thirdly, a novel semi-supervised attribute reduction algorithm via the maximum relevance, minimum redundancy, and minimum irrelevance is proposed. Finally, compared with the representative semi-supervised attribute reduction algorithms and supervised attribute reduction algorithm, the effectiveness of the proposed algorithm is verified by various experiments.
Granularity self-information based uncertainty measure for feature selection and robust classification
2023, Fuzzy Sets and Systems
Information entropy theory has been widely studied and successfully applied to machine learning and data mining. The fuzzy entropy and neighborhood entropy theories have been rapidly developed and widely used in uncertainty measure. In this paper, a granularity self-information theory is first proposed to measure uncertainty robustly. The theory improves the shortcomings of neighborhood self-information in measuring sample uncertainty by combining with data distributions. Then, granularity entropy theory is put forward and fully explained. With the proposed theories, a novel feature selection algorithm and a robust classification algorithm are designed and validated with some experiments. The experimental results show the designed algorithms have good performance. This indicates the efficacy of granularity self-information and granularity entropy for evaluating samples and features.
SemiACO: A semi-supervised feature selection based on ant colony optimization
2023, Expert Systems with Applications
Feature selection is one of the most efficient procedures for reducing the dimensionality of high-dimensional data by choosing a practical subset of features. Since labeled samples are not always available and labeling data may be time-consuming or costly, the importance of semi-supervised learning becomes apparent. Semi-supervised learning deals with data that includes both labeled and unlabelled instances. This article proposes a method based on Ant Colony Optimization (ACO) for the semi-supervised feature selection problem called SemiACO. The SemiACO algorithm finds features by considering the minimum redundancy between features and the maximum relevancy between the features and the class label. The SemiACO uses a nonlinear heuristic function instead of a linear one. The heuristic learning technique for the ACO heuristic function utilize a Temporal Difference (TD) reinforcement learning algorithm. We characterize the feature selection search space as a Markov Decision Process (MDP), where features indicate the states, and selecting the unvisited features by each ant represents a set of actions. We contrast the efficiency of SemiACO based on various experiments on 14 benchmark datasets, comparing eight semi-supervised feature selection methods.
Semi-supervised feature selection for partially labeled mixed-type data based on multi-criteria measure approach
2023, International Journal of Approximate Reasoning
In many real applications, the data are always collected from different types and they are subjected to obtain partial labeling information of objects. Such data are referred to as partially labeled mixed-type data. There is currently few work on feature selection approaches for these data. Motivated by this issue, this paper aims at selecting the informative feature subset from partially labeled mixed-type data. At first, to improve the classification performance, an improved label propagation algorithm based on K-nearest neighbor is proposed, which marks the decision labels of unlabeled objects by making use of the information between unlabeled objects and labeled objects. On this basis, a feature multi-criteria measure based on the dependency, information entropy and information granulation is proposed for selecting candidate features. Finally, the corresponding semi-supervised feature selection algorithm is developed to select the feature subset for the partially labeled mixed-type data. Experimental results on UCI data sets demonstrate the effectiveness of the proposed feature selection algorithm and the superiority in terms of the classification accuracy compared with other algorithms.
Unsupervised feature selection guided by orthogonal representation of feature space
2023, Neurocomputing
Feature selection has been an outstanding strategy in eliminating redundant and inefficient features in high-dimensional data. This paper introduces a novel unsupervised feature selection based on the matrix factorization, namely Unsupervised Feature Selection Guided by Orthogonal Representation (UFGOR). The orthogonality between a pair of variables refers to a specific case of linear independence such that they are perfectly uncorrelated. Motivated by the benefits of the orthogonality concept, the proposed UFGOR method is established based on the distance between the selected feature set and an orthogonal set corresponding to the whole feature space. Moreover, this orthogonal set is generated via QR-matrix factorization over the whole features and is employed as the compact representation of data matrix. In the next step, an unsupervised feature selection method is performed through the matrix factorization of the generated orthogonal set. Additionally, a dual-correlation model is utilized in the objective function of UFGOR to simultaneously consider both the local correlation in a set of selected features and the global correlation among the samples of a data. A detailed convergence analysis in line with an effective iterative algorithm proposed for the UFGOR method is also given. Numerical experiments on several real-world datasets illustrate the superior efficiency of our approach in comparison with some state-of-the-art unsupervised feature selection methods.
Neurodynamics-driven holistic approaches to semi-supervised feature selection
2023, Neural Networks
Citation Excerpt :
Yang et al. (2010) proposed a semi_Fisher score method (Duda & Hart, 1973). Based on the Markov blanket and information theory (Koller & M., 1996; Wang et al., 2017) proposed a semi-supervised representatives feature selection algorithm to identify and remove non-essential information and irrelevant and redundant features. Most of the current semi-supervised filter methods select features incrementally.
Feature selection is a crucial part of machine learning and pattern recognition, which aims at selecting a subset of informative features from the original dataset. Because of label information, supervised feature selection performs better than unsupervised feature selection without label information. However, in the presence of a small number of labeled data and a large number of unlabeled data, it is challenging for supervised feature selection methods to select relevant features. In this paper, we propose three neurodynamics-driven holistic approaches to semi-supervised feature selection via semi-supervised feature redundancy minimization and semi-supervised feature relevancy maximization. We first define information-theoretic semi-supervised similarity coefficient matrix and semi-supervised feature relevancy vector based on multi-information, unsupervised symmetric uncertainty, and entropy to measure feature redundancy and relevancy. We then formulate a fractional programming problem and an iteratively weighted quadratic programming problem based on the semi-supervised similarity coefficient matrix and semi-supervised feature relevancy vector for semi-supervised feature selection. To solve the formulated problems, we delineate three neurodynamic optimization approaches based on two projection neural networks. We elaborate on the experimental results on six benchmark datasets to demonstrate the superior classification performance of the proposed neurodynamic approaches against six existing supervised and semi-supervised feature selection methods.

View all citing articles on Scopus

Jiandong Wang was born in November 1945 and received the B.S. in Electrical Engineering from Shanghai Jiao Tong University in China. He is currently Professor and Doctoral Students Adviser in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics. His main research interests include artificial intelligence, data mining and information security.

Hao Liao was born in Huanggang, PR China, in April 1987. He received the Ph.D degree in information physics from University of Fribourg, Switzerland. He is currently Lecturer in the College of Computer Science and Software Engineering at Shenzhen University. His main research interests include artificial intelligence, data mining, information network, information economy.

Haiyan Chen was born in Changzhou, PR China, in 1979. She received the B.S., M.S. and Ph. degree in Computer Science and Technology from the Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, P.R. China, in 2002, 2005 and 2014. She is currently Associate Professor in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics. Her study concerns data mining, modeling and simulation.

View full text

An efficient semi-supervised representatives feature selection algorithm based on information theory

Highlights

Abstract

Introduction

Section snippets

Related works

Semi-supervised representatives feature selection

Experimental preparation

Conclusion

Acknowledgments

Int. J. Med. Inform.

Pattern Recognit.

Neurocomputing

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Inf. Sci.

Pattern Recognit.

Pattern Recognit. Lett.

Expert Syst. Appl.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit. Lett.

Pattern Recognit.

Neurocomputing

Pattern Recognit.

J. Netw. Comput. Appl.

Expert Syst. Appl.

Pattern Recognit. Lett.

Artif. Intell.

Towards a memetic feature selection paradigm

IEEE Trans. Comput. Intell. Mag.

Predicting missing links via correlation between nodes

Physica A

Statistical pattern recognition: a review

IEEE Trans. Pattern Anal. Mach. Intell.

An introduction to variable and feature selection

J. Mach. Learn. Res.

A review of feature reduction techniques in neuroimaging

Neuroinformatics

A global geometric framework for nonlinear dimensionality reduction

Science

Nonlinear dimensionality reduction by locally linear embedding

Science

Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis

J. Mach. Learn. Res.

Semi-supervised local fisher discriminant analysis based on reconstruction probability class

Int. J. Pattern Recognit. Artif. Intell.

Theoretical and empirical analysis of relieff and rrelieff

Mach. Learn.

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

A fast clustering-based feature subset selection algorithm for high-dimensional data

IEEE Trans. Knowl. Data Eng.

Unsupervised feature selection using feature similarity

IEEE Trans. Pattern Anal. Mach. Intell.

Evolutionary model selection in unsupervised learning

Intell. Data Anal.