OSFSMI: Online stream feature selection method based on mutual information
Graphical abstract
In this paper two novel online streaming feature selection methods called OSFSMI and OSFSMI-k are proposed. OSFSMI and OSFSMI-k following same strategy, the only difference is OSFSMI-k requires predefined feature subset size while OSFSMI identify the number of features automatically. The general framework of our methods is shown in the following diagram:
Introduction
Rapid improvement of storage and information processing technologies has led to appearance of large-scale datasets with large number of patterns and features [1]. The presence of high dimensional data − known as the curse of the dimensionality problem − reduces the performance of many machine learning methods [2]. A popular approach to tackle this problem is to reduce dimensionality of the feature space [3]. Feature selection is a well-known and effective dimensionality reduction approach that aims at selecting a parsimonious feature subset by identifying and eliminating those of redundant and irrelevant features.
Up to now, many feature selection methods have been proposed to improve the interpretability, efficiency and accuracy of the learning models. Most of these methods require to access the entire feature set to perform their search process [4], [5]., [6], [7], [8], [9], [10], [11]. However, in many real-world applications it is either impossible to acquire the entire data or it is impractical to wait for the complete data before feature selection starts [12], [13], [14], [15]. In other words, in these types of applications, data arrives sequentially and novel features or instances may appear incrementally. For example, in online social networks such as Twitter, in the case of presenting a new hot topic, a set of new keywords appears which leads to increase the dimensionality of the data over time. Traditional feature selection methods need to load the entire training dataset in the memory, which leads to exceeding the memory capacity for many real-world applications. These limitations make the traditional batch feature selection techniques impractical for emerging big data applications. To overcome these problems, online streaming feature selection methods (OSF) have been recently proposed to provide a complementary algorithmic methodology to addresses high dimensionality in big data analytics by choosing the most informative features [15], [16], [17], [18].
Considering the fact that the whole data is unavailable, a successful OSF method needs an efficient incremental update rule in its search process. To this end, several methods have been recently proposed to select a best feature subset from online data streams. These methods can be classified into two categories: instance-based and feature-based OSF methods. In instance-based OSF methods, the number of instances increases over the time, while the number of features is assumed to be fixed [16], [19], [20], [21]. This type of methods can be employed in some applications such as traffic network monitoring, financial analysis of stock data streams and Internet query monitoring, where all feature space is available from the beginning but the number of instances increase over time. For example, the method proposed in [22] uses an incremental learning algorithm to select prominent features as new instances arrive. Therefore, the scope of these methods is limited to the problems where all features are given before the learning process. On the other hand, feature-based OSF methods assume that the feature space is unavailable or is infinite before starting the feature selection process [17], [23], [24], [25], [26], [27]. In some real-world applications, the features are often expensive to generate (e.g., a lab experiment), and thus may appear in streaming manner. Generally, in feature-based OSF methods a criterion is defined to decide whether or not a newly arrived feature should be added to the model. For example, in [23] a statistical analysis is performed to evaluate the importance of a so-far-seen feature. This method requires prior knowledge about the entire feature space. Also, it does not have any strategy to remove those features that later found to be redundant. In [26] a conditional independent test is used to evaluate both relevancy and redundancy of a newly arrived feature. Although this method considers both relevancy and redundancy analysis in its online process, in each iteration it compares a new feature with all subsets of previously selected features, and thus it requires a high computational cost to process a so-far-seen feature. Recently, in [24] the authors proposed a method called SOALA that uses a pairwise mutual information to evaluate the relevance and redundancy of features in an online manner. In this method, deciding to include or ignore the features depends on some predefined parameters that setting their precise value requires to know the whole feature space. Also, the authors of [17] proposed an online feature selection method based on rough set theory to assess the importance of features in an online manner. Although their method does not require any domain knowledge, it only operates effectively with datasets containing discrete values, and therefore it is necessary to perform a discretization step for real-valued attributes. This method also suffers from high computational complexity when dealing with large-scale datasets.
In this paper, two novel feature-based OSF methods are proposed which aims to achieve high classification accuracy with a reasonable running time. We suppose that the features appear incrementally over time while the number of instances is considered to be fixed. The first method selects a prominent subset of feature with the minimal size, while the other method chooses a set of prominent features in constant size k. These methods employ mutual information concept to evaluate the relevancy and redundancy of features without using any learning model. Thus, the proposed methods can be classified into filter-based feature selection methods. A number of mutual information based methods have been successfully applied to feature selection tasks, see a review in [4]. However, most of these algorithms consider batch feature selection problem and cannot be applied for online feature selection scenarios. For example, in [28] mutual information is used to choose relevant features and eliminate redundant ones in an offline manner, and one needs to know the whole feature space at the beginning of the search process. The framework of the proposed methods consists of two steps. In the first step, the relevancy of each newly arrived feature with the target class is evaluated. The relevant features are included to the selected subset and the others are ignored. In the second step, the goal is to identify and eliminate those of ineffective features through several iterations. In this step, a specific strategy is used to eliminate redundant features. Using this strategy, if a previously selected feature is identified as a redundant feature, it is removed from the selected subset. To this end, a measure is introduced to evaluate the effectiveness of a newly arrived feature considering both of relevancy and redundancy concepts. Using this measure, those of ineffective features compared to a so far seen feature are eliminated from the feature set and the process is continued iteratively until there is no more ineffective in the feature set. Using this process, one more chance is given to a new arrived feature to be processed in the further steps. Also, by removing a feature, it is ensured that there are exists some other effective features that has higher relevancy and lower redundancy value than it. This process also leads to decreasing the risk of discounting a feature. The proposed methods have several novelties compared to the previous feature-based OSF methods [17], [23], [24], [25], [26], [27], as:
- 1.
The proposed methods employ mutual information to analysis relevancy and redundancy of features. Compared to [29], [30] which uses rough set theory in their processes, using mutual information has several advantages. While those based on rough sets require time steps (n and m show the number of instances and the number of features, respectively), it is only for the proposed methods. Moreover, mutual information can be used for both discrete and continues features [31], while rough sets can only be applied on data with discrete variables.
- 2.
The proposed methods do not employ any adjustable user-defined parameters. Thus, compared to [23], [26], they can generate more robust results over various information sources.
- 3.
The proposed methods employ an elimination strategy to remove redundant features in further steps, even if they have been previously selected. Compared to [23], this strategy results in returning a set of features with minimal redundancy.
- 4.
The redundancy analysis step of the proposed methods is only performed on relevant features, while in [26], [27], in each step the redundancy of so-far-seen feature is computed with all other previously selected features that needs a high computational cost. Also, [26], [27] use a k-greedy search strategy to eliminate redundant features by checking all subsets of selected features. Thus, their complexity for evaluating the redundancy and relevancy of each feature is , where St denotes the selected feature at time t. Our algorithms take only time steps to identify and eliminate redundant features.
- 5.
Compared with the algorithm proposed in [24], the proposed methods calculate the redundancy of features considering the target class. [24] uses the pairwise mutual information to discover redundant features without considering the target class. Two features may be dependent on each other, while each sharing different information about the target class, and thus cannot not considered as redundant features.
- 6.
Compared to [22], the proposed methods are filter-based feature-based OSF methods and do not use any learning model in their processes. In [22] the data samples arrive one-by-one and a learning method is used to evaluate features, and thus it is a wrapper and instance-based OSF method. Therefore, it is much slower than the proposed methods.
- 7.
Although the method proposed in [28] uses the mutual information concept for relevancy and redundancy analysis, it is an offline feature selection method and needs to access the whole feature space and cannot be used online streaming feature selection.
The efficiency of the proposed methods is evaluated on two complex scenarios over 29 datasets in different categories. Our experiments show superiority of the proposed methods over others.
Section snippets
Related work
The aim of feature selection is to select a set of prominent features to improve interpretability and efficiency of the learning model without degrading model accuracy. Considering the type of data arrival, feature selection approaches can be either offline or online. Offline methods, also known as traditional feature selection methods, need to access the entire feature space to perform their global search [32]. These methods can be categorized into filter, wrapper, embedded and hybrid methods.
Proposed methods
In this section we provide the details of the proposed feature selection methods for online stream feature selection problem, called OSFSMI and OSFSMI-k. These methods are based on mutual information to take into account relevancy and redundancy concepts in their processes. In both of these methods it is assumed that the entire dataset is not accessible at the beginning of the process and features appear incrementally in an online manner.
Experimental results
In this section the effectiveness of the proposed algorithms are assessed and compared with two different types of feature selection methods. First, we compare the first proposed method (OSFSMI) with five state-of-the-art online feature selection methods including; SAOLA [24], group-SAOLA [43], OSFS [27], fast-OSFS [26] and Alpha-investing [23] algorithms. In order to provide streaming feature selection scenario, it is supposed that the features are not available and they are presented
Conclusion
Feature selection aims to select informative features by removing redundant and irrelevant features. In this paper two novel feature selection methods, called OSFSMI and OSFSMI-k, were proposed for feature selection of online data streams. The main idea is to use mutual information to compute both relevancy and redundancy of so-far-seen features in an online manner in order to select most informative and non-redundant feature set that are highly relevant to the target class. The search process
References (57)
- et al.
A survey on feature selection methods
Comput. Electr. Eng.
(2014) - et al.
Gene selection for microarray data classification using a novel ant colony optimization
Neurocomputing
(2015) - et al.
A graph theoretic approach for unsupervised feature selection
Eng. Appl. Artif. Intell.
(2015) - et al.
Relevance-redundancy feature selection based on ant colony optimization
Pattern Recogn.
(2015) - et al.
Weighted bee colony algorithm for discrete optimization problems with application to feature selection
Eng. Appl. Artif. Intell.
(2015) - et al.
Knowledge reduction of dynamic covering decision information systems when varying covering cardinalities
Inf. Sci.
(2016) - et al.
Online streaming feature selection using rough sets
Int. J. Approx. Reasoning
(2016) - et al.
Attribute reduction: a dimension incremental strategy
Knowl.-Based Syst.
(2013) - et al.
Selecting feature subset for high dimensional data via the propositional FOIL rules
Pattern Recogn.
(2013) - et al.
Online streaming feature selection using rough sets
Int. J. Approximate Reasoning
(2016)
On-line incremental feature weighting in evolving fuzzy classifiers
Fuzzy Sets Syst.
An incremental meta-cognitive-based scaffolding fuzzy neural network
Neurocomputing
Constraint Score A new filter method for feature selection with pairwise constraints
Pattern Recogn.
Feature selection using joint mutual information maximisation
Expert Syst. Appl.
Library of online streaming feature selection
Knowl.-Based Syst.
The key data mining models for high dimensional data
Computational Methods of Feature Selection (Chapman \\& Hall/Crc Data Mining and Knowledge Discovery Series)
Nonlinear dimensionality reduction by locally linear embedding
Science
A review of feature selection methods based on mutual information
Neural Comput. Appl.
A review of feature selection methods on synthetic data
Knowl. Inf. Syst.
Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)
Streamwise feature selection
J. Mach. Learn. Res.
Learning in Non-Stationary Environments: Methods and Applications
Knowledge Discovery from Data Streams
Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners
Online feature selection for mining big data
Incremental attribute reduction based on elementary sets
Online feature selection and its applications
IEEE Trans. Knowl. Data Eng.
Cited by (57)
ML-KnockoffGAN: Deep online feature selection for multi-label learning
2023, Knowledge-Based SystemsA Multi-Objective online streaming Multi-Label feature selection using mutual information
2023, Expert Systems with ApplicationsOnline and offline streaming feature selection methods with bat algorithm for redundancy analysis
2023, Pattern RecognitionCitation Excerpt :Operators based on the information theory have attracted much attention in recent years. OSFSMI [29] uses the well-known mutual information to eliminate irrelevant and/or redundant features in OSF. SFS-FI [30] uses an interaction metric to measure the interaction degree between the arriving feature and the already selected subset.
A novel method for fusing graph convolutional network and feature based on feedback connection mechanism for nondestructive testing
2022, Pattern Recognition LettersGeneral assembly framework for online streaming feature selection via Rough Set models
2022, Expert Systems with ApplicationsFeature selection for online streaming high-dimensional data: A state-of-the-art review
2022, Applied Soft Computing