Online feature selection for multi-source streaming features
Introduction
In the era of big data, feature spaces are usually high dimensional and continuously varying [1], [2], [3], [4], [5]. Candidate features arrive one by one over time in a streaming manner and continuously accumulate. Certain examples are the timely updates of hot topics on Twitter [6], dynamic analysis of clinical and physiological data [7], features mining of brain cognitive process [8], [9], real-time filtering of spam [10], and dynamic recommendation of Web Services[1], [2]. The traditional off-line batch mode is unsuitable for these streaming feature scenarios and selects many redundant features. Hence, there is a requirement for the timely selection of features as per the class attribute. Recently, Online Feature Selection with Streaming Features (OSFSF) has attracted considerable attention. It is an effective approach for solving the curse of dimensionality [11], [12], [13], [14]. OSFSF selects a relevant feature subset from arrived features and maintains an optimal mode of the current phase. Representative algorithms of online streaming feature selection include Grafting [15], Alpha-investing [16], OSFS [17], OGFS [18], and SAOLA [19]. However, these algorithms are designed for feature selection from only a single dataset, and be applied to multi-source data though pool the multiple datasets together and use the union () or intersection () of selected features in each data sources. The former often leads to unreliable results because of the inconsistent distribution of multi-source data [20]. Although MCFS [20] can search an invariant set as selected features, fixed feature spaces are restrained and the same features among multiple interventional datasets are required. The abovementioned algorithms ignore a common scenario: in many learning tasks, there may be different features and overlapping features among data sources. Furthermore, certain instances may span multiple data sources [21]. In the following scenarios, the selected features often have multiple sources with overlapping instances and features [22], [23].
Scenario 1: In disease diagnosis and pathological analysis, fixed patients are often selected as observation samples, and each patient is considered as an instance. As shown in Fig. 1, the white spaces represent nonexisting instances. Examination items, such as blood routine, urine routine, and stool routine, are used as features where urine and stool routines come from the same data source. The examination data of patients may come from multiple hospitals or different departments of a hospital. For any examination, such as B-ultrasound, X-ray, MRI, or ECG, patient’s examination items are consistent, and there are no lack of features. Different patient instances have both overlapping and unique items. Furthermore, features continue to increase with the patient’s continuous examination. Therefore, it is a typical multi-source feature streaming with overlapping instances and features.
Scenario 2: A similar situation occurs in the ecosystem with several sensors for collecting data, and the data returned from each sensor corresponds to a feature. An object corresponding to multiple sensors, such as plankton in seawater and oil slick in seawater, is called an instance. Because of the limited-lifespan of each sensor, worn-out sensors should be replaced by new ones. Therefore, features corresponding to previous sensors vanish, and features corresponding to current sensors [24].
Scenario 3: Important international news, such as the US presidential election, may be reported by different media, including CNN, FOX, Reuters, Facebook, and Twitter. The candidates’ news is fixed as instances in a specific period. The focuses (i.e., features) of candidates in news change in real time with the progress of the election. Moreover, although the focuses of various information sources overlap, there are also differences indicating that the features are streamed with multiple sources.
Motivated by the aforementioned observations, we propose a more effective online feature selection framework for multi-source streaming features. Based on this framework, we developed a novel algorithm for multi-source streaming feature selection (MSFS). We address the following challenges: (1) mining features from multi-sources streaming features with overlapping instances and features; (2) exploring the effects of varying data source scales, overlap ratios of instances, and different orders for streaming features; (3) obtaining higher performance than its rivals, e.g., high prediction accuracy and less number of selected features.
The contributions and innovations of this study are summarized as follows: We originally propose a novel online feature selection method, MSFS, for multi-source feature streaming. Unlike the previous methods, the MSFS algorithm supports real-time dynamic changes of feature space and allows each data source to have different features and instances. In addition, MSFS belongs to causal feature selection, which aims to mine the Markov blanket of class attribute. The causal features selected imply the causal mechanism related to the class attribute [20]. According to the causal invariance in causal inference [20], theoretically, MSFS can remain strongly prediction performance to multi-source data with different distribution. We initially develop a mechanism to mine potential relationships among different data sources rather than considering only each data source independently. In particular, we mined redundant features among data sources using overlapping instances. The overlapping instances span data sources, and the relationship between features and the class attribute is the most closest and consistent. Therefore, using overlapping instances to filter redundant features can theoretically ensure the accuracy and efficiency of feature selection. Compared with existing multi-source streaming feature selection methods, MSFS is suitable for various multi-source data scenarios, such as: a) multi-source streaming data and multi-source fixed data, e.g., Section 5.3 Comparison with three online algorithms with, 5.4 Comparison with multi-source causal feature selection (; b) multi-source streaming data with overlapping instances and missing features, e.g., Section 5.5.1; c) multi-source streaming data with overlapping instances and overlapping features, e.g., Section 5.5.2. Extensive comparative experiments with state-of-the-art online algorithms and a multi-source offline algorithm on benchmark datasets and two real-world application scenarios were employed to evaluate the effectiveness and efficiency of MSFS algorithm. Moreover, we discuss the impact of data source size and instances overlap ratios on the performance of the algorithm. Experiments demonstrated that the MSFS algorithm acquires the approximate markov blanket with higher prediction accuracy and fewer features than existing algorithms.
The remainder of the paper is organized as follows. Section 2 is the related work; Section 3 introduces the MSFS framework for online feature selection of multi-source streaming features, and Section 4 presents our proposed algorithm; Section 5 reports the experimental results and discussion, and conclusions are presented in Section 6.
Section snippets
Related work
From the selection strategy perspective that describes, feature selection includes wrapper methods, filter methods, and embedded methods [12]. Wrapper methods evaluate the selected features through searching for all subsets of features. Filter methods are independent of any learning algorithms and pick up the intrinsic properties of the features measured via univariate statistics. Embedded methods are a trade-off between wrapper and filter methods by embedding feature selection into the model
Notations and definitions
In this section, we formally define multi-source streaming features and discuss their relevant specialties. Table 1 summarizes the notations used in this paper and their mathematical meanings. Definition 1 (Multi-Source Streaming Features) Multi-source streaming features are feature vectors from multiple data sources and flow one by one over time. Here, “multi-source” indicates that the streaming features come from multiple sources, each with a fixed number of training instances.
The following operations
The MSFS algorithm
In the previous section, we presented the MSFS framework. Based on this framework, we developed our MSFS algorithm and the pseudo-code as below, where S is the multi-source, C is the class attribute, f is a new feature from a random data source, CFS is the candidate feature set at the current time and SF is the final result of the feature selection (Table 1 for details). The MSFS algorithm comprises three phases: (1) relevance analysis, (2) intra-source redundancy analysis, and (3) inter-source
Experiments
The experiments include five parts. First, we introduce the experimental setup. Then, we discuss the effect of three parameters on MSFS in Section 5.2. Furthermore, we compare MSFS with three state-of-the-art online algorithms with and , which are originally only applicable to a single dataset in Section 5.3. Meanwhile, we compare MSFS with a multi-source causal feature selection (MCFS) algorithm in Section 5.4. Finally, we present the application real-world scenarios in Section 5.5.
Conclusion
We proposed a novel online algorithm, MSFS, to address multi-source streaming features with overlapping instances and features, which employ conditional independence and mutual information to timely filter the irrelevant and redundant features through three phases, i.e., relevance, intra-source redundancy and inter-source redundancy analyses. For MSFS algorithm, we discussed the influence of Sand RO on the performance of MSFS algorithm. Our empirical study demonstrated that: (1) As the
CRediT authorship contribution statement
Dianlong You: Conceptualization, Methodology, Formal analysis, Data curation, Writing – original draft, Writing – review & editing, Validation, Investigation, Funding acquisition. Miaomiao Sun: Conceptualization, Methodology, Formal analysis, Data curation, Writing – original draft, Writing – review & editing. Shunpan Liang: Funding acquisition. Ruiqi Li: Software, Formal analysis. Yang Wang: Formal analysis, Data curation. Jiawei Xiao: Resources, Formal analysis. Fuyong Yuan: Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The paper is supported by The National Natural Science Foundation of China under Grant No. 61772450, Natural Science Foundation of Hebei Province under Grant No. F2021203038 and No. G2021203010.
Dianlong You received the Ph.D. degree in computer application technology from Yanshan University, Qinhuangdao, HeBei, China, in 2014. From 2017–8 to 2018–8, he was a visiting scholar with the School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA. His current research interests include machine learning, streaming feature selection and causal discovery.
References (47)
- et al.
Heterogeneous ensemble selection for evolving data streams
Pattern Recogn.
(2021) - et al.
A review of machine learning in hypertension detection and blood pressure estimation based on clinical and physiological data
Biomed. Signal Process. Control
(2021) - et al.
Feature selection in machine learning: A new perspective
Neurocomputing
(2018) - et al.
Feature selection with multi-view data: A survey
Inf. Fusion
(2019) - et al.
Ensemble prediction-based dynamic robust multi-objective optimization methods
Swarm Evol. Comput.
(2019) - et al.
Feature selection with kernelized multi-class support vector machine
Pattern Recogn.
(2021) - et al.
Towards efficient and effective discovery of markov blankets for feature selection
Inf. Sci.
(2020) - et al.
Streaming feature-based causal structure learning algorithm with symmetrical uncertainty
Inf. Sci.
(2018) - et al.
Lofs: a library of online streaming feature selection
Knowl.-Based Syst.
(2016) - et al.
Toward mining capricious data streams: A generative approach
IEEE Trans. Neural Networks Learn. Syst.
(2020)
Online feature selection with capricious streaming features: A general framework, in
IEEE International Conference on Big Data (Big Data)
Topicsketch: Real-time bursty topic detection from twitter
IEEE Trans. Knowl. Data Eng.
Nonparametric bayesian prior inducing deep network for automatic detection of cognitive status
IEEE Trans. Cybern.
Scalable gamma-driven multilayer network for brain workload detection through functional near-infrared spectroscopy
IEEE Trans. Cybern.
Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006
A survey on online feature selection with streaming features
Front. Comput. Sci.
Online feature selection and its applications
IEEE Trans. Knowl. Data Eng.
A latent factor analysis-based approach to online sparse streaming feature selection
IEEE Trans. Syst. Man Cybern.: Syst.
Online feature selection using grafting
Online feature selection with streaming features
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (13)
Improved salp swarm algorithm based on Newton interpolation and cosine opposition-based learning for feature selection
2024, Mathematics and Computers in SimulationAn external attention-based feature ranker for large-scale feature selection
2023, Knowledge-Based SystemsBest subset selection for high-dimensional non-smooth models using iterative hard thresholding
2023, Information SciencesNon-linear Feature Selection Based on Convolution Neural Networks with Sparse Regularization
2024, Cognitive Computation
Dianlong You received the Ph.D. degree in computer application technology from Yanshan University, Qinhuangdao, HeBei, China, in 2014. From 2017–8 to 2018–8, he was a visiting scholar with the School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA. His current research interests include machine learning, streaming feature selection and causal discovery.
Miaomiao Sun currently is a Master Student in School of Information Science and Engineering, Yanshan University,Qinhuangdao, HeBei. Her current research interests are focused on streaming feature selection and causal discovery.
Shunpan Liang received the Ph.D. degree in mechanical and electronic engineering from Yanshan University, Qinhuangdao, HeBei, China, in 2013. His main research interests include recommendation system, machine learning.
Ruiqi Li currently is a Master Student in School of Information Science and Engineering, Yanshan University, Qinhuangdao, HeBei. Her current research interests include streaming feature selection, and causal discovery.
Yang Wang currently is a Master Student in School of Information Science and Engineering, Yanshan University, Qinhuang dao, HeBei. Her current research interests include streaming feature selection and causal discovery.
Jiawei Xiao currently is a M.S. student in the School of Information Science and Engineering, Yanshan University, Qinhuangdao, HeBei. His current research interests are focused on streaming feature selection and causal discovery.
Fuyong Yuan masters supervisor. His current research interests include recommendation system, machine learning, and data mining.
Limin Shen received his B.S. and Ph.D. degrees in Computer Science and Technology from Yanshan University, China. He is a professor and PhD supervisor in College of Computer Science and Engineering, Yanshan University, China. His main research interests include service computing, collaborative computing, and cooperative defense. Dr. Shen is a member of IEEE.
Xindong Wu received his Ph.D. degree in artificial intelligence from the University of Edinburgh, Edinburgh, U.K. He is Chief Scientist at Mininglamp Technology, China, and a Yangtze River Scholar with the Hefei University of Technology, Hefei, China. His current research interests include data mining, knowledge-based systems, and web information exploration. He is the Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM). He is the editor-in-chief of Knowledge and Information Systems (KAIS) and ACM Transactions on Knowledge Discovery from Data (TKDD). He is a Fellow of IEEE and the AAAS.