Elsevier

Information Sciences

Volume 590, April 2022, Pages 267-295
Information Sciences

Online feature selection for multi-source streaming features

https://doi.org/10.1016/j.ins.2022.01.008Get rights and content

Abstract

Multi-source streaming feature selection in an online manner has attracted considerable attention, from researchers because it can reduce the dimensionality of heterogeneous big data. However, traditional online algorithms such as Alpha-investing, Online Streaming Feature Selection (OSFS), Online Group Feature Selection (OGFS) and Scalable and Accurate OnLine Approach (SAOLA) consider only a single data source with fixed instances, and are not directly applicable to multi-source data. Multi-source Causal Feature Selection (MCFS) can search for an invariant set in multiple interventional datasets. However, fixed feature spaces are restrained, and exactly these same features are required among multi-source data. To overcome these limitations, we propose a novel method known as Multi-source Streaming Feature Selection (MSFS) to tackle the feature selection problem for multi-source streaming features. The MSFS algorithm addresses a new feature from a random source in three phases: relevance, intra-source redundancy, and inter-source redundancy analyses. That is, MSFS attempts to mine the potential relationships among different data sources rather than only independently consider each data source. In particular, each new feature is analyzed online using the overlapping instances from all data sources, and the Markov blanket (MB) of the target variable is dynamically adjusted. To evaluate the performance of the MSFS algorithm, we compare it with that of the abovementioned algorithms on 14 datasets and two real-world scenarios. The results demonstrate that MSFS outperforms the existing algorithms in classification accuracy and number of selected features.

Introduction

In the era of big data, feature spaces are usually high dimensional and continuously varying [1], [2], [3], [4], [5]. Candidate features arrive one by one over time in a streaming manner and continuously accumulate. Certain examples are the timely updates of hot topics on Twitter [6], dynamic analysis of clinical and physiological data [7], features mining of brain cognitive process [8], [9], real-time filtering of spam [10], and dynamic recommendation of Web Services[1], [2]. The traditional off-line batch mode is unsuitable for these streaming feature scenarios and selects many redundant features. Hence, there is a requirement for the timely selection of features as per the class attribute. Recently, Online Feature Selection with Streaming Features (OSFSF) has attracted considerable attention. It is an effective approach for solving the curse of dimensionality [11], [12], [13], [14]. OSFSF selects a relevant feature subset from arrived features and maintains an optimal mode of the current phase. Representative algorithms of online streaming feature selection include Grafting [15], Alpha-investing [16], OSFS [17], OGFS [18], and SAOLA [19]. However, these algorithms are designed for feature selection from only a single dataset, and be applied to multi-source data though pool the multiple datasets together and use the union () or intersection () of selected features in each data sources. The former often leads to unreliable results because of the inconsistent distribution of multi-source data [20]. Although MCFS [20] can search an invariant set as selected features, fixed feature spaces are restrained and the same features among multiple interventional datasets are required. The abovementioned algorithms ignore a common scenario: in many learning tasks, there may be different features and overlapping features among data sources. Furthermore, certain instances may span multiple data sources [21]. In the following scenarios, the selected features often have multiple sources with overlapping instances and features [22], [23].

Scenario 1: In disease diagnosis and pathological analysis, fixed patients are often selected as observation samples, and each patient is considered as an instance. As shown in Fig. 1, the white spaces represent nonexisting instances. Examination items, such as blood routine, urine routine, and stool routine, are used as features where urine and stool routines come from the same data source. The examination data of patients may come from multiple hospitals or different departments of a hospital. For any examination, such as B-ultrasound, X-ray, MRI, or ECG, patient’s examination items are consistent, and there are no lack of features. Different patient instances have both overlapping and unique items. Furthermore, features continue to increase with the patient’s continuous examination. Therefore, it is a typical multi-source feature streaming with overlapping instances and features.

Scenario 2: A similar situation occurs in the ecosystem with several sensors for collecting data, and the data returned from each sensor corresponds to a feature. An object corresponding to multiple sensors, such as plankton in seawater and oil slick in seawater, is called an instance. Because of the limited-lifespan of each sensor, worn-out sensors should be replaced by new ones. Therefore, features corresponding to previous sensors vanish, and features corresponding to current sensors [24].

Scenario 3: Important international news, such as the US presidential election, may be reported by different media, including CNN, FOX, Reuters, Facebook, and Twitter. The candidates’ news is fixed as instances in a specific period. The focuses (i.e., features) of candidates in news change in real time with the progress of the election. Moreover, although the focuses of various information sources overlap, there are also differences indicating that the features are streamed with multiple sources.

Motivated by the aforementioned observations, we propose a more effective online feature selection framework for multi-source streaming features. Based on this framework, we developed a novel algorithm for multi-source streaming feature selection (MSFS). We address the following challenges: (1) mining features from multi-sources streaming features with overlapping instances and features; (2) exploring the effects of varying data source scales, overlap ratios of instances, and different orders for streaming features; (3) obtaining higher performance than its rivals, e.g., high prediction accuracy and less number of selected features.

The contributions and innovations of this study are summarized as follows: We originally propose a novel online feature selection method, MSFS, for multi-source feature streaming. Unlike the previous methods, the MSFS algorithm supports real-time dynamic changes of feature space and allows each data source to have different features and instances. In addition, MSFS belongs to causal feature selection, which aims to mine the Markov blanket of class attribute. The causal features selected imply the causal mechanism related to the class attribute [20]. According to the causal invariance in causal inference [20], theoretically, MSFS can remain strongly prediction performance to multi-source data with different distribution. We initially develop a mechanism to mine potential relationships among different data sources rather than considering only each data source independently. In particular, we mined redundant features among data sources using overlapping instances. The overlapping instances span data sources, and the relationship between features and the class attribute is the most closest and consistent. Therefore, using overlapping instances to filter redundant features can theoretically ensure the accuracy and efficiency of feature selection. Compared with existing multi-source streaming feature selection methods, MSFS is suitable for various multi-source data scenarios, such as: a) multi-source streaming data and multi-source fixed data, e.g., Section 5.3 Comparison with three online algorithms with, 5.4 Comparison with multi-source causal feature selection (; b) multi-source streaming data with overlapping instances and missing features, e.g., Section 5.5.1; c) multi-source streaming data with overlapping instances and overlapping features, e.g., Section 5.5.2. Extensive comparative experiments with state-of-the-art online algorithms and a multi-source offline algorithm on benchmark datasets and two real-world application scenarios were employed to evaluate the effectiveness and efficiency of MSFS algorithm. Moreover, we discuss the impact of data source size and instances overlap ratios on the performance of the algorithm. Experiments demonstrated that the MSFS algorithm acquires the approximate markov blanket with higher prediction accuracy and fewer features than existing algorithms.

The remainder of the paper is organized as follows. Section 2 is the related work; Section 3 introduces the MSFS framework for online feature selection of multi-source streaming features, and Section 4 presents our proposed algorithm; Section 5 reports the experimental results and discussion, and conclusions are presented in Section 6.

Section snippets

Related work

From the selection strategy perspective that describes, feature selection includes wrapper methods, filter methods, and embedded methods [12]. Wrapper methods evaluate the selected features through searching for all subsets of features. Filter methods are independent of any learning algorithms and pick up the intrinsic properties of the features measured via univariate statistics. Embedded methods are a trade-off between wrapper and filter methods by embedding feature selection into the model

Notations and definitions

In this section, we formally define multi-source streaming features and discuss their relevant specialties. Table 1 summarizes the notations used in this paper and their mathematical meanings.

Definition 1

(Multi-Source Streaming Features) Multi-source streaming features are feature vectors from multiple data sources and flow one by one over time. Here, “multi-source” indicates that the streaming features come from multiple sources, each with a fixed number of training instances.

The following operations

The MSFS algorithm

In the previous section, we presented the MSFS framework. Based on this framework, we developed our MSFS algorithm and the pseudo-code as below, where S is the multi-source, C is the class attribute, f is a new feature from a random data source, CFS is the candidate feature set at the current time and SF is the final result of the feature selection (Table 1 for details). The MSFS algorithm comprises three phases: (1) relevance analysis, (2) intra-source redundancy analysis, and (3) inter-source

Experiments

The experiments include five parts. First, we introduce the experimental setup. Then, we discuss the effect of three parameters on MSFS in Section 5.2. Furthermore, we compare MSFS with three state-of-the-art online algorithms with and , which are originally only applicable to a single dataset in Section 5.3. Meanwhile, we compare MSFS with a multi-source causal feature selection (MCFS) algorithm in Section 5.4. Finally, we present the application real-world scenarios in Section 5.5.

Conclusion

We proposed a novel online algorithm, MSFS, to address multi-source streaming features with overlapping instances and features, which employ conditional independence and mutual information to timely filter the irrelevant and redundant features through three phases, i.e., relevance, intra-source redundancy and inter-source redundancy analyses. For MSFS algorithm, we discussed the influence of OVRI,|S|and RO on the performance of MSFS algorithm. Our empirical study demonstrated that: (1) As the

CRediT authorship contribution statement

Dianlong You: Conceptualization, Methodology, Formal analysis, Data curation, Writing – original draft, Writing – review & editing, Validation, Investigation, Funding acquisition. Miaomiao Sun: Conceptualization, Methodology, Formal analysis, Data curation, Writing – original draft, Writing – review & editing. Shunpan Liang: Funding acquisition. Ruiqi Li: Software, Formal analysis. Yang Wang: Formal analysis, Data curation. Jiawei Xiao: Resources, Formal analysis. Fuyong Yuan: Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The paper is supported by The National Natural Science Foundation of China under Grant No. 61772450, Natural Science Foundation of Hebei Province under Grant No. F2021203038 and No. G2021203010.

Dianlong You received the Ph.D. degree in computer application technology from Yanshan University, Qinhuangdao, HeBei, China, in 2014. From 2017–8 to 2018–8, he was a visiting scholar with the School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA. His current research interests include machine learning, streaming feature selection and causal discovery.

References (47)

  • Y. He, X. Yuan, S. Chen, X. Wu, Online learning in variable feature spaces under incomplete supervision, in:...
  • D. Wu et al.

    Online feature selection with capricious streaming features: A general framework, in

    IEEE International Conference on Big Data (Big Data)

    (2019)
  • Y. He, B. Wu, D. Wu, X. Wu, On partial multi-task learning, in: the 24th European Conference on Artificial...
  • W. Xie et al.

    Topicsketch: Real-time bursty topic detection from twitter

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • E.Q. Wu et al.

    Nonparametric bayesian prior inducing deep network for automatic detection of cognitive status

    IEEE Trans. Cybern.

    (2021)
  • E.Q. Wu et al.

    Scalable gamma-driven multilayer network for brain workload detection through functional near-infrared spectroscopy

    IEEE Trans. Cybern.

    (2021)
  • D. Wang et al.

    Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006

  • X. Hu et al.

    A survey on online feature selection with streaming features

    Front. Comput. Sci.

    (2018)
  • J. Wang et al.

    Online feature selection and its applications

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • D. Wu et al.

    A latent factor analysis-based approach to online sparse streaming feature selection

    IEEE Trans. Syst. Man Cybern.: Syst.

    (2021)
  • S. Perkins et al.

    Online feature selection using grafting

  • J. Zhou, D. Foster, R. Stine, L. Ungar, Streaming feature selection using alpha-investing, in: Proceedings of the...
  • X. Wu et al.

    Online feature selection with streaming features

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • Cited by (13)

    View all citing articles on Scopus

    Dianlong You received the Ph.D. degree in computer application technology from Yanshan University, Qinhuangdao, HeBei, China, in 2014. From 2017–8 to 2018–8, he was a visiting scholar with the School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA. His current research interests include machine learning, streaming feature selection and causal discovery.

    Miaomiao Sun currently is a Master Student in School of Information Science and Engineering, Yanshan University,Qinhuangdao, HeBei. Her current research interests are focused on streaming feature selection and causal discovery.

    Shunpan Liang received the Ph.D. degree in mechanical and electronic engineering from Yanshan University, Qinhuangdao, HeBei, China, in 2013. His main research interests include recommendation system, machine learning.

    Ruiqi Li currently is a Master Student in School of Information Science and Engineering, Yanshan University, Qinhuangdao, HeBei. Her current research interests include streaming feature selection, and causal discovery.

    Yang Wang currently is a Master Student in School of Information Science and Engineering, Yanshan University, Qinhuang dao, HeBei. Her current research interests include streaming feature selection and causal discovery.

    Jiawei Xiao currently is a M.S. student in the School of Information Science and Engineering, Yanshan University, Qinhuangdao, HeBei. His current research interests are focused on streaming feature selection and causal discovery.

    Fuyong Yuan masters supervisor. His current research interests include recommendation system, machine learning, and data mining.

    Limin Shen received his B.S. and Ph.D. degrees in Computer Science and Technology from Yanshan University, China. He is a professor and PhD supervisor in College of Computer Science and Engineering, Yanshan University, China. His main research interests include service computing, collaborative computing, and cooperative defense. Dr. Shen is a member of IEEE.

    Xindong Wu received his Ph.D. degree in artificial intelligence from the University of Edinburgh, Edinburgh, U.K. He is Chief Scientist at Mininglamp Technology, China, and a Yangtze River Scholar with the Hefei University of Technology, Hefei, China. His current research interests include data mining, knowledge-based systems, and web information exploration. He is the Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM). He is the editor-in-chief of Knowledge and Information Systems (KAIS) and ACM Transactions on Knowledge Discovery from Data (TKDD). He is a Fellow of IEEE and the AAAS.

    View full text