Online early terminated streaming feature selection based on Rough Set theory

https://doi.org/10.1016/j.asoc.2021.107993Get rights and content

Highlights

  • First present the issue of how to terminate online streaming feature selection early.

  • Propose a novel early terminated online streaming feature selection framework.

  • Extensive experiments demonstrate the effectiveness of our new framework.

Abstract

Feature selection is a vital dimensionality reduction technology for machine learning and data mining that aims to select a minimal subset from the original feature space. Traditional feature selection methods assume that all features can be required before learning, while features may exist in a stream mode for some real-world applications. Therefore, online streaming feature selection was proposed to handle streaming features on the fly. When the feature dimension is extraordinarily high or even infinite, it is time-consuming or impractical to wait for all the streaming features to arrive. Motivated by this, we study and solve the exciting issue of whether we can terminate the online streaming feature selection early for efficiency while maintaining satisfactory performance for the first time. Specifically, we first formally define the problem of online early terminated streaming feature selection and summary two properties that the early terminated mapping function should satisfy. Then we choose the dependency degree function in Rough Set theory as our early terminated mapping function and demonstrate that it satisfies the two properties. Based on this, we propose a novel Early Terminated Online Streaming Feature Selection framework, named OSFS-ET, which could terminate the streaming feature selection early before the end of streaming features and guarantee a competing performance with the currently selected features. Extensive experiments on twelve real-world datasets demonstrate that OSFS-ET can be far faster than state-of-the-art streaming feature selection methods while maintaining excellent performance on predictive accuracy.

Section snippets

Code metadata

Permanent link to reproducible Capsule: https://codeocean.com/capsule/8154265/tree/v1.

Related work

Feature selection aims to select a minimal subset from the original feature space and is essential to speed up learning and improve concept quality [1]. According to different data types, we can divide feature selection into two categories: traditional feature selection for static data and online feature selection for stream data [5].

The proposed framework

This section first defines online streaming feature selection and early terminated online streaming feature selection. With the in-depth analysis of the reasons for early termination, we point out two properties that the mapping function should satisfy to terminate the selection before the end while maintaining competing performance. Then, we introduce the dependency degree in Rough Set theory and demonstrate that it satisfies these two early terminated properties. After that, we propose our

Datasets

In this section, we apply the proposed OSFS-ET and its competing algorithms on twelve real-world high-dimensional datasets [40], [41],1 as shown in Table 2.

Evaluation metrics

We use two basic classifiers, KNN(k = 9) and SVM (with the linear kernel) in Matlab R2017a, to evaluate a selected feature subset in our experiments. We perform 5-fold cross-validation

Conclusion

In this paper, we study the exciting issue of how to terminate the online streaming feature selection early while maintaining a satisfactory performance for the first time. An assumption is proposed that the online streaming feature selection can be terminated early if the expected increase of mapping function is much lower than the time consumption cost for the following arriving features. Based on this, we first present a formal definition on this issue and summarize two properties that the

CRediT authorship contribution statement

Peng Zhou: Conceptualization, Methodology, Software, Writing – original draft, Funding acquisition. Peipei Li: Validation, Investigation, Writing – review & editing, Funding acquisition. Shu Zhao: Formal analysis, Project administration, Funding acquisition. Yanping Zhang: Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported in part by the National Natural Science Foundation of China under grants (61906056, 61976077, 61876001).

References (43)

  • ZhangQ. et al.

    A survey on rough set theory and its applications

    CAAI Trans. Intell. Technol.

    (2016)
  • RahmaniniaM. et al.

    Osfsmi: Online stream feature selection method based on mutual information

    Appl. Soft Comput.

    (2018)
  • AnS. et al.

    Probability granular distance-based fuzzy rough set model

    Appl. Soft Comput.

    (2021)
  • YuK. et al.

    Lofs: Library of online streaming feature selection

    Knowl.-Based Syst.

    (2016)
  • LiuH. et al.

    Computational Methods of Feature Selection

    (2007)
  • GuyonI. et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • LiJ. et al.

    Feature selection: A data perspective

    Acm Comput. Surv.

    (2017)
  • LiY. et al.

    Recent advances in feature selection and its applications

    Knowl. Inf. Syst.

    (2017)
  • DingW. et al.

    Subkilometer crater discovery with boosting and transfer learning

    ACM Trans. Intell. Syst. Technol. (TIST)

    (2011)
  • WangM. et al.

    Multimodal graph-based reranking for web image search

    IEEE Trans. Image Process.

    (2012)
  • WuX. et al.

    Data mining with big data

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • Cited by (13)

    • Feature selection based on double-hierarchical and multiplication-optimal fusion measurement in fuzzy neighborhood rough sets

      2022, Information Sciences
      Citation Excerpt :

      For example, intuitionistic FNRSs are modeled for heterogeneous datasets [26], and multigranularity FNRSs and their fuzzy neighborhood entropy are established for feature selection [28], while fuzzy neighborhoods are developed for coverage data classification [42]. Feature selection (FS) adopts granulation cognition to remove redundant attributes and select useful information, so FS is extensively utilized in data mining, machine learning, and knowledge discovery [4,14,18,30,39,40,49]. In particular, FS resorts mainly to uncertainty measurement [5,16,45], so measure-driven FS becomes an important topic [15,22,27,41,43].

    • RHDOFS: A Distributed Online Algorithm Towards Scalable Streaming Feature Selection

      2023, IEEE Transactions on Parallel and Distributed Systems
    View all citing articles on Scopus

    The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.

    View full text