Elsevier

Information Sciences

Volume 610, September 2022, Pages 1102-1121
Information Sciences

Parallel incremental efficient attribute reduction algorithm based on attribute tree

https://doi.org/10.1016/j.ins.2022.08.044Get rights and content

Highlights

  • We introduce the mechanism of a binary tree and propose a parallel incremental acceleration strategy based on the attribute tree.

  • The branch threshold coefficient is added into the calculation process to guide the algorithm to jump out of the loop, avoid redundant calculations, and reduce the number of attribute evaluations.

  • When multiple incremental objects are added to the decision system, the incremental mechanism can be used to update the reduction.

  • We combine IARAT and Spark parallel technology to parallelize data processing to accelerate the calculation process.

Abstract

Attribute reduction is an important application of rough sets. Efficiently reducing massive dynamic data sets quickly has always been a major goal of researchers. Traditional incremental methods focus on reduction by updated approximations. However, these methods must evaluate all attributes and repeatedly calculate their importance. When these algorithms are applied to large datasets with high time complexity, reducing large decision systems becomes inefficient. We propose an incremental acceleration strategy based on attribute trees to solve this problem. The key step is to cluster all attributes into multiple trees for incremental attribute evaluation. Specifically, we first select the appropriate attribute tree for attribute evaluation according to the attribute tree correlation measure to reduce the time complexity. Next, the branch coefficient is added to the stop criterion, increasing with the branch depth and guiding a jump out of the loop after reaching the maximum threshold. This avoids redundant calculation and improves efficiency. Furthermore, we propose an algorithm for incremental attribute reduction based on attribute trees using these improvements. Finally, a Spark parallel mechanism is added to parallelize data processing to implement the parallel incremental efficient attribute reduction based on the attribute tree. Experimental results on the Shuttle dataset show that the time consumption of our algorithm is more than 40% lower than that of the classical IARC algorithm while maintaining its good classification performance. In addition, the time is shortened by more than 87% from the benchmark after adding the Spark parallelizing mechanism.

Introduction

With the development of information technology, science and industry is generating an increasingly growing amount of data that will soon exceed the capacity of our applications. In the meantime, demand for computing resources is increasing. Modern information technology has made records and data collection fine-grained and multi-dimensional [1], [41]. Data processing and knowledge discovery resulting in massive data has always been of great interest to those involved in data mining. “Big data” refers to a volume of data beyond traditional systems' storage and processing capabilities; it typically is of huge potential value. However, huge, complex datasets often have significant data redundancy [6], [38] and challenge the limits of data storage, computational efficiency, and accuracy.

Rough set theory (RST) [24] is commonly used to analyze decision-making uncertainty. Rough sets can divide uncertain knowledge concepts through indiscernibility relations and obtain equivalence class sets under different equivalence relations. This establishes an approximate space. RST has been applied in image processing [7], cluster analysis [8], pattern recognition [40], machine learning [4], [15], feature selection [23], [39], decision support [10], [13], [43], and data mining [14], [44]. Attribute reduction is an important concept in rough set theory that has become a data preprocessing tool to identify accuracy and discover potentially useful knowledge.

Classic attribute reduction algorithms, such as those based on positive regions [47], [30], information entropy [18], [19], and discernibility matrices [36], [28], load a small dataset into the main memory of a single machine at one time for reduction and cannot process massive amounts of data. Much research has focused on parallel attribute reduction using big data technologies such as Hadoop [20] and MapReduce [17]. MapReduce is an offline computing framework that abstracts an algorithm into the two stages of Map and Reduce for processing, which is suitable for data-intensive computing. However, it is inefficient due to frequent disk I/O and many sorting operations. Streaming data is an infinite sequence of data, in theory, and Storm is a streaming data processing system that processes only one record per job. The processing speed needs to reach the millisecond level. Tez is a computing framework that runs on YARN to support DAG jobs and summarizes MapReduce data processing. However, Tez cannot reuse intermediate information, resulting in iterative calculation, resulting in considerable redundant calculation. Spark is an improved distributed computing framework based on MapReduce, which puts data into memory as much as possible to reduce disk I/O. Spark improves the computational efficiency of iterative and interactive applications.

To address the limitations of serial algorithms, Zhang et al. [46] proposed a heuristic parallel attribute reduction algorithm (PLAR) and obtained the same attribute reduction as traditional methods using the Spark platform. Muhammad et al. [31] proposed a parallel attribute reduction algorithm that can search all positive regions in parallel, with computational efficiency 63 % higher than that of classical algorithms. However, due to the need to repeatedly calculate the positive regions, its time consumption is still high; Qian et al. [27], [29] studied the attribute reduction process in the MapReduce framework and proposed a parallel knowledge reduction algorithm, which showed that it is feasible to realize the attribute reduction of massive data, but due to the frequent I/O operations of the Map Reduce framework, its efficiency is low; Sowkuntla et al. [37] proposed MapReduce-based parallel/distributed approaches for attribute reduction in massive incomplete decision systems (IDS) for large incomplete decision tables, which deal with large datasets in terms of the number of objects and attributes according to different data partitioning strategies. This approach works only for datasets with many objects and moderate properties. However, its shortcomings, due to the inherent limitations of MapReduce and traditional reduction algorithms, must be addressed.

Many methods have been developed to solve the attribute reduction problem of static decision systems, i.e., non-incremental reduction algorithms [9], [5], [11]. However, decision systems may change over time, and since object sets change dynamically, obtaining new reductions requires recalculating a decision-making system and thus much computation. These reduction algorithms are inefficient when dealing with dynamic decision systems, and updating the reduction is a key issue in improving efficiency. Incremental learning can effectively improve the efficiency of knowledge discovery by leveraging the valuable results of legacy decision systems. Many incremental algorithms have been proposed to deal with dynamic data. Liu et al. [22] proposed an incremental approach by constructing support, precision, and coverage matrices to obtain reduction. However, their algorithm for reduction by constructing matrices is not suitable for the large-scale attribute reduction of data. Jin et al. [12] proposed an incremental attribute reduction algorithm based on the calculation of updated knowledge granularity, but multiple equivalence class divisions in large datasets also significantly increase the computation time of the algorithm. For dynamic ordered data, Sang et al. [33] studied the incremental attribute reduction method of dynamic ordered data in the DRSA framework. They proposed a matrix-based dominance degree calculation method to update the reduction, but it takes a substantial amount of time to update the matrix in large-scale data sets. Shu et al. [35] proposed an incremental feature selection framework based on neighborhood entropy to deal with dynamically mixed data with mixed-type features. However, the processing of large incremental data is time-consuming. The above methods focus on reduction in update approximation and are inefficient for reducing large-scale decision systems.

Our study examined the parallelizing Spark method and analyzed current incremental reduction algorithms. The concept of the attribute group proposed by Chen et al. [2] is integrated with the mechanism of a binary tree, and a novel incremental attribute reduction algorithm is used in combination with the Spark parallel framework. To better reflect the effectiveness of the proposed algorithm, we evaluated its performance on multiple sets of UCI datasets.

The main contributions of this paper are as follows:

  • (1)

    Starting from the concept of an attribute group, we introduce the mechanism of a binary tree and propose a parallel incremental acceleration strategy based on the attribute tree. All conditional attributes are clustered into multiple attribute trees, and a core attribute search is performed on the attribute trees with higher correlations through multiple rounds of branching. The branch threshold coefficient is added into the calculation process to guide the algorithm to jump out of the loop, avoid redundant calculations, reduce the number of attribute evaluations, and improve the efficiency and accuracy of attribute reduction.

  • (2)

    When multiple incremental objects are added to the decision system, the incremental mechanism can be used to update the reduction. Based on the above improvements, we analyze the situation of multiple incremental objects and propose an incremental attribute reduction algorithm based on attribute tree (IARAT).

  • (3)

    Considering the performance advantages of the Spark framework, we combine IARAT and Spark parallel technology to parallelize data processing to accelerate the calculation process. This paper proposes a parallel incremental efficient attribute reduction algorithm based on attribute tree: PIARAT.

The rest of this paper is organized as follows. Section 2 reviews the basic concepts of rough sets, knowledge granularity, and the Apache Spark computing engine. Section 3 details PIARAT, and Section 4 evaluates the proposed algorithm on multiple UCI datasets. Conclusions are presented in Section 5.

Section snippets

Preliminaries

We review the concepts used in this article. Section 2.1 introduces rough sets, Section 2.2 reviews the representation of knowledge granularity and related incremental mechanisms, and Section 2.3 introduces the Spark parallel framework.

  • 1%1

    Rough set theory

An information system is a quadruple S=<U,CD,V,f>, where U=x1,x2,x3,...,xN is a finite nonempty object set; N is the number of samples in the universe; C=a1,a2,a3,...,an is the nonempty finite set of all condition attributes; n is the number of

PIARAt

We integrated the concept of the attribute group with binary trees and combined the Spark parallel framework to design PIARAT. The original data had to be preprocessed to meet the input requirements of the algorithm; redundant attributes were removed to obtain the core attributes. The resultant algorithm accelerated dynamic attribute reduction.

This section has three parts. Section 3.1 introduces the parallel preprocessing algorithm of raw data (Algorithm 1). IARAT is introduced in Section 3.2,

Experimental analysis

We conducted experiments to evaluate the effectiveness and efficiency of the proposed incremental algorithm in the case of object changes. The experimental platform was an Intel Core i5-10400H CPU at 2.30 GHz, 16 GB RAM, Windows 10 Home Chinese version, JetBrains PyCharm development tool, and the Python language. The simulation environment platform of MapReduce Hadoop-2.7.1 and Spark-3.0.0-preview was built on a Windows 10 system, and the cluster operation adopted the local mode. In this

Conclusions

We started with the concept of an attribute group, introduced the mechanism of a binary tree to find a reduction, and proposed IARAT. The branch coefficient was added to improve the efficiency and precision of attribute reduction. With Spark parallel technology to parallelize data processing and to consider the performance advantages of the Spark framework, a new method based on attribute trees was proposed: PIARAT. Experiments showed that IARAT is effective and efficient. This article focuses

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to express our sincere appreciation to the editor and anonymous reviewers for their insightful comments, which greatly improved the quality of this paper.

This work is supported in part by the National Natural Science Foundation of China (61976120, 62006128, 62102199), the Natural Science Foundation of Jiangsu Province (BK20191445), the Natural Science Key Foundation of Jiangsu Education Department (21KJA510004), the General Program of the Natural Science Foundation of

References (47)

  • J.Y. Liang et al.

    An efficient rough feature selection algorithm with a multi-granulation view

    Int. J. Approximate Reasoning

    (2012)
  • D. Liu et al.

    A rough set-based incremental approach for learning knowledge in dynamic incomplete information systems

    Int. J. Approximate Reasoning

    (2014)
  • K.Y. Liu et al.

    Rough set based semi-supervised feature selection via ensemble selector

    Knowl.-Based Syst.

    (2019)
  • Z. Pawlak et al.

    Rudiments of rough sets

    Inf. Sci.

    (2007)
  • J. Qian et al.

    Hierarchical attribute reduction algorithms for big data using MapReduce

    Knowl.-Based Syst.

    (2015)
  • J. Qian et al.

    Hybrid approaches to attribute reduction based on indiscernibility and discernibility relation

    Int. J. Approximate Reasoning

    (2011)
  • Y.H. Qian et al.

    Positive approximation: an accelerator for attribute reduction in rough set theory

    Artif. Intell.

    (2010)
  • M.S. Raza et al.

    A parallel rough set-based dependency calculation method for efficient feature selection

    Appl. Soft Comput.

    (2018)
  • B.B. Sang et al.

    Incremental attribute reduction approaches for ordered data with time-evolving objects

    Knowl.-Based Syst.

    (2021)
  • W.H. Shu et al.

    An incremental approach to attribute reduction from dynamic incomplete decision systems in rough set theory

    Data Knowl. Eng.

    (2015)
  • W.H. Shu et al.

    Incremental feature selection for dynamic hybrid data using neighborhood rough set

    Knowl.-Based Syst.

    (2020)
  • P. Sowkuntla et al.

    MapReduce based parallel attribute reduction in Incomplete Decision Systems

    Knowl.-Based Syst.

    (2021)
  • L. Sun et al.

    Multilabel feature selection using ML-ReliefF and neighborhood mutual information for multilabel neighborhood decision systems

    Inf. Sci.

    (2020)
  • Cited by (11)

    View all citing articles on Scopus
    View full text