Parallel incremental efficient attribute reduction algorithm based on attribute tree

doi:10.1016/j.ins.2022.08.044

Information Sciences

Volume 610, September 2022, Pages 1102-1121

https://doi.org/10.1016/j.ins.2022.08.044 Get rights and content

Highlights

•
We introduce the mechanism of a binary tree and propose a parallel incremental acceleration strategy based on the attribute tree.
•
The branch threshold coefficient is added into the calculation process to guide the algorithm to jump out of the loop, avoid redundant calculations, and reduce the number of attribute evaluations.
•
When multiple incremental objects are added to the decision system, the incremental mechanism can be used to update the reduction.
•
We combine IARAT and Spark parallel technology to parallelize data processing to accelerate the calculation process.

Abstract

Attribute reduction is an important application of rough sets. Efficiently reducing massive dynamic data sets quickly has always been a major goal of researchers. Traditional incremental methods focus on reduction by updated approximations. However, these methods must evaluate all attributes and repeatedly calculate their importance. When these algorithms are applied to large datasets with high time complexity, reducing large decision systems becomes inefficient. We propose an incremental acceleration strategy based on attribute trees to solve this problem. The key step is to cluster all attributes into multiple trees for incremental attribute evaluation. Specifically, we first select the appropriate attribute tree for attribute evaluation according to the attribute tree correlation measure to reduce the time complexity. Next, the branch coefficient is added to the stop criterion, increasing with the branch depth and guiding a jump out of the loop after reaching the maximum threshold. This avoids redundant calculation and improves efficiency. Furthermore, we propose an algorithm for incremental attribute reduction based on attribute trees using these improvements. Finally, a Spark parallel mechanism is added to parallelize data processing to implement the parallel incremental efficient attribute reduction based on the attribute tree. Experimental results on the Shuttle dataset show that the time consumption of our algorithm is more than 40% lower than that of the classical IARC algorithm while maintaining its good classification performance. In addition, the time is shortened by more than 87% from the benchmark after adding the Spark parallelizing mechanism.

Introduction

With the development of information technology, science and industry is generating an increasingly growing amount of data that will soon exceed the capacity of our applications. In the meantime, demand for computing resources is increasing. Modern information technology has made records and data collection fine-grained and multi-dimensional [1], [41]. Data processing and knowledge discovery resulting in massive data has always been of great interest to those involved in data mining. “Big data” refers to a volume of data beyond traditional systems' storage and processing capabilities; it typically is of huge potential value. However, huge, complex datasets often have significant data redundancy [6], [38] and challenge the limits of data storage, computational efficiency, and accuracy.

Rough set theory (RST) [24] is commonly used to analyze decision-making uncertainty. Rough sets can divide uncertain knowledge concepts through indiscernibility relations and obtain equivalence class sets under different equivalence relations. This establishes an approximate space. RST has been applied in image processing [7], cluster analysis [8], pattern recognition [40], machine learning [4], [15], feature selection [23], [39], decision support [10], [13], [43], and data mining [14], [44]. Attribute reduction is an important concept in rough set theory that has become a data preprocessing tool to identify accuracy and discover potentially useful knowledge.

Classic attribute reduction algorithms, such as those based on positive regions [47], [30], information entropy [18], [19], and discernibility matrices [36], [28], load a small dataset into the main memory of a single machine at one time for reduction and cannot process massive amounts of data. Much research has focused on parallel attribute reduction using big data technologies such as Hadoop [20] and MapReduce [17]. MapReduce is an offline computing framework that abstracts an algorithm into the two stages of Map and Reduce for processing, which is suitable for data-intensive computing. However, it is inefficient due to frequent disk I/O and many sorting operations. Streaming data is an infinite sequence of data, in theory, and Storm is a streaming data processing system that processes only one record per job. The processing speed needs to reach the millisecond level. Tez is a computing framework that runs on YARN to support DAG jobs and summarizes MapReduce data processing. However, Tez cannot reuse intermediate information, resulting in iterative calculation, resulting in considerable redundant calculation. Spark is an improved distributed computing framework based on MapReduce, which puts data into memory as much as possible to reduce disk I/O. Spark improves the computational efficiency of iterative and interactive applications.

To address the limitations of serial algorithms, Zhang et al. [46] proposed a heuristic parallel attribute reduction algorithm (PLAR) and obtained the same attribute reduction as traditional methods using the Spark platform. Muhammad et al. [31] proposed a parallel attribute reduction algorithm that can search all positive regions in parallel, with computational efficiency 63 % higher than that of classical algorithms. However, due to the need to repeatedly calculate the positive regions, its time consumption is still high; Qian et al. [27], [29] studied the attribute reduction process in the MapReduce framework and proposed a parallel knowledge reduction algorithm, which showed that it is feasible to realize the attribute reduction of massive data, but due to the frequent I/O operations of the Map Reduce framework, its efficiency is low; Sowkuntla et al. [37] proposed MapReduce-based parallel/distributed approaches for attribute reduction in massive incomplete decision systems (IDS) for large incomplete decision tables, which deal with large datasets in terms of the number of objects and attributes according to different data partitioning strategies. This approach works only for datasets with many objects and moderate properties. However, its shortcomings, due to the inherent limitations of MapReduce and traditional reduction algorithms, must be addressed.

Many methods have been developed to solve the attribute reduction problem of static decision systems, i.e., non-incremental reduction algorithms [9], [5], [11]. However, decision systems may change over time, and since object sets change dynamically, obtaining new reductions requires recalculating a decision-making system and thus much computation. These reduction algorithms are inefficient when dealing with dynamic decision systems, and updating the reduction is a key issue in improving efficiency. Incremental learning can effectively improve the efficiency of knowledge discovery by leveraging the valuable results of legacy decision systems. Many incremental algorithms have been proposed to deal with dynamic data. Liu et al. [22] proposed an incremental approach by constructing support, precision, and coverage matrices to obtain reduction. However, their algorithm for reduction by constructing matrices is not suitable for the large-scale attribute reduction of data. Jin et al. [12] proposed an incremental attribute reduction algorithm based on the calculation of updated knowledge granularity, but multiple equivalence class divisions in large datasets also significantly increase the computation time of the algorithm. For dynamic ordered data, Sang et al. [33] studied the incremental attribute reduction method of dynamic ordered data in the DRSA framework. They proposed a matrix-based dominance degree calculation method to update the reduction, but it takes a substantial amount of time to update the matrix in large-scale data sets. Shu et al. [35] proposed an incremental feature selection framework based on neighborhood entropy to deal with dynamically mixed data with mixed-type features. However, the processing of large incremental data is time-consuming. The above methods focus on reduction in update approximation and are inefficient for reducing large-scale decision systems.

Our study examined the parallelizing Spark method and analyzed current incremental reduction algorithms. The concept of the attribute group proposed by Chen et al. [2] is integrated with the mechanism of a binary tree, and a novel incremental attribute reduction algorithm is used in combination with the Spark parallel framework. To better reflect the effectiveness of the proposed algorithm, we evaluated its performance on multiple sets of UCI datasets.

The main contributions of this paper are as follows:

(1)
Starting from the concept of an attribute group, we introduce the mechanism of a binary tree and propose a parallel incremental acceleration strategy based on the attribute tree. All conditional attributes are clustered into multiple attribute trees, and a core attribute search is performed on the attribute trees with higher correlations through multiple rounds of branching. The branch threshold coefficient is added into the calculation process to guide the algorithm to jump out of the loop, avoid redundant calculations, reduce the number of attribute evaluations, and improve the efficiency and accuracy of attribute reduction.
(2)
When multiple incremental objects are added to the decision system, the incremental mechanism can be used to update the reduction. Based on the above improvements, we analyze the situation of multiple incremental objects and propose an incremental attribute reduction algorithm based on attribute tree (IARAT).
(3)
Considering the performance advantages of the Spark framework, we combine IARAT and Spark parallel technology to parallelize data processing to accelerate the calculation process. This paper proposes a parallel incremental efficient attribute reduction algorithm based on attribute tree: PIARAT.

The rest of this paper is organized as follows. Section 2 reviews the basic concepts of rough sets, knowledge granularity, and the Apache Spark computing engine. Section 3 details PIARAT, and Section 4 evaluates the proposed algorithm on multiple UCI datasets. Conclusions are presented in Section 5.

Section snippets

Preliminaries

We review the concepts used in this article. Section 2.1 introduces rough sets, Section 2.2 reviews the representation of knowledge granularity and related incremental mechanisms, and Section 2.3 introduces the Spark parallel framework.

1%1
Rough set theory

An information system is a quadruple $S = < U, C \cup D, V, f >$ , where $U = \{x_{1}, x_{2}, x_{3}, . . ., x_{N}\}$ is a finite nonempty object set; $N$ is the number of samples in the universe; $C = \{a_{1}, a_{2}, a_{3}, . . ., a_{n}\}$ is the nonempty finite set of all condition attributes; $n$ is the number of

PIARAt

We integrated the concept of the attribute group with binary trees and combined the Spark parallel framework to design PIARAT. The original data had to be preprocessed to meet the input requirements of the algorithm; redundant attributes were removed to obtain the core attributes. The resultant algorithm accelerated dynamic attribute reduction.

This section has three parts. Section 3.1 introduces the parallel preprocessing algorithm of raw data (Algorithm 1). IARAT is introduced in Section 3.2,

Experimental analysis

We conducted experiments to evaluate the effectiveness and efficiency of the proposed incremental algorithm in the case of object changes. The experimental platform was an Intel Core i5-10400H CPU at 2.30 GHz, 16 GB RAM, Windows 10 Home Chinese version, JetBrains PyCharm development tool, and the Python language. The simulation environment platform of MapReduce Hadoop-2.7.1 and Spark-3.0.0-preview was built on a Windows 10 system, and the cluster operation adopted the local mode. In this

Conclusions

We started with the concept of an attribute group, introduced the mechanism of a binary tree to find a reduction, and proposed IARAT. The branch coefficient was added to improve the efficiency and precision of attribute reduction. With Spark parallel technology to parallelize data processing and to consider the performance advantages of the Spark framework, a new method based on attribute trees was proposed: PIARAT. Experiments showed that IARAT is effective and efficient. This article focuses

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to express our sincere appreciation to the editor and anonymous reviewers for their insightful comments, which greatly improved the quality of this paper.

This work is supported in part by the National Natural Science Foundation of China (61976120, 62006128, 62102199), the Natural Science Foundation of Jiangsu Province (BK20191445), the Natural Science Key Foundation of Jiangsu Education Department (21KJA510004), the General Program of the Natural Science Foundation of

References (47)

Y. Chen et al.
Attribute group for attribute reduction
Inf. Sci.
(2020)
W.P. Ding et al.
Multigranulation consensus fuzzy-rough based attribute reduction
Knowl.-Based Syst.
(2020)
L.X. Han et al.
A generic parallel processing model for facilitating data mining and integration
Parallel Comput.
(2011)
T. Herawan et al.
A rough set approach for selecting clustering attribute
Knowl.-Based Syst.
(2010)
M. Hu et al.
Attribute reduction based on overlap degree and k-nearest-neighbor rough sets in decision information systems
Inf. Sci.
(2022)
Z.H. Jiang et al.
Accelerator for multi-granularity attribute reduction
Knowl.-Based Syst.
(2019)
Y.G. Jing et al.
An incremental approach for attribute reduction based on knowledge granularity
Knowl.-Based Syst.
(2016)
H.R. Ju et al.
Robust supervised rough granular description model with the principle of justifiable granularity
Appl. Soft Comput.
(2021)
H.R. Ju et al.
Cost-sensitive rough set: a multi-granulation approach
Knowl.-Based Syst.
(2017)
H.R. Ju et al.
Sequential three-way classifier with justifiable granularity
Knowl.-Based Syst.
(2019)

J.Y. Liang et al.

An efficient rough feature selection algorithm with a multi-granulation view

Int. J. Approximate Reasoning

(2012)

D. Liu et al.

A rough set-based incremental approach for learning knowledge in dynamic incomplete information systems

Int. J. Approximate Reasoning

(2014)

K.Y. Liu et al.

Rough set based semi-supervised feature selection via ensemble selector

Knowl.-Based Syst.

(2019)

Z. Pawlak et al.

Rudiments of rough sets

Inf. Sci.

(2007)

J. Qian et al.

Hierarchical attribute reduction algorithms for big data using MapReduce

Knowl.-Based Syst.

(2015)

J. Qian et al.

Hybrid approaches to attribute reduction based on indiscernibility and discernibility relation

Int. J. Approximate Reasoning

(2011)

Y.H. Qian et al.

Positive approximation: an accelerator for attribute reduction in rough set theory

Artif. Intell.

(2010)

M.S. Raza et al.

A parallel rough set-based dependency calculation method for efficient feature selection

Appl. Soft Comput.

(2018)

B.B. Sang et al.

Incremental attribute reduction approaches for ordered data with time-evolving objects

Knowl.-Based Syst.

(2021)

W.H. Shu et al.

An incremental approach to attribute reduction from dynamic incomplete decision systems in rough set theory

Data Knowl. Eng.

(2015)

W.H. Shu et al.

Incremental feature selection for dynamic hybrid data using neighborhood rough set

Knowl.-Based Syst.

(2020)

P. Sowkuntla et al.

MapReduce based parallel attribute reduction in Incomplete Decision Systems

Knowl.-Based Syst.

(2021)

L. Sun et al.

Multilabel feature selection using ML-ReliefF and neighborhood mutual information for multilabel neighborhood decision systems

Inf. Sci.

(2020)

Cited by (11)

Systematic attribute reductions based on double granulation structures and three-view uncertainty measures in interval-set decision systems
2024, International Journal of Approximate Reasoning
Attribute reductions eliminate redundant information to become valuable in data reasoning. In the data context of interval-set decision systems (ISDSs), attribute reductions rely on granulation structures and uncertainty measures; however, the current structures and measures exhibit the singleness limitations, so their enrichments imply corresponding improvements of attribute reductions. Aiming at ISDSs, a fuzzy-equivalent granulation structure is proposed to improve the existing similar granulation structure, dependency degrees are proposed to enrich the existing condition entropy by using algebra-information fusion, so $3 \times 2$ attribute reductions are systematically formulated to contain both a basic reduction algorithm (called CAR) and five advanced reduction algorithms. At the granulation level, the similar granulation structure is improved to the fuzzy-equivalent granulation structure by removing the granular repeatability, and two knowledge structures emerge. At the measurement level, dependency degrees are proposed from the algebra perspective to supplement the condition entropy from the information perspective, and mixed measures are generated by fusing dependency degrees and condition entropies from the algebra-information viewpoint, so three-view and three-way uncertainty measures emerge to acquire granulation monotonicity/non-monotonicity. At the reduction level, the two granulation structures and three-view uncertainty measures two-dimensionally produce $3 \times 2$ heuristic reduction algorithms based on attribute significances, and thus five new algorithms emerge to improve an old algorithm (i.e., CAR). As finally shown by data experiments, $3 \times 2$ -systematic construction measures and attribute reductions exhibit the effectiveness and development, comparative results validate the three-level improvements of granulation structures, uncertainty measures, and reduction algorithms on ISDSs. This study resorts to tri-level thinking to enrich the theory and application of three-way decision.
Pheromone-guided parallel rough hypercuboid attribute reduction algorithm
2024, Applied Soft Computing
In knowledge discovery and data mining, removing redundant and irrelevant data attributes is crucial. Traditional algorithms, however, struggle with efficiency in high-dimensional big data contexts. To solve this problem, this paper proposes a novel global search attribute reduction method namely RHC-IGWO (Rough Hypercuboid Structure via Improved Grey Wolf Optimizer) by integrating the rough hypercuboid method and Improved Grey Wolf Optimizer (IGWO) with the pheromone mechanism. The algorithm is embedded into the Apache Spark parallel computing framework (Parallel computing Rough Hypercuboid Structure via Improved Grey Wolf Optimizer, PcRHC-IGWO) to accelerate and simplify the attribute reduction process. The algorithm divides the decision table into several independent blocks, introduces the pheromone mechanism to simulate the wolf pack behavior, and uses the IGWO for global search, which is conducive to efficient local search and global information sharing between individuals. The position of the individual is initialized by calculating the relevance between the attributes, and the pheromone value is dynamically updated according to the reduction quality. This allows automatically giving more search focus to more promising attribute regions. Experiments with public and real datasets demonstrate the RHC-IGWO algorithm's significant speedup and its efficacy in maintaining or enhancing classification accuracy. Particularly noteworthy is its performance on schizophrenia datasets, where the proposed method achieves outstanding classification accuracies of 86.2%, 88.89%, and 92.86% across various classifiers. These results not only demonstrate its effectiveness but also underline its potential in advanced data analysis scenarios. Additionally, on some large-scale datasets, the time required for processing has been reduced by 85.71%, showcasing the algorithm's efficiency in handling complex and voluminous data.
A method of data analysis based on division-mining-fusion strategy
2024, Information Sciences
With the advancement of data technology and storage services, the scale and complexity of data are rapidly growing. Consequently, promptly analyzing data and deriving precise insights have become urgent. Nevertheless, traditional methods struggle to balance the speed and accuracy of data mining. This paper proposes a data analysis technique called the Division-Mining-Fusion (DMF) strategy to tackle this challenge. Specifically, we divide a large-scale and complex dataset into multiple small-scale and simple sub-datasets. Then, we extract the knowledge embedded within each sub-dataset. Finally, we combine the extracted knowledge from each sub-dataset to accomplish learning tasks. To demonstrate the superior performance of the DMF strategy, we apply it to two fields: rough set theory and feature selection. The DMF strategy can accelerate the speed of data mining, enhance the accuracy of data analysis, and reduce the dimensionality of data. These advantages suggest that the DMF strategy outperforms traditional methods in processing data more efficiently. In addition, the number of sub-datasets is a crucial parameter of the DMF strategy. As the number of sub-datasets increases, the ability of the DMF strategy to analyze data continuously improves.
Exploiting feature multi-correlations for multilabel feature selection in robust multi-neighborhood fuzzy β covering space
2024, Information Fusion
Multilabel data contains rich label semantic information, and its data structure conforms to the cognitive laws of the actual world. However, these data usually involve many irrelevant, redundant, and noisy features, challenging constructing effective learning models. It is necessary to design a feature selection strategy to select valid information in multilabel data. Generally, constructing a robust granular space is essential for building learning methods that can capture the intrinsic information contained in data. In addition, carefully investigating the complex relationship between features facilitates the evaluation of features. However, the interactivity and complementarity between features are rarely studied for multilabel data. Fuzzy $β$ covering constructs granular space flexibly and excavates potential uncertainty information. The concept of fuzzy $β$ covering theory is extended to multilabel data to reasonably construct the granular space of multilabel data and effectively quantify the uncertainty of multilabel data. Concretely, a parameterized multilabel fuzzy $β$ covering relation is proposed to resist the negative effect of ambiguity, uncertainty, and noise in multilabel data, and a robust multi-neighborhood fuzzy $β$ covering granular space is constructed. Besides, the concept of multilabel fuzzy $β$ covering decision is explored to improve the accuracy and robustness of multilabel learning. The above concepts build the basis of multilabel fuzzy $β$ covering uncertainty measures, and a series of fuzzy $β$ covering entropy measures are then defined. Consequently, feature multi-correlations are presented, including relevance, redundancy, interactivity, and complementarity between features. Finally, following the principle of correlation maximization, a robust multilabel feature selection considering feature multi-correlations (RMSMC) is designed. Extensive experiments illustrate the superiority of RMSMC over nine representative algorithms.
Analysis of core attribute and approximate reduct based on the three-way decision
2024, Applied Soft Computing
Attribute reduction plays an important role in pattern recognition and machine learning, and the theory of rough sets has become a commonly used model for attribute reduction since its superiority in describing and quantifying vagueness and uncertainty. However, little attention has been paid to analyzing the importance of core attributes in different circumstances, and core attributes are generally all selected into the reduct without considering their differences. In this study, the role of core attributes is analyzed in terms of their impacts on classification ability, and the three-way partition of attributes is proposed to distinguish different core attributes and condition attributes. Then, a unified approximate attribute reduction framework based on the three-way decision is introduced to keep the quality and resulting classification performance of the reduct. Moreover, a general forward-adding and back-deleting heuristic algorithm is developed to effectively select important attributes and also eliminate unimportant core and condition attributes in the boundary region. Comprehensive comparative experiments and statistical significance analysis are conducted on UCI data sets. The experimental results show that our method achieves a better attribute reduction rate and classification performance and also verify that core attributes are not always indispensable.
Fast calculation for approximations in Dominance-based Rough Set Approach using Dual Information Granule
2023, Applied Soft Computing
The Dominance-based Rough Set Approach (DRSA) is an extension of RST, which utilizes the dominance relation in attributes. However, traditional DRSA-based methods do not exploit the properties and logical structure in depth, which causes a high computational cost in lower and upper approximations. Besides, these methods are object-based, leading to repeated calculations from identical instances, and they involve numerous redundant computations to approximate different decisions. Therefore, we propose a novel approach, called DIGAC (Dual Information Granule-based Approximation Calculation), to improve DRSA by replacing the object-based calculation with the granule-based calculation. It effectively reduces the time complexity by constructing a granule-based ordered decision system. Additionally, this novel approach uses three types of Dual Information Granules (DIGs) to avoid repeated calculations from identical samples. In the process of lower and upper approximation, we leverage their transitivity to put forward a Distributed Storage-based Lower Approximation Calculation (DSLAC) strategy and a Query-based Upper Approximation Calculation (QUAC) strategy to eliminate redundant computations. Importantly, we theoretically prove that the DIG-based approach extensively reduces the time complexity of the approximations and obtains the same approximations as the original counterpart. Our approach is investigated on 23 datasets, and the experimental results show that it outperforms existing algorithms in terms of efficiency and stability, especially for large-scale and high-dimensional datasets, where the average decrease in execution time is up to 99%.

View all citing articles on Scopus

View full text

Parallel incremental efficient attribute reduction algorithm based on attribute tree

Highlights

Abstract

Introduction

Section snippets

Preliminaries

PIARAt

Experimental analysis

Conclusions

Declaration of Competing Interest

Acknowledgments

Inf. Sci.

Knowl.-Based Syst.

Parallel Comput.

Knowl.-Based Syst.

Inf. Sci.

Knowl.-Based Syst.

Knowl.-Based Syst.

Appl. Soft Comput.

Knowl.-Based Syst.

Knowl.-Based Syst.

Int. J. Approximate Reasoning

Int. J. Approximate Reasoning

Knowl.-Based Syst.

Inf. Sci.

Knowl.-Based Syst.

Int. J. Approximate Reasoning

Artif. Intell.

Appl. Soft Comput.

Knowl.-Based Syst.

Data Knowl. Eng.

Knowl.-Based Syst.

Knowl.-Based Syst.

Inf. Sci.