Abstract:
With the rise of big data trends so quickly, real-time stream data processing has become very important. Stream data is a type of big, fast, and unreliable dataset that c...Show MoreMetadata
Abstract:
With the rise of big data trends so quickly, real-time stream data processing has become very important. Stream data is a type of big, fast, and unreliable dataset that cannot be handled well by traditional algorithms. Designing the algorithm that can efficiently process streaming data is a challenging task. This paper shows how important is to make a real-time clustering algorithm for data streams with high concept drift and an algorithm that can adapt to different dimensions. We propose Scalable Random Sampling Online Optimization Weighted Fuzzy c-Means (SRSOO-WFCM) algorithms for handling Big Data in a High-Performance Computing (HPC) environment using an Apache Spark cluster. To compare SRSOO-WFCM with the traditional Online Fuzzy c-Means (OFCM) algorithm, we made a scalable version of OFCM named SOFCM. The proposed SRSOO-WFCM and SOFCM are incremental algorithms that involve making one sequential run through the data subsets. We employ both loadable and very large datasets to perform extensive experiments that facilitate comparing the proposed SRSOO-WFCM and SOFCM algorithms. The proposed SRSOO-WFCM algorithm performs better than the SOFCM in terms of Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and F-score, respectively.
Date of Conference: 04-07 December 2022
Date Added to IEEE Xplore: 30 January 2023
ISBN Information: