Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop

doi:10.1016/j.future.2018.03.006

Future Generation Computer Systems

Volume 86, September 2018, Pages 1395-1412

https://doi.org/10.1016/j.future.2018.03.006 Get rights and content

Highlights

•
Compared to mini batch $k$ -means (MBK), the centers of our proposed method (short for PK) won’t change too much. PK updates the centers by taking the streaming average of samples.
•
Since the small batches are used, the whole computational resources are reduced.
•
The parallel computing speeds up the convergence rate.
•
DGO is used to enhance the clustering ability, avoiding local optima, increasing the chance of finding global optimal solution in terms of perfect clusters arrangement.
•
Using our proposed algorithm into real IoT data to test the performance in real world clustering problem is pragmatic.
•
Data level fusion combines with pre-processing preserves the characteristics of original data and increases the efficiency of data processing.

Abstract

Clustering algorithms are an important branch of data mining family which has been applied widely in IoT applications such as finding similar sensing patterns, detecting outliers, and segmenting large behavioral groups in real-time. Traditional full batch $k$ -means for clustering IoT big data is confronted by large scaled storage and high computational complexity problems. In order to overcome the latency inherited from full batch $k$ -means, two big data processing methods were often used: the first method is to use small batches as the input data to multiple computers for reducing the computation efforts. However, depending on the sensed data which may be heterogeneously fused from different sources in an IoT network, the size of each mini batch may vary in each iteration of clustering process. When these input data are subject to clustering their centers would shift drastically, which affects the final clustering results. The second method is parallel computing, it decreases the runtime while the overall computational effort remains the same. Furthermore, some centroid based clustering algorithm such as $k$ -means converges easily into local optima. In light of this, in this paper, a new partitioned clustering method that is optimized by metaheuristic is proposed for IoT big data environment. The method has three main activities: Firstly, a sample of the dataset is partitioned into mini batches. It is followed by adjusting the centroids of the mini batches of data. The third step is collating the mini batches to form clusters, so the quality of the clusters would be maximized. How the positions of the centroids could be optimally attuned at the mini batches are governed by a metaheuristic called Dynamic Group Optimization. The data are processed in parallel in Hadoop. Extensive experiments are conducted to investigate the performance. The results show that our proposed method is a promising tool for clustering fused IoT data efficiently.

Introduction

Along with the rapid growth of technology, big data [[1], [2]] has been widely used in researches and applications. Among those various scenarios, Internet of Things (IoT) [[3], [4], [5]] require large scale data collection but very low-latency real-time analysis. IoT applications include but not limited to transportation, healthcare, smart city and sensor monitoring. The data of IoT are known to be unstructured in format, huge in volume, fast in speed and irregular in synchronization. Effective data mining algorithms are hence in demand to meet the requirements for big data analytics to support real-time IoT applications. In most IoT scenarios, for various purposes, a number of information systems were built by different departments, where the design logics, data model and database of each information system may not be the same. For example, in an air pollution monitoring IoT application, multiple time-series of weather data, air quality data, as well as automobile traffic data and geographical reference data are fused into some unified big data for analysis. Each type of the data has its own format, and probably collected at different sampling rates. Hence some efforts are required to carefully fuse them into a composite format while assuring the data are synchronized and consistent. It is crucial that the data from different sources are converted into some useful form so that the data analytics part of the IoT system can make decision upon the data efficiently. There are several challenges pertaining to data fusion [6] in IoT, which demands for a smart computing solution. The main challenges are shown as follows:

$•$
IoT data defects are due to the sensor data obtained by a variety of different degrees of interference and the uncertainty of their own measurement. Data fusion techniques should reduce its interference according to the redundancy.
$•$
Abnormal and false IoT data are due to the ambiguity of the environment and the inability to distinguish between noise and real data. Data fusion techniques should be able to improve the recognition rate to reduce the impact.
$•$
Different IoT data model are due to the different data sources. Data fusion techniques should align the different data to a unified form.
$•$
High IoT dimension are due to the IoT applications always produce big data with high dimension. Therefore, the data fusion techniques should compress the data to reduce the computational effort and communication cost.

According to the abstract level of data, the data fusion can be divided into three levels, namely data level fusion, feature level fusion and decision level fusion. Fusion at data level directly associates the data from IoT sensors, and then extracts features and makes decisions. The advantage is that the amount of data loss is less, can provide high accurate results. However, the cost is high and the processing time is long. Feature level fusion is intermediate level fusion; firstly, IoT sensors provide the eigenvector and then the eigenvector is passed to the fusion center, which is responsible for the centralized fusion of the eigenvector. The advantage is lower cost but the disadvantage is that data loss is big and low precision. Decision level fusion is high level fusion; initially, all IoT sensors make a preliminary decision based on their own data, and then all decisions are sent to the fusion center. The fusion center finally fuses local decisions together. The data loss is very large, very low precision the prediction but it has the characteristics of anti-interference ability and low cost.

In order to cope with the increasingly complex and huge IoT data fusion needs, efficient and accurate data fusion method has become the most important key. Since the accuracy of the data is the basis for making the right decisions, the data level fusion is the most optimal solution. However, the defects of data level fusion, such as high cost, affect the efficiency of fusion. It is a solution that IoT sensors make a preliminary data processing to reduce redundancy in the view of defects of data level fusion. And then the pre-processed data are sent to fusion center for clustering in order to further remove redundant information. It is known that the IoT sensors just collect data and at best they may pre-process the data rather than making any high level decision. In general, good data fusion is related to and depends on the efficiency of the processing. More has to be done, when it comes to keeping up with high data analytic accuracy. Therefore, efficacious data analytic algorithms at the decision level are in high demand assuming that the data fusion would have been done sufficiently well at the data level and feature level. For data clustering and data feature dimension control, at the IoT decision level, traditionally the following algorithms are popularly used:

Principal component analysis (PCA) [7] is the most commonly used linear dimensionality reduction method. Its goal is to map high-dimensional data to low-dimensional space by linear projection. PCA expects the variance of the data on the projected dimension to the largest, thus using less data dimension while preserving the characteristics of more original data. It is the suitable method to employ into IoT sensors to do data pre-processing because the most information can be preserved.

Clustering algorithms are an important branch of data mining algorithms which have been applied widely in many IoT big data applications. Clustering reveals the intrinsic relationship between data. Data which have common characteristics are grouped into the same cluster such that intra-similarity is high in the same cluster and inter-similarity across clusters.

Generally, there are four categories of clustering methods: Hierarchical method is based on the distance between objects and clusters. The idea of hierarchical methods is that objects are more related to nearby objects rather than the farther objects. Balanced iterative reducing clustering hierarchies (BIRCH) [8] is the well-known algorithm in this category. The second category is the partitioning method, the main idea is that it construct $k$ ( $k < n$ ) partitions and then evaluate them by some criterion, for example, minimizing the sum of square errors. Typical algorithms includes $k$ -means and affinity propagation clustering (AP) [9]. Density-based method is the third category, clusters are dense regions in the data space, separated by regions of lower object density and a cluster is defined as a maximal set of density-connected points. The fourth is the model-based method, which is hypothesized for each of the clusters and tries to find the best fit of that model to each other, the well-known algorithm is expectation maximization clustering (EM) [10].

However, the conventional data analytics are becoming less suitable for handling the current big data processing. To meet the challenge of IoT big data applications, researchers proposed new methods and extended standard clustering algorithms to tackle big data which is too large to load it all into the memory for deriving clusters. $k$ -means [11] was extended to hybridize with particle swarm algorithm $k$ -means clustering (CPSO) [[12], [13]] for enhancing the search ability over high dimensional data by Tang in 2012. Wave cluster algorithm (WC) [14] was proposed in 2012, which employs the multi-resolution characteristics of wavelet transformation to recognize various scaled data sets. Segundo proposed an improved clique algorithm [15] in 2013, which combines the advantages of density and grid methods. It divides data not only based on the grid, but also takes density into account, it is partitioning data into ‘dense’ and ‘sparse’ sets and focusing on dense grid cell data. However, clique algorithm divides each dimension equally according to the user setting, it may lead to a cluster being divided into several artificial clusters. In addition, the number of connections will grow exponentially and the computational complexity will be very high at high-dimensional data sets. CURE [16] was proposed to solve big data problem, it uses random sampling technique to reduce the computational cost. Fast spectral clustering [17] and fast computation of Gaussian likelihoods [18] used random sampling technique to accelerate the speed of clustering as well. Many partition-based clustering algorithms tend to have good performance on large data problems. The most well-known clustering algorithm is $k$ -means, which uses the center/mean of a data group, known as centroid, to represent the entire group. Memberships are assigned between data and their similar groups iteratively until no further membership relocation shall be necessary. K-means can quickly converge with a group of clusters which have certain quality of similarity in each cluster within reasonable run time. Although partition-based algorithms when compared to the other clustering algorithms have good performance on big data problem, as the scale of data increases to big data, the computational costs rise exponentially. In order to solve this problem, Sculley who is a researcher of Google Inc. proposed a mini batch $k$ -means (MBK) [19] algorithm that randomly polls data in small chunks from the big data in each iteration of the clustering process as input data. This way, clusters are progressively and iteratively built up without loading the full big data archive at once into memory for clustering. MBK maintains the fast convergence time of $k$ -means, but the quality of the clustering results is not maximized. In practice, MBK is a promising tool for solving big data clustering problem. There is another method for accelerating the convergence of $k$ -means, researchers introduced the distributed-computation into clustering. Although it solves computational complexity problem partially, it brings new problems, for example, Kantabutra proposed a Parallel $k$ -means [20] based distributed system, which costs a lot of resources on information transfer among multiple nodes. As a result, the overall computation cost is not reduced.

Nevertheless, the classical clustering algorithm such as $k$ -means can easily fall into local optima that do not produce the best clustering result. Achieving a globally optimum clustering result requires an exhaustive process in which all partitioning possibilities are tried out, which is computationally prohibitive [12]. The problem of finding the absolute best of clusters by $k$ -means, $k$ -centers and FCM are computationally difficult (NP-hard). Therefore, some efficient optimization tools have to be used in lieu of brute-force for improving the clustering performance. The dynamic group optimization algorithm (DGO) is one of the latest optimization algorithms, is used here for enhancing the mini batch partitioned clustering. DGO is inspired by the intra-society and inter-society communications and the interactions of animals including humans [[21], [22]]. Owing to the efficiency of DGO, it has been successfully applied into engineering problem as well as financial portfolio optimization and hyperparameter optimization. There are three phases of operation in this algorithm: (1) Intragroup cooperation; during which the members of the group are guided by their head animal using random walk mutation operators. (2) Intergroup communication; during which the head animal (the groups’ best solution) communicates with the other heads to obtain global information. (3) Group variation; during which members transfer to other groups based on the performance of each group search. DGO is a half swarming and half evolutionary algorithm, and is applicable to the global search phase and takes into account of local exploitation. Until now, the particle swarm algorithm (PSO) [23] and genetic algorithm (GA) [24] have been adopted for enhancing the $k$ -means [[12], [13]]. However, GA and PSO mainly focus on global exploration lacking of local exploitation capability. Their search performances are relatively inefficient comparing to DGO which specializes in both global exploration and local exploitation. Therefore, it is interesting to apply DGO to improve the clustering performance of the new big data clustering algorithm by partitioning the input big data into mini batches. DGO assisted the partitioned clustering algorithm to avoid local optima and to accelerate the convergence, despite the input data arrive in distributed manner.

Efficient and accurate data fusion method is crucial for an IoT data system. Low level data fusion can preserve the integrity of the original data. However, it needs high computation efforts and high cost for perfect fusion. Higher level data fusion incurs relatively lower cost, however the amount of data loss may be large. How to balance the computational cost and data accuracy is the key to good IoT data analytics performance. Data level fusion that combines with low level data pre-processing can be an optimal solution. It does not only retain the accuracy of the IoT data, but also improves the efficiency of the processing. Moreover, the provision of an efficient clustering can further enhance data analysis capabilities. Therefore, in this work, we propose a data level IoT fusion method coupled with high level precise clustering. The method has two main components, the first component is IoT sensor with the capability of PCA. Another component is fusion center. Fusion center is responsible for associating all data and further clustering. The structure of our proposed method is shown in Fig. 1.

Big data of IoT is quite challenging for real-time analysis due to its sheer volume and velocity. Instead of loading the full big data in memory in single shot, new clustering algorithms are needed to assign streaming data into clusters during their formation. So as the data continue to arrive, the clusters evolve by iteratively improving their qualities by dynamically changing the memberships of the clustered data. The data feeds are fragmented into multiple incoming mini batches from the big data source. This solution is called “Partition-based” clustering where the input dataset is subject to be clustered is partitioned into small fragments namely “mini-batches”. Partition-based clustering operates like $k$ -means, the distances between the mini-batches of data and the centers of the current clusters are measured. Iteratively the algorithm decides from the measured distances on which clusters the new data belong to, and whether the current memberships for the existing data would need to be updated. Partition-based algorithms usually have good performance on big data since the clustering process executes in a multi-processing distributed environment. In Hadoop such MapReduce processing is already supported by the operating system. Files are fragmented and processed in parallel, relieving the memory/IO bottlenecks.

However, if the amount of data is huge, reaching the maximum capacity, very large computational effort is required despite the data are processed in parallel as there is no speed up. There are two useful ways to reduce the computational effort proposed so far in the literature. The first method is using small batches of data as input in each iteration, which is the same the environment described above. This type of partition-based clustering algorithm which is modified from $k$ -mean is called Mini-batch $k$ -mean (short for MBK). The second method is directly putting the clustering algorithms into a distributed system, such as Hadoop [25] system. The big data are automatically fragmented by Hadoop into default mini-batches. In our observation, both algorithms have some disadvantages. For the first method, the centers of the data in the mini-batches may shift without control in each iteration of $k$ -mean operation. If the selected data of batches are different, the final results of clustering would be affected. The second method is to just assign the whole clustering job into different nodes. Although it can reduce the runtime, there is hardly any speed-up, the computational efforts are the same overall. Moreover, many clustering algorithms fall easily into local optima. Leaving them into different Hadoop nodes without control may not be a good idea.

For alleviating the disadvantages of two methods, we propose a partition-based clustering algorithm, $k$ -means, optimized by metaheuristic (short for PK). This algorithm not only adopts the concept of mini-batch clustering that operates in parallel computing environment, it also takes into account of metaheuristic optimization that prevents $k$ -means from ending up with a solution from local optima. Specifically, the partition-based clustering algorithm is first integrated with DGO, empowering the clustering process with optimization ability. The DGO optimized clustering is called Dynamic Group Clustering (DGC) which has an additional layer of logics above the clustering, for guiding the formation of clusters in such a way that local optima solutions would be avoided.

DGC works in cooperation with partitioning the dataset. There are three major steps in the PK data partitioning part. In the first step, PK chooses samples randomly from the input dataset formatting them into small batches. Different from MBK, each small batch of data is considered as a complete dataset in PK over which clustering would be performed. Each mini batch is clustered in parallel until the clustering is converged to a terminal condition or an upper-limit of iterations is reached. In the second step, all the computed center points from the individual small batches form a new small dataset, the file of center points is loaded into DGO clustering (DGC) for determining the final and optimal k centers. In the third step, each sample of whole dataset is assigned to the nearest cluster center; from there the memberships of each data point would be updated if it is found that its distance to the new cluster center is shorter than the distance to the existing cluster center. The boundaries of the clusters which are represented by the memberships of the data points are updated accordingly. When PK assigns samples, the centers will approach the optimal positions. PK uses small batches to reduce the computation time. Meanwhile, the multiple batches can run in parallel in distributed environment. By using mini batches in clustering and executing them in parallel reduces the computation efforts significantly. An overall framework of PK process flow is shown in Fig. 2.

We proposed a data fusion method and a PK algorithm in solving IoT big data problem. It is done by associating small data batches with optimized partition-based clustering; all are operating in distributed computing environment. The unique features as well as the main differences between our proposed model and the others are listed as follow:

(1) Data level fusion combines with pre-processing preserves the characteristics of original data and increases the efficiency of data processing.

(2) Compared to MBK, the centers of data clusters using PK will not change tremendously. PK updates the cluster centers by taking the average of the streaming samples over time.

(3) Since small batches are used, the whole computational resources are reduced.

(4) Computing the clustering tasks in parallel increases the convergence rate.

(5) DGO is used to enhance the clustering ability, giving rise to a new optimized clustering method called DGC.

(6) Real IoT big data is used to test the performance of our proposed model.

This article is structured as follows: In Section 2, a brief explanation of PCA, DGO and clustering are presented. The details of our methodology is given in Section 3. The results of the experimentation and an analysis of the results are presented in Section 4. Finally, a conclusion and further research are presented in Section 5.

Section snippets

Background

Principal components analysis (PCA) [7] is a useful compress tool that has been successfully applied into fields such as image recognition and statistics. It is a powerful technique for finding patterns in data with high dimensions. This is the general principle: if all the points are mapped together, then almost all of the information (such as the distance among points) are lost; but if the mapping variance is as large as possible, then the data points will be dispersed, so we can retain more

Methodology

In this Section, the idea and details of our proposed method are presented as a methodology.

Experiment setup

In order to fully evaluate the performance of the proposed data fusion methods and PK, two test suites are used to investigate the performance. The first test suites consists of 4 artificially generated big datasets. They contain more than one thousand objects and 50 attributes. Those artificial data are generated by using Scikit-learn [32] software package.

Artificial Data Set 1: This dataset contains 10,000 objects with 100 attributes and 20 clusters. Samples were drawn from uniform

Conclusion

The analytics for big data of IoT are faced with a series of problems such as large scaled storage and high computational complexity. Most data of IoT are unstructured, large scaled, and dynamical. Effective data mining for big data has profound impact on the progress of IoT technology, especially in real-time applications. In this study, the new IoT data fusion method is proposed. We used data level fusion with PCA to reduce the cost and keep the original information. It improves the

Acknowledgments

The authors are thankful to the financial support from the research grants, (1) MYRG2016-00069, titled ‘Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance’ offered by RDAO/FST, University of Macau and Macau SAR government, Macau. (2) FDCT/126/2014/A3, titled ‘A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel’ offered by FDCT of Macau SAR government, Macau .

Rui Tang received the B.S. degree in software engineering from Nanchang University, JiangXi, China, in 2010 and the M.S. degree in software engineering from University of Macau, Taipa, Macau SAR in 2013. Currently, he is working toward the Ph.D. degree at University of Macau. His research interest is in the Metaheuristics Algorithms and Data Mining.

References (39)

HashemI.A.T. et al.
The role of big data in smart city
Int. J. Inf. Manage.
(2016)
GubbiJ. et al.
Internet of Things (IoT): A vision architectural elements, and future directions
Future Gener. Comput. Syst.
(2013)
LopezJ. et al.
Evolving privacy: From sensors to the Internet of Things
Future Gener. Comput. Syst.
(2017)
WoldS. et al.
Principal component analysis
Chemometr. Intell. Lab. Syst.
(1987)
FanJ.-L. et al.
Suppressed fuzzy c-means clustering algorithm
Pattern Recognit. Lett.
(2003)
ParkH.-S. et al.
A simple and fast algorithm for $k$ -medoids clustering
Expert Syst. Appl.
(2009)
RousseeuwP.J.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
ur RehmanM.H. et al.
Big data reduction framework for value creation in sustainable enterprises
Int. J. Inf. Manage.
(2016)
Al-TurjmanF.
Information-centric framework for the Internet of Things (IoT): Traffic modelling & optimization
Future Gener. Comput. Syst.
(2017)
YaqoobI. et al.
Temporary Removal: Information fusion in social big data: Foundations state-of-the-art, applications, challenges, and future research directions
Int. J. Inf. Manage.
(2016)

ZhangT. et al.

BIRCH: A new data clustering algorithm and its applications

Data Min. Knowl. Discov.

(1997)

FreyB.J. et al.

Clustering by passing messages between data points

Science

(2007)

DempsterA.P. et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. Ser. B Methodol.

(1977)

J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth...

TangR. et al.

Integrating nature-inspired optimization algorithms to K-means clustering

RuiT. et al.

Nature-inspired clustering algorithms for web intelligence data

W. Deng, L. Wang, J. Qi, An improved support vector machine model based on wave cluster, in: Proc of the 11th...

San SegundoP. et al.

An improved bit parallel exact maximum clique algorithm

Optim. Lett.

(2013)

GuhaS. et al.

CURE: an efficient clustering algorithm for large databases

Cited by (32)

Cluster-based local modeling (CBLM) paradigm meets deep learning: A novel approach to soil moisture estimation
2024, Journal of Hydrology
Producing precise soil moisture maps through soil moisture modeling is highly valued for a variety of purposes, such as agricultural productivity, water resource management, climate modeling, environmental monitoring, and disaster management. However, the severe spatial variability of soil moisture makes a single model insufficient for accurately estimating soil moisture in an area. In this study, a deep learning-based soil moisture estimation approach using the Cluster-Based Local Modeling (CBLM) paradigm is presented. The approach involves the utilization of Sentinel-2 imagery and ancillary data such as soil maps, geology maps, land use/cover maps as well as topographic and spectral indices to create a database for modeling. The optical trapezoid model (OPTRAM) was also employed to estimate relative soil moisture content as another ancillary data for soil moisture modeling. Some robust clustering algorithms i.e. K-means, genetic algorithm, particle swarm optimization, and density-based spatial clustering were applied to divide the study area into homogenous sub-areas. Two powerful deep learning methods, i.e., Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), were then trained on the dataset to make accurate soil moisture estimations. Promising results were obtained through the proposed approach in accurately modeling soil moisture. According to the findings, the global modeling approach using both LSTM and CNN models exhibited moderate performance. Isolated pixels with unreasonably high soil moisture values were produced by the LSTM model which may be related to modeling process errors. Significantly enhanced results were observed in both deep learning models with the use of the local modeling approach. The cluster-based local modeling paradigm was able to capture local variations in soil moisture and produce more interpretable results by developing separate models for each cluster. Moreover, the CBLM-based CNN-CLTM hybrid model outperformed both local CNN and LSTM models. Generalization of the results was also confirmed by applying the best model for a different year and different season. Overall, the approach shows promise in improving our understanding of soil moisture dynamics and identifying areas that are vulnerable to drought or flooding, which could help inform targeted interventions.
Vessel sailing route extraction and analysis from satellite-based AIS data using density clustering and probability algorithms
2023, Ocean Engineering
The vessel Automatic Identification System (AIS) data collected by satellites have the features of large coverage area and large data volume, and they are instantaneous discrete data rather than time-continuous data, so the data has large dispersion with many noise points. This poses a challenge for vessel sailing route extraction. This paper proposes a vessel sailing route extraction method which consists of the fast Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm and the Kernel Density Estimation-based Outlier Factor (KDE-based OF) noise reduction algorithm. The method in this paper firstly improves the clustering discrimination method in the DBSCAN algorithm to separate trajectories in different directions. Secondly, this paper extracts a fast clustering algorithm based on the density clustering algorithm to reduce its computing time overhead with satellite big data. Finally, this paper proposes the KDE-based OF processing algorithm, which calculates the outlier probability distribution value of the trajectory points through the algorithm to eliminate the edge trajectory points with low probability distribution. Based on the actual satellite vessel AIS data, this paper conducts multi-method comparisons and performance analysis experiments. Experiments show that the proposed method has the best stability and advancement.
Unlocking the power of mist computing through clustering techniques in IoT networks
2023, Internet of Things (Netherlands)
Fog computing allows the implementation of information processes on the native net, mainly because of its ability to process in real-time. On the other hand, the principles of Mist Computing have not been matured yet to be successfully adopted for daily use by communities. Research shows that it is possible to classify computational resources from computational paradigms, especially using multiple edge devices, so that they can perform cost-effective tasks using clustering software. However, the application of these solutions still faces several limitations in the resources that exist in the devices, including the diversity of the device and the mobility of IoT environments. Providing clustering methods for fog nodes can help address these issues. But, cooperation among things on the Internet of things faces many challenges in developing real-time applications. Mist computing infrastructure, located between the fog and the end nodes at the edge, In addition to intelligent offloading, can enable local decision-making with the help of sensors and actuators. It adopts appropriate layer-specific data transmission policies to ensure delay-tolerant data transmission of real-time information as well as big data. It can also help conserve bandwidth and battery power while single urgent information is transferred toward a server or router. Thus, collaborative clustering techniques can presumably be applied for IoT-mist. This requires a comprehensive study of clustering techniques and their applications in IoT-mist networks. In this paper, Our goal of this survey is, how collaborative clustering techniques in IoT-mist can improve network lifespan, optimize energy, and reduce delay in the network. We then review fog node clustering and how it allows more demanding applications to be processed in the mist layer, which in turn facilitates the fog of some tasks. Next, we discuss the challenges of applying clustering techniques on the IoT-mist network, and finally, we provide a conclusion and future works for this research.
Cluster-based wind turbine maintenance prioritization for a utility-scale wind farm
2022, Procedia Computer Science
At peak and off-peak periods of power generation, each wind turbine generator (WTG) on a utility-scale wind farm receives different wind speeds due to their varying geospatial locations. This variation significantly affects their long-term productivity and maintenance demand. Hence, clustering WTGs according to their resource advantages could be helpful when prioritizing maintenance activities. This study applies intelligent clustering techniques to develop WTG clusters for a utility-scale wind farm located in the Eastern Cape of South Africa using the SCADA wind speed. The wind speed variation across the 44 utility-scale wind turbines on the wind farm was analyzed. Wind resource variability on the wind farm was studied over a year across the four seasons. The average hourly wind speeds of each turbine were clustered using the k-means clustering method and the Calinski-Harabasz method was used to determine the optimal number of clusters. Based on the wind speed received by each WTG, the wind farm can be divided into seven (7) optimal clusters of WTGs. While some clusters consistently receive high wind speed, thus making the WTGs in such clusters economically viable, others receive low wind speeds, making them less economically productive. Turbines with more wind advantage will necessarily require priority maintenance owing to increased productivity and wear rate. This condition can be used to prioritize maintenance activities on the wind farm amidst other significant benefits.
Parallel batch k-means for Big data clustering
2021, Computers and Industrial Engineering
Citation Excerpt :
The number of clusters in a separate batch may differ from their number in the original dataset, which is a problem. Ideal clustering is difficult to implement because it is hard to determine the natural number of clusters (Tang & Fong, 2018). It is also difficult to choose the optimal batch size.
The application of clustering algorithms is expanding due to the rapid growth of data volumes. Nevertheless, existing algorithms are not always effective because of high computational complexity. A new parallel batch clustering algorithm based on the k-means algorithm is proposed. The proposed algorithm splits a dataset into equal partitions and reduces the exponential growth of computations. The goal is to preserve the characteristics of the dataset while increasing the clustering speed. The centers of the clusters are calculated for each partition, which are merged and also clustered later. The approach to determine the optimal batch size is also considered. The statistical significance of the proposed approach is provided. Six experimental datasets are used to evaluate the effectiveness of the proposed parallel batch clustering. The obtained results are compared with the k-means algorithm. The analysis shows the practical applicability of the proposed algorithm to Big Data.
Wind turbine power output very short-term forecast: A comparative study of data clustering techniques in a PSO-ANFIS model
2020, Journal of Cleaner Production
Citation Excerpt :
In recent times, the significant technological developments in renewable energy harvesting technologies is now concomitant with an avalanche of data generated from sensor technologies of these systems towards improved system efficiency, thus making data clustering to be of high importance. Most real-time systems are built with data logging capabilities with a three-dimensional data definition: the data velocity, complexity, and size (Tang and Fong, 2018; Torrecilla and Romo, 2018). This tripartite nature of the data has therefore made its analysis, retrieval, and processing both challenging and time-consuming (Sassi Hidri et al., 2018).
The emergence of new sites for wind energy exploration in South Africa requires an accurate prediction of the potential power output of a typical utility-scale wind turbine in such areas. However, careful selection of data clustering technique is very essential as it has a significant impact on the accuracy of the prediction. Adaptive neurofuzzy inference system (ANFIS), both in its standalone and hybrid form has been applied in offline and online forecast in wind energy studies, however, the effect of clustering techniques has not been reported despite its significance. Therefore, this study investigates the effect of the choice of clustering algorithm on the performance of a standalone ANFIS and ANFIS optimized with particle swarm optimization (PSO) technique using a synthetic wind turbine power output data of a potential site in the Eastern Cape, South Africa. In this study a wind resource map for the Eastern Cape province was developed. Also, autoregressive ANFIS models and their hybrids with PSO were developed. Each model was evaluated based on three clustering techniques (grid partitioning (GP), subtractive clustering (SC), and fuzzy-c-means (FCM)). The gross wind power of the model wind turbine was estimated from the wind speed data collected from the potential site at 10 min data resolution using Windographer software. The standalone and hybrid models were trained and tested with 70% and 30% of the dataset respectively. The performance of each clustering technique was compared for both standalone and PSO-ANFIS models using known statistical metrics. From our findings, ANFIS standalone model clustered with SC performed best among the standalone models with a root mean square error (RMSE) of 0.132, mean absolute percentage error (MAPE) of 30.94, a mean absolute deviation (MAD) of 0.077, relative mean bias error (rMBE) of 0.190 and variance accounted for (VAF) of 94.307. Also, PSO-ANFIS model clustered with SC technique performed the best among the three hybrid models with RMSE of 0.127, MAPE of 28.11, MAD of 0.078, rMBE of 0.190 and VAF of 94.311. The ANFIS-SC model recorded the lowest computational time of 30.23secs among the standalone models. However, the PSO-ANFIS-SC model recorded a computational time of 47.21secs. Based on our findings, a hybrid ANFIS model gives better forecast accuracy compared to the standalone model, though with a trade-off in the computational time. Since, the choice of clustering technique was observed to play a vital role in the forecast accuracy of standalone and hybrid models, this study recommends SC technique for ANFIS modeling at both standalone and hybrid models.

View all citing articles on Scopus

Simon Fong graduated from La Trobe University, Australia, with the first class honors B.Eng. degree in computer systems and received the Ph.D. degree in computer science in 1993 and 1998, respectively. He is currently an associate professor in the Computer and Information Science Department, University of Macau. He is also one of the founding members of the Data Analytics and Collaborative Computing Research Group in the Faculty of Science and Technology. Before joining the University of Macau, he was an assistant professor in the School of Computer Engineering, Nanyang Technological University, Singapore. Prior to his academic career, he took up various managerial and technical posts, such as a systems engineer, IT consultant, and e-commerce director in Melbourne, Hong Kong and Singapore. Some companies that he worked before include Hong Kong Telecom, Singapore Network Services, AES Pro-Data and United Oversea Bank, Singapore. He has published more than 286 international conference and peer-reviewed journal papers, mostly in the area of e-commerce technology, business intelligence, and data-mining.

View full text

Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop

Highlights

Abstract

Introduction

Section snippets

Background

Methodology

Experiment setup

Conclusion

Acknowledgments

Int. J. Inf. Manage.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Chemometr. Intell. Lab. Syst.

Pattern Recognit. Lett.

Expert Syst. Appl.

J. Comput. Appl. Math.

Big data reduction framework for value creation in sustainable enterprises

Int. J. Inf. Manage.

Information-centric framework for the Internet of Things (IoT): Traffic modelling & optimization

Future Gener. Comput. Syst.

Temporary Removal: Information fusion in social big data: Foundations state-of-the-art, applications, challenges, and future research directions

Int. J. Inf. Manage.

BIRCH: A new data clustering algorithm and its applications

Data Min. Knowl. Discov.

Clustering by passing messages between data points

Science

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. Ser. B Methodol.

Integrating nature-inspired optimization algorithms to K-means clustering

Nature-inspired clustering algorithms for web intelligence data

An improved bit parallel exact maximum clique algorithm

Optim. Lett.

CURE: an efficient clustering algorithm for large databases