Elsevier

Future Generation Computer Systems

Volume 86, September 2018, Pages 1395-1412
Future Generation Computer Systems

Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop

https://doi.org/10.1016/j.future.2018.03.006Get rights and content

Highlights

  • Compared to mini batchk-means (MBK), the centers of our proposed method (short for PK) won’t change too much. PK updates the centers by taking the streaming average of samples.

  • Since the small batches are used, the whole computational resources are reduced.

  • The parallel computing speeds up the convergence rate.

  • DGO is used to enhance the clustering ability, avoiding local optima, increasing the chance of finding global optimal solution in terms of perfect clusters arrangement.

  • Using our proposed algorithm into real IoT data to test the performance in real world clustering problem is pragmatic.

  • Data level fusion combines with pre-processing preserves the characteristics of original data and increases the efficiency of data processing.

Abstract

Clustering algorithms are an important branch of data mining family which has been applied widely in IoT applications such as finding similar sensing patterns, detecting outliers, and segmenting large behavioral groups in real-time. Traditional full batchk-means for clustering IoT big data is confronted by large scaled storage and high computational complexity problems. In order to overcome the latency inherited from full batchk-means, two big data processing methods were often used: the first method is to use small batches as the input data to multiple computers for reducing the computation efforts. However, depending on the sensed data which may be heterogeneously fused from different sources in an IoT network, the size of each mini batch may vary in each iteration of clustering process. When these input data are subject to clustering their centers would shift drastically, which affects the final clustering results. The second method is parallel computing, it decreases the runtime while the overall computational effort remains the same. Furthermore, some centroid based clustering algorithm such ask-means converges easily into local optima. In light of this, in this paper, a new partitioned clustering method that is optimized by metaheuristic is proposed for IoT big data environment. The method has three main activities: Firstly, a sample of the dataset is partitioned into mini batches. It is followed by adjusting the centroids of the mini batches of data. The third step is collating the mini batches to form clusters, so the quality of the clusters would be maximized. How the positions of the centroids could be optimally attuned at the mini batches are governed by a metaheuristic called Dynamic Group Optimization. The data are processed in parallel in Hadoop. Extensive experiments are conducted to investigate the performance. The results show that our proposed method is a promising tool for clustering fused IoT data efficiently.

Introduction

Along with the rapid growth of technology, big data [[1], [2]] has been widely used in researches and applications. Among those various scenarios, Internet of Things (IoT) [[3], [4], [5]] require large scale data collection but very low-latency real-time analysis. IoT applications include but not limited to transportation, healthcare, smart city and sensor monitoring. The data of IoT are known to be unstructured in format, huge in volume, fast in speed and irregular in synchronization. Effective data mining algorithms are hence in demand to meet the requirements for big data analytics to support real-time IoT applications. In most IoT scenarios, for various purposes, a number of information systems were built by different departments, where the design logics, data model and database of each information system may not be the same. For example, in an air pollution monitoring IoT application, multiple time-series of weather data, air quality data, as well as automobile traffic data and geographical reference data are fused into some unified big data for analysis. Each type of the data has its own format, and probably collected at different sampling rates. Hence some efforts are required to carefully fuse them into a composite format while assuring the data are synchronized and consistent. It is crucial that the data from different sources are converted into some useful form so that the data analytics part of the IoT system can make decision upon the data efficiently. There are several challenges pertaining to data fusion [6] in IoT, which demands for a smart computing solution. The main challenges are shown as follows:

  • IoT data defects are due to the sensor data obtained by a variety of different degrees of interference and the uncertainty of their own measurement. Data fusion techniques should reduce its interference according to the redundancy.

  • Abnormal and false IoT data are due to the ambiguity of the environment and the inability to distinguish between noise and real data. Data fusion techniques should be able to improve the recognition rate to reduce the impact.

  • Different IoT data model are due to the different data sources. Data fusion techniques should align the different data to a unified form.

  • High IoT dimension are due to the IoT applications always produce big data with high dimension. Therefore, the data fusion techniques should compress the data to reduce the computational effort and communication cost.

According to the abstract level of data, the data fusion can be divided into three levels, namely data level fusion, feature level fusion and decision level fusion. Fusion at data level directly associates the data from IoT sensors, and then extracts features and makes decisions. The advantage is that the amount of data loss is less, can provide high accurate results. However, the cost is high and the processing time is long. Feature level fusion is intermediate level fusion; firstly, IoT sensors provide the eigenvector and then the eigenvector is passed to the fusion center, which is responsible for the centralized fusion of the eigenvector. The advantage is lower cost but the disadvantage is that data loss is big and low precision. Decision level fusion is high level fusion; initially, all IoT sensors make a preliminary decision based on their own data, and then all decisions are sent to the fusion center. The fusion center finally fuses local decisions together. The data loss is very large, very low precision the prediction but it has the characteristics of anti-interference ability and low cost.

In order to cope with the increasingly complex and huge IoT data fusion needs, efficient and accurate data fusion method has become the most important key. Since the accuracy of the data is the basis for making the right decisions, the data level fusion is the most optimal solution. However, the defects of data level fusion, such as high cost, affect the efficiency of fusion. It is a solution that IoT sensors make a preliminary data processing to reduce redundancy in the view of defects of data level fusion. And then the pre-processed data are sent to fusion center for clustering in order to further remove redundant information. It is known that the IoT sensors just collect data and at best they may pre-process the data rather than making any high level decision. In general, good data fusion is related to and depends on the efficiency of the processing. More has to be done, when it comes to keeping up with high data analytic accuracy. Therefore, efficacious data analytic algorithms at the decision level are in high demand assuming that the data fusion would have been done sufficiently well at the data level and feature level. For data clustering and data feature dimension control, at the IoT decision level, traditionally the following algorithms are popularly used:

Principal component analysis (PCA) [7] is the most commonly used linear dimensionality reduction method. Its goal is to map high-dimensional data to low-dimensional space by linear projection. PCA expects the variance of the data on the projected dimension to the largest, thus using less data dimension while preserving the characteristics of more original data. It is the suitable method to employ into IoT sensors to do data pre-processing because the most information can be preserved.

Clustering algorithms are an important branch of data mining algorithms which have been applied widely in many IoT big data applications. Clustering reveals the intrinsic relationship between data. Data which have common characteristics are grouped into the same cluster such that intra-similarity is high in the same cluster and inter-similarity across clusters.

Generally, there are four categories of clustering methods: Hierarchical method is based on the distance between objects and clusters. The idea of hierarchical methods is that objects are more related to nearby objects rather than the farther objects. Balanced iterative reducing clustering hierarchies (BIRCH) [8] is the well-known algorithm in this category. The second category is the partitioning method, the main idea is that it constructk (k<n) partitions and then evaluate them by some criterion, for example, minimizing the sum of square errors. Typical algorithms includesk-means and affinity propagation clustering (AP) [9]. Density-based method is the third category, clusters are dense regions in the data space, separated by regions of lower object density and a cluster is defined as a maximal set of density-connected points. The fourth is the model-based method, which is hypothesized for each of the clusters and tries to find the best fit of that model to each other, the well-known algorithm is expectation maximization clustering (EM) [10].

However, the conventional data analytics are becoming less suitable for handling the current big data processing. To meet the challenge of IoT big data applications, researchers proposed new methods and extended standard clustering algorithms to tackle big data which is too large to load it all into the memory for deriving clusters.k-means [11] was extended to hybridize with particle swarm algorithmk-means clustering (CPSO) [[12], [13]] for enhancing the search ability over high dimensional data by Tang in 2012. Wave cluster algorithm (WC) [14] was proposed in 2012, which employs the multi-resolution characteristics of wavelet transformation to recognize various scaled data sets. Segundo proposed an improved clique algorithm [15] in 2013, which combines the advantages of density and grid methods. It divides data not only based on the grid, but also takes density into account, it is partitioning data into ‘dense’ and ‘sparse’ sets and focusing on dense grid cell data. However, clique algorithm divides each dimension equally according to the user setting, it may lead to a cluster being divided into several artificial clusters. In addition, the number of connections will grow exponentially and the computational complexity will be very high at high-dimensional data sets. CURE [16] was proposed to solve big data problem, it uses random sampling technique to reduce the computational cost. Fast spectral clustering [17] and fast computation of Gaussian likelihoods [18] used random sampling technique to accelerate the speed of clustering as well. Many partition-based clustering algorithms tend to have good performance on large data problems. The most well-known clustering algorithm isk-means, which uses the center/mean of a data group, known as centroid, to represent the entire group. Memberships are assigned between data and their similar groups iteratively until no further membership relocation shall be necessary. K-means can quickly converge with a group of clusters which have certain quality of similarity in each cluster within reasonable run time. Although partition-based algorithms when compared to the other clustering algorithms have good performance on big data problem, as the scale of data increases to big data, the computational costs rise exponentially. In order to solve this problem, Sculley who is a researcher of Google Inc. proposed a mini batchk-means (MBK) [19] algorithm that randomly polls data in small chunks from the big data in each iteration of the clustering process as input data. This way, clusters are progressively and iteratively built up without loading the full big data archive at once into memory for clustering. MBK maintains the fast convergence time ofk-means, but the quality of the clustering results is not maximized. In practice, MBK is a promising tool for solving big data clustering problem. There is another method for accelerating the convergence ofk-means, researchers introduced the distributed-computation into clustering. Although it solves computational complexity problem partially, it brings new problems, for example, Kantabutra proposed a Parallelk-means [20] based distributed system, which costs a lot of resources on information transfer among multiple nodes. As a result, the overall computation cost is not reduced.

Nevertheless, the classical clustering algorithm such ask-means can easily fall into local optima that do not produce the best clustering result. Achieving a globally optimum clustering result requires an exhaustive process in which all partitioning possibilities are tried out, which is computationally prohibitive [12]. The problem of finding the absolute best of clusters byk-means,k-centers and FCM are computationally difficult (NP-hard). Therefore, some efficient optimization tools have to be used in lieu of brute-force for improving the clustering performance. The dynamic group optimization algorithm (DGO) is one of the latest optimization algorithms, is used here for enhancing the mini batch partitioned clustering. DGO is inspired by the intra-society and inter-society communications and the interactions of animals including humans [[21], [22]]. Owing to the efficiency of DGO, it has been successfully applied into engineering problem as well as financial portfolio optimization and hyperparameter optimization. There are three phases of operation in this algorithm: (1) Intragroup cooperation; during which the members of the group are guided by their head animal using random walk mutation operators. (2) Intergroup communication; during which the head animal (the groups’ best solution) communicates with the other heads to obtain global information. (3) Group variation; during which members transfer to other groups based on the performance of each group search. DGO is a half swarming and half evolutionary algorithm, and is applicable to the global search phase and takes into account of local exploitation. Until now, the particle swarm algorithm (PSO) [23] and genetic algorithm (GA) [24] have been adopted for enhancing thek-means [[12], [13]]. However, GA and PSO mainly focus on global exploration lacking of local exploitation capability. Their search performances are relatively inefficient comparing to DGO which specializes in both global exploration and local exploitation. Therefore, it is interesting to apply DGO to improve the clustering performance of the new big data clustering algorithm by partitioning the input big data into mini batches. DGO assisted the partitioned clustering algorithm to avoid local optima and to accelerate the convergence, despite the input data arrive in distributed manner.

Efficient and accurate data fusion method is crucial for an IoT data system. Low level data fusion can preserve the integrity of the original data. However, it needs high computation efforts and high cost for perfect fusion. Higher level data fusion incurs relatively lower cost, however the amount of data loss may be large. How to balance the computational cost and data accuracy is the key to good IoT data analytics performance. Data level fusion that combines with low level data pre-processing can be an optimal solution. It does not only retain the accuracy of the IoT data, but also improves the efficiency of the processing. Moreover, the provision of an efficient clustering can further enhance data analysis capabilities. Therefore, in this work, we propose a data level IoT fusion method coupled with high level precise clustering. The method has two main components, the first component is IoT sensor with the capability of PCA. Another component is fusion center. Fusion center is responsible for associating all data and further clustering. The structure of our proposed method is shown in Fig. 1.

Big data of IoT is quite challenging for real-time analysis due to its sheer volume and velocity. Instead of loading the full big data in memory in single shot, new clustering algorithms are needed to assign streaming data into clusters during their formation. So as the data continue to arrive, the clusters evolve by iteratively improving their qualities by dynamically changing the memberships of the clustered data. The data feeds are fragmented into multiple incoming mini batches from the big data source. This solution is called “Partition-based” clustering where the input dataset is subject to be clustered is partitioned into small fragments namely “mini-batches”. Partition-based clustering operates likek-means, the distances between the mini-batches of data and the centers of the current clusters are measured. Iteratively the algorithm decides from the measured distances on which clusters the new data belong to, and whether the current memberships for the existing data would need to be updated. Partition-based algorithms usually have good performance on big data since the clustering process executes in a multi-processing distributed environment. In Hadoop such MapReduce processing is already supported by the operating system. Files are fragmented and processed in parallel, relieving the memory/IO bottlenecks.

However, if the amount of data is huge, reaching the maximum capacity, very large computational effort is required despite the data are processed in parallel as there is no speed up. There are two useful ways to reduce the computational effort proposed so far in the literature. The first method is using small batches of data as input in each iteration, which is the same the environment described above. This type of partition-based clustering algorithm which is modified fromk-mean is called Mini-batchk-mean (short for MBK). The second method is directly putting the clustering algorithms into a distributed system, such as Hadoop [25] system. The big data are automatically fragmented by Hadoop into default mini-batches. In our observation, both algorithms have some disadvantages. For the first method, the centers of the data in the mini-batches may shift without control in each iteration ofk-mean operation. If the selected data of batches are different, the final results of clustering would be affected. The second method is to just assign the whole clustering job into different nodes. Although it can reduce the runtime, there is hardly any speed-up, the computational efforts are the same overall. Moreover, many clustering algorithms fall easily into local optima. Leaving them into different Hadoop nodes without control may not be a good idea.

For alleviating the disadvantages of two methods, we propose a partition-based clustering algorithm,k-means, optimized by metaheuristic (short for PK). This algorithm not only adopts the concept of mini-batch clustering that operates in parallel computing environment, it also takes into account of metaheuristic optimization that preventsk-means from ending up with a solution from local optima. Specifically, the partition-based clustering algorithm is first integrated with DGO, empowering the clustering process with optimization ability. The DGO optimized clustering is called Dynamic Group Clustering (DGC) which has an additional layer of logics above the clustering, for guiding the formation of clusters in such a way that local optima solutions would be avoided.

DGC works in cooperation with partitioning the dataset. There are three major steps in the PK data partitioning part. In the first step, PK chooses samples randomly from the input dataset formatting them into small batches. Different from MBK, each small batch of data is considered as a complete dataset in PK over which clustering would be performed. Each mini batch is clustered in parallel until the clustering is converged to a terminal condition or an upper-limit of iterations is reached. In the second step, all the computed center points from the individual small batches form a new small dataset, the file of center points is loaded into DGO clustering (DGC) for determining the final and optimal k centers. In the third step, each sample of whole dataset is assigned to the nearest cluster center; from there the memberships of each data point would be updated if it is found that its distance to the new cluster center is shorter than the distance to the existing cluster center. The boundaries of the clusters which are represented by the memberships of the data points are updated accordingly. When PK assigns samples, the centers will approach the optimal positions. PK uses small batches to reduce the computation time. Meanwhile, the multiple batches can run in parallel in distributed environment. By using mini batches in clustering and executing them in parallel reduces the computation efforts significantly. An overall framework of PK process flow is shown in Fig. 2.

We proposed a data fusion method and a PK algorithm in solving IoT big data problem. It is done by associating small data batches with optimized partition-based clustering; all are operating in distributed computing environment. The unique features as well as the main differences between our proposed model and the others are listed as follow:

(1) Data level fusion combines with pre-processing preserves the characteristics of original data and increases the efficiency of data processing.

(2) Compared to MBK, the centers of data clusters using PK will not change tremendously. PK updates the cluster centers by taking the average of the streaming samples over time.

(3) Since small batches are used, the whole computational resources are reduced.

(4) Computing the clustering tasks in parallel increases the convergence rate.

(5) DGO is used to enhance the clustering ability, giving rise to a new optimized clustering method called DGC.

(6) Real IoT big data is used to test the performance of our proposed model.

This article is structured as follows: In Section 2, a brief explanation of PCA, DGO and clustering are presented. The details of our methodology is given in Section 3. The results of the experimentation and an analysis of the results are presented in Section 4. Finally, a conclusion and further research are presented in Section 5.

Section snippets

Background

Principal components analysis (PCA) [7] is a useful compress tool that has been successfully applied into fields such as image recognition and statistics. It is a powerful technique for finding patterns in data with high dimensions. This is the general principle: if all the points are mapped together, then almost all of the information (such as the distance among points) are lost; but if the mapping variance is as large as possible, then the data points will be dispersed, so we can retain more

Methodology

In this Section, the idea and details of our proposed method are presented as a methodology.

Experiment setup

In order to fully evaluate the performance of the proposed data fusion methods and PK, two test suites are used to investigate the performance. The first test suites consists of 4 artificially generated big datasets. They contain more than one thousand objects and 50 attributes. Those artificial data are generated by using Scikit-learn [32] software package.

Artificial Data Set 1: This dataset contains 10,000 objects with 100 attributes and 20 clusters. Samples were drawn from uniform

Conclusion

The analytics for big data of IoT are faced with a series of problems such as large scaled storage and high computational complexity. Most data of IoT are unstructured, large scaled, and dynamical. Effective data mining for big data has profound impact on the progress of IoT technology, especially in real-time applications. In this study, the new IoT data fusion method is proposed. We used data level fusion with PCA to reduce the cost and keep the original information. It improves the

Acknowledgments

The authors are thankful to the financial support from the research grants, (1) MYRG2016-00069, titled ‘Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance’ offered by RDAO/FST, University of Macau and Macau SAR government, Macau. (2) FDCT/126/2014/A3, titled ‘A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel’ offered by FDCT of Macau SAR government, Macau .

Rui Tang received the B.S. degree in software engineering from Nanchang University, JiangXi, China, in 2010 and the M.S. degree in software engineering from University of Macau, Taipa, Macau SAR in 2013. Currently, he is working toward the Ph.D. degree at University of Macau. His research interest is in the Metaheuristics Algorithms and Data Mining.

References (39)

  • ZhangT. et al.

    BIRCH: A new data clustering algorithm and its applications

    Data Min. Knowl. Discov.

    (1997)
  • FreyB.J. et al.

    Clustering by passing messages between data points

    Science

    (2007)
  • DempsterA.P. et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc. Ser. B Methodol.

    (1977)
  • J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth...
  • TangR. et al.

    Integrating nature-inspired optimization algorithms to K-means clustering

  • RuiT. et al.

    Nature-inspired clustering algorithms for web intelligence data

  • W. Deng, L. Wang, J. Qi, An improved support vector machine model based on wave cluster, in: Proc of the 11th...
  • San SegundoP. et al.

    An improved bit parallel exact maximum clique algorithm

    Optim. Lett.

    (2013)
  • GuhaS. et al.

    CURE: an efficient clustering algorithm for large databases

  • Cited by (32)

    • Parallel batch k-means for Big data clustering

      2021, Computers and Industrial Engineering
      Citation Excerpt :

      The number of clusters in a separate batch may differ from their number in the original dataset, which is a problem. Ideal clustering is difficult to implement because it is hard to determine the natural number of clusters (Tang & Fong, 2018). It is also difficult to choose the optimal batch size.

    • Wind turbine power output very short-term forecast: A comparative study of data clustering techniques in a PSO-ANFIS model

      2020, Journal of Cleaner Production
      Citation Excerpt :

      In recent times, the significant technological developments in renewable energy harvesting technologies is now concomitant with an avalanche of data generated from sensor technologies of these systems towards improved system efficiency, thus making data clustering to be of high importance. Most real-time systems are built with data logging capabilities with a three-dimensional data definition: the data velocity, complexity, and size (Tang and Fong, 2018; Torrecilla and Romo, 2018). This tripartite nature of the data has therefore made its analysis, retrieval, and processing both challenging and time-consuming (Sassi Hidri et al., 2018).

    View all citing articles on Scopus

    Rui Tang received the B.S. degree in software engineering from Nanchang University, JiangXi, China, in 2010 and the M.S. degree in software engineering from University of Macau, Taipa, Macau SAR in 2013. Currently, he is working toward the Ph.D. degree at University of Macau. His research interest is in the Metaheuristics Algorithms and Data Mining.

    Simon Fong graduated from La Trobe University, Australia, with the first class honors B.Eng. degree in computer systems and received the Ph.D. degree in computer science in 1993 and 1998, respectively. He is currently an associate professor in the Computer and Information Science Department, University of Macau. He is also one of the founding members of the Data Analytics and Collaborative Computing Research Group in the Faculty of Science and Technology. Before joining the University of Macau, he was an assistant professor in the School of Computer Engineering, Nanyang Technological University, Singapore. Prior to his academic career, he took up various managerial and technical posts, such as a systems engineer, IT consultant, and e-commerce director in Melbourne, Hong Kong and Singapore. Some companies that he worked before include Hong Kong Telecom, Singapore Network Services, AES Pro-Data and United Oversea Bank, Singapore. He has published more than 286 international conference and peer-reviewed journal papers, mostly in the area of e-commerce technology, business intelligence, and data-mining.

    View full text