Elsevier

Information Sciences

Volume 479, April 2019, Pages 515-525
Information Sciences

Secure weighted possibilistic c-means algorithm on cloud for clustering big data

https://doi.org/10.1016/j.ins.2018.02.013Get rights and content

Abstract

The weighted possibilistic c-means algorithm is an important soft clustering technique for big data analytics with cloud computing. However, the private data will be disclosed when the raw data is directly uploaded to cloud for efficient clustering. In this paper, a secure weighted possibilistic c-means algorithm based on the BGV encryption scheme is proposed for big data clustering on cloud. Specially, BGV is used to encrypt the raw data for the privacy preservation on cloud. Furthermore, the Taylor theorem is used to approximate the functions for calculating the weight value of each object and updating the membership matrix and the cluster centers as the polynomial functions which only include addition and multiplication operations such that the weighed possibilistic c-means algorithm can be securely and correctly performed on the encrypted data in cloud. Finally, the presented scheme is estimated on two big datasets, i.e., eGSAD and sWSN, by comparing with the traditional weighted possibilistic c-means method in terms of effectiveness, efficiency and scalability. The results show that the presented scheme performs more efficiently than the traditional weighted possiblistic c-means algorithm and it achieves a good scalability on cloud for big data clustering.

Introduction

Recent years have witnessed a considerable development in Internet of Things with the rapid proliferation of mobile devices and sensing techniques [1], [6], [26]. Specially, Internet of Things are being widely used in smart cities, intelligent transportation and industrial manufacture [7]. A typical Internet of Things system usually consists of three layers from bottom to top, i.e., physical layer, network layer and application layer, as presented in Fig. 1.

In the Internet of Things systems, the physical layer uses sensing devices such as sensors, RFID and two-dimensional codes to collect data and then the collected data is transmitted to the application layer through the network layer. In Internet of Things, a dedicated network is usually combined with the Internet to achieve the real-time and dependable transmission for the collected data. In the application layer, the collected data is analyzed typically using the cloud computing techniques to provide predictive services and intelligent decisions [4]. Therefore, data analytic is playing an important role for Internet of Things to offer various services [10], [20], [21].

Clustering, as a crucial and challenging technique for data analytics, partitions objects into different groups based on some similarity metrics so that the objects in the same group share more similarity than others in different groups [25]. In the past few decades, many clustering algorithms have been developed, which can be roughly grouped by two categories, i.e., hard clustering and soft clustering [9], [12]. Typical hard clustering algorithms include k-means and affinity propagation while representative soft clustering algorithms include the fuzzy c-means algorithm and the possibilistic c-means algorithm. In hard clustering, each object is assigned to only one cluster while each object is assigned to multiple clusters with different memberships in soft clustering. As a representative soft clustering algorithm, possibilistic c-means (PCM) has been successfully utilized in fault diagnosis, nonlinear system identification and incomplete data clustering [11]. Specially, PCM is viewed as a potential technique for big data anaytics. However, the traditional PCM algorithm cannot obtain the desirable clustering results for the datasets including some noisy objects. To tackle this problem, Schneider proposed a weighted possibilistic c-means algorithm (WPCM) to minimize the negative effect of noisy objects by assigning a small weight to each noisy object [16]. Generally, WPCM could yield significantly more accurate clustering results than PCM for the datasets including noisy objects.

Currently, big data collected from Internet of Things is posing a novel challenge on WPCM [17]. Big data is typically defined by four characteristics, i.e., large volume, large variety, large value and large velocity. Large volume is the dominated characteristic of big data, implying that there are a large number of objects in a big data set. Large variety indicates the different types of data including structured, semi-structured and unstructured data which the third characteristic refers to the hidden valuable information in big data. Moreover, big data is continually generated quickly and it requires to be processed in real time. It is difficult for WPCM to cluster big data with a large number of objects efficiently since WPCM has a high computational complexity [22]. Furthermore, WPCM needs to load all the objects into the memory for big data clustering. In some cases, the memory space of the computing devices is limited, leading to the failure of WPCM for big data clustering. Although some improved WPCM algorithms such as online WPCM and incremental WPCM have been proposed for big data clustering [9], they always produce lower clustering accuracy than the conventional WPCM algorithm. Cloud computing, as an emerging computing paradigm, is offering a scalable and cost-efficient solution for big data analytics by providing tremendous memory space and strong computing power [2]. Cloud computing has enjoyed its success in mobile crowdsourcing, scientific computing and machine learning [13], [14], [23]. Based on cloud computing, a distributed weighted possibilisic c-means algorithm was developed for big data clustering efficiently by uploading objects on cloud [22]. However, the private data will be disclosed when uploading the raw data to cloud directly, posing serious threat to human security. Specially, big data usually includes some private information such as personal medical records and bank counts. Once they are leaked, personal life and property will be threaten.

In this paper, a secure weighted possibilistic c-means algorithm (SWPCM) on cloud is presented for efficient big data clustering. To prevent the disclosure of the private data, BGV is utilized to encrypt the raw objects before uploading them on cloud. BGV is one of the most efficient fully homomorphic encryption schemes and it has obtained the successful application in cloud computing and deep computation models [5]. However, BGV does not support the exponential and division operations that are included in the functions of the WPCM algorithm for calculating the weighted values and updating the membership matrix and the clustering centers. Therefore, the Taylor theorem is employed to approximate the functions as the polynomial functions which only include addition and multiplication operations such that the proposed SWPCM algorithm can yield the correct clustering result on the encrypted data based on BGV. Finally, the proposed algorithm is evaluated on two representative datasets, i.e., eGSAD and sWSN [22], by comparing with the traditional weighted possibilistic c-means algorithm in terms of effectiveness, efficiency and scalability.

So, the presented scheme includes there contributions, as listed below.

  • A secure weighted possibilistic c-means algorithm based on BGV is presented for big data clustering on cloud. To protected the private data, the BGV encryption technique is employed to encrypt the data and thus the weighted possibilistic c-means algorithm is run on the encrypted data to prevent the disclosure of the raw data on cloud.

  • BGV cannot support division and exponential operations on the encrypted data directly. To obtain correct clustering results on the encrypted data, the Taylor theorem is employed to approximate the functions for computing the weighted values and updating the membership matrix and the clustering centers as the polynomial functions to remove the division and exponential operations.

  • Extensive experiments are conducted to evaluate the presented SWPMC algorithm by comparing the traditional WPCM algorithm in terms of efficiency, effectiveness and scalibility. The results show that the presented scheme performs more efficiently than the traditional weighted possiblistic c-means algorithm and it achieves a good scalability on cloud for big data clustering.

The paper is organized as follows. The possibilistic c-means algorithm and the related work are reviewed in Section 2 and the proposed algorithm is illustrated in Section 3. The experimental results are shown in Section 4 and the paper is concluded in the last section.

Section snippets

Possibilistic c-means clustering algorithm

The possibilistic c-means algorithm (PCM) is a typical soft clustering technique that was developed by Krishnapuram and Keller [11]. Given a dataset X={x1,x2,,xn} with n objects, each object with m attributes, PCM is defined by a c × n membership matrix U={uij|1ic;1jn} in which uij denotes the membership of xj towards the ith clustering center and c denotes the number of clustering centers denoted by V={v1,v2,,vc}. Therefore, PCM aims to calculate the membership matrix U and the

Secure weighted possibilistic c-means algorithm

This section illustrates the proposed secure weighted possibilistic c-means algorithm based on BGV for big data clustering on cloud. BGV is a fully homomorphic encryption scheme which can perform addition and multiplication operations on the encrypted data securely and correctly. As one of the most efficient fully homomorphic encryption schemes, BGV has successfully employed in cloud computing and deep computation models. However, BGV does not directly support division and exponential

Experiments

In the experiments, the performance of the presented secure weighted possibilistic c-means scheme (SWPCM) based on the BGV encryption is evaluated in a cloud platform which includes 20 personal computers, each with 3.2 GHz Core i7 CPU and 4 GB memory. Specially, the SWPCM scheme is compared with the traditional weighted possibilistic c-means method (WPCM) from three aspects, i.e., efficiency, effectiveness and scalability. In this paper, C* and ARI(U,U*) are used to compare the clustering

Conclusion

In this paper, a secure weighted possibilistic c-means algorithm based on the BGV encryption scheme is presented for big data clustering on cloud. One property of the presented scheme is the combination of the fully homomorphic encryption scheme and the cloud computing to improve the clustering efficiency for big data without the disclosure of the private data. The key idea is to approximate the functions for calculating the weight values and updating the membership matrix and the clustering

References (26)

  • H. Li et al.

    Mobile crowdsensing in software defined opportunistic networks

    IEEE Commun. Mag.

    (2017)
  • Q. Zhang et al.

    An improved deep computation model based on canonical polyadic decomposition

    IEEE Trans. Syst. Man Cybern.

    (2017)
  • A. Al-Fuqaha et al.

    Internet of Things: a survey on enabling technologies, protocols, and applications

    IEEE Commun. Surv. Tutorials

    (2015)
  • M. Armbrust et al.

    A view of cloud computing

    Commun. ACM

    (2010)
  • M. Barni et al.

    Comments on a possibilistic approach to clustering

    IEEE Trans. Fuzzy Syst.

    (1996)
  • M.Z.A. Bhuiyan et al.

    E-sampling: event-sensitive autonomous adaptive sensing and low-cost monitoring in networked sensing systems

    ACM Trans. Auton. Adapt. Syst.

    (2017)
  • Z. Brakerski et al.

    (Leveled) fully homomorphic encryption without bootstrapping

    Proceedings of ACM Innovations in Theoretical Computer Science Conference

    (2012)
  • X. Deng et al.

    Confident information coverage hole healing in hybrid industrial wireless sensor networks

    IEEE Trans. Ind. Inf.

    (2017)
  • X. Deng et al.

    Healing multi-modal confident information coverage holes in NB-iot-enabled networks

    IEEE IoT J.

    (2017)
  • M. Filippone et al.

    Applying the possibilistic c-means algorithm in kernel-induced spaces

    IEEE Trans. Fuzzy Syst.

    (2010)
  • T. Havens et al.

    Fuzzy c-means algorithms for very large data

    IEEE Trans. Fuzzy Syst.

    (2012)
  • H. Hui et al.

    Hybrid DVFS scheduling for real-time systems based on reinforcement learning

    Proceedings of IEEE International Conference on Green Computing and Communications

    (2017)
  • R. Krishnapuram et al.

    A possibilistic approach to clustering

    IEEE Trans. Fuzzy Syst.

    (1993)
  • Cited by (36)

    • A privacy preserve big data analysis system for wearable wireless sensor network

      2020, Computers and Security
      Citation Excerpt :

      This will not only improve the efficiency problem, but also may cause serious consequences due to the negligence of the medical staff or the judgment error. Machine learning and deep learning is used for big data analytics with cloud computing, so it has been leveraged as an important computer-aided auxiliary diagnosis (Fatima and Pasha, 2017; Zhang et al., 2019). Through deep learning, the collected data is analyzed and trained to build a model for a specific disease.

    • An effective clustering method based on data indeterminacy in neutrosophic set domain

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Clustering represents and models data by few clusters which achieves simplification to data analysis. Data clustering is an important field in machine learning, and has found numerous applications in computer vision, image processing, taxonomy, medicine, geology, business, and pattern recognition community (Heloulou et al., 2017; Alswaitti et al., 2018; Zhang et al., 2019; Nguyen and Kuo, 2019; Alswaitti et al., 2019; Zhou et al., 2019). In data analysis, clustering methods can be considered as two popular categories: hard (crisp) and fuzzy methods (Baraldi and Blonda, 1999).

    View all citing articles on Scopus
    View full text