Elsevier

Information Sciences

Volume 546, 6 February 2021, Pages 1-24
Information Sciences

DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark

https://doi.org/10.1016/j.ins.2020.07.066Get rights and content

Abstract

For the extraction of useful patterns, the collected data should be distributed to and shared with analyzers. This, however, creates problems and challenges for the individual with respect to their privacy and identity. In this paper, the Mondrian multidimensional anonymization method was developed and improved for satisfaction of the l-diversity privacy model, and it has been presented in a distributed fashion within the Apache Spark framework. Since one of the major challenges in data privacy is the tradeoff between privacy and data utility, the presented method focuses on information loss and classifier evaluation criteria. Therefore, the cut dimension was selected using the coefficient of variation and information gain criteria, and the cut points were chosen dynamically, which led to a decrease in the information loss parameter and an improvement in the classifier performance evaluation criteria such as accuracy and FMeasure compared to the previous algorithms in the literature. The processing speed is 100 times higher in Spark than in the Hadoop framework. Consequently, the proposed method was presented in a distributed fashion based on RDDs programming within Apache Spark framework. This will resolve the problem of speed in large-scale data anonymization as it exists in the previous Hadoop-based algorithms. The results of the experiments performed on the numerical datasets demonstrate the improvements made by the proposed method.

Introduction

Preservation of privacy is a challenging issue being faced by people in their everyday lives, which has motivated computer science researchers to conduct a large number of studies in the area of information privacy preservation [1]. Although privacy can be established in the data through elimination of some attributes or changes made in the data, the records available in the dataset can be re-identified, and an individual’s identity can be recognized through further analysis of the dataset [2], [3].

Organizations such as hospitals, universities, telecommunications centers, and social networks share their data on the web with the research community, or make them available to other organizations to be analyzed and their subtle patterns extracted. Data sharing and publishing, however, may breach an individual’s privacy. Therefore, methods must be accessible that publish and analyze data such that an individual’s sensitive information is not revealed. In general, methods such as access restriction, encryption, anonymization, and noise-based methods are employed for privacy preservation. In the method of access restriction, the data owner can prevent organizations and web pages from collecting their sensitive information. Moreover, data administrators restrict unauthorized individuals’ access to sensitive information by setting access control policies in databases. The application of encryption methods involves confidential data storage in environments like clouds. Anonymization in any data record disassociates the data and the data owner, preventing him from being re-identified. Noise-based methods, one of the commonest of which is Differential Privacy, satisfy privacy by adding noise to and making changes in the data [4], [5].

Anonymization, which is the method proposed in the present paper in this regard, is generally carried out as a generalization, suppression, permutation, anatomization, or perturbation operator with the idea of changing the data values to provide the required privacy models such as k-anonymity[6], l-diversity[7], and t-closeness[8] in data publishing. In the generalization and suppression operators, the values of the quasi identifier (QID) attributes are replaced by more general values. In the permutation and anatomization operators, the QID attributes and the sensitive attributes (SA) are disassociated as the data in the equivalence class are shuffled and grouped. The perturbation operator changes the data values through the noise addition, swap, and aggregation operations [9], [10].

The generalization operator is the commonest, most powerful operator used in anonymization. Various methods are available for data generalization, including the full domain [11], local recoding [12], sub-tree schema [13], and multidimensional [14] methods, on which basis different algorithms have been developed. In the data generalization method used in this paper, the multidimensional anonymization method is used, which involves lower data distortion than the other methods, and the l-diversity privacy model is used, which features a higher level of privacy than the k-anonymity model, and is resistant against attribute linkage and record linkage attacks. The k-anonymity model denotes that a minimum of k and a maximum of 2*k-1 records with the same QID attributes are grouped together. l-diversity states that a minimum of l distinct values must be available for the sensitive attribute in each group consisting of the k-anonymity model [10], [15].

Given that data size is increasing every day, and has reached the gigabyte, terabyte, and zettabyte scales, data are stored in distributed environments such as clouds [16] and NoSQL databases [17]. Since the most important factor for sharing and publishing information in network environments is time, centralized methods are not applicable to large-scale data, as they lead to inappropriate results in the time complexity, scalability, and performance criteria [1], [5]. Algorithms are therefore required that publish data confidentially over the distributed platform. Apache Spark [18] is a fast, parallel, scalable processing engine for processing large-scale data, characterized mainly by storage and in-memory computation as resilient distributed datasets(RDDs), which has resulted in faster distributed processing than that over other platforms [18], [19], [20]. Therefore, Apache Spark can be considered a beneficial platform for anonymizing large-scale data and providing different privacy models.

Mondrian's multidimensional anonymization method [14] was first presented in 2006 as a centralized approach to preserve the k-anonymity privacy model. Recently, Zakerzadeh et al. [21] attempted to reduce the processing time of Mondrian’s approach by implementing it on the distributed framework of MapReduce. Since then, a tremendous amount of work has been done to meet the k-anonymity privacy model through the MapReduce processing platform, which is comprehensively presented in Section 2 of this article. Most of the previous works have focused on the preservation of the k-anonymity [6] using MapReduce processing model, however, to the best of our knowledge, no previous work has implemented Mondrian’s method on the more efficient distributed platform of Apache Spark. In the current study, we will fill the gap by proposing DI-Mondrian, which is in fact an RDD implementation of Mondrian’s method on the more efficient Apache Spark platform. The goal is to reduce information loss and execution time of the Mondrian algorithm.

In summary, we will make the following contributions:

  • 1.

    We propose a novel implementation of Mondrian’s multidimensional anonymization method on Apache Spark platform, called “DI-Mondrian”. RDDs programming is used for this implementation and also provides l-diversity privacy model. Our results show that DI-Mondrian improves the performance of state-of-the-art MapReduce-based platforms in terms of scalability and speed of data anonymization process.

  • 2.

    We show that the proposed DI-Mondrian approach offers less information loss through a better selection of cutting points and creating balanced partitions compared to the state-of-the-art methods. This is done by using coefficient of variation and information gain metrics for selecting the cut dimension, and thus taking into account the distribution of all of the data related to an attribute.

  • 3.

    We propose a novel dynamic method to select the cut-points that lead to less information loss and therefore, better data utility (such as accuracy and FMeasure in data classification). This will in turn lead to a better tradeoff between privacy and utility in anonymous data.

In the remainder of the paper, the following sections are covered in detail. In Section 2, the related works in the field of privacy preservation in data publishing are reviewed. In Section 3 the related concepts to Apache Spark framework are introduced. Section 4 details the proposed method. In Section 4, the results of the experiments are provided. Conclusions and suggestions for future works are made in Section 5.

Section snippets

Related works

Privacy preservation in data publishing is an area that has recently received particular attention in the research community [10], [22]. Data needs to be supported by one of the privacy models, including k-anonymity [6], l-diversity [7], t-closeness [8], m-invariance [23], δ-presence [24] and so on using a method of anonymization before they are published. Each of the privacy models is resistant against certain attacks, and prevents information from being disclosed. The k-anonymity and l

Basics of Spark

Given the features of big data, it requires new algorithms, architectures, models, platforms, and analytic solutions to manage them. Therefore, a large number of platforms have been released for management of big data. A leading company in this regard is Apache, which has released platforms such as Hadoop and Spark [38]. Apache Spark is a fast, open-source, massively-parallel, scalable, general engine for large-scale data processing. One of the capabilities of Apache Spark is in-memory

Steps of running the proposed approach

Fig. 4. shows the steps of the method proposed for achievement of the l-diversity privacy model. In each step indicated with a blue rectangle, an RDD is established. For establishment of the RDDs, the Apache Spark library transformations and actions are used.

  • RDD1 establishment. The dataset is read from the HDFS and distributed to the worker nodes in the computational cluster in RDD1.

  • Preprocessing. This step includes removing the duplicate records from RDD1, establishing an RDD for the generic

Evaluation of the proposed method

For evaluation of the designed method, two well-known datasets available in the UCI machine learning repository [41] are used.

  • Poker hand

This dataset is used for classification, and contains 1,025,010 data records with 11 numerical attributes. In the experiments, the first 10 attributes are used as quasi-identifier attributes (QID), and the 11th attribute (the class label) is used as the sensitive attribute. For each pair of quasi-identifier attributes, the first attribute is the suit of the

Conclusion and future works

In this research, a distributed method called “DI-Mondrian”, was presented based on the Apache Spark framework using RDDs for achievement of the l-diversity privacy model on datasets. The presented idea involves an improved version of the Mondrian method, where the CV and InfoGain methods are used for selection of the cut dimension and of r dynamic points as the cut points. Several different experiments were conducted with various values for the parameters k, l, and r over the Poker Hand and

CRediT authorship contribution statement

Farough Ashkouti: Conceptualization, Methodology, Software, Data curation, Validation, Writing - Original draft. Keyhan Khamforoosh: Conceptualization, Resources, Supervision, Project administration, Validation, Writing - review & editing. Amir sheikhahmadi: Conceptualization, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (41)

  • L. Sweeney

    k-anonymity: a model for protecting privacy

    Int. J. Uncertainty, Fuzziness Knowledge-Based Syst.

    (2002)
  • A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond k-anonymity,” ACM...
  • L. Ninghui et al.

    t-Closeness: privacy beyond k-anonymity and ℓ-diversity

  • B. Fung et al.

    Privacy-preserving data publishing: a survey of recent developments

    ACM Comput. Surv.

    (2010)
  • K. LeFevre, D. J. D. J. DeWitt, and R. Ramakrishnan, “Incognito: efficient full-domain K-anonymity,” SIGMOD ’05 Proc....
  • J. Xu et al.

    Utility-based anonymization for privacy preservation with less information loss

    Acm Sigkdd Explor. Newsl.

    (2006)
  • B.C.M. Fung et al.

    Anonymizing classification data for privacy preservation

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • K. LeFevre et al.

    Mondrian multidimensional K-anonymity

    Proc. – Int. Conf. Data Eng.

    (2006)
  • A. Meier et al.

    “Nosql databases”, in SQL &

    (2019)
  • [Online]. Available:...
  • Cited by (0)

    View full text