DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark
Introduction
Preservation of privacy is a challenging issue being faced by people in their everyday lives, which has motivated computer science researchers to conduct a large number of studies in the area of information privacy preservation [1]. Although privacy can be established in the data through elimination of some attributes or changes made in the data, the records available in the dataset can be re-identified, and an individual’s identity can be recognized through further analysis of the dataset [2], [3].
Organizations such as hospitals, universities, telecommunications centers, and social networks share their data on the web with the research community, or make them available to other organizations to be analyzed and their subtle patterns extracted. Data sharing and publishing, however, may breach an individual’s privacy. Therefore, methods must be accessible that publish and analyze data such that an individual’s sensitive information is not revealed. In general, methods such as access restriction, encryption, anonymization, and noise-based methods are employed for privacy preservation. In the method of access restriction, the data owner can prevent organizations and web pages from collecting their sensitive information. Moreover, data administrators restrict unauthorized individuals’ access to sensitive information by setting access control policies in databases. The application of encryption methods involves confidential data storage in environments like clouds. Anonymization in any data record disassociates the data and the data owner, preventing him from being re-identified. Noise-based methods, one of the commonest of which is Differential Privacy, satisfy privacy by adding noise to and making changes in the data [4], [5].
Anonymization, which is the method proposed in the present paper in this regard, is generally carried out as a generalization, suppression, permutation, anatomization, or perturbation operator with the idea of changing the data values to provide the required privacy models such as k-anonymity[6], l-diversity[7], and t-closeness[8] in data publishing. In the generalization and suppression operators, the values of the quasi identifier (QID) attributes are replaced by more general values. In the permutation and anatomization operators, the QID attributes and the sensitive attributes (SA) are disassociated as the data in the equivalence class are shuffled and grouped. The perturbation operator changes the data values through the noise addition, swap, and aggregation operations [9], [10].
The generalization operator is the commonest, most powerful operator used in anonymization. Various methods are available for data generalization, including the full domain [11], local recoding [12], sub-tree schema [13], and multidimensional [14] methods, on which basis different algorithms have been developed. In the data generalization method used in this paper, the multidimensional anonymization method is used, which involves lower data distortion than the other methods, and the l-diversity privacy model is used, which features a higher level of privacy than the k-anonymity model, and is resistant against attribute linkage and record linkage attacks. The k-anonymity model denotes that a minimum of k and a maximum of 2*k-1 records with the same QID attributes are grouped together. l-diversity states that a minimum of l distinct values must be available for the sensitive attribute in each group consisting of the k-anonymity model [10], [15].
Given that data size is increasing every day, and has reached the gigabyte, terabyte, and zettabyte scales, data are stored in distributed environments such as clouds [16] and NoSQL databases [17]. Since the most important factor for sharing and publishing information in network environments is time, centralized methods are not applicable to large-scale data, as they lead to inappropriate results in the time complexity, scalability, and performance criteria [1], [5]. Algorithms are therefore required that publish data confidentially over the distributed platform. Apache Spark [18] is a fast, parallel, scalable processing engine for processing large-scale data, characterized mainly by storage and in-memory computation as resilient distributed datasets(RDDs), which has resulted in faster distributed processing than that over other platforms [18], [19], [20]. Therefore, Apache Spark can be considered a beneficial platform for anonymizing large-scale data and providing different privacy models.
Mondrian's multidimensional anonymization method [14] was first presented in 2006 as a centralized approach to preserve the k-anonymity privacy model. Recently, Zakerzadeh et al. [21] attempted to reduce the processing time of Mondrian’s approach by implementing it on the distributed framework of MapReduce. Since then, a tremendous amount of work has been done to meet the k-anonymity privacy model through the MapReduce processing platform, which is comprehensively presented in Section 2 of this article. Most of the previous works have focused on the preservation of the k-anonymity [6] using MapReduce processing model, however, to the best of our knowledge, no previous work has implemented Mondrian’s method on the more efficient distributed platform of Apache Spark. In the current study, we will fill the gap by proposing DI-Mondrian, which is in fact an RDD implementation of Mondrian’s method on the more efficient Apache Spark platform. The goal is to reduce information loss and execution time of the Mondrian algorithm.
In summary, we will make the following contributions:
- 1.
We propose a novel implementation of Mondrian’s multidimensional anonymization method on Apache Spark platform, called “DI-Mondrian”. RDDs programming is used for this implementation and also provides l-diversity privacy model. Our results show that DI-Mondrian improves the performance of state-of-the-art MapReduce-based platforms in terms of scalability and speed of data anonymization process.
- 2.
We show that the proposed DI-Mondrian approach offers less information loss through a better selection of cutting points and creating balanced partitions compared to the state-of-the-art methods. This is done by using coefficient of variation and information gain metrics for selecting the cut dimension, and thus taking into account the distribution of all of the data related to an attribute.
- 3.
We propose a novel dynamic method to select the cut-points that lead to less information loss and therefore, better data utility (such as accuracy and FMeasure in data classification). This will in turn lead to a better tradeoff between privacy and utility in anonymous data.
In the remainder of the paper, the following sections are covered in detail. In Section 2, the related works in the field of privacy preservation in data publishing are reviewed. In Section 3 the related concepts to Apache Spark framework are introduced. Section 4 details the proposed method. In Section 4, the results of the experiments are provided. Conclusions and suggestions for future works are made in Section 5.
Section snippets
Related works
Privacy preservation in data publishing is an area that has recently received particular attention in the research community [10], [22]. Data needs to be supported by one of the privacy models, including k-anonymity [6], l-diversity [7], t-closeness [8], m-invariance [23], δ-presence [24] and so on using a method of anonymization before they are published. Each of the privacy models is resistant against certain attacks, and prevents information from being disclosed. The k-anonymity and l
Basics of Spark
Given the features of big data, it requires new algorithms, architectures, models, platforms, and analytic solutions to manage them. Therefore, a large number of platforms have been released for management of big data. A leading company in this regard is Apache, which has released platforms such as Hadoop and Spark [38]. Apache Spark is a fast, open-source, massively-parallel, scalable, general engine for large-scale data processing. One of the capabilities of Apache Spark is in-memory
Steps of running the proposed approach
Fig. 4. shows the steps of the method proposed for achievement of the l-diversity privacy model. In each step indicated with a blue rectangle, an RDD is established. For establishment of the RDDs, the Apache Spark library transformations and actions are used.
- •
RDD1 establishment. The dataset is read from the HDFS and distributed to the worker nodes in the computational cluster in RDD1.
- •
Preprocessing. This step includes removing the duplicate records from RDD1, establishing an RDD for the generic
Evaluation of the proposed method
For evaluation of the designed method, two well-known datasets available in the UCI machine learning repository [41] are used.
- •
Poker hand
This dataset is used for classification, and contains 1,025,010 data records with 11 numerical attributes. In the experiments, the first 10 attributes are used as quasi-identifier attributes (QID), and the 11th attribute (the class label) is used as the sensitive attribute. For each pair of quasi-identifier attributes, the first attribute is the suit of the
Conclusion and future works
In this research, a distributed method called “DI-Mondrian”, was presented based on the Apache Spark framework using RDDs for achievement of the l-diversity privacy model on datasets. The presented idea involves an improved version of the Mondrian method, where the CV and InfoGain methods are used for selection of the cut dimension and of r dynamic points as the cut points. Several different experiments were conducted with various values for the parameters k, l, and r over the Poker Hand and
CRediT authorship contribution statement
Farough Ashkouti: Conceptualization, Methodology, Software, Data curation, Validation, Writing - Original draft. Keyhan Khamforoosh: Conceptualization, Resources, Supervision, Project administration, Validation, Writing - review & editing. Amir sheikhahmadi: Conceptualization, Validation, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (41)
- et al.
Privacy-preserving tabular data publishing: A comprehensive evaluation from web to cloud
Comput. Secur.
(Jan. 2018) - et al.
Privacy preserving publication of relational and transaction data: survey on the anonymization of patient data
Comput. Sci. Rev.
(2019) - et al.
Security in cloud computing: opportunities and challenges
Inf. Sci. (Ny)
(2015) - et al.
A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud
J. Comput. Syst. Sci.
(2014) - et al.
Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop
Futur. Gener. Comput. Syst.
(2017) Big privacy: challenges and opportunities of privacy study in the age of big data
IEEE Access
(2016)- et al.
Unique in the Crowd: the privacy bounds of human mobility
Sci. Rep.
(2013) - et al.
Protection of big data privacy
IEEE Access
(2016) - et al.
On syntactic anonymity and differential privacy
Trans. Data Privacy
(2013) - et al.
Information security in big data: privacy and data mining
IEEE Access
(2014)