Hierarchical anonymization algorithms against background knowledge attack in data releasing
Introduction
Advances in the Internet and data processing technologies have accelerated data collection and dissemination. As collected data may contain private information, a breach of privacy is possible if it is disclosed-together with identifiers-to unauthorized parties. Removal of attributes that are identifiers, such as name and social security number, is not sufficient to protect privacy, when quasi identifiers1 (QI) exist. Hence, proposing promising approaches for privacy preservation has gained significant attention in the context of data collection and dissemination.
Anonymization is an approach to preserve individuals’ privacy by removing their identifiers from the data that is going to be published, while maintaining as much of original information as possible. Each anonymization framework includes a privacy model and an anonymization algorithm. Privacy models can be divided into syntactic and semantic models. Syntactic privacy models partition data into a set of groups (called equivalence classes) such that all records within each equivalence class are indistinguishable from one another from QI point of view. In the k-anonymity model, as the first syntactic privacy model, each equivalence class contains at least k records [1], [2]. This model prevents identity disclosure2, but it does not preserve the privacy against attribute disclosure3. To address this issue, other variants of k-anonymity have been proposed [3], [4], [5]. Semantic privacy models add some noise to data in order to preserve privacy. The differential privacy model is a semantic privacy model in which it is guaranteed that deletion and addition of any individual’s record does not significantly affect the result of data analysis [6].
Each privacy model provides a defense against a particular adversary model. A common assumption is that the adversary has two pieces of information: (I) whether or not his/her targets exist in the microdata table and (II) the QI values of his/her targets [1], [2], [3], [4]. None of the models mentioned above- including the differential privacy model- can preserve privacy if the adversary has additional information (called background knowledge) [7]. Hence, researchers have proposed enhanced models that assume the adversary has some background knowledge [8], [9], [10], [11], [12], [13]. A background knowledge is any known fact that by itself is not a privacy disclosure, but the adversary combines it with other information to make more precise inference on target’s sensitive information. This is called a background knowledge attack. Examples of background knowledge in a particular medical dataset context are “a male breast cancer is rare”, “the prevalence of chronic bronchitis is higher among the 65+ age group compared to other groups; and, across all age groups, females have higher rates than males for both black and white races”, etc. [14].
In this work, we develop a syntactic-based anonymization framework in which we assume the adversary has background knowledge about the correlations among dataset attributes. In a syntactic privacy model, when an equivalence class is published, the adversary can estimate the probabilities of possible associations of sensitive values to his/her target (i.e. record respondent) without exploiting any background knowledge4. When different sensitive attribute values exist in an equivalence class, the probability of associating a record respondent to sensitive values in the equivalence class is the same. By exploiting the background knowledge, the adversary may be able to discriminate one association from the others, resulting in privacy breaches. Modeling the background knowledge is an open problem in data anonymization [8]. We model the adversary’s background knowledge as a probability distribution associating the sensitive values to a record respondent based on QI values, called background knowledge distribution. The goal in our privacy model is to maximize uncertainties in identifying record respondents and their respective values for sensitive attributes in a given equivalence class. In the presence of adversary’s background knowledge, we attempt to create equivalence classes such that record respondents have similar background knowledge distributions in each class in order to achieve our goal. Therefore, when adversaries examine different associations of sensitive values (within each equivalence class) to their targets, they will not be able to discriminate any association with a high degree of certainty. The constraint of similar background knowledge distributions cannot prevent identity disclosure or attribute disclosure. Therefore, we also apply k-anonymity [1] and β-likeness [5]. To remain the anonymized data useful, a high similarity among QI values in each equivalence class is also required.
Hence, we propose to create equivalence classes with the following privacy requirements: (1) The background knowledge distributions within any equivalence class should be similar in order to prevent the so-called background knowledge attack, (2) k-anonymity: the size of each class is at least k, (3) β-likeness: the maximum relative difference in the frequency of sensitive values within any equivalence class and that of the overall microdata table does not exceed a given threshold β.
We present two syntactic anonymization algorithms based on the value generalization approach. We suggest a hierarchical procedure to satisfy our privacy requirements. First, we apply agglomerative clustering to prevent the background knowledge attack. We apply the clustering algorithm to generate clusters in which the difference of background knowledge distributions between each pair of records in the cluster is below a certain threshold. Then, each cluster is partitioned into a number of equivalence classes. We propose two algorithms to produce the equivalence classes: k-anonymity-primacy and β-likeness-primacy. The former prioritizes the QI attributes and generates equivalence classes in a β-likeness aware manner. For this purpose, we propose a clustering-based algorithm to select homogeneous records in terms of QI values, and then check whether β-likeness is satisfied. The latter focuses on the sensitive attribute values. It generates large equivalence classes in which β-likeness is satisfied. Then, large equivalence classes are split in order to satisfy k-anonymity.
A work close to ours is found in Riboni et al. [8]. They have proposed a privacy model based on adversary’s background knowledge and t-closeness [4]. Their anonymization algorithm applies Hilbert index transformation to create an ordered list of record respondents based on similarity of QI-values. We enforce a stronger model than t-closeness and propose two new anonymization algorithms to satisfy our privacy requirements. Sori-Comas et al. [15] also proposed two clustering-based anonymization algorithms attaining k-anonymity and t-closeness which is suitable to anonymize numerical values. They do not consider the adversary’s background knowledge.
We verify the effectiveness of our anonymization algorithms by running extensive experiments on two datasets: Adult dataset [16] and BKseq dataset [8]. We study the performance of our anonymization algorithms based on different parameters of our privacy model. The experimental results show that k-anonymity-primacy generates the anonymized microdata with low information loss while β-likeness-primacy incurs low privacy loss. The k-anonymity-primacy algorithm also generates more balanced equivalence classes compared to β-likeness-primacy. We further compare the performance of our proposed algorithms with the state of the art anonymization algorithms like Hilbert index-based algorithm [8]. The performance of our algorithms are better than Hilbert index-based algorithm in terms of both data utility and privacy.
Furthermore, we extend an information loss measure to capture data inaccuracies caused by generalization. In any anonymization algorithm, it is possible not to fit some records in any equivalence class. To protect the privacy of other record respondents, a simple solution is to remove them from the published data. We introduce an extension to the Global Certainty Penalty (GCP) metric [17] to consider this kind of information loss too, and we name it Removed Global Certainty Penalty (RGCP). The RGCP metric charges a penalty for each not-fit record. The penalty of each record is proportional to the range of QI values in the nearest equivalence class. We evaluate our algorithms using both GCP and RGCP. When a large number of records are removed by the algorithm, the advantage of RGCP over GCP is better seen.
Contributions. In summary, our contributions are as follows:
- We propose two syntactic anonymization algorithms which simultaneously satisfy two privacy models (k-anonymity and β-likeness) against adversaries who have background knowledge on correlations of attributes. We conduct extensive experiments on different aspects of the algorithms, namely data utility, privacy, size of equivalence classes, and the run time. Then we compare our algorithms with microaggregation approaches. We also perform an experimental comparison between the closest work, Hilbert index-based algorithm [8], and the proposed algorithms. We demonstrate that the proposed algorithms outperform Hilbert index-based algorithm in terms of data utility and privacy.
- We extend GCP to measure information loss of equivalence classes when the generalization operation is performed. Our metric, called RGCP, considers unpublished records too.
Paper organization. In Section 2, we review research related to anonymization algorithms and information loss metrics. In Section 3, we define the problem and propose two solutions. In Section 4, we introduce a new information loss measure. Experiments and results are presented in Section 5. In Section 6, we discuss more the results and contributions. Finally, we conclude the paper in Section 7 followed by the future work discussion.
Section snippets
Related work
This research is built up on the concepts of syntactic-based anonymization algorithms, which consider adversaries with background knowledge and information loss measures. In this section, we study the literature of these domains from various aspects.
Problem definition
Assume that a data publisher is going to publish the original microdata table where each record ri corresponds to an individual vi, so called record respondent. Each record ri of T contains d QI attributes and a single sensitive attribute S. D[Ai], 1 ≤ i ≤ d denotes the attribute domain of Ai and denotes the attribute domain of S. Let’s assume that ri[Aj] denotes Aj value of record ri in T and ri[QI] denotes QI values of record ri.
An anonymization
The proposed information loss measure
Information loss measures are used to compare data quality in the entire anonymized data versus the quality in the original data. In this work, we consider the information loss measures applied in generalization algorithms. DM [35] and AEC [19] consider penalties per equivalence class rather than per record and are not able to consider the data distribution of attribute values. GH [36] and PM [37] take into account the height of generalization hierarchy. These measures are suitable for
Experiments and results
In this section, we empirically evaluate our anonymization algorithms using two data sets and according to a number of measures.
Discussion
We have analyzed our algorithms in terms of GCP as the information loss metric, and RL as the privacy loss metric over two datasets, regarding different values of privacy parameters.
The proposed privacy model has different goals including prevention of background knowledge attack and identity and attribute disclosures. The anonymization algorithms satisfy these goals in a hierarchical manner. The algorithms prioritize the prevention of background knowledge attack, which deals with exposed
Conclusion
In this paper, we propose anonymization algorithms which consider the adversary’s background knowledge in privacy model. We define a strong privacy model based on k-anonymity and β-likeness to protect against background knowledge attack.
Two anonymization algorithms are proposed to achieve our privacy model: k-anonymity-primacy and β-likeness-primacy. In the first phase of these algorithms, we use the agglomerative clustering to prevent background knowledge attack. In the next phase, our
Acknowledgments
The authors would like to thank the reviewers for their comments, which helped improve the paper significantly.
References (43)
- et al.
Efficient k-anonymization using clustering techniques
Proceedings of 12th International Conference on Database Systems for Advanced Applications (DASFAA 2007)
(2007) - et al.
Anatomy: simple and effective privacy preservation
Proceedings of the 32nd International Conference on Very Large Data Bases
(2006) - et al.
k-anonymization revisited
Proceedings of International Conference on Data Engineering (ICDE)
(2008) - et al.
Efficient sanitization of unsafe data correlations
Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference
(2015) Transforming data to satisfy privacy constraints
Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD)
(2002)- P. Samarati, L. Sweeney, Protecting Privacy When Disclosing Information: k-Anonymity and its Enforcement Through...
Protecting respondents' identities in microdata release
IEEE Trans. Knowl. Data Eng.
(2001)- et al.
L-diversity: privacy beyond k-anonymity
ACM Trans. Knowl. Discov. Data.
(2007) - et al.
t-closeness: privacy beyond k-anonymity and L-diversity
Proceedings of the 23th IEEE International Conference on Data Eng. (ICDE)
(2007) - et al.
Publishing microdata with a robust privacy guarantee
Proc. VLDB Endow.
(2012)
Differential privacy
Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP)
No free lunch in data privacy
Proceedings of the 2011 International Conference on Management of Data
JS-Reduce: defending your data from sequential background knowledge attacks
IEEE Trans. Dep. Sec. Comp.
Modeling and integrating background knowledge in data anonymization
Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE)
Worst-case background knowledge in privacy-preserving data publishing
Proceedings of the 23th International Conference on Data Engineering(ICDE)
Privacy skyline: privacy with multidimensional adversarial knowledge
Proceedings of the 33th IEEE International Conference on Very Large Data Bases
Injector: mining background knowledge for data anonymization
Proceedings of the International Conference on Data Engineering (ICDE)
Privacy-maxEnt: integrating background knowledge in privacy quantification
Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data
t-closeness through microaggregation: strict privacy with enhanced utility preservation
IEEE Trans. Knowl. Data Eng.
Cited by (36)
Differentially private data publishing for arbitrarily partitioned data
2021, Information SciencesHeterogeneous data release for cluster analysis with differential privacy
2020, Knowledge-Based SystemsTowards Optimization of Privacy-Utility Trade-Off Using Similarity and Diversity Based Clustering
2024, IEEE Transactions on Emerging Topics in ComputingEnhancing Utility in Anonymized Data against the Adversary’s Background Knowledge
2023, Applied Sciences (Switzerland)