Hierarchical anonymization algorithms against background knowledge attack in data releasing

doi:10.1016/j.knosys.2016.03.004

Knowledge-Based Systems

Volume 101, 1 June 2016, Pages 71-89

https://doi.org/10.1016/j.knosys.2016.03.004 Get rights and content

Highlights

•
We define a privacy model based on k-anonymity and one of its strong refinements to prevent the background knowledge attack.
•
We propose two hierarchical anonymization algorithm to satisfy our privacy model.
•
Our algorithms outperform the state-of the art anonymization algorithm in terms of utility and privacy.
•
We extend an information loss measure to capture data inaccuracies caused by not-fitted records in any equivalence class.

Abstract

Preserving privacy in the presence of adversary’s background knowledge is very important in data publishing. The k-anonymity model, while protecting identity, does not protect against attribute disclosure. One of strong refinements of k-anonymity, β-likeness, does not protect against identity disclosure. Neither model protects against attacks featured by background knowledge. This research proposes two approaches for generating k-anonymous β-likeness datasets that protect against identity and attribute disclosures and prevent attacks featured by any data correlations between QIs and sensitive attribute values as the adversary’s background knowledge. In particular, two hierarchical anonymization algorithms are proposed. Both algorithms apply agglomerative clustering techniques in their first stage in order to generate clusters of records whose probability distributions extracted by background knowledge are similar. In the next phase, k-anonymity and β-likeness are enforced in order to prevent identity and attribute disclosures. Our extensive experiments demonstrate that the proposed algorithms outperform other state-of-the-art anonymization algorithms in terms of privacy and data utility where the number of unpublished records in our algorithms is less than that of the others. As well-known information loss metrics fail to measure precisely the imposed data inaccuracies stemmed from the removal of records that cannot be published in any equivalence class. This research also introduces an extension into the Global Certainty Penalty metric that considers unpublished records.

Introduction

Advances in the Internet and data processing technologies have accelerated data collection and dissemination. As collected data may contain private information, a breach of privacy is possible if it is disclosed-together with identifiers-to unauthorized parties. Removal of attributes that are identifiers, such as name and social security number, is not sufficient to protect privacy, when quasi identifiers¹ (QI) exist. Hence, proposing promising approaches for privacy preservation has gained significant attention in the context of data collection and dissemination.

Anonymization is an approach to preserve individuals’ privacy by removing their identifiers from the data that is going to be published, while maintaining as much of original information as possible. Each anonymization framework includes a privacy model and an anonymization algorithm. Privacy models can be divided into syntactic and semantic models. Syntactic privacy models partition data into a set of groups (called equivalence classes) such that all records within each equivalence class are indistinguishable from one another from QI point of view. In the k-anonymity model, as the first syntactic privacy model, each equivalence class contains at least k records [1], [2]. This model prevents identity disclosure², but it does not preserve the privacy against attribute disclosure³. To address this issue, other variants of k-anonymity have been proposed [3], [4], [5]. Semantic privacy models add some noise to data in order to preserve privacy. The differential privacy model is a semantic privacy model in which it is guaranteed that deletion and addition of any individual’s record does not significantly affect the result of data analysis [6].

Each privacy model provides a defense against a particular adversary model. A common assumption is that the adversary has two pieces of information: (I) whether or not his/her targets exist in the microdata table and (II) the QI values of his/her targets [1], [2], [3], [4]. None of the models mentioned above- including the differential privacy model- can preserve privacy if the adversary has additional information (called background knowledge) [7]. Hence, researchers have proposed enhanced models that assume the adversary has some background knowledge [8], [9], [10], [11], [12], [13]. A background knowledge is any known fact that by itself is not a privacy disclosure, but the adversary combines it with other information to make more precise inference on target’s sensitive information. This is called a background knowledge attack. Examples of background knowledge in a particular medical dataset context are “a male breast cancer is rare”, “the prevalence of chronic bronchitis is higher among the 65+ age group compared to other groups; and, across all age groups, females have higher rates than males for both black and white races”, etc. [14].

In this work, we develop a syntactic-based anonymization framework in which we assume the adversary has background knowledge about the correlations among dataset attributes. In a syntactic privacy model, when an equivalence class is published, the adversary can estimate the probabilities of possible associations of sensitive values to his/her target (i.e. record respondent) without exploiting any background knowledge⁴. When different sensitive attribute values exist in an equivalence class, the probability of associating a record respondent to sensitive values in the equivalence class is the same. By exploiting the background knowledge, the adversary may be able to discriminate one association from the others, resulting in privacy breaches. Modeling the background knowledge is an open problem in data anonymization [8]. We model the adversary’s background knowledge as a probability distribution associating the sensitive values to a record respondent based on QI values, called background knowledge distribution. The goal in our privacy model is to maximize uncertainties in identifying record respondents and their respective values for sensitive attributes in a given equivalence class. In the presence of adversary’s background knowledge, we attempt to create equivalence classes such that record respondents have similar background knowledge distributions in each class in order to achieve our goal. Therefore, when adversaries examine different associations of sensitive values (within each equivalence class) to their targets, they will not be able to discriminate any association with a high degree of certainty. The constraint of similar background knowledge distributions cannot prevent identity disclosure or attribute disclosure. Therefore, we also apply k-anonymity [1] and β-likeness [5]. To remain the anonymized data useful, a high similarity among QI values in each equivalence class is also required.

Hence, we propose to create equivalence classes with the following privacy requirements: (1) The background knowledge distributions within any equivalence class should be similar in order to prevent the so-called background knowledge attack, (2) k-anonymity: the size of each class is at least k, (3) β-likeness: the maximum relative difference in the frequency of sensitive values within any equivalence class and that of the overall microdata table does not exceed a given threshold β.

We present two syntactic anonymization algorithms based on the value generalization approach. We suggest a hierarchical procedure to satisfy our privacy requirements. First, we apply agglomerative clustering to prevent the background knowledge attack. We apply the clustering algorithm to generate clusters in which the difference of background knowledge distributions between each pair of records in the cluster is below a certain threshold. Then, each cluster is partitioned into a number of equivalence classes. We propose two algorithms to produce the equivalence classes: k-anonymity-primacy and β-likeness-primacy. The former prioritizes the QI attributes and generates equivalence classes in a β-likeness aware manner. For this purpose, we propose a clustering-based algorithm to select homogeneous records in terms of QI values, and then check whether β-likeness is satisfied. The latter focuses on the sensitive attribute values. It generates large equivalence classes in which β-likeness is satisfied. Then, large equivalence classes are split in order to satisfy k-anonymity.

A work close to ours is found in Riboni et al. [8]. They have proposed a privacy model based on adversary’s background knowledge and t-closeness [4]. Their anonymization algorithm applies Hilbert index transformation to create an ordered list of record respondents based on similarity of QI-values. We enforce a stronger model than t-closeness and propose two new anonymization algorithms to satisfy our privacy requirements. Sori-Comas et al. [15] also proposed two clustering-based anonymization algorithms attaining k-anonymity and t-closeness which is suitable to anonymize numerical values. They do not consider the adversary’s background knowledge.

We verify the effectiveness of our anonymization algorithms by running extensive experiments on two datasets: Adult dataset [16] and BKseq dataset [8]. We study the performance of our anonymization algorithms based on different parameters of our privacy model. The experimental results show that k-anonymity-primacy generates the anonymized microdata with low information loss while β-likeness-primacy incurs low privacy loss. The k-anonymity-primacy algorithm also generates more balanced equivalence classes compared to β-likeness-primacy. We further compare the performance of our proposed algorithms with the state of the art anonymization algorithms like Hilbert index-based algorithm [8]. The performance of our algorithms are better than Hilbert index-based algorithm in terms of both data utility and privacy.

Furthermore, we extend an information loss measure to capture data inaccuracies caused by generalization. In any anonymization algorithm, it is possible not to fit some records in any equivalence class. To protect the privacy of other record respondents, a simple solution is to remove them from the published data. We introduce an extension to the Global Certainty Penalty (GCP) metric [17] to consider this kind of information loss too, and we name it Removed Global Certainty Penalty (RGCP). The RGCP metric charges a penalty for each not-fit record. The penalty of each record is proportional to the range of QI values in the nearest equivalence class. We evaluate our algorithms using both GCP and RGCP. When a large number of records are removed by the algorithm, the advantage of RGCP over GCP is better seen.

Contributions. In summary, our contributions are as follows:

- We propose two syntactic anonymization algorithms which simultaneously satisfy two privacy models (k-anonymity and β-likeness) against adversaries who have background knowledge on correlations of attributes. We conduct extensive experiments on different aspects of the algorithms, namely data utility, privacy, size of equivalence classes, and the run time. Then we compare our algorithms with microaggregation approaches. We also perform an experimental comparison between the closest work, Hilbert index-based algorithm [8], and the proposed algorithms. We demonstrate that the proposed algorithms outperform Hilbert index-based algorithm in terms of data utility and privacy.
- We extend GCP to measure information loss of equivalence classes when the generalization operation is performed. Our metric, called RGCP, considers unpublished records too.

Paper organization. In Section 2, we review research related to anonymization algorithms and information loss metrics. In Section 3, we define the problem and propose two solutions. In Section 4, we introduce a new information loss measure. Experiments and results are presented in Section 5. In Section 6, we discuss more the results and contributions. Finally, we conclude the paper in Section 7 followed by the future work discussion.

Section snippets

Related work

This research is built up on the concepts of syntactic-based anonymization algorithms, which consider adversaries with background knowledge and information loss measures. In this section, we study the literature of these domains from various aspects.

Problem definition

Assume that a data publisher is going to publish the original microdata table $T = {r_{1}, r_{2}, \dots, r_{n}}$ where each record r_i corresponds to an individual v_i, so called record respondent. Each record r_i of T contains d QI attributes $A_{1}, A_{2}, \dots, A_{d}$ and a single sensitive attribute S. D[A_i], 1 ≤ i ≤ d denotes the attribute domain of A_i and $D [S] = {s_{1}, s_{2}, \dots, s_{m}}$ denotes the attribute domain of S. Let’s assume that r_i[A_j] denotes A_j value of record r_i in T and r_i[QI] denotes QI values of record r_i.

An anonymization

The proposed information loss measure

Information loss measures are used to compare data quality in the entire anonymized data versus the quality in the original data. In this work, we consider the information loss measures applied in generalization algorithms. DM [35] and AEC [19] consider penalties per equivalence class rather than per record and are not able to consider the data distribution of attribute values. GH [36] and PM [37] take into account the height of generalization hierarchy. These measures are suitable for

Experiments and results

In this section, we empirically evaluate our anonymization algorithms using two data sets and according to a number of measures.

Discussion

We have analyzed our algorithms in terms of GCP as the information loss metric, and RL as the privacy loss metric over two datasets, regarding different values of privacy parameters.

The proposed privacy model has different goals including prevention of background knowledge attack and identity and attribute disclosures. The anonymization algorithms satisfy these goals in a hierarchical manner. The algorithms prioritize the prevention of background knowledge attack, which deals with exposed

Conclusion

In this paper, we propose anonymization algorithms which consider the adversary’s background knowledge in privacy model. We define a strong privacy model based on k-anonymity and β-likeness to protect against background knowledge attack.

Two anonymization algorithms are proposed to achieve our privacy model: k-anonymity-primacy and β-likeness-primacy. In the first phase of these algorithms, we use the agglomerative clustering to prevent background knowledge attack. In the next phase, our

Acknowledgments

The authors would like to thank the reviewers for their comments, which helped improve the paper significantly.

References (43)

J. Byun et al.
Efficient k-anonymization using clustering techniques
Proceedings of 12th International Conference on Database Systems for Advanced Applications (DASFAA 2007)
(2007)
X. Xiao et al.
Anatomy: simple and effective privacy preservation
Proceedings of the 32nd International Conference on Very Large Data Bases
(2006)
A. Gionis et al.
k-anonymization revisited
Proceedings of International Conference on Data Engineering (ICDE)
(2008)
B. Al Bouna et al.
Efficient sanitization of unsafe data correlations
Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference
(2015)
V.S. Iyengar
Transforming data to satisfy privacy constraints
Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD)
(2002)
P. Samarati, L. Sweeney, Protecting Privacy When Disclosing Information: k-Anonymity and its Enforcement Through...
P. Samarati
Protecting respondents' identities in microdata release
IEEE Trans. Knowl. Data Eng.
(2001)
P. Machanavajjhala et al.
L-diversity: privacy beyond k-anonymity
ACM Trans. Knowl. Discov. Data.
(2007)
N. Li et al.
t-closeness: privacy beyond k-anonymity and L-diversity
Proceedings of the 23th IEEE International Conference on Data Eng. (ICDE)
(2007)
J. Cao et al.
Publishing microdata with a robust privacy guarantee
Proc. VLDB Endow.
(2012)

C. Dwork

Differential privacy

Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP)

(2006)

D. Kifer et al.

No free lunch in data privacy

Proceedings of the 2011 International Conference on Management of Data

(2011)

D. Riboni et al.

JS-Reduce: defending your data from sequential background knowledge attacks

IEEE Trans. Dep. Sec. Comp.

(2012)

T. Li et al.

Modeling and integrating background knowledge in data anonymization

Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE)

(2009)

D. Martin et al.

Worst-case background knowledge in privacy-preserving data publishing

Proceedings of the 23th International Conference on Data Engineering(ICDE)

(2007)

B.C. Chen et al.

Privacy skyline: privacy with multidimensional adversarial knowledge

Proceedings of the 33th IEEE International Conference on Very Large Data Bases

(2007)

T. Li et al.

Injector: mining background knowledge for data anonymization

Proceedings of the International Conference on Data Engineering (ICDE)

(2008)

W. Du et al.

Privacy-maxEnt: integrating background knowledge in privacy quantification

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

(2008)

National Heart, lung and Blood Institute, Data Fact Sheet, National Heart, lung and Blood Institute, November 2015....

J. Soria-Comas et al.

t-closeness through microaggregation: strict privacy with enhanced utility preservation

IEEE Trans. Knowl. Data Eng.

(2015)

Adult Dataset, April 2015. https://archive.ics.uci.edu/ml/datasets/Adult. last access: April...

Cited by (36)

Differentially private data publishing for arbitrarily partitioned data
2021, Information Sciences
Many models have been proposed to preserve data privacy for different data publishing scenarios. Among these models, $∊$ -differential privacy is receiving increasing attention because it does not make assumptions about adversaries’ prior knowledge and can provide a rigorous privacy guarantee. Although there are numerous proposed approaches using $∊$ -differential privacy to publish centralized data of a single-party, differentially private data publishing for distributed data among multiple parties has not been studied extensively. The challenge in releasing distributed data is how to protect privacy and integrity during collaborative data integration and anonymization. In this paper, we present the first differentially private solution to anonymize data from two parties with arbitrarily partitioned data in a semi-honest model. We aim at satisfying two privacy requirements: (1) the collaborative anonymization should satisfy differential privacy; (2) one party cannot learn extra information about the other party’s data except for the final result and the information that can be inferred from the result. To meet these privacy requirements, we propose a distributed differentially private anonymization algorithm and guarantee that each step of the algorithm satisfies the definition of secure two-party computation. In addition to the security and cost analyses, we demonstrate the utility of our algorithm in classification analysis.
Heterogeneous data release for cluster analysis with differential privacy
2020, Knowledge-Based Systems
Many models have been proposed to preserve data privacy for different data publishing scenarios. Among these models, $ϵ$ -differential privacy has drawn increasing attention in recent years due to its rigorous privacy guarantees. While many existing solutions using $ϵ$ -differential privacy deal with relational data and set-valued data separately, most of the real-life data, such as electronic health records, are in heterogeneous form. Privacy protection on heterogeneous data has not been widely studied. Furthermore, many existing works in privacy protection consider preserving the utility for the tasks of frequent itemset mining or classification analysis, but few works have focused on data publication for cluster analysis. In this paper, we propose the first differentially-private solution to release heterogeneous data for cluster analysis. The challenge facing us is how to mask raw data without any explicit guidance. Our approach addresses this challenge by converting a clustering problem to a classification problem, in which class labels can be used to encode the cluster structure of the raw data and assist the masking process. The approach generalizes the raw data probabilistically and adds noise to them for satisfying $ϵ$ -differential privacy. Through extensive experiments on real-life datasets, we validate the performance of our approach.
Towards Optimization of Privacy-Utility Trade-Off Using Similarity and Diversity Based Clustering
2024, IEEE Transactions on Emerging Topics in Computing
Enhancing Efficiency and Data Utility in Longitudinal Data Anonymization
2023, SSRN
A Generic Approach towards Enhancing Utility and Privacy in Person-Specific Data Publishing Based on Attribute Usefulness and Uncertainty
2023, Electronics (Switzerland)
Enhancing Utility in Anonymized Data against the Adversary’s Background Knowledge
2023, Applied Sciences (Switzerland)

View all citing articles on Scopus

View full text

Hierarchical anonymization algorithms against background knowledge attack in data releasing

Highlights

Abstract

Introduction

Section snippets

Related work

Problem definition

The proposed information loss measure

Experiments and results

Discussion

Conclusion

Acknowledgments

Protecting respondents' identities in microdata release

IEEE Trans. Knowl. Data Eng.

L-diversity: privacy beyond k-anonymity

ACM Trans. Knowl. Discov. Data.

t-closeness: privacy beyond k-anonymity and L-diversity

Proceedings of the 23th IEEE International Conference on Data Eng. (ICDE)

Publishing microdata with a robust privacy guarantee

Proc. VLDB Endow.

Differential privacy

Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP)

No free lunch in data privacy

Proceedings of the 2011 International Conference on Management of Data

JS-Reduce: defending your data from sequential background knowledge attacks

IEEE Trans. Dep. Sec. Comp.

Modeling and integrating background knowledge in data anonymization

Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE)

Worst-case background knowledge in privacy-preserving data publishing

Proceedings of the 23th International Conference on Data Engineering(ICDE)

Privacy skyline: privacy with multidimensional adversarial knowledge

Proceedings of the 33th IEEE International Conference on Very Large Data Bases

Injector: mining background knowledge for data anonymization

Proceedings of the International Conference on Data Engineering (ICDE)

Privacy-maxEnt: integrating background knowledge in privacy quantification

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

t-closeness through microaggregation: strict privacy with enhanced utility preservation

IEEE Trans. Knowl. Data Eng.