Elsevier

Knowledge-Based Systems

Volume 101, 1 June 2016, Pages 71-89
Knowledge-Based Systems

Hierarchical anonymization algorithms against background knowledge attack in data releasing

https://doi.org/10.1016/j.knosys.2016.03.004Get rights and content

Highlights

  • We define a privacy model based on k-anonymity and one of its strong refinements to prevent the background knowledge attack.

  • We propose two hierarchical anonymization algorithm to satisfy our privacy model.

  • Our algorithms outperform the state-of the art anonymization algorithm in terms of utility and privacy.

  • We extend an information loss measure to capture data inaccuracies caused by not-fitted records in any equivalence class.

Abstract

Preserving privacy in the presence of adversary’s background knowledge is very important in data publishing. The k-anonymity model, while protecting identity, does not protect against attribute disclosure. One of strong refinements of k-anonymity, β-likeness, does not protect against identity disclosure. Neither model protects against attacks featured by background knowledge. This research proposes two approaches for generating k-anonymous β-likeness datasets that protect against identity and attribute disclosures and prevent attacks featured by any data correlations between QIs and sensitive attribute values as the adversary’s background knowledge. In particular, two hierarchical anonymization algorithms are proposed. Both algorithms apply agglomerative clustering techniques in their first stage in order to generate clusters of records whose probability distributions extracted by background knowledge are similar. In the next phase, k-anonymity and β-likeness are enforced in order to prevent identity and attribute disclosures. Our extensive experiments demonstrate that the proposed algorithms outperform other state-of-the-art anonymization algorithms in terms of privacy and data utility where the number of unpublished records in our algorithms is less than that of the others. As well-known information loss metrics fail to measure precisely the imposed data inaccuracies stemmed from the removal of records that cannot be published in any equivalence class. This research also introduces an extension into the Global Certainty Penalty metric that considers unpublished records.

Introduction

Advances in the Internet and data processing technologies have accelerated data collection and dissemination. As collected data may contain private information, a breach of privacy is possible if it is disclosed-together with identifiers-to unauthorized parties. Removal of attributes that are identifiers, such as name and social security number, is not sufficient to protect privacy, when quasi identifiers1 (QI) exist. Hence, proposing promising approaches for privacy preservation has gained significant attention in the context of data collection and dissemination.

Anonymization is an approach to preserve individuals’ privacy by removing their identifiers from the data that is going to be published, while maintaining as much of original information as possible. Each anonymization framework includes a privacy model and an anonymization algorithm. Privacy models can be divided into syntactic and semantic models. Syntactic privacy models partition data into a set of groups (called equivalence classes) such that all records within each equivalence class are indistinguishable from one another from QI point of view. In the k-anonymity model, as the first syntactic privacy model, each equivalence class contains at least k records [1], [2]. This model prevents identity disclosure2, but it does not preserve the privacy against attribute disclosure3. To address this issue, other variants of k-anonymity have been proposed [3], [4], [5]. Semantic privacy models add some noise to data in order to preserve privacy. The differential privacy model is a semantic privacy model in which it is guaranteed that deletion and addition of any individual’s record does not significantly affect the result of data analysis [6].

Each privacy model provides a defense against a particular adversary model. A common assumption is that the adversary has two pieces of information: (I) whether or not his/her targets exist in the microdata table and (II) the QI values of his/her targets [1], [2], [3], [4]. None of the models mentioned above- including the differential privacy model- can preserve privacy if the adversary has additional information (called background knowledge) [7]. Hence, researchers have proposed enhanced models that assume the adversary has some background knowledge [8], [9], [10], [11], [12], [13]. A background knowledge is any known fact that by itself is not a privacy disclosure, but the adversary combines it with other information to make more precise inference on target’s sensitive information. This is called a background knowledge attack. Examples of background knowledge in a particular medical dataset context are “a male breast cancer is rare”, “the prevalence of chronic bronchitis is higher among the 65+ age group compared to other groups; and, across all age groups, females have higher rates than males for both black and white races”, etc. [14].

In this work, we develop a syntactic-based anonymization framework in which we assume the adversary has background knowledge about the correlations among dataset attributes. In a syntactic privacy model, when an equivalence class is published, the adversary can estimate the probabilities of possible associations of sensitive values to his/her target (i.e. record respondent) without exploiting any background knowledge4. When different sensitive attribute values exist in an equivalence class, the probability of associating a record respondent to sensitive values in the equivalence class is the same. By exploiting the background knowledge, the adversary may be able to discriminate one association from the others, resulting in privacy breaches. Modeling the background knowledge is an open problem in data anonymization [8]. We model the adversary’s background knowledge as a probability distribution associating the sensitive values to a record respondent based on QI values, called background knowledge distribution. The goal in our privacy model is to maximize uncertainties in identifying record respondents and their respective values for sensitive attributes in a given equivalence class. In the presence of adversary’s background knowledge, we attempt to create equivalence classes such that record respondents have similar background knowledge distributions in each class in order to achieve our goal. Therefore, when adversaries examine different associations of sensitive values (within each equivalence class) to their targets, they will not be able to discriminate any association with a high degree of certainty. The constraint of similar background knowledge distributions cannot prevent identity disclosure or attribute disclosure. Therefore, we also apply k-anonymity [1] and β-likeness [5]. To remain the anonymized data useful, a high similarity among QI values in each equivalence class is also required.

Hence, we propose to create equivalence classes with the following privacy requirements: (1) The background knowledge distributions within any equivalence class should be similar in order to prevent the so-called background knowledge attack, (2) k-anonymity: the size of each class is at least k, (3) β-likeness: the maximum relative difference in the frequency of sensitive values within any equivalence class and that of the overall microdata table does not exceed a given threshold β.

We present two syntactic anonymization algorithms based on the value generalization approach. We suggest a hierarchical procedure to satisfy our privacy requirements. First, we apply agglomerative clustering to prevent the background knowledge attack. We apply the clustering algorithm to generate clusters in which the difference of background knowledge distributions between each pair of records in the cluster is below a certain threshold. Then, each cluster is partitioned into a number of equivalence classes. We propose two algorithms to produce the equivalence classes: k-anonymity-primacy and β-likeness-primacy. The former prioritizes the QI attributes and generates equivalence classes in a β-likeness aware manner. For this purpose, we propose a clustering-based algorithm to select homogeneous records in terms of QI values, and then check whether β-likeness is satisfied. The latter focuses on the sensitive attribute values. It generates large equivalence classes in which β-likeness is satisfied. Then, large equivalence classes are split in order to satisfy k-anonymity.

A work close to ours is found in Riboni et al. [8]. They have proposed a privacy model based on adversary’s background knowledge and t-closeness [4]. Their anonymization algorithm applies Hilbert index transformation to create an ordered list of record respondents based on similarity of QI-values. We enforce a stronger model than t-closeness and propose two new anonymization algorithms to satisfy our privacy requirements. Sori-Comas et al. [15] also proposed two clustering-based anonymization algorithms attaining k-anonymity and t-closeness which is suitable to anonymize numerical values. They do not consider the adversary’s background knowledge.

We verify the effectiveness of our anonymization algorithms by running extensive experiments on two datasets: Adult dataset [16] and BKseq dataset [8]. We study the performance of our anonymization algorithms based on different parameters of our privacy model. The experimental results show that k-anonymity-primacy generates the anonymized microdata with low information loss while β-likeness-primacy incurs low privacy loss. The k-anonymity-primacy algorithm also generates more balanced equivalence classes compared to β-likeness-primacy. We further compare the performance of our proposed algorithms with the state of the art anonymization algorithms like Hilbert index-based algorithm [8]. The performance of our algorithms are better than Hilbert index-based algorithm in terms of both data utility and privacy.

Furthermore, we extend an information loss measure to capture data inaccuracies caused by generalization. In any anonymization algorithm, it is possible not to fit some records in any equivalence class. To protect the privacy of other record respondents, a simple solution is to remove them from the published data. We introduce an extension to the Global Certainty Penalty (GCP) metric [17] to consider this kind of information loss too, and we name it Removed Global Certainty Penalty (RGCP). The RGCP metric charges a penalty for each not-fit record. The penalty of each record is proportional to the range of QI values in the nearest equivalence class. We evaluate our algorithms using both GCP and RGCP. When a large number of records are removed by the algorithm, the advantage of RGCP over GCP is better seen.

Contributions. In summary, our contributions are as follows:

  • - We propose two syntactic anonymization algorithms which simultaneously satisfy two privacy models (k-anonymity and β-likeness) against adversaries who have background knowledge on correlations of attributes. We conduct extensive experiments on different aspects of the algorithms, namely data utility, privacy, size of equivalence classes, and the run time. Then we compare our algorithms with microaggregation approaches. We also perform an experimental comparison between the closest work, Hilbert index-based algorithm [8], and the proposed algorithms. We demonstrate that the proposed algorithms outperform Hilbert index-based algorithm in terms of data utility and privacy.

  • - We extend GCP to measure information loss of equivalence classes when the generalization operation is performed. Our metric, called RGCP, considers unpublished records too.

Paper organization. In Section 2, we review research related to anonymization algorithms and information loss metrics. In Section 3, we define the problem and propose two solutions. In Section 4, we introduce a new information loss measure. Experiments and results are presented in Section 5. In Section 6, we discuss more the results and contributions. Finally, we conclude the paper in Section 7 followed by the future work discussion.

Section snippets

Related work

This research is built up on the concepts of syntactic-based anonymization algorithms, which consider adversaries with background knowledge and information loss measures. In this section, we study the literature of these domains from various aspects.

Problem definition

Assume that a data publisher is going to publish the original microdata table T={r1,r2,,rn} where each record ri corresponds to an individual vi, so called record respondent. Each record ri of T contains d QI attributes A1,A2,,Ad and a single sensitive attribute S. D[Ai], 1 ≤ id denotes the attribute domain of Ai and D[S]={s1,s2,,sm} denotes the attribute domain of S. Let’s assume that ri[Aj] denotes Aj value of record ri in T and ri[QI] denotes QI values of record ri.

An anonymization

The proposed information loss measure

Information loss measures are used to compare data quality in the entire anonymized data versus the quality in the original data. In this work, we consider the information loss measures applied in generalization algorithms. DM [35] and AEC [19] consider penalties per equivalence class rather than per record and are not able to consider the data distribution of attribute values. GH [36] and PM [37] take into account the height of generalization hierarchy. These measures are suitable for

Experiments and results

In this section, we empirically evaluate our anonymization algorithms using two data sets and according to a number of measures.

Discussion

We have analyzed our algorithms in terms of GCP as the information loss metric, and RL as the privacy loss metric over two datasets, regarding different values of privacy parameters.

The proposed privacy model has different goals including prevention of background knowledge attack and identity and attribute disclosures. The anonymization algorithms satisfy these goals in a hierarchical manner. The algorithms prioritize the prevention of background knowledge attack, which deals with exposed

Conclusion

In this paper, we propose anonymization algorithms which consider the adversary’s background knowledge in privacy model. We define a strong privacy model based on k-anonymity and β-likeness to protect against background knowledge attack.

Two anonymization algorithms are proposed to achieve our privacy model: k-anonymity-primacy and β-likeness-primacy. In the first phase of these algorithms, we use the agglomerative clustering to prevent background knowledge attack. In the next phase, our

Acknowledgments

The authors would like to thank the reviewers for their comments, which helped improve the paper significantly.

References (43)

  • C. Dwork

    Differential privacy

    Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP)

    (2006)
  • D. Kifer et al.

    No free lunch in data privacy

    Proceedings of the 2011 International Conference on Management of Data

    (2011)
  • D. Riboni et al.

    JS-Reduce: defending your data from sequential background knowledge attacks

    IEEE Trans. Dep. Sec. Comp.

    (2012)
  • T. Li et al.

    Modeling and integrating background knowledge in data anonymization

    Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE)

    (2009)
  • D. Martin et al.

    Worst-case background knowledge in privacy-preserving data publishing

    Proceedings of the 23th International Conference on Data Engineering(ICDE)

    (2007)
  • B.C. Chen et al.

    Privacy skyline: privacy with multidimensional adversarial knowledge

    Proceedings of the 33th IEEE International Conference on Very Large Data Bases

    (2007)
  • T. Li et al.

    Injector: mining background knowledge for data anonymization

    Proceedings of the International Conference on Data Engineering (ICDE)

    (2008)
  • W. Du et al.

    Privacy-maxEnt: integrating background knowledge in privacy quantification

    Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

    (2008)
  • National Heart, lung and Blood Institute, Data Fact Sheet, National Heart, lung and Blood Institute, November 2015....
  • J. Soria-Comas et al.

    t-closeness through microaggregation: strict privacy with enhanced utility preservation

    IEEE Trans. Knowl. Data Eng.

    (2015)
  • Adult Dataset, April 2015. https://archive.ics.uci.edu/ml/datasets/Adult. last access: April...
  • Cited by (36)

    View all citing articles on Scopus
    View full text