Elsevier

Information Sciences

Volume 450, June 2018, Pages 316-335
Information Sciences

Bottom-up sequential anonymization in the presence of adversary knowledge

https://doi.org/10.1016/j.ins.2018.03.027Get rights and content

Abstract

In many real world situations, data are updated and released over time. In sequential data publishing, the number of attributes and records may vary and the attribute values may be modified in different releases. This might lead to a compromise in privacy when different releases of the same data are combined. Preventing information disclosure becomes more difficult when the adversary has background knowledge on correlations among sensitive attribute values over time. In this paper, we propose an anonymization framework to protect against background knowledge attack in a sequential data publishing setting. Our proposed anonymization framework recreates the adversary’s inference ability to estimate her posterior beliefs. Our method extends a privacy model to consider the adversary’s posterior beliefs. We propose a bottom-up sequential algorithm which uses local generalization to decrease information loss compared to other sequential anonymization algorithms that use global generalization. We verify the theoretical study by experimentation on two datasets. Experimental results show that our proposed algorithm outperforms the state of the art sequential approaches like CELL(FMJ) and TDS4ASR in terms of information loss, the adversary’s information gain, and average adversary confidence.

Introduction

The growing demand for access to data belonging to individuals has made individual privacy a major concern. Privacy preserving data publishing is targeting this concern. Anonymization is one method to conceal association between individuals and records in a microdata table while maintaining usability of the data. Privacy models such as k-anonymity [1], β-likeness [33], and differential privacy [26] refer to scenarios where only a single static dataset is published. In practice, data sources are dynamic and datasets could be updated and re-published periodically. For instance, data of disease progression, behaviour analysis of individuals over time or the investigation of the safety and effectiveness of a drug in slowing down the progression of some diseases might be published periodically. Thus; a privacy breach may emerge from any one of the releases or due to the combination of information from different releases over time.

There are several scenarios of data re-publishing [5], [6], [19], [20], [23], [24], [34], namely: multiple data publishing, sequential data publishing, and continuous data publishing. Our focus in this paper is on sequential data publishing in which the number of released attributes might vary. Meanwhile, values of attributes might vary due to the update operation. The main goal is to anonymize the next release such that the combination of information from all releases does not violate individuals’ privacy. Besides, the adversary’s additional knowledge makes it difficult to preserve the privacy of individuals. Often researchers assume that I) whether the adversary’s targets exist in the dataset (e.g., microdata table) and II) the quasi identifier1 (QI) attribute values of their targets, are known [1], [26], [33]. If an adversary has additional knowledge (usually called background knowledge), most privacy models cannot preserve privacy for individuals in either single publishing or sequential publishing scenarios. In particular, correlations among attribute values serve as an important piece of adversary knowledge leading to privacy breach [7], [8], [29], [31]. In re-publishing, the adversary can also exploit the correlation among sensitive attribute values over different releases.

Most previous works related to this task focused on techniques dealing with either re-publishing [5], [6], [19], [20], [23], [24], [34] or integrating background knowledge into the privacy model for a single data release [7], [8], [29]. k-linkability and k-diversity are the state of the art privacy models in sequential data publishing [9], [23], [34]. In this paper, we propose an anonymization framework for sequential data publishing when the adversary has background knowledge about correlations among attribute values.

We extend our privacy model based on k-anonymity and β-likeness in order to be used in sequential releases of data. Our privacy model considers the adversary’s background knowledge and guarantees stronger privacy compared to k-linkability and k-diversity. The privacy model is based on the adversary’s beliefs such that the adversary’s beliefs about associations within each QI group are similar and k-anonymity and β-likeness are satisfied.

Our main contribution in this paper is proposing the knowledge-based sequential anonymization algorithm (KSAA) to satisfy the privacy model as data is re-published. KSAA is a bottom-up anonymization algorithm using local generalization that leads to lower loss of information. To protect against background knowledge attack, KSAA generates the primary QI groups satisfying the privacy model in the current release. Then, KSAA checks whether the join of the current release with the previous ones may violate the privacy model. In case of violation, KSAA merges the QI group with the closest one in terms of QI values to achieve privacy model. We model possible joins between different releases as a multipartite graph whose nodes are records observed in different releases. An edge connects two records in two different views if those records contain consistent values in terms of QI values. Every perfect matching in the graph represents a possible consistency assignment between records in sequential releases. We try to protect against information disclosure among different releases with hiding the target perfect matching among a number of other perfect matchings in the graph. Our method utilizes the correlations between sensitive values to narrow down cliques included in a perfect matching.

The work in [34], denoted as CELL(FMJ) herein, is a related research study, since it models possible joins between releases as a multipartite graph. CELL(FMJ) tries to satisfy either k-linkability or k-diversity using a top-down anonymization algorithm with global generalization. TDS4ASR[23] is another top-down anonymization algorithm to satisfy k-linkability. The algorithm focuses on cases in which only two releases of the same table are published. CELL(FMJ) and TDS4ASR do not consider any adversary’s background knowledge.

Two datasets, Census dataset [4] and BKseq dataset [19], are used to evaluate the effectiveness of our anonymization framework. Our results are based on different values of privacy parameters and measures. The empirical results indicate that the performance of our proposed algorithm is better than the state of the art sequential anonymization algorithms such as TDS4ASR and CELL(FMJ) in terms of global certainty penalty, the adversary’s gain of knowledge, and average adversary confidence.

  • Contributions: Our contributions are summarized as follows:

  • We propose an anonymization framework for sequential releases of data, which takes into account the adversary’s background knowledge. We extend a privacy model based on k-anonymity and β-likeness in sequential data publishing.

  • We propose a bottom-up sequential anonymization algorithm using local generalization to decrease information loss compared to other sequential anonymization methods with global generalization. Extensive experiments on different aspects of the anonymization algorithms are performed.

The rest of the paper is organized as follows. Related work is presented in Section 2. Section 3 formally defines the problem including the privacy attack, graph representation of records in a sequential release, the adversary’s background knowledge, and how she infers the actual associations between individuals and sensitive values. Section 4 illustrates our proposed anonymization framework including a privacy model and an anonymization algorithm. Experimental evaluation comes in Section 5. Finally, we conclude the paper in Section 6 followed by the future work.

Section snippets

Related work

To protect the privacy of individuals represented in databases, a large number of privacy models have been proposed. k-anonymity [1] and its refinements (e.g. β-likeness [33] and t-closeness[14]) are syntactic privacy models which partition records into groups, called QI groups. It is worth to note that the refinements do not replace k-anonymity and we require them in addition to k-anonymity. Few works proposed a combination of privacy models to prevent identity and attribute disclosures [10],

Problem definition

Let T={r1,r2,,rN} be a table of N records. Each record ri corresponds to an individual v, called record respondent. Each ri contains an identifier attribute ID, D QI attributes Aj, 1 ≤ j ≤ D, and a single sensitive attribute AD+1. Suppose that [Aj] denotes attribute domain of Aj, 1jD+1. In particular, the domain of the sensitive attribute contains M different values, namely [AD+1]={s1,s2,,sM}. rn(Aj) shows the Aj value of record rn, 1 ≤ n  ≤ N.

A sequential release is a sequence of t views2

Sequential anonymization framework

In this section, we elaborate our anonymization framework against the background knowledge attack in sequential data publishing. Section 4.1 describes the privacy model and a strategy for limiting the adversary’s confidence. Our anonymization algorithm is described and discussed in Sections 4.2 and 4.3, respectively.

Experimental results

We report our evaluation results of the anonymization framework in sequential data publishing using two datasets regarding different measures. In Section 5.1, we describe the configuration and datasets. Then, we define evaluation measures in Section 5.2. In Section 5.3, we present results on our anonymization framework.

Conclusion

We proposed an anonymization framework in sequential publishing that considers the adversary’s background knowledge in the privacy model. We extended a privacy model in single publishing in order to be applied in a sequential release of data. The privacy model, J-similarity, protects against both identity and attribute disclosures as well as background knowledge attack. We proposed the knowledge based sequential anonymization algorithm (KSAA) which guarantees to achieve the privacy model for

Acknowledgement

This research was in part supported by a grant from the School of Computer Science, Institute for Research in Fundamental Sciences (IPM).

References (34)

  • J. Domingo-Ferrer et al.

    Efficient multivariate data-oriented microaggregation

    VLDB J.

    (2006)
  • C. Dwork, Differential privacy, In Proc. of the 33rd Int. Colloquium on Automata, Languages and Programming (ICALP),...
  • E.G. Komishani et al.

    PPTD: Preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression

    Knowl.-Based Syst.

    (2015)
  • A. Gionis et al.

    k-concealment: an alternative model of k-type anonymity

    Trans. Data Privacy

    (2012)
  • Q. Gong et al.

    Anonymizing 1:m microdata with high utility

    Knowl.-Based Syst.

    (2016)
  • D. Kifer et al.

    No free lunch in data privacy

    Proceedings of International Conference on Management of Data, (SIGMOD ’11)

    (2011)
  • N. Li et al.

    t-closeness: privacy beyond k-anonymity and l-diversity

    Proceedings of the 23rd IEEE International Conference Data Eng. (ICDE â 07)

    (2007)
  • View full text