Bottom-up sequential anonymization in the presence of adversary knowledge
Introduction
The growing demand for access to data belonging to individuals has made individual privacy a major concern. Privacy preserving data publishing is targeting this concern. Anonymization is one method to conceal association between individuals and records in a microdata table while maintaining usability of the data. Privacy models such as k-anonymity [1], β-likeness [33], and differential privacy [26] refer to scenarios where only a single static dataset is published. In practice, data sources are dynamic and datasets could be updated and re-published periodically. For instance, data of disease progression, behaviour analysis of individuals over time or the investigation of the safety and effectiveness of a drug in slowing down the progression of some diseases might be published periodically. Thus; a privacy breach may emerge from any one of the releases or due to the combination of information from different releases over time.
There are several scenarios of data re-publishing [5], [6], [19], [20], [23], [24], [34], namely: multiple data publishing, sequential data publishing, and continuous data publishing. Our focus in this paper is on sequential data publishing in which the number of released attributes might vary. Meanwhile, values of attributes might vary due to the update operation. The main goal is to anonymize the next release such that the combination of information from all releases does not violate individuals’ privacy. Besides, the adversary’s additional knowledge makes it difficult to preserve the privacy of individuals. Often researchers assume that I) whether the adversary’s targets exist in the dataset (e.g., microdata table) and II) the quasi identifier1 (QI) attribute values of their targets, are known [1], [26], [33]. If an adversary has additional knowledge (usually called background knowledge), most privacy models cannot preserve privacy for individuals in either single publishing or sequential publishing scenarios. In particular, correlations among attribute values serve as an important piece of adversary knowledge leading to privacy breach [7], [8], [29], [31]. In re-publishing, the adversary can also exploit the correlation among sensitive attribute values over different releases.
Most previous works related to this task focused on techniques dealing with either re-publishing [5], [6], [19], [20], [23], [24], [34] or integrating background knowledge into the privacy model for a single data release [7], [8], [29]. k-linkability and k-diversity are the state of the art privacy models in sequential data publishing [9], [23], [34]. In this paper, we propose an anonymization framework for sequential data publishing when the adversary has background knowledge about correlations among attribute values.
We extend our privacy model based on k-anonymity and β-likeness in order to be used in sequential releases of data. Our privacy model considers the adversary’s background knowledge and guarantees stronger privacy compared to k-linkability and k-diversity. The privacy model is based on the adversary’s beliefs such that the adversary’s beliefs about associations within each QI group are similar and k-anonymity and β-likeness are satisfied.
Our main contribution in this paper is proposing the knowledge-based sequential anonymization algorithm (KSAA) to satisfy the privacy model as data is re-published. KSAA is a bottom-up anonymization algorithm using local generalization that leads to lower loss of information. To protect against background knowledge attack, KSAA generates the primary QI groups satisfying the privacy model in the current release. Then, KSAA checks whether the join of the current release with the previous ones may violate the privacy model. In case of violation, KSAA merges the QI group with the closest one in terms of QI values to achieve privacy model. We model possible joins between different releases as a multipartite graph whose nodes are records observed in different releases. An edge connects two records in two different views if those records contain consistent values in terms of QI values. Every perfect matching in the graph represents a possible consistency assignment between records in sequential releases. We try to protect against information disclosure among different releases with hiding the target perfect matching among a number of other perfect matchings in the graph. Our method utilizes the correlations between sensitive values to narrow down cliques included in a perfect matching.
The work in [34], denoted as CELL(FMJ) herein, is a related research study, since it models possible joins between releases as a multipartite graph. CELL(FMJ) tries to satisfy either k-linkability or k-diversity using a top-down anonymization algorithm with global generalization. TDS4ASR[23] is another top-down anonymization algorithm to satisfy k-linkability. The algorithm focuses on cases in which only two releases of the same table are published. CELL(FMJ) and TDS4ASR do not consider any adversary’s background knowledge.
Two datasets, Census dataset [4] and BKseq dataset [19], are used to evaluate the effectiveness of our anonymization framework. Our results are based on different values of privacy parameters and measures. The empirical results indicate that the performance of our proposed algorithm is better than the state of the art sequential anonymization algorithms such as TDS4ASR and CELL(FMJ) in terms of global certainty penalty, the adversary’s gain of knowledge, and average adversary confidence.
- •
Contributions: Our contributions are summarized as follows:
- •
We propose an anonymization framework for sequential releases of data, which takes into account the adversary’s background knowledge. We extend a privacy model based on k-anonymity and β-likeness in sequential data publishing.
- •
We propose a bottom-up sequential anonymization algorithm using local generalization to decrease information loss compared to other sequential anonymization methods with global generalization. Extensive experiments on different aspects of the anonymization algorithms are performed.
The rest of the paper is organized as follows. Related work is presented in Section 2. Section 3 formally defines the problem including the privacy attack, graph representation of records in a sequential release, the adversary’s background knowledge, and how she infers the actual associations between individuals and sensitive values. Section 4 illustrates our proposed anonymization framework including a privacy model and an anonymization algorithm. Experimental evaluation comes in Section 5. Finally, we conclude the paper in Section 6 followed by the future work.
Section snippets
Related work
To protect the privacy of individuals represented in databases, a large number of privacy models have been proposed. k-anonymity [1] and its refinements (e.g. β-likeness [33] and t-closeness[14]) are syntactic privacy models which partition records into groups, called QI groups. It is worth to note that the refinements do not replace k-anonymity and we require them in addition to k-anonymity. Few works proposed a combination of privacy models to prevent identity and attribute disclosures [10],
Problem definition
Let be a table of N records. Each record ri corresponds to an individual v, called record respondent. Each ri contains an identifier attribute ID, D QI attributes Aj, 1 ≤ j ≤ D, and a single sensitive attribute . Suppose that [Aj] denotes attribute domain of Aj, . In particular, the domain of the sensitive attribute contains M different values, namely . rn(Aj) shows the Aj value of record rn, 1 ≤ n ≤ N.
A sequential release is a sequence of t views2
Sequential anonymization framework
In this section, we elaborate our anonymization framework against the background knowledge attack in sequential data publishing. Section 4.1 describes the privacy model and a strategy for limiting the adversary’s confidence. Our anonymization algorithm is described and discussed in Sections 4.2 and 4.3, respectively.
Experimental results
We report our evaluation results of the anonymization framework in sequential data publishing using two datasets regarding different measures. In Section 5.1, we describe the configuration and datasets. Then, we define evaluation measures in Section 5.2. In Section 5.3, we present results on our anonymization framework.
Conclusion
We proposed an anonymization framework in sequential publishing that considers the adversary’s background knowledge in the privacy model. We extended a privacy model in single publishing in order to be applied in a sequential release of data. The privacy model, J-similarity, protects against both identity and attribute disclosures as well as background knowledge attack. We proposed the knowledge based sequential anonymization algorithm (KSAA) which guarantees to achieve the privacy model for
Acknowledgement
This research was in part supported by a grant from the School of Computer Science, Institute for Research in Fundamental Sciences (IPM).
References (34)
- et al.
Hierarchical anonymization algorithms against background knowledge attack in data releasing
Knowl. Based Sys.
(2016) - et al.
Enhancing data utility in differential privacy via microaggregation-based k-anonymity
VLDB J.
(2014) - et al.
t-closeness through microaggregation: strict privacy with enhanced utility preservation
IEEE TKDE.
(2015) - 2014 Alzheimer’s disease Facts and Figures, Alzheimer association, https://www.alz.org/, September...
- et al.
Efficient sanitization of unsafe data correlations
Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conf.
(2015) - et al.
Anonymizing sequential releases under arbitrary updates
Proceedings of the Joint EDBT/ICDT 2013 Workshops
(2013) - et al.
Privacy, accuracy, and consistency too: a holistic solution to contingency table release
Proceedings of the 26th ACM Symposium on Principles of Database Systems (PODS)
(2007) - et al.
Privacy preserving serial data publishing by role composition
Proc. VLDB Endowment
(2008) - et al.
Publishing microdata with a robust privacy guarantee
PVLDB
(2012) - Census dataset, http://www.ipums.org, June...
Efficient multivariate data-oriented microaggregation
VLDB J.
PPTD: Preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression
Knowl.-Based Syst.
k-concealment: an alternative model of k-type anonymity
Trans. Data Privacy
Anonymizing 1:m microdata with high utility
Knowl.-Based Syst.
No free lunch in data privacy
Proceedings of International Conference on Management of Data, (SIGMOD ’11)
t-closeness: privacy beyond k-anonymity and l-diversity
Proceedings of the 23rd IEEE International Conference Data Eng. (ICDE â 07)
Cited by (11)
Enhancing Utility in Anonymized Data against the Adversary’s Background Knowledge
2023, Applied Sciences (Switzerland)Fuzzy-Logic-Based Privacy-Aware Dynamic Release of IoT-Enabled Healthcare Data
2022, IEEE Internet of Things JournalOne-off Disclosure Control by Heterogeneous Generalization
2022, Proceedings of the 31st USENIX Security Symposium, Security 2022Exposing safe correlations in transactional datasets
2021, Service Oriented Computing and ApplicationsData anonymization for pervasive health care: Systematic literature mapping study
2021, JMIR Medical InformaticsPrivacy Preservation Techniques for Sequential Data Releasing
2021, ACM International Conference Proceeding Series