Abstract
From the perspective of responsible data release, simulation is a useful tool for estimating risk from adversaries with an unknown amount of identified auxiliary information. We present a simple approach to simulation of attack on sampled datasets, along with an implementation, and demonstrate how a data steward might make use of it to evaluate the privacy risk of release for data gathered about students in the University of California system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The project repository is https://codeberg.org/bavajadas.de.benadam/PrivacySim.
- 2.
We use the terms ‘attacker’ and ‘intruder’ interchangeably throughout.
- 3.
In communications anonymity, a similar point is made by pointing out that “usability is a security property;” that is, that increased adoption of an anonymity system increases the total set of individuals that an attacker must individuate (as well as those individuals’ diversity, but that is a separate point). See Serjantov et al. (2003), and discussion of “degree of anonymity” in Berthold et al. (2001).
- 4.
The insights in this paper can also be extended to attacks that seek out attribute disclosure, without unique reidentification to a single research subject.
- 5.
We could add to this model the harm that could arise from incorrect matches that are presumed by the attacker and others to be accurate, but with increased attention on the likelihood of false matches, this problem should be marginalized. (Put differently, every form of deidentification runs the risk that a careless or fraudulent intruder might claim that they have reidentified a research subject when the chance that they have actually done so is very low.)
- 6.
This interpretation might be generalized further to include cases in which groups of one kind or another share low plausible deniability.
- 7.
The assumptions need not be uniform, either. A data steward can assign greater likelihood of accessible auxiliary information for a data subject who is likely to be targeted, such as a Governor, or to the members of a vulnerable group who face special harms from successful attack.
- 8.
Rocher et al. (2019) estimated that there was a 23% chance the match was wrong. Ironically, this study by Rocher et al. was the same study that was described in the New York Times’ article with the misleading headline “Your Data Were ‘Anonymized’? These Scientists Can Still Identify You.”
- 9.
Nayek et al. (2016) assesses the problem with formal privacy measures, like “differential privacy,” concluding “… for developing practical disclosure control goals, it is essential for the agency to consider intruders with limited prior information about their target units.”. Elliot et al. note “many authors have commented that this environment is inherently difficult—if not impossible—to understand and therefore directly assessing risk is itself impossible. This in turn has led to bad decision-making about data sharing (a strange mixture of over-caution and imprudence which is driven more often than not by the personality of the decision-maker rather than by rational processes.”
- 10.
More technically, sampling alone could never meet Differential Privacy standards because any microdata release that does not involve perturbation or the creation of synthetic data will violate the Differential Privacy guarantee.
- 11.
We allow simulation of noise levels in both the released data and the auxiliary data, though this is not explicit in the simulation steps below. See the code repository, above at note 1.
- 12.
This decision would be similar to the judgments that must be made when differentiating between quasi-identifiers and non-identifiers when implementing k-anonymity.
- 13.
A slightly more sophisticated version of our methodology would include all matches, and sampling uniformly to decide which records from the released data to match with which from the auxiliary data.
- 14.
The data dictionary can be found in the data directory of our repository. See above at note 1.
- 15.
See https:/www.census.gov/programs-surveys/acs/data/pums.html for the PUMS data, and https://mimic.physionet.org/ for the MIMIC III data.
- 16.
There were 1,620 parameter settings. For each iteration at a given setting, steps 1–9 mentioned above are performed. The full set of simulation runs is computationally intensive, so there are two implementations of the simulation code. One is designed to run serially, and is suitable for small, slow runs on a single laptop; the other is designed to run in parallel on a high-performance computing cluster (HPC). The cluster we used had some specific features, such as use of the PBS Scheduler, but minor modifications should allow the code to be used on a variety of HPC setups. See the experimental_actors branch of the repository referenced above at note 1.
References
Abowd, J.: Formal Privacy Methods for the 2020 Census. 2020 Census Program Memorandum Series: 2020.07 (2020)
Barth-Jones, D.: The Debate Over ‘Re-identification’ of Health Information: What Do We Risk? Health Affairs (2012a)
Barth-Jones, D.: The ‘Re-identification’ of Governor William Weld’s Medical Information: A Critical Re-examination of Health Data Identification Risks and Privacy Protections, Then and Now. Draft (2012b). https://fpf.org/wp-content/uploads/The-Re-identification-of-Governor-Welds-Medical-Information-Daniel-Barth-Jones.pdf
Berthold, O., Pfitzmann, A., Standtke, R.: The disadvantages of free MIX routes and how to overcome them. In: Federrath, H. (ed.) Designing Privacy Enhancing Technologies. LNCS, vol. 2009, pp. 30–45. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44702-4_3
Bhaskar, R., Bhowmick, A., Goyal, V., Laxman, S., Thakurta, A.: Noiseless database privacy. In: Lee, D.H., Wang, X. (eds.) ASIACRYPT 2011. LNCS, vol. 7073, pp. 215–232. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25385-0_12
Christensen, G., Miguel, E.: Transparency, Reproducibility, and the Credibility of Economics Research. NBER Working Paper No. 22989 (2016)
de Montjoye, Y.-A., Radaelli, L., Singh, V.K.: Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347, 536–539 (2015)
Domingo-Ferrer, F., Muralidhar, K.: New directions in anonymization: permutation paradigm, verifiability by subjects and intruders, transparency to users. Inf. Sci. 337, 11–24 (2016)
Dwork, C., Smith, A.: Differential privacy for statistics: what we know and what we want to learn. J. Priv. Confidentiality 1, 135–139 (2009)
Elliot, M., Domingo-Ferrer, J.: The future of statistical disclosure control. arXiv preprint arXiv:1812.09204 (2018)
Federal Committee on Statistical Methodology: Statistical Policy Working Paper 22: Report on Statistical Disclosure Limitation Methodology (2nd version). Office of Management and Budget, Executive Office of the President (2005)
Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security (2012)
Nayak, T., Zhang, C., You, J.: Measuring Identification Risk in microdata Release and Its Control by Post-Randomization. Center for Disclosure Avoidance Research, U.S. Census Bureau Research Report Series #2016-02 (2016)
Ohm, P.: Broken Promises of Privacy, 57 UCLA L. Rev. 1701, 1719 (2010)
Ramachandran, A., Singh, L., Porter, E., Nagle, F.: Exploring re-identification risks in public domains. In: Tenth Annual International Conference on Privacy, Security and Trust, pp. 35–42 (2012)
Rocher, L., Hendrickx, J.M., De Montjoye, Y.-A.: Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10, 1–9 (2019)
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13, 1010–1027 (2001)
Serjantov, A., Dingledine, R., Syverson, P.: From a trickle to a flood: active attacks on several mix types. In: Petitcolas, F.A.P. (ed.) IH 2002. LNCS, vol. 2578, pp. 36–52. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36415-3_3
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 571–588 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sidi, D., Bambauer, J. (2020). Plausible Deniability. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-57521-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)