Plausible Deniability

Sidi, David; Bambauer, Jane

doi:10.1007/978-3-030-57521-2_7

David Sidi¹⁰ &
Jane Bambauer¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

722 Accesses
1 Citations
3 Altmetric

Abstract

From the perspective of responsible data release, simulation is a useful tool for estimating risk from adversaries with an unknown amount of identified auxiliary information. We present a simple approach to simulation of attack on sampled datasets, along with an implementation, and demonstrate how a data steward might make use of it to evaluate the privacy risk of release for data gathered about students in the University of California system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The project repository is https://codeberg.org/bavajadas.de.benadam/PrivacySim.
2.
We use the terms ‘attacker’ and ‘intruder’ interchangeably throughout.
3.
In communications anonymity, a similar point is made by pointing out that “usability is a security property;” that is, that increased adoption of an anonymity system increases the total set of individuals that an attacker must individuate (as well as those individuals’ diversity, but that is a separate point). See Serjantov et al. (2003), and discussion of “degree of anonymity” in Berthold et al. (2001).
4.
The insights in this paper can also be extended to attacks that seek out attribute disclosure, without unique reidentification to a single research subject.
5.
We could add to this model the harm that could arise from incorrect matches that are presumed by the attacker and others to be accurate, but with increased attention on the likelihood of false matches, this problem should be marginalized. (Put differently, every form of deidentification runs the risk that a careless or fraudulent intruder might claim that they have reidentified a research subject when the chance that they have actually done so is very low.)
6.
This interpretation might be generalized further to include cases in which groups of one kind or another share low plausible deniability.
7.
The assumptions need not be uniform, either. A data steward can assign greater likelihood of accessible auxiliary information for a data subject who is likely to be targeted, such as a Governor, or to the members of a vulnerable group who face special harms from successful attack.
8.
Rocher et al. (2019) estimated that there was a 23% chance the match was wrong. Ironically, this study by Rocher et al. was the same study that was described in the New York Times’ article with the misleading headline “Your Data Were ‘Anonymized’? These Scientists Can Still Identify You.”
9.
Nayek et al. (2016) assesses the problem with formal privacy measures, like “differential privacy,” concluding “… for developing practical disclosure control goals, it is essential for the agency to consider intruders with limited prior information about their target units.”. Elliot et al. note “many authors have commented that this environment is inherently difficult—if not impossible—to understand and therefore directly assessing risk is itself impossible. This in turn has led to bad decision-making about data sharing (a strange mixture of over-caution and imprudence which is driven more often than not by the personality of the decision-maker rather than by rational processes.”
10.
More technically, sampling alone could never meet Differential Privacy standards because any microdata release that does not involve perturbation or the creation of synthetic data will violate the Differential Privacy guarantee.
11.
We allow simulation of noise levels in both the released data and the auxiliary data, though this is not explicit in the simulation steps below. See the code repository, above at note 1.
12.
This decision would be similar to the judgments that must be made when differentiating between quasi-identifiers and non-identifiers when implementing k-anonymity.
13.
A slightly more sophisticated version of our methodology would include all matches, and sampling uniformly to decide which records from the released data to match with which from the auxiliary data.
14.
The data dictionary can be found in the data directory of our repository. See above at note 1.
15.
See https:/www.census.gov/programs-surveys/acs/data/pums.html for the PUMS data, and https://mimic.physionet.org/ for the MIMIC III data.
16.
There were 1,620 parameter settings. For each iteration at a given setting, steps 1–9 mentioned above are performed. The full set of simulation runs is computationally intensive, so there are two implementations of the simulation code. One is designed to run serially, and is suitable for small, slow runs on a single laptop; the other is designed to run in parallel on a high-performance computing cluster (HPC). The cluster we used had some specific features, such as use of the PBS Scheduler, but minor modifications should allow the code to be used on a variety of HPC setups. See the experimental_actors branch of the repository referenced above at note 1.

References

Abowd, J.: Formal Privacy Methods for the 2020 Census. 2020 Census Program Memorandum Series: 2020.07 (2020)
Google Scholar
Barth-Jones, D.: The Debate Over ‘Re-identification’ of Health Information: What Do We Risk? Health Affairs (2012a)
Google Scholar
Barth-Jones, D.: The ‘Re-identification’ of Governor William Weld’s Medical Information: A Critical Re-examination of Health Data Identification Risks and Privacy Protections, Then and Now. Draft (2012b). https://fpf.org/wp-content/uploads/The-Re-identification-of-Governor-Welds-Medical-Information-Daniel-Barth-Jones.pdf
Berthold, O., Pfitzmann, A., Standtke, R.: The disadvantages of free MIX routes and how to overcome them. In: Federrath, H. (ed.) Designing Privacy Enhancing Technologies. LNCS, vol. 2009, pp. 30–45. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44702-4_3
Chapter Google Scholar
Bhaskar, R., Bhowmick, A., Goyal, V., Laxman, S., Thakurta, A.: Noiseless database privacy. In: Lee, D.H., Wang, X. (eds.) ASIACRYPT 2011. LNCS, vol. 7073, pp. 215–232. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25385-0_12
Chapter Google Scholar
Christensen, G., Miguel, E.: Transparency, Reproducibility, and the Credibility of Economics Research. NBER Working Paper No. 22989 (2016)
Google Scholar
de Montjoye, Y.-A., Radaelli, L., Singh, V.K.: Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347, 536–539 (2015)
Article Google Scholar
Domingo-Ferrer, F., Muralidhar, K.: New directions in anonymization: permutation paradigm, verifiability by subjects and intruders, transparency to users. Inf. Sci. 337, 11–24 (2016)
Article Google Scholar
Dwork, C., Smith, A.: Differential privacy for statistics: what we know and what we want to learn. J. Priv. Confidentiality 1, 135–139 (2009)
Google Scholar
Elliot, M., Domingo-Ferrer, J.: The future of statistical disclosure control. arXiv preprint arXiv:1812.09204 (2018)
Federal Committee on Statistical Methodology: Statistical Policy Working Paper 22: Report on Statistical Disclosure Limitation Methodology (2nd version). Office of Management and Budget, Executive Office of the President (2005)
Google Scholar
Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security (2012)
Google Scholar
Nayak, T., Zhang, C., You, J.: Measuring Identification Risk in microdata Release and Its Control by Post-Randomization. Center for Disclosure Avoidance Research, U.S. Census Bureau Research Report Series #2016-02 (2016)
Google Scholar
Ohm, P.: Broken Promises of Privacy, 57 UCLA L. Rev. 1701, 1719 (2010)
Google Scholar
Ramachandran, A., Singh, L., Porter, E., Nagle, F.: Exploring re-identification risks in public domains. In: Tenth Annual International Conference on Privacy, Security and Trust, pp. 35–42 (2012)
Google Scholar
Rocher, L., Hendrickx, J.M., De Montjoye, Y.-A.: Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10, 1–9 (2019)
Article Google Scholar
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13, 1010–1027 (2001)
Article Google Scholar
Serjantov, A., Dingledine, R., Syverson, P.: From a trickle to a flood: active attacks on several mix types. In: Petitcolas, F.A.P. (ed.) IH 2002. LNCS, vol. 2578, pp. 36–52. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36415-3_3
Chapter Google Scholar
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 571–588 (2002)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

The University of Arizona, Tucson, AZ, 85712, USA
David Sidi & Jane Bambauer

Authors

David Sidi
View author publications
You can also search for this author in PubMed Google Scholar
Jane Bambauer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sidi .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (XLS 37 kb)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sidi, D., Bambauer, J. (2020). Plausible Deniability. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_7
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics