A Partitioned Recoding Scheme for Privacy Preserving Data Publishing

Clifton, Chris; Hanson, Eric J.; Merrill, Keith; Merrill, Shawn; Zahraa, Amjad

doi:10.1007/978-3-030-57521-2_4

Chris Clifton¹⁰,
Eric J. Hanson¹¹,
Keith Merrill¹¹,
Shawn Merrill¹⁰ &
…
Amjad Zahraa¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

728 Accesses

Abstract

There is growing interest in Differential Privacy as a disclosure limitation mechanism for statistical data. The increased attention has brought to light a number of subtleties in the definition and mechanisms. We explore an interesting dichotomy in parallel composition, where a subtle difference in the definition of a “neighboring database” leads to significantly different results. We show that by “pre-partitioning” the data randomly into disjoint subsets, then applying well-known anony-mization schemes to those pieces, we can eliminate this dichotomy. This provides potential operational benefits, with some interesting implications that give further insight into existing privacy schemes. We explore the theoretical limits of the privacy impacts of pre-partitioning, in the process illuminating some subtle distinctions in privacy definitions. We also discuss the resulting utility, including empirical evaluation of the impact on released privatized statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There is some subtlety here, as k-anonymity under global recoding is not assured, even if each partition element satisfies it.

References

Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.: Toward privacy in public databases. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 363–385. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30576-7_20
Chapter Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Ebadi, H., Antignac, T., Sands, D.: Sampling and partitioning for differential privacy. In: 14th Annual Conference on Privacy, Security and Trust (PST), pp. 664–673, Auckland, NZ, 12–14 December 2016
Google Scholar
Li, N., Li, T.: $t$-closeness: privacy beyond $k$-anonymity and $l$-diversity. In: Proceedings of the 23nd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey, 16–20 April 2007
Google Scholar
Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy: or, $k$-anonymization meets differential privacy. In: 7th ACM Symposium on Information, Computer and Communications Security (ASIACCS’2012), pp. 32–33, Seoul, Korea, 2–4 May 2012
Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: $l$-diversity: privacy beyond $k$-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)
Article Google Scholar
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 19–30, Providence, Rhode Island, 29 June - 2 July 2009
Google Scholar
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC, pp. 75–84 (2007)
Google Scholar
Ruggles, S., et al.: IPUMS USA: version 8.0 extract of 1940 Census for U.S. census bureau disclosure avoidance research [dataset] (2018). https://doi.org/10.18128/D010.V8.0.EXT1940USCB
Samarati, P.: Protecting respondent’s privacy in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Article Google Scholar
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
Article MathSciNet Google Scholar
Zafarani, F., Clifton, C.: Differentially private naive bayes classifier using smooth sensitivity. Under review by The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 14–18 September 2020
Google Scholar

Download references

Acknowledgments

This work was supported by the United States Census Bureau under CRADA CB16ADR0160002. The views and opinions expressed in this writing are those of the authors and not the U.S. Census Bureau.

Author information

Authors and Affiliations

Department of Computer Science and CERIAS Purdue University, West Lafayette, IN, 47907, USA
Chris Clifton, Shawn Merrill & Amjad Zahraa
Department of Mathematics, Brandeis University, Waltham, MA, 02453, USA
Eric J. Hanson & Keith Merrill

Authors

Chris Clifton
View author publications
You can also search for this author in PubMed Google Scholar
Eric J. Hanson
View author publications
You can also search for this author in PubMed Google Scholar
Keith Merrill
View author publications
You can also search for this author in PubMed Google Scholar
Shawn Merrill
View author publications
You can also search for this author in PubMed Google Scholar
Amjad Zahraa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shawn Merrill .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

Appendix: Proof of Theorem 2

We first fix some notation. Let $\mathcal {A}$ be a $\varepsilon $-differentially private mechanism. Let D be a dataset with $|D| = n$ and choose an integer $n' < n$. Fix some tuple $t \in D$. We denote by $Y_t$ the set of all subdatasets $D_s \subset D$ with $|D_s| = n'$ and $t \in D_s$ and by $N_t$ the set of all subdatasets $D_s \subset D$ with $|D_s| = n'$ and $t \notin D_s$. We observe

$$|Y_t| = {{n-1}\atopwithdelims (){n'-1}} \qquad |N_t| = {{n-1}\atopwithdelims (){n'}}.$$

For $D' = D\setminus \{t\}\cup \{t'\}$ a neighbor of D, we define $Y'_t$ and $N'_t$ analogously. We observe that $N_t = N'_t$.

We will need the following lemma.

Lemma 3

Let $t \in D$ and $S \subset {\text {range}}(\mathcal {A})$. Then

$$ \sum _{D_s \in Y_t}\frac{P(\mathcal {A}(D_s)\in S)}{|Y_t|} \le e^\varepsilon \sum _{D_s \in N_t}\frac{P(\mathcal {A}(D_s) \in S)}{|N_t|}. $$

Proof

For each $D_s \in Y_t$, we can replace the tuple t by any of the $n-n'$ tuples in $D\setminus D_s$ to create a dataset in $N_t$ that is a neighbor of $D_s$. Similarly, given any $D_s \in N_t$, we can replace any of the $n'$ tuples in $D_s$ with t to create a dataset in $Y_t$ that is a neighbor of $D_s$.

Now consider

$$(n-n')\sum _{D_s \in Y_t} P(\mathcal {A}(D_s) \in S)$$

as counting each $D_s \in Y_t$ with multiplicity $n-n'$. Thus we replace the $n-n'$ copies of $D_s \in Y_t$ in this sum with its $n-n'$ neighbors in $N_t$. By differential privacy, each such change causes the probability to grow by no more than $e^\varepsilon $. Moreover, each dataset in $N_t$ will occur $n'$ times in the new sum. Thus

$$ (n-n') \sum _{D_t \in Y_t}P(\mathcal {A}(D_t) \in S) \le e^\varepsilon n' \sum _{D_t \in N_t}P(\mathcal {A}(D_t) \in S). $$

The result now follows from the observation that $\frac{n'}{n-n'} = \frac{|Y_t|}{|N_t|}$.

This lemma captures the reason we have assumed the size of the subdataset to be fixed. In the unbounded case, if we delete a tuple t to pass from dataset D to $D'$, then for each $D_s \subseteq D$ with $t \in D_s$, there is a unique $D'_s \subseteq D'$ with $d(D_s,D'_s) = 1$. Lemma 3 is our generalization of this fact to the unbounded case.

We are now ready to prove Theorem 2, which we restate here for convenience.

Theorem 2

Let $\mathcal {A}$ satisfy $\varepsilon $-DP. Let D be a dataset with $|D| = n$ and choose an integer $n' < n$. We denote $\beta = n'/n$. Choose a subdataset $D' \subset D$ with $|D'| = n'$ uniformly at random. Then the mechanism which returns $\mathcal {A}(D')$ satisfies $\varepsilon '$-DP, where

$$\varepsilon ' = \ln \left( \frac{e^\varepsilon \beta + 1 - \beta }{1 - \beta }\right) .$$

Proof

Let $S \subset {\text {range}}(\mathcal {A})$ and let $D' = D\setminus \{t\}\cup \{t'\}$ be a neighbor of D. We will use the law of total probability twice, conditioning first on whether $t \in D_s$ (i.e. on whether $D_s \in Y_t$ or $D_s \in N_t$), then on the specific subdataset chosen as $D_s$. This gives

$$\begin{aligned} P(\mathcal {A}(D_s) \in S)= & {} \beta \sum _{D_t \in Y_t}\frac{P(\mathcal {A}(D_t) \in S)}{|Y_t|} + (1-\beta )\sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t) \in S)}{|N_t|}\\\le & {} \beta e^\varepsilon \sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t)\in S)}{|N_t|} + (1-\beta ) \sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t)\in S)}{|N_t|}\\= & {} (\beta e^\varepsilon + 1-\beta ) \sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t) \in S)}{|N_t|}\\= & {} (\beta e^\varepsilon + 1-\beta ) \sum _{D'_s \in N'_{t}}\frac{P(\mathcal {A}(D'_s) \in S)}{|N'_{t}|}. \end{aligned}$$

by the lemma and the fact that $N_t = N'_t$. By analogous reasoning, we have

$$\begin{aligned} P(\mathcal {A}(D'_s) \in S)= & {} \beta \sum _{D'_t \in Y'_t} \frac{P(\mathcal {A}(D'_t) \in S)}{|Y'_t|} + (1-\beta )\sum _{D'_t \in N'_t}\frac{P(\mathcal {A}(D'_t) \in S)}{|N'_t|}\\ {}\ge & {} (1-\beta ) \sum _{D'_t \in N'_t}\frac{P(\mathcal {A}(D'_t) \in S)}{|N'_t|}. \end{aligned}$$

Combining these two inequalities yields

$$ P(\mathcal {A}(D_s)\in S) \le \frac{\beta e^\varepsilon + 1-\beta }{1-\beta }P(\mathcal {A}(D'_s)\in S). $$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Clifton, C., Hanson, E.J., Merrill, K., Merrill, S., Zahraa, A. (2020). A Partitioned Recoding Scheme for Privacy Preserving Data Publishing. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_4
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 2

Appendix: Proof of Theorem 2

Lemma 3

Proof

Theorem 2

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation