Skip to main content

A Partitioned Recoding Scheme for Privacy Preserving Data Publishing

  • Conference paper
  • First Online:
Book cover Privacy in Statistical Databases (PSD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

  • 728 Accesses

Abstract

There is growing interest in Differential Privacy as a disclosure limitation mechanism for statistical data. The increased attention has brought to light a number of subtleties in the definition and mechanisms. We explore an interesting dichotomy in parallel composition, where a subtle difference in the definition of a “neighboring database” leads to significantly different results. We show that by “pre-partitioning” the data randomly into disjoint subsets, then applying well-known anony-mization schemes to those pieces, we can eliminate this dichotomy. This provides potential operational benefits, with some interesting implications that give further insight into existing privacy schemes. We explore the theoretical limits of the privacy impacts of pre-partitioning, in the process illuminating some subtle distinctions in privacy definitions. We also discuss the resulting utility, including empirical evaluation of the impact on released privatized statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    There is some subtlety here, as k-anonymity under global recoding is not assured, even if each partition element satisfies it.

References

  1. Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.: Toward privacy in public databases. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 363–385. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30576-7_20

    Chapter  Google Scholar 

  2. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  3. Ebadi, H., Antignac, T., Sands, D.: Sampling and partitioning for differential privacy. In: 14th Annual Conference on Privacy, Security and Trust (PST), pp. 664–673, Auckland, NZ, 12–14 December 2016

    Google Scholar 

  4. Li, N., Li, T.: \(t\)-closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: Proceedings of the 23nd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey, 16–20 April 2007

    Google Scholar 

  5. Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy: or, \(k\)-anonymization meets differential privacy. In: 7th ACM Symposium on Information, Computer and Communications Security (ASIACCS’2012), pp. 32–33, Seoul, Korea, 2–4 May 2012

    Google Scholar 

  6. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: \(l\)-diversity: privacy beyond \(k\)-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)

    Article  Google Scholar 

  7. McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 19–30, Providence, Rhode Island, 29 June - 2 July 2009

    Google Scholar 

  8. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC, pp. 75–84 (2007)

    Google Scholar 

  9. Ruggles, S., et al.: IPUMS USA: version 8.0 extract of 1940 Census for U.S. census bureau disclosure avoidance research [dataset] (2018). https://doi.org/10.18128/D010.V8.0.EXT1940USCB

  10. Samarati, P.: Protecting respondent’s privacy in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  11. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  12. Zafarani, F., Clifton, C.: Differentially private naive bayes classifier using smooth sensitivity. Under review by The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 14–18 September 2020

    Google Scholar 

Download references

Acknowledgments

This work was supported by the United States Census Bureau under CRADA CB16ADR0160002. The views and opinions expressed in this writing are those of the authors and not the U.S. Census Bureau.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shawn Merrill .

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 2

Appendix: Proof of Theorem 2

We first fix some notation. Let \(\mathcal {A}\) be a \(\varepsilon \)-differentially private mechanism. Let D be a dataset with \(|D| = n\) and choose an integer \(n' < n\). Fix some tuple \(t \in D\). We denote by \(Y_t\) the set of all subdatasets \(D_s \subset D\) with \(|D_s| = n'\) and \(t \in D_s\) and by \(N_t\) the set of all subdatasets \(D_s \subset D\) with \(|D_s| = n'\) and \(t \notin D_s\). We observe

$$|Y_t| = {{n-1}\atopwithdelims (){n'-1}} \qquad |N_t| = {{n-1}\atopwithdelims (){n'}}.$$

For \(D' = D\setminus \{t\}\cup \{t'\}\) a neighbor of D, we define \(Y'_t\) and \(N'_t\) analogously. We observe that \(N_t = N'_t\).

We will need the following lemma.

Lemma 3

Let \(t \in D\) and \(S \subset {\text {range}}(\mathcal {A})\). Then

$$ \sum _{D_s \in Y_t}\frac{P(\mathcal {A}(D_s)\in S)}{|Y_t|} \le e^\varepsilon \sum _{D_s \in N_t}\frac{P(\mathcal {A}(D_s) \in S)}{|N_t|}. $$

Proof

For each \(D_s \in Y_t\), we can replace the tuple t by any of the \(n-n'\) tuples in \(D\setminus D_s\) to create a dataset in \(N_t\) that is a neighbor of \(D_s\). Similarly, given any \(D_s \in N_t\), we can replace any of the \(n'\) tuples in \(D_s\) with t to create a dataset in \(Y_t\) that is a neighbor of \(D_s\).

Now consider

$$(n-n')\sum _{D_s \in Y_t} P(\mathcal {A}(D_s) \in S)$$

as counting each \(D_s \in Y_t\) with multiplicity \(n-n'\). Thus we replace the \(n-n'\) copies of \(D_s \in Y_t\) in this sum with its \(n-n'\) neighbors in \(N_t\). By differential privacy, each such change causes the probability to grow by no more than \(e^\varepsilon \). Moreover, each dataset in \(N_t\) will occur \(n'\) times in the new sum. Thus

$$ (n-n') \sum _{D_t \in Y_t}P(\mathcal {A}(D_t) \in S) \le e^\varepsilon n' \sum _{D_t \in N_t}P(\mathcal {A}(D_t) \in S). $$

The result now follows from the observation that \(\frac{n'}{n-n'} = \frac{|Y_t|}{|N_t|}\).

This lemma captures the reason we have assumed the size of the subdataset to be fixed. In the unbounded case, if we delete a tuple t to pass from dataset D to \(D'\), then for each \(D_s \subseteq D\) with \(t \in D_s\), there is a unique \(D'_s \subseteq D'\) with \(d(D_s,D'_s) = 1\). Lemma 3 is our generalization of this fact to the unbounded case.

We are now ready to prove Theorem 2, which we restate here for convenience.

Theorem 2

Let \(\mathcal {A}\) satisfy \(\varepsilon \)-DP. Let D be a dataset with \(|D| = n\) and choose an integer \(n' < n\). We denote \(\beta = n'/n\). Choose a subdataset \(D' \subset D\) with \(|D'| = n'\) uniformly at random. Then the mechanism which returns \(\mathcal {A}(D')\) satisfies \(\varepsilon '\)-DP, where

$$\varepsilon ' = \ln \left( \frac{e^\varepsilon \beta + 1 - \beta }{1 - \beta }\right) .$$

Proof

Let \(S \subset {\text {range}}(\mathcal {A})\) and let \(D' = D\setminus \{t\}\cup \{t'\}\) be a neighbor of D. We will use the law of total probability twice, conditioning first on whether \(t \in D_s\) (i.e. on whether \(D_s \in Y_t\) or \(D_s \in N_t\)), then on the specific subdataset chosen as \(D_s\). This gives

$$\begin{aligned} P(\mathcal {A}(D_s) \in S)= & {} \beta \sum _{D_t \in Y_t}\frac{P(\mathcal {A}(D_t) \in S)}{|Y_t|} + (1-\beta )\sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t) \in S)}{|N_t|}\\\le & {} \beta e^\varepsilon \sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t)\in S)}{|N_t|} + (1-\beta ) \sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t)\in S)}{|N_t|}\\= & {} (\beta e^\varepsilon + 1-\beta ) \sum _{D_t \in N_t}\frac{P(\mathcal {A}(D_t) \in S)}{|N_t|}\\= & {} (\beta e^\varepsilon + 1-\beta ) \sum _{D'_s \in N'_{t}}\frac{P(\mathcal {A}(D'_s) \in S)}{|N'_{t}|}. \end{aligned}$$

by the lemma and the fact that \(N_t = N'_t\). By analogous reasoning, we have

$$\begin{aligned} P(\mathcal {A}(D'_s) \in S)= & {} \beta \sum _{D'_t \in Y'_t} \frac{P(\mathcal {A}(D'_t) \in S)}{|Y'_t|} + (1-\beta )\sum _{D'_t \in N'_t}\frac{P(\mathcal {A}(D'_t) \in S)}{|N'_t|}\\ {}\ge & {} (1-\beta ) \sum _{D'_t \in N'_t}\frac{P(\mathcal {A}(D'_t) \in S)}{|N'_t|}. \end{aligned}$$

Combining these two inequalities yields

$$ P(\mathcal {A}(D_s)\in S) \le \frac{\beta e^\varepsilon + 1-\beta }{1-\beta }P(\mathcal {A}(D'_s)\in S). $$

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Clifton, C., Hanson, E.J., Merrill, K., Merrill, S., Zahraa, A. (2020). A Partitioned Recoding Scheme for Privacy Preserving Data Publishing. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics