Skip to main content

Bistochastic Privacy

  • Conference paper
  • First Online:
Modeling Decisions for Artificial Intelligence (MDAI 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13408))

  • 324 Accesses

Abstract

We introduce a new privacy model relying on bistochastic matrices, that is, matrices whose components are nonnegative and sum to 1 both row-wise and column-wise. This class of matrices is used to both define privacy guarantees and a tool to apply protection on a data set. The bistochasticity assumption happens to connect several fields of the privacy literature, including the two most popular models, k-anonymity and differential privacy. Moreover, it establishes a bridge with information theory, which simplifies the thorny issue of evaluating the utility of a protected data set. Bistochastic privacy also clarifies the trade-off between protection and utility by using bits, which can be viewed as a natural currency to comprehend and operationalize this trade-off, in the same way than bits are used in information theory to capture uncertainty. A discussion on the suitable parameterization of bistochastic matrices to achieve the privacy guarantees of this new model is also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chaudhuri, A., Mukerjee, R.: Randomized Response : Theory and Techniques. Marcel Dekker (1988)

    Google Scholar 

  2. Clifton, C., Tassa, T.: On syntactic anonymity and differential privacy. Trans. Data Priv. 6, 147–159 (2013)

    MathSciNet  Google Scholar 

  3. Cover, T., Thomas, J.: Elements of Information Theory. Wiley (2012)

    Google Scholar 

  4. Domingo-Ferrer, J., Muralidhar, K.: New directions in anonymization: permutation paradigm, verifiability by subjects and intruders, transparency to users. Inf. Sci. 337–338, 11–24 (2016)

    Article  Google Scholar 

  5. Domingo-Ferrer, J., Soria-Comas, J.: Connecting randomized response, post-randomization, differential privacy and t-closeness via deniability and permutation (2018). https://arxiv.org/abs/1803.02139

  6. Domingo-Ferrer, J., Soria-Comas, J.: Multi-dimensional randomized response. IEEE Transactions on Knowledge and Data Engineering. To appear

    Google Scholar 

  7. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1

    Chapter  Google Scholar 

  8. General Data Protection Regulation. European Union Regulation 2016/679 (2016)

    Google Scholar 

  9. Hundepool, A., et al.: Statistical Disclosure Control. Wiley (2012)

    Google Scholar 

  10. Jacobs, K.: Quantum Measurement Theory and its Applications. Cambridge University Press (2014)

    Google Scholar 

  11. Kooiman, P.L., Willenborg, L., Gouweleeuw, J.: PRAM:A Method for Disclosure Limitation of Microdata. Research Rep. 9705, Statistics Netherlands, Voorburg, NL (1998)

    Google Scholar 

  12. Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and its Applications. Springer Series in Statistics (2011)

    Google Scholar 

  13. Muralidhar, K., Domingo-Ferrer, J., Martinez, S.: ε-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). Lecture Notes in Computer Science, vol. 12276 (Privacy in Statistical Databases - PSD 2020), pp. 21–31 (2020)

    Google Scholar 

  14. Samarati, P., Sweeney, L.: Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement through Generalization and Suppression. SRI International Report (1998)

    Google Scholar 

  15. Shannon, C.E.: Communication theory of secrecy systems. Bell Syst. Tech. J. 28(4), 656–715 (1949)

    Article  MathSciNet  Google Scholar 

  16. Wang, Y., Wu, X., Hu, D. Using randomized response for differential privacy preserving data collection. In: Proceedings of the EDBT/ICDT 2016 Joint Conference (2016)

    Google Scholar 

  17. Wooldridge, J.: Econometric Analysis of Cross Section and Panel Data. MIT Press (2010)

    Google Scholar 

  18. Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In Proceedings of the 32nd International Conferences on Very Large Data Bases-VLDB 2006, VLDB Endowment, pp. 139–150 (2006)

    Google Scholar 

Download references

Acknowledgements

Partial funding from the European Commission under project H2020–871042 “SoBigData++” is acknowledged. The second author is also partially funded by an ICREA Acadèmia Prize.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Ruiz .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A.1 Randomized Response

Let X denotes an original categorical attribute with \(1,\dots ,r\) categories, and Y its anonymized version. Given a value X = u, randomized response (RR, [1]) computes a value Y = v by using an r × r Markov transition matrix:

$$P=\left(\begin{array}{ccc}{p}_{11}& \cdots & {p}_{1r}\\ \vdots & \ddots & \vdots \\ {p}_{r1}& \cdots & {p}_{rr}\end{array}\right) (A.1)$$

where \({p}_{uv}=\mathrm{Pr}\left(Y=v|X=u\right)\) denotes the probability that the original response u in X is reported as v in Y, for \(u,v\in \left\{1,\dots ,r\right\}\). To be a proper Markov transition matrix, it must hold that \(\sum_{v=1}^{r}{p}_{uv}=1 \forall u=1,\dots ,r\). P is thus right stochastic, meaning that any original category must be spread along the anonymized categories.

The usual setting in RR is that each subject computes her randomized response Y to be reported instead of her true response X. This is called the ex-ante or local anonymization mode. Nevertheless, it is also possible for a (trusted) data collector to gather the original responses from the subjects and randomize them in a centralized way. This ex-post mode corresponds to the Post-Randomization method (PRAM, [11]). Apart from who performs the anonymization, RR and PRAM operate the same way and make use of the same matrix P.

Let π1,…,πr be the proportions of respondents whose true values fall in each of the r categories of X; let \({\lambda }_{v}={\sum }_{u=1}^{r}{p}_{uv}{\pi }_{u}\) for v = 1,…,r be the probability of the reported value Y being v. If we define by \(\lambda ={\left({\lambda }_{1},\dots ,{\lambda }_{r}\right)}^{T}\) and\(\pi ={\left({\pi }_{1},\dots ,{\pi }_{r}\right)}^{T}\), then we have\(\lambda ={P}^{T}\pi\). Furthermore, if P is nonsingular, it is proven in [1] that an unbiased estimator \(\widehat{\pi }\) of π can be obtained as\(\widehat{\pi }={\left({P}^{T}\right)}^{-1}\lambda\). Thus, univariate frequencies can be easily retrieved from the protected data set. Note that this procedure does not entail any privacy risk as only some estimates of the frequencies are retrieved, not specific responses that can be traced back to any individual.

RR is based on an implicit privacy guarantee called plausible deniability [5]. It equips the individuals with the ability to deny, with variable strength according to the parameterization of P, that they have reported a specific value. In fact, the more similar the probabilities in P, the higher the deniability. In the case where the probabilities within each column of P are identical, it can be proved that perfect secrecy in the Shannon sense is reached [15]: observing the anonymized attribute Y gives no information at all on the real value X. Under such configuration, a privacy breach cannot originate from the release of an anonymized data set, as the release does not bring any information that could be used for an attack. However, as exposed in the paper, the price to pay in terms of data utility is high.

1.2 A.2 The Permutation Model of SDC

The permutation model of statistical disclosure control conceptually unifies SDC methods by viewing them basically as permutation [4]. Consider an original attribute \(X=\left\{{x}_{1},\dots ,{x}_{n}\right\}\) observed on n individuals and its anonymized version\(Y=\left\{{y}_{1},\dots ,{y}_{n}\right\}\). Assume these attributes can be ranked—even categorical nominal attributes can be, using a semantic distance. For i = 1 to n: compute j = Rank(yi) and let zi = x(j), where x(j) is the value of X of rank j. Then call attribute \(Z=\left\{{z}_{1},\dots ,{z}_{n}\right\}\) the reverse-mapped version of X. For example, if an original value x1 ∈ X is anonymized as y1 ∈ Y, and y1 is, say, the 3rd smallest value in Y, then take z1 to be the 3rd smallest value in X. If there are several attributes in the original data set X and anonymized data set Y, the previous reverse-mapping procedure is conducted for each attribute; call Z the data set formed by reverse-mapped attributes.

Note that: i) a reverse-mapped attribute Z is a permutation of the corresponding original attribute X; ii) the rank order of Z is the same as the rank order of Y. Therefore, any SDC method for microdata—individual records—is functionally equivalent to permutation—transforming data set X into Z—followed by residual noise—transforming Z into the anonymized data set Y. The noise added is residual because by construction the ranks of Z and Y are the same.

1.3 A.3 Information Theory

Classically, information theory approaches the notion of information contained in a message as capturing how much the message reduces uncertainty about something [10]. As a result, in this theory information shares the same definition as entropy and choosing which term to use depends on whether it is given or taken away. For example, a high entropy attribute will convey a high initial uncertainty about its actual value. If we then learn the value, we have acquired an amount of information equal to the initial uncertainty, i.e. the entropy we had originally about the value. Thus, information and entropy are two sides of the same coin. In this paper, we propose to apply entropy to a data set in a controlled way. This operation will take away data utility from the user but will in exchange generate protection. As such, data utility and protection also become two sides of the same coin, albeit in that case they are inversely related.

In information theory, a basic way to capture uncertainty is majorization [12]. Assume two vectors \(x={\left({x}_{1},\dots ,{x}_{N}\right)}^{T}\) and \(y={\left({y}_{1},\dots ,{y}_{N}\right)}^{T}\) that represent probability distributions, with the elements of each vector pre-ordered in decreasing order. The vector x is said to majorize y, usually noted as \(x\succ y\), if and only if the largest element of x is greater than the largest element of y, the largest two elements of x are greater than the largest two elements of y, and so on… [10]. Equivalently, that means that the probability distribution represented by x is more narrowly peaked than y, in turn implying that x conveys less uncertainty than y, thus that x has less entropy than y.

In the privacy literature there is no such well-defined notion of information and no associated concepts such as majorization. What is meant as information for the meaningful exploitation of a data set lies in the eye of the user. For example, one user may be interested in the ability to perform some simple statistical requests such as cross-tabulations and thus will call information the analytical validity of such requests on anonymized data and their close proximity with the same requests performed on the original data set. Another user may be only interested in the ability to perform some econometric analyses, and thus again will qualify an anonymized data as informative given, for example, the validity of some OLS outputs made on it. Of course, and because the needs of users can be quasi-infinitely rich, one is left with a severe problem of diversity for evaluating the information content of an anonymized data set. In the paper, we reasonably assume that the original data set always provides the highest utility and analytical value to the user, and thus that an anonymized data set always entails a loss of utility.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ruiz, N., Domingo-Ferrer, J. (2022). Bistochastic Privacy. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2022. Lecture Notes in Computer Science(), vol 13408. Springer, Cham. https://doi.org/10.1007/978-3-031-13448-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13448-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13447-0

  • Online ISBN: 978-3-031-13448-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics