Bistochastic Privacy

Ruiz, Nicolas; Domingo-Ferrer, Josep

doi:10.1007/978-3-031-13448-7_5

Nicolas Ruiz⁹ &
Josep Domingo-Ferrer⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13408))

Included in the following conference series:

International Conference on Modeling Decisions for Artificial Intelligence

324 Accesses

Abstract

We introduce a new privacy model relying on bistochastic matrices, that is, matrices whose components are nonnegative and sum to 1 both row-wise and column-wise. This class of matrices is used to both define privacy guarantees and a tool to apply protection on a data set. The bistochasticity assumption happens to connect several fields of the privacy literature, including the two most popular models, k-anonymity and differential privacy. Moreover, it establishes a bridge with information theory, which simplifies the thorny issue of evaluating the utility of a protected data set. Bistochastic privacy also clarifies the trade-off between protection and utility by using bits, which can be viewed as a natural currency to comprehend and operationalize this trade-off, in the same way than bits are used in information theory to capture uncertainty. A discussion on the suitable parameterization of bistochastic matrices to achieve the privacy guarantees of this new model is also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chaudhuri, A., Mukerjee, R.: Randomized Response : Theory and Techniques. Marcel Dekker (1988)
Google Scholar
Clifton, C., Tassa, T.: On syntactic anonymity and differential privacy. Trans. Data Priv. 6, 147–159 (2013)
MathSciNet Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory. Wiley (2012)
Google Scholar
Domingo-Ferrer, J., Muralidhar, K.: New directions in anonymization: permutation paradigm, verifiability by subjects and intruders, transparency to users. Inf. Sci. 337–338, 11–24 (2016)
Article Google Scholar
Domingo-Ferrer, J., Soria-Comas, J.: Connecting randomized response, post-randomization, differential privacy and t-closeness via deniability and permutation (2018). https://arxiv.org/abs/1803.02139
Domingo-Ferrer, J., Soria-Comas, J.: Multi-dimensional randomized response. IEEE Transactions on Knowledge and Data Engineering. To appear
Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Chapter Google Scholar
General Data Protection Regulation. European Union Regulation 2016/679 (2016)
Google Scholar
Hundepool, A., et al.: Statistical Disclosure Control. Wiley (2012)
Google Scholar
Jacobs, K.: Quantum Measurement Theory and its Applications. Cambridge University Press (2014)
Google Scholar
Kooiman, P.L., Willenborg, L., Gouweleeuw, J.: PRAM:A Method for Disclosure Limitation of Microdata. Research Rep. 9705, Statistics Netherlands, Voorburg, NL (1998)
Google Scholar
Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and its Applications. Springer Series in Statistics (2011)
Google Scholar
Muralidhar, K., Domingo-Ferrer, J., Martinez, S.: ε-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). Lecture Notes in Computer Science, vol. 12276 (Privacy in Statistical Databases - PSD 2020), pp. 21–31 (2020)
Google Scholar
Samarati, P., Sweeney, L.: Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement through Generalization and Suppression. SRI International Report (1998)
Google Scholar
Shannon, C.E.: Communication theory of secrecy systems. Bell Syst. Tech. J. 28(4), 656–715 (1949)
Article MathSciNet Google Scholar
Wang, Y., Wu, X., Hu, D. Using randomized response for differential privacy preserving data collection. In: Proceedings of the EDBT/ICDT 2016 Joint Conference (2016)
Google Scholar
Wooldridge, J.: Econometric Analysis of Cross Section and Panel Data. MIT Press (2010)
Google Scholar
Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In Proceedings of the 32nd International Conferences on Very Large Data Bases-VLDB 2006, VLDB Endowment, pp. 139–150 (2006)
Google Scholar

Download references

Acknowledgements

Partial funding from the European Commission under project H2020–871042 “SoBigData++” is acknowledged. The second author is also partially funded by an ICREA Acadèmia Prize.

Author information

Authors and Affiliations

Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Catalonia
Nicolas Ruiz & Josep Domingo-Ferrer

Authors

Nicolas Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Josep Domingo-Ferrer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Ruiz .

Editor information

Editors and Affiliations

Umeå University, Umeå, Sweden
Vicenç Torra
Tamagawa University, Tokyo, Japan
Yasuo Narukawa

Appendix

1.1 A.1 Randomized Response

Let X denotes an original categorical attribute with $1,\dots ,r$ categories, and Y its anonymized version. Given a value X = u, randomized response (RR, [1]) computes a value Y = v by using an r × r Markov transition matrix:

$$P=\left(\begin{array}{ccc}{p}_{11}& \cdots & {p}_{1r}\\ \vdots & \ddots & \vdots \\ {p}_{r1}& \cdots & {p}_{rr}\end{array}\right) (A.1)$$

where ${p}_{uv}=\mathrm{Pr}\left(Y=v|X=u\right)$ denotes the probability that the original response u in X is reported as v in Y, for $u,v\in \left\{1,\dots ,r\right\}$. To be a proper Markov transition matrix, it must hold that $\sum_{v=1}^{r}{p}_{uv}=1 \forall u=1,\dots ,r$. P is thus right stochastic, meaning that any original category must be spread along the anonymized categories.

The usual setting in RR is that each subject computes her randomized response Y to be reported instead of her true response X. This is called the ex-ante or local anonymization mode. Nevertheless, it is also possible for a (trusted) data collector to gather the original responses from the subjects and randomize them in a centralized way. This ex-post mode corresponds to the Post-Randomization method (PRAM, [11]). Apart from who performs the anonymization, RR and PRAM operate the same way and make use of the same matrix P.

Let π₁,…,π_r be the proportions of respondents whose true values fall in each of the r categories of X; let ${\lambda }_{v}={\sum }_{u=1}^{r}{p}_{uv}{\pi }_{u}$ for v = 1,…,r be the probability of the reported value Y being v. If we define by $\lambda ={\left({\lambda }_{1},\dots ,{\lambda }_{r}\right)}^{T}$ and$\pi ={\left({\pi }_{1},\dots ,{\pi }_{r}\right)}^{T}$, then we have$\lambda ={P}^{T}\pi$. Furthermore, if P is nonsingular, it is proven in [1] that an unbiased estimator $\widehat{\pi }$ of π can be obtained as$\widehat{\pi }={\left({P}^{T}\right)}^{-1}\lambda$. Thus, univariate frequencies can be easily retrieved from the protected data set. Note that this procedure does not entail any privacy risk as only some estimates of the frequencies are retrieved, not specific responses that can be traced back to any individual.

RR is based on an implicit privacy guarantee called plausible deniability [5]. It equips the individuals with the ability to deny, with variable strength according to the parameterization of P, that they have reported a specific value. In fact, the more similar the probabilities in P, the higher the deniability. In the case where the probabilities within each column of P are identical, it can be proved that perfect secrecy in the Shannon sense is reached [15]: observing the anonymized attribute Y gives no information at all on the real value X. Under such configuration, a privacy breach cannot originate from the release of an anonymized data set, as the release does not bring any information that could be used for an attack. However, as exposed in the paper, the price to pay in terms of data utility is high.

1.2 A.2 The Permutation Model of SDC

The permutation model of statistical disclosure control conceptually unifies SDC methods by viewing them basically as permutation [4]. Consider an original attribute $X=\left\{{x}_{1},\dots ,{x}_{n}\right\}$ observed on n individuals and its anonymized version$Y=\left\{{y}_{1},\dots ,{y}_{n}\right\}$. Assume these attributes can be ranked—even categorical nominal attributes can be, using a semantic distance. For i = 1 to n: compute j = Rank(y_i) and let z_i = x_(j), where x_(j) is the value of X of rank j. Then call attribute $Z=\left\{{z}_{1},\dots ,{z}_{n}\right\}$ the reverse-mapped version of X. For example, if an original value x₁ ∈ X is anonymized as y₁ ∈ Y, and y₁ is, say, the 3rd smallest value in Y, then take z₁ to be the 3rd smallest value in X. If there are several attributes in the original data set X and anonymized data set Y, the previous reverse-mapping procedure is conducted for each attribute; call Z the data set formed by reverse-mapped attributes.

Note that: i) a reverse-mapped attribute Z is a permutation of the corresponding original attribute X; ii) the rank order of Z is the same as the rank order of Y. Therefore, any SDC method for microdata—individual records—is functionally equivalent to permutation—transforming data set X into Z—followed by residual noise—transforming Z into the anonymized data set Y. The noise added is residual because by construction the ranks of Z and Y are the same.

1.3 A.3 Information Theory

Classically, information theory approaches the notion of information contained in a message as capturing how much the message reduces uncertainty about something [10]. As a result, in this theory information shares the same definition as entropy and choosing which term to use depends on whether it is given or taken away. For example, a high entropy attribute will convey a high initial uncertainty about its actual value. If we then learn the value, we have acquired an amount of information equal to the initial uncertainty, i.e. the entropy we had originally about the value. Thus, information and entropy are two sides of the same coin. In this paper, we propose to apply entropy to a data set in a controlled way. This operation will take away data utility from the user but will in exchange generate protection. As such, data utility and protection also become two sides of the same coin, albeit in that case they are inversely related.

In information theory, a basic way to capture uncertainty is majorization [12]. Assume two vectors $x={\left({x}_{1},\dots ,{x}_{N}\right)}^{T}$ and $y={\left({y}_{1},\dots ,{y}_{N}\right)}^{T}$ that represent probability distributions, with the elements of each vector pre-ordered in decreasing order. The vector x is said to majorize y, usually noted as $x\succ y$, if and only if the largest element of x is greater than the largest element of y, the largest two elements of x are greater than the largest two elements of y, and so on… [10]. Equivalently, that means that the probability distribution represented by x is more narrowly peaked than y, in turn implying that x conveys less uncertainty than y, thus that x has less entropy than y.

In the privacy literature there is no such well-defined notion of information and no associated concepts such as majorization. What is meant as information for the meaningful exploitation of a data set lies in the eye of the user. For example, one user may be interested in the ability to perform some simple statistical requests such as cross-tabulations and thus will call information the analytical validity of such requests on anonymized data and their close proximity with the same requests performed on the original data set. Another user may be only interested in the ability to perform some econometric analyses, and thus again will qualify an anonymized data as informative given, for example, the validity of some OLS outputs made on it. Of course, and because the needs of users can be quasi-infinitely rich, one is left with a severe problem of diversity for evaluating the information content of an anonymized data set. In the paper, we reasonably assume that the original data set always provides the highest utility and analytical value to the user, and thus that an anonymized data set always entails a loss of utility.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruiz, N., Domingo-Ferrer, J. (2022). Bistochastic Privacy. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2022. Lecture Notes in Computer Science(), vol 13408. Springer, Cham. https://doi.org/10.1007/978-3-031-13448-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-13448-7_5
Published: 23 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13447-0
Online ISBN: 978-3-031-13448-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bistochastic Privacy

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A.1 Randomized Response

1.2 A.2 The Permutation Model of SDC

1.3 A.3 Information Theory

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation