Abstract
This paper presents a unified framework for regression-based statistical disclosure control for microdata. A basic method, known as information preserving statistical obfuscation (IPSO), produces synthetic data that preserve variances, covariances and fitted values. The data are then generated conditionally according to the multivariate normal distribution. Generalizations of the IPSO method are described in the literature, and these methods aim to generate data more similar to the original data. This paper describes these methods in a concise and interpretable way, which is close to efficient implementation. Decomposing the residual data into orthogonal scores and corresponding loadings is an essential part of the framework. Both QR decomposition (Gram–Schmidt orthogonalization) and singular value decomposition (principal components) may be used. Within this framework, new and generalized methods are presented. In particular, a method is described by means of which the correlations to the original principal component scores can be controlled exactly. It is shown that a suggested method of random orthogonal matrix masking can be implemented without generating an orthogonal matrix. Generalized methodology for hierarchical categories is presented within the context of microaggregation. Some information can then be preserved at the lowest level and more information at higher levels. The presented methodology is also applicable to tabular data. One possibility is to replace the content of primary and secondary suppressed cells with generated values. It is proposed replacing suppressed cell frequencies with decimal numbers, and it is argued that this can be a useful method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Benedetto, G., Stinson, M.H., Abowd, J.M.: The Creation and Use of the SIPP Synthetic Beta. Technical Report, United States Census Bureau (2013)
Burridge, J.: Information preserving statistical obfuscation. Stat. Comput. 13(4), 321–327 (2003). https://doi.org/10.1023/A:1025658621216
Calvino, A.: A simple method for limiting disclosure in continuous microdata based on principal component analysis. J. Off. Stat. 33(1), 15–41 (2017). https://doi.org/10.1515/JOS-2017-0002
Chan, T.F.: Rank revealing QR factorizations. Linear Algebra Appl. 88–9, 67–82 (1987). https://doi.org/10.1016/0024-3795(87)90103-0
de Wolf, P.P., Giessing, S.: Adjusting the tau-ARGUS modular approach to deal with linked tables. Data Knowl. Eng. 68(11), 1160–1174 (2009). https://doi.org/10.1016/j.datak.2009.06.005
Demmel, J., Gu, M., Eisenstat, S., Slapnicar, I., Veselic, K., Drmac, Z.: Computing the singular value decomposition with high relative accuracy. Linear Algebra Appl. 299(1–3), 21–80 (1999). https://doi.org/10.1016/S0024-3795(99)00134-2
Domingo-Ferrer, J., Gonzalez-Nicolas, U.: Hybrid microdata using microaggregation. Inf. Sci. 180(15), 2834–2844 (2010). https://doi.org/10.1016/j.ins.2010.04.005
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011)
Duncan, G.T., Pearson, R.W.: Enhancing access to microdata while protecting confidentiality: prospects for the future. Stat. Sci. 6(3), 219–239 (1991)
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley, Hoboken (2012). https://doi.org/10.1002/9781118348239.ch1
Hundepool, A., de Wolf, P.P., Bakker, J., Reedijk, A., Franconi, L., Polettini, S., Capobianchi, A., Domingo, J.: mu-ARGUS User’s Manual, Version 5.1. Technical Report, Statistics Netherlands (2014)
Jarmin, R.S., Louis, T.A., Miranda, J.: Expanding the role of synthetic data at the U.S. Census Bureau. Stat. J. IAOS 30(1–3), 117–121 (2014)
Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)
Klein, M.D., Datta, G.S.: Statistical disclosure control via sufficiency under the multiple linear regression model. J. Stat. Theory Pract. 12(1), 100–110 (2018)
Langsrud, Ø.: Rotation tests. Stat. Comput. 15(1), 53–60 (2005). https://doi.org/10.1007/s11222-005-4789-5
Loong, B., Rubin, D.B.: Multiply-imputed synthetic data: advice to the imputer. J. Off. Stat. 33(4), 1005–1019 (2017). https://doi.org/10.1515/JOS-2017-0047
Mateo-Sanz, J., Martinez-Balleste, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: DomingoFerrer, J., Torra, V. (eds.) Privacy in Statistical Databases, Proceedings, . Conference on Privacy in Statistical DataBases (PSD 2004), Barcelona, Spain, 09–11 June 2004, vol. 3050, pp. 298–306 (2004)
Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Trans. Data Priv. 1(1), 17–33 (2008)
Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007). https://doi.org/10.1198/016214507000000932
Salazar-Gonzalez, J.J.: Statistical confidentiality: optimization techniques to protect tables. Comput. Oper. Res. 35(5), 1638–1651 (2008). https://doi.org/10.1016/j.cor.2006.09.007
Strang, G.: Linear Algebra and Its Applications, 3rd edn. Harcourt Brace Jovanovich, San Diego (1988)
Templ, M., Meindl, B.: Robustification of microdata masking methods and the comparison with existing methods. In: Domingo-Ferrer, J., Saygın, Y. (eds.) Privacy in Statistical Databases, Proceedings, UNESCO Chair in Data Privacy International Conference (PSD 2008), Istanbul, Turkey, 24–26 Sept 2008, pp. 113–126. Springer, Berlin (2008)
Templ, M., Kowarik, A., Meindl, B.: Statistical disclosure control for micro-data using the R Package sdcMicro. J. Stat. Softw. 67(4), 1–37 (2015)
Ting, D., Fienberg, S.E., Trottini, M.: Random orthogonal matrix masking methodology for microdata release. Int. J. Inf. Comput. Secur. 2(1), 86–105 (2008). https://doi.org/10.1504/IJICS.2008.016823
Wedderburn, R.W.M.: Random Rotations and Multivariate Normal Simulation. Research Report, Rothamsted Experimental Station (1975)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Generalized QR decomposition
The QR decomposition of a \(n \times m\) matrix \(\varvec{A}\) with rank r can be written as
where \(\varvec{Q}\) is a \(n \times r\) matrix whose columns form an orthonormal basis for the column space of \(\varvec{A}\). This decomposition can be viewed as the matrix formulation of the Gram–Schmidt orthogonalization process. The Cholesky decomposition of \(\varvec{A}^T\!\varvec{A}\) can be read from the QR decomposition of \(\varvec{A}\) as \(\varvec{R}^T\!\varvec{R}\).
In this paper, in order to allow linearly dependent columns of \(\varvec{A}\) (\(r < m\)), we refer to a generalized variant of QR decomposition. In such cases, a usual decomposition (Chan 1987) is
where \(\varvec{\tilde{P}}\) is a permutation matrix that reorders the columns (pivoting) in order to make a decomposition so that \(\varvec{\tilde{R}}\) is upper triangular.
To make the decomposition unique, we require the diagonal entries of \(\varvec{\tilde{R}}\) to be positive. Furthermore, we require \(\varvec{\tilde{P}}\) to keep the order of the columns as close to the original order as possible (minimal pivoting). We now have \(\varvec{A} = \varvec{Q} \varvec{\tilde{R}} \varvec{\tilde{P}}^{T}\) and in generalized QR decomposition (24) we use
The QR decomposition of a composite matrix can be written as
Now \(\varvec{Q}_{1}\) can be computed by QR decomposition of \(\varvec{A}_{1}\). The matrix \(\varvec{Q}_{2}\) can be computed by QR decomposition of \(\varvec{A}_{2} - \varvec{Q}_{1}\varvec{Q}_{1}^{T}\varvec{A}_{2}\), which is the residual part after regressing \(\varvec{A}_{2}\) onto \(\varvec{A}_{1}\).
Appendix 2: The singular value decomposition
The singular value decomposition (SVD) of a \(n \times m\) matrix \(\varvec{A}\) with rank r can be written as
where \(\varvec{\varLambda }\) is a \(r \times r\) diagonal matrix of strictly positive singular values in descending order. This is the rank-revealing version of the decomposition (Demmel et al. 1999). Other variants of SVD allow some singular values to be zero, but these can be omitted. The columns of \(\varvec{U}\) form an orthonormal basis for the column space of \(\varvec{A}\) and the columns of \(\varvec{V}\) form an orthonormal basis for the row space.
The singular values are the square root of the eigenvalues of \(\varvec{A}^T\!\varvec{A}\) and \(\varvec{A}\varvec{A}^T\). The eigen decompositions of these two symmetric matrices can be read directly from the SVD of \(\varvec{A}\) as \(\varvec{V} \varvec{\varLambda }^2 \varvec{V}^{T}\) and \(\varvec{U} \varvec{\varLambda }^2 \varvec{U}^{T}\). It is also worth mentioning that an alternative to the ordinary Cholesky decomposition, \(\varvec{A}^T\!\varvec{A}=\varvec{R}^T\!\varvec{R}\), is to let \(\varvec{\varLambda } \varvec{V}^{T}\) play the role of \(\varvec{R}\).
To make the SVD unique, we can require all column sums of \(\varvec{V}\) to be positive. In cases with equal singular values, the decomposition is not unique regardless.
There is a close relationship between SVD and PCA. In PCA, the variables are usually centered to zero means and in many cases standardized to equal variances prior to decomposition. If \(\varvec{A}\) is such a centered/standardized matrix, then \(\varvec{U} \varvec{\varLambda }\) is the matrix of PCA scores and \(\varvec{V}\) is the matrix of PCA loadings.
The Moore–Penrose generalized inverse of \(\varvec{A}\) can be written as
We have
When \(\varvec{A}\) is invertible, \(\varvec{A}^{\dagger }= \varvec{A}^{-1}\). When \(\varvec{A}^{T}\varvec{A}\) or \(\varvec{A}\varvec{A}^{T}\) is invertible, this means, respectively, that \(\varvec{A}^{\dagger } = (\varvec{A}^{T}\varvec{A})^{-1}\varvec{A}^{T}\) or \(\varvec{A}^{\dagger } = \varvec{A}^{T} (\varvec{A}\varvec{A}^{T})^{-1}\).
Rights and permissions
About this article
Cite this article
Langsrud, Ø. Information preserving regression-based tools for statistical disclosure control. Stat Comput 29, 965–976 (2019). https://doi.org/10.1007/s11222-018-9848-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-018-9848-9