Information preserving regression-based tools for statistical disclosure control

Langsrud, Øyvind

doi:10.1007/s11222-018-9848-9

Information preserving regression-based tools for statistical disclosure control

Published: 02 January 2019

Volume 29, pages 965–976, (2019)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Øyvind Langsrud ORCID: orcid.org/0000-0002-1380-4396¹

456 Accesses
3 Altmetric
Explore all metrics

Abstract

This paper presents a unified framework for regression-based statistical disclosure control for microdata. A basic method, known as information preserving statistical obfuscation (IPSO), produces synthetic data that preserve variances, covariances and fitted values. The data are then generated conditionally according to the multivariate normal distribution. Generalizations of the IPSO method are described in the literature, and these methods aim to generate data more similar to the original data. This paper describes these methods in a concise and interpretable way, which is close to efficient implementation. Decomposing the residual data into orthogonal scores and corresponding loadings is an essential part of the framework. Both QR decomposition (Gram–Schmidt orthogonalization) and singular value decomposition (principal components) may be used. Within this framework, new and generalized methods are presented. In particular, a method is described by means of which the correlations to the original principal component scores can be controlled exactly. It is shown that a suggested method of random orthogonal matrix masking can be implemented without generating an orthogonal matrix. Generalized methodology for hierarchical categories is presented within the context of microaggregation. Some information can then be preserved at the lowest level and more information at higher levels. The presented methodology is also applicable to tabular data. One possibility is to replace the content of primary and secondary suppressed cells with generated values. It is proposed replacing suppressed cell frequencies with decimal numbers, and it is argued that this can be a useful method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Synthetic Decimal Numbers as a Flexible Tool for Suppression of Post-published Tabular Data

Recent advances in cyclic perturbation of frequency tables

Article 26 February 2016

Secondary Cell Suppression by Gaussian Elimination: An Algorithm Suitable for Handling Issues with Zeros and Singletons

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Benedetto, G., Stinson, M.H., Abowd, J.M.: The Creation and Use of the SIPP Synthetic Beta. Technical Report, United States Census Bureau (2013)
Burridge, J.: Information preserving statistical obfuscation. Stat. Comput. 13(4), 321–327 (2003). https://doi.org/10.1023/A:1025658621216
Article MathSciNet Google Scholar
Calvino, A.: A simple method for limiting disclosure in continuous microdata based on principal component analysis. J. Off. Stat. 33(1), 15–41 (2017). https://doi.org/10.1515/JOS-2017-0002
Article Google Scholar
Chan, T.F.: Rank revealing QR factorizations. Linear Algebra Appl. 88–9, 67–82 (1987). https://doi.org/10.1016/0024-3795(87)90103-0
MathSciNet MATH Google Scholar
de Wolf, P.P., Giessing, S.: Adjusting the tau-ARGUS modular approach to deal with linked tables. Data Knowl. Eng. 68(11), 1160–1174 (2009). https://doi.org/10.1016/j.datak.2009.06.005
Article Google Scholar
Demmel, J., Gu, M., Eisenstat, S., Slapnicar, I., Veselic, K., Drmac, Z.: Computing the singular value decomposition with high relative accuracy. Linear Algebra Appl. 299(1–3), 21–80 (1999). https://doi.org/10.1016/S0024-3795(99)00134-2
Article MathSciNet MATH Google Scholar
Domingo-Ferrer, J., Gonzalez-Nicolas, U.: Hybrid microdata using microaggregation. Inf. Sci. 180(15), 2834–2844 (2010). https://doi.org/10.1016/j.ins.2010.04.005
Article Google Scholar
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)
Article Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011)
Book MATH Google Scholar
Duncan, G.T., Pearson, R.W.: Enhancing access to microdata while protecting confidentiality: prospects for the future. Stat. Sci. 6(3), 219–239 (1991)
Article Google Scholar
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley, Hoboken (2012). https://doi.org/10.1002/9781118348239.ch1
Book Google Scholar
Hundepool, A., de Wolf, P.P., Bakker, J., Reedijk, A., Franconi, L., Polettini, S., Capobianchi, A., Domingo, J.: mu-ARGUS User’s Manual, Version 5.1. Technical Report, Statistics Netherlands (2014)
Jarmin, R.S., Louis, T.A., Miranda, J.: Expanding the role of synthetic data at the U.S. Census Bureau. Stat. J. IAOS 30(1–3), 117–121 (2014)
Google Scholar
Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)
MATH Google Scholar
Klein, M.D., Datta, G.S.: Statistical disclosure control via sufficiency under the multiple linear regression model. J. Stat. Theory Pract. 12(1), 100–110 (2018)
Article MathSciNet Google Scholar
Langsrud, Ø.: Rotation tests. Stat. Comput. 15(1), 53–60 (2005). https://doi.org/10.1007/s11222-005-4789-5
Article MathSciNet Google Scholar
Loong, B., Rubin, D.B.: Multiply-imputed synthetic data: advice to the imputer. J. Off. Stat. 33(4), 1005–1019 (2017). https://doi.org/10.1515/JOS-2017-0047
Article Google Scholar
Mateo-Sanz, J., Martinez-Balleste, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: DomingoFerrer, J., Torra, V. (eds.) Privacy in Statistical Databases, Proceedings, . Conference on Privacy in Statistical DataBases (PSD 2004), Barcelona, Spain, 09–11 June 2004, vol. 3050, pp. 298–306 (2004)
Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Trans. Data Priv. 1(1), 17–33 (2008)
MathSciNet Google Scholar
Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007). https://doi.org/10.1198/016214507000000932
Article MathSciNet MATH Google Scholar
Salazar-Gonzalez, J.J.: Statistical confidentiality: optimization techniques to protect tables. Comput. Oper. Res. 35(5), 1638–1651 (2008). https://doi.org/10.1016/j.cor.2006.09.007
Article Google Scholar
Strang, G.: Linear Algebra and Its Applications, 3rd edn. Harcourt Brace Jovanovich, San Diego (1988)
MATH Google Scholar
Templ, M., Meindl, B.: Robustification of microdata masking methods and the comparison with existing methods. In: Domingo-Ferrer, J., Saygın, Y. (eds.) Privacy in Statistical Databases, Proceedings, UNESCO Chair in Data Privacy International Conference (PSD 2008), Istanbul, Turkey, 24–26 Sept 2008, pp. 113–126. Springer, Berlin (2008)
Templ, M., Kowarik, A., Meindl, B.: Statistical disclosure control for micro-data using the R Package sdcMicro. J. Stat. Softw. 67(4), 1–37 (2015)
Article Google Scholar
Ting, D., Fienberg, S.E., Trottini, M.: Random orthogonal matrix masking methodology for microdata release. Int. J. Inf. Comput. Secur. 2(1), 86–105 (2008). https://doi.org/10.1504/IJICS.2008.016823
Google Scholar
Wedderburn, R.W.M.: Random Rotations and Multivariate Normal Simulation. Research Report, Rothamsted Experimental Station (1975)

Download references

Author information

Authors and Affiliations

Statistics Norway, P.O. Box 2633, St. Hanshaugen, 0131, Oslo, Norway
Øyvind Langsrud

Authors

Øyvind Langsrud
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Øyvind Langsrud.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Generalized QR decomposition

The QR decomposition of a $n \times m$ matrix $\varvec{A}$ with rank r can be written as

$$\begin{aligned} \varvec{A} = \varvec{Q} \varvec{R} \end{aligned}$$

(24)

where $\varvec{Q}$ is a $n \times r$ matrix whose columns form an orthonormal basis for the column space of $\varvec{A}$. This decomposition can be viewed as the matrix formulation of the Gram–Schmidt orthogonalization process. The Cholesky decomposition of $\varvec{A}^T\!\varvec{A}$ can be read from the QR decomposition of $\varvec{A}$ as $\varvec{R}^T\!\varvec{R}$.

In this paper, in order to allow linearly dependent columns of $\varvec{A}$ ($r < m$), we refer to a generalized variant of QR decomposition. In such cases, a usual decomposition (Chan 1987) is

$$\begin{aligned} \varvec{A} \varvec{\tilde{P}} = \varvec{Q} \varvec{\tilde{R}} \end{aligned}$$

(25)

where $\varvec{\tilde{P}}$ is a permutation matrix that reorders the columns (pivoting) in order to make a decomposition so that $\varvec{\tilde{R}}$ is upper triangular.

To make the decomposition unique, we require the diagonal entries of $\varvec{\tilde{R}}$ to be positive. Furthermore, we require $\varvec{\tilde{P}}$ to keep the order of the columns as close to the original order as possible (minimal pivoting). We now have $\varvec{A} = \varvec{Q} \varvec{\tilde{R}} \varvec{\tilde{P}}^{T}$ and in generalized QR decomposition (24) we use

$$\begin{aligned} \varvec{R} = \varvec{\tilde{R}} \varvec{\tilde{P}}^{T} \end{aligned}$$

(26)

The QR decomposition of a composite matrix can be written as

$$\begin{aligned} \left[ \varvec{A}_{1}\; \varvec{A}_{2} \right] = \left[ \varvec{Q}_{1}\; \varvec{Q}_{2} \right] \left[ \varvec{R}_{1}^{T}\; \varvec{R}_{2}^{T} \right] ^{T} \end{aligned}$$

(27)

Now $\varvec{Q}_{1}$ can be computed by QR decomposition of $\varvec{A}_{1}$. The matrix $\varvec{Q}_{2}$ can be computed by QR decomposition of $\varvec{A}_{2} - \varvec{Q}_{1}\varvec{Q}_{1}^{T}\varvec{A}_{2}$, which is the residual part after regressing $\varvec{A}_{2}$ onto $\varvec{A}_{1}$.

Appendix 2: The singular value decomposition

The singular value decomposition (SVD) of a $n \times m$ matrix $\varvec{A}$ with rank r can be written as

$$\begin{aligned} \varvec{A} = \varvec{U} \varvec{\varLambda } \varvec{V}^{T} \end{aligned}$$

(28)

where $\varvec{\varLambda }$ is a $r \times r$ diagonal matrix of strictly positive singular values in descending order. This is the rank-revealing version of the decomposition (Demmel et al. 1999). Other variants of SVD allow some singular values to be zero, but these can be omitted. The columns of $\varvec{U}$ form an orthonormal basis for the column space of $\varvec{A}$ and the columns of $\varvec{V}$ form an orthonormal basis for the row space.

The singular values are the square root of the eigenvalues of $\varvec{A}^T\!\varvec{A}$ and $\varvec{A}\varvec{A}^T$. The eigen decompositions of these two symmetric matrices can be read directly from the SVD of $\varvec{A}$ as $\varvec{V} \varvec{\varLambda }^2 \varvec{V}^{T}$ and $\varvec{U} \varvec{\varLambda }^2 \varvec{U}^{T}$. It is also worth mentioning that an alternative to the ordinary Cholesky decomposition, $\varvec{A}^T\!\varvec{A}=\varvec{R}^T\!\varvec{R}$, is to let $\varvec{\varLambda } \varvec{V}^{T}$ play the role of $\varvec{R}$.

To make the SVD unique, we can require all column sums of $\varvec{V}$ to be positive. In cases with equal singular values, the decomposition is not unique regardless.

There is a close relationship between SVD and PCA. In PCA, the variables are usually centered to zero means and in many cases standardized to equal variances prior to decomposition. If $\varvec{A}$ is such a centered/standardized matrix, then $\varvec{U} \varvec{\varLambda }$ is the matrix of PCA scores and $\varvec{V}$ is the matrix of PCA loadings.

The Moore–Penrose generalized inverse of $\varvec{A}$ can be written as

$$\begin{aligned} \varvec{A}^{\dagger } = \varvec{V} \varvec{\varLambda }^{-1} \varvec{U}^{T} \end{aligned}$$

(29)

We have

$$\begin{aligned} \varvec{A}^{\dagger } = (\varvec{A}^{T}\varvec{A})^{\dagger }\varvec{A}^{T} = \varvec{A}^{T} (\varvec{A}\varvec{A}^{T})^{\dagger } \end{aligned}$$

(30)

When $\varvec{A}$ is invertible, $\varvec{A}^{\dagger }= \varvec{A}^{-1}$. When $\varvec{A}^{T}\varvec{A}$ or $\varvec{A}\varvec{A}^{T}$ is invertible, this means, respectively, that $\varvec{A}^{\dagger } = (\varvec{A}^{T}\varvec{A})^{-1}\varvec{A}^{T}$ or $\varvec{A}^{\dagger } = \varvec{A}^{T} (\varvec{A}\varvec{A}^{T})^{-1}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Langsrud, Ø. Information preserving regression-based tools for statistical disclosure control. Stat Comput 29, 965–976 (2019). https://doi.org/10.1007/s11222-018-9848-9

Download citation

Received: 20 October 2017
Accepted: 05 December 2018
Published: 02 January 2019
Issue Date: 11 September 2019
DOI: https://doi.org/10.1007/s11222-018-9848-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information preserving regression-based tools for statistical disclosure control

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Synthetic Decimal Numbers as a Flexible Tool for Suppression of Post-published Tabular Data

Recent advances in cyclic perturbation of frequency tables

Secondary Cell Suppression by Gaussian Elimination: An Algorithm Suitable for Handling Issues with Zeros and Singletons

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Generalized QR decomposition

Appendix 2: The singular value decomposition

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now