Recent advances in cyclic perturbation of frequency tables

Lenz, Rainer

doi:10.1007/s11943-016-0180-6

Recent advances in cyclic perturbation of frequency tables

Neue Entwicklungen in der zyklischen Überlagerung von Fallzahltabellen

Originalveröffentlichung
Published: 26 February 2016

Volume 10, pages 37–62, (2016)
Cite this article

AStA Wirtschafts- und Sozialstatistisches Archiv Aims and scope Submit manuscript

Rainer Lenz^1,2

134 Accesses
1 Citation
Explore all metrics

Abstract

As part of statistical disclosure control data holders can only distribute confidential data being sufficiently protected meeting national and institutional legislation. When releasing frequency tables to users, data holders usually apply what are called pre- and post-tabular methods reviewing both diclosure limitation and the users’ requirements. The main characteristic of post-tabular methods is to compute tables on the basis of their original underlying microdata and to perturb them just before transmission to users. The method of cyclic perturbation, proposed in (Duncan and Roehrig, Database Technologies: concepts, methodologies, tools and applications, pp 1823–1843, 2007), is very promising. Here, a sequence of perturbation patterns is added consecutively to some original table following a certain stochastic procedure. The paper presents different variations to define that sequence and discusses appropriate parameter settings in order to balance out the conventional trade-off between data utility and disclosure risk.

Zusammenfassung

Im Zuge der Bereitstellung vertraulicher Einzel- und Tabellendaten müssen sich Datenhalter einerseits an der bestehenden Gesetzgebung und andererseits an den Bedürfnissen ihrer Datennutzer orientieren. Im Falle von Fallzahltabellen werden traditionell so genannte prä- und posttabulare Geheimhaltungsstrategien zur bestmöglichen Erfüllung beider gegenläufigen Zielkriterien verfolgt. Bei letzteren wird eine auf den originalen Mikrodaten erzeugte Tabelle vor der Weitergabe durch spezielle Methoden der Sperrung, Rundung oder Überlagerung von Zellwerten modifiziert. Als vielversprechend hat sich die in (Duncan und Roehrig, Database Technologies: concepts, methodologies, tools and applications, pp 1823–1843, 2007) vorgeschlagene Verfahrensgruppe der zyklischen Überlagerung herausgestellt. Basierend auf Zufallszahlen wird hier eine Folge von Überlagerungsmustern, die Basiszyklen genannt werden, sukzessive zu einer Originaltabelle addiert. Im vorliegenden Beitrag werden verschiedene Varianten der zyklischen Überlagerung vorgestellt und anhand des klassischen Zielkonfliktes, der Minimierung des Informationsverlustes bei gleichzeitiger Gewährung einer ausreichenden Datensicherheit, bewertet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Further Developments with Perturbation Techniques to Protect Tabular Data

Measuring Disclosure Risk with Entropy in Population Based Frequency Tables

Pre-tabular Perturbation with Controlled Tabular Adjustment: Some Considerations

References

Baglivo J, Oliver D, Pagano M (1988) Methods for the analysis of contingency tables with large and small cell counts. J Amer Statistical Assoc 83:1006–1013
Brandt M, Zwick M (2011) Improvement of data access—the long way to remote data access in Germany, discussion paper 39, Research Data Centres of the Federal Statistical Office and the Land Statistical Offices of Germany
Diaconis P, Sturmfels B (1998) Algebraic algorithms for sampling from conditional distributions. Annals of Statistics 26:363–397
Cox LH (2007) Contingency tables of network type: models, markov basis and applications. Statistica Sinica 17:1371–1393
Duncan GT, Roehrig SF (2007) Reconciling information privacy and information access in a globalized technology society. In: Erickson J (ed)Database technologies: concepts, methodologies, tools and applications. IGI Global, Hershey, pp 1823–1843
Duncan GT, Elliott M, Salazar-Gonzales JJ (2011) Statistical Confidentiality—Principles and Practice, Statistics for Social and Behavioral Sciences, Springer, Berlin
Fellegi IP (1972) On the Question of Statistical Confidentiality. J Amer Statistical Assoc 67:7–18
Fraser B, Wooton J (2005) A proposed method for confidentialising tabular output to protect against differencing, https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2005/wp.35.e.pdf, UNECE/Eurostat work session on statistical data confidentiality, Geneva
Gießing S (2004) Survey on methods of tabular data protection, PSD 2004. In: Domingo-Ferrer J, Torra V (eds), Lecture notes in computer science 3050, Springer, Berlin, pp 1–13
Gießing S (2011) Post-tabular stochastic noise to protect skewed business data, UNECE Conference of European Statisticians, Tarragona
Hafner H-P, Ritchie F, Lenz R (2014) User-focused threat identification for anonymised microdata, Conference of European stakeholders, Rome
Höhne J, Höninger J (2012) Morpheus—Remote access to micro data with a quality measure, Working Paper Series of the German Data Forum 203
Hundepool A, Van de Wetering A, Ramaswamy R, de Wolf P-P, Gießing S, Fischetti M, Salazar JJ, Castro J, Lowthian P (2014), τ-Argus user’s manual, version 4.1, see http://neon.vb.cbs.nl/casc/tau.htm
Hundepool A, Domingo-Ferrer J, Franconi L, Gießing S, Lenz R, Longhurst J, Schulte Nordholt E, Seri G, De Wolf P Handbook on statistical disclosure control, see http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf
Lenz R (2011) On the way to remote access to German official microdata—a glimpse of work in progress, Statistique et nouvelles technologies de l’information, Revue des Nouvelles Technologies de l’information (RNTI), 125–138
Ronning G, Sturm R, Höhne J, Lenz R, Rosemann M, Scheffler M, Vorgrimler D (2005) Handbuch zur Anonymisierung wirtschaftsstatistischer Mikrodaten, Statistik und Wissenschaft, Band 4, Wiesbaden
Salazar JJ (2006) Controlled rounding and cell perturbation: Statistical disclosure limitation methods for tabular data. Math Program 105(2–3):251–274
Shlomo N, Young C (2008)Invariant post-tabular protection of census frequency counts. In: Domingo-Ferrer J, Saygin Y (Eds) PSD 2008, Lecture Notes in Computer Science, pp 77–89
Smith D, Elliot M (2008) A measure of disclosure risk for tables of counts. Trans Data Priv 1:34–52
Statistische Ämter des Bundes und der Länder. Access to official German micro data. http://www.forschungsdatenzentrum.de/bestand/

Download references

Acknowledgement

This work was partially supported by the German Federal Ministry of Research and Education. The author also acknowledges the anonymous referees whose suggestions improved the article.

Author information

Authors and Affiliations

Saarland State University of Applied Sciences, Faculty of Engineering, 66117, Saarbrücken, Germany
Rainer Lenz
Technical University of Dortmund, Department of Statistics, 44221, Dortmund, Germany
Rainer Lenz

Authors

Rainer Lenz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rainer Lenz.

Appendices

Appendix A: Proofs

Proof of lemma 1: Obviously it holds $E(Z)=\alpha - \beta$, so that we obtain

$$\begin{aligned} E({S}_{k}) = E(\sum_{i=0}^{k-1} X_{i}) = \sum_{i=0}^{k-1} E(X_{i}) & = & \sum_{i=0}^{k-1} E(Z_{i} \,M_{i \,mod \,N}) \\ & = & \sum_{i=0}^{k-1} E(Z_{i}) \,M_{i \,mod \,N} \\ & = & E(Z) \,\sum_{i=0}^{k-1} M_{i \,mod \,N} \\ & = & (\alpha - \beta) \,\sum_{i=0}^{k-1} M_{i \,mod \,N} \,.\end{aligned}$$

♦

Proof of theorem 2: We define an undirected bipartite graph ${\cal G}$ with bipartition $(V, W)$ as follows. Let $V=\{v_{1}, \ldots, v_{m}\}$ and $W=\{w_{1}, \ldots, w_{n}\}$ the vertices referring to the rows and columns of T, respectively. In ${\cal G}$, two vertices v _i and w _j are connected by an edge if and only if $t(i, j)> 0$ holds for the value of cell $(i, j)$. Since each row and each column of T contains at least two non-zero entries, there exists a 2-regular subgraph ${\cal G}'$ of ${\cal G}$, whose vertices have degree $d(x)=2$ for all $x \in{\cal G}'$. Hence, ${\cal G}'$ possesses an Eulerian tour. Let $v_{i1}w_{j1}v_{i2}w_{j2} \ldots v_{ik}w_{jk}$ be that tour. We generate the non-zero entries of the associated cycle as follows: $t_{i1,j1}=1, \, t_{i2,j1}=-1, \, t_{i2,j2}=1, \, t_{i3,j2}=1, \, \ldots, t_{ik,jk}=1, \, t_{ik,j1}=-1$.

♦

Proof of theorem 4: Considering again $({\cal S}_{r})$ as sequence of matrices, we may w.l.o.g. focus on some arbitrary chosen matrix element, since for each cell there exist two basic data cycles containing non-zero entries at that position. We choose the element $({\cal S}_{r})_{11}$ in the left upper corner and obtain symmetric distributions for each r, particularly for $r=1$ and $r=2$:

$$ ({\cal S}_{1})_{11}=\left\{\begin{array}{cl} 2 & \mbox{with probability} \, \alpha \beta \\ 1 & \mbox{with probability} \, (1-\gamma) \gamma \\ 0 & \mbox{with probability} \, \alpha^2 + \beta^2 + \gamma^2 \\ -1 & \mbox{with probability} \, (1-\gamma) \gamma \\ -2 & \mbox{with probability} \, \alpha \beta\end{array}\right.$$

and

$$ ({\cal S}_{2})_{11}=\left\{\begin{array}{cl} 4 & \alpha^2 \beta^2 \\ 3 & 2 \gamma (\alpha \beta^2 + \alpha^2 \beta) \\ 2 & 2 (\alpha \beta^3 + \alpha^3 \beta) + \gamma^2 (4 \alpha \beta + \alpha^2 + \beta^2) \\ 1 & 2 \gamma (\alpha^3 + \beta^3 + \alpha^2 \beta + 3 \alpha \beta^2) + 2 \gamma^3 (\alpha + \beta) \\ 0 & \alpha^4 + \beta^4 + \gamma^4 + 4 \alpha^2 \beta^2 + 4 \gamma^2 (\alpha^2 + \beta^2 + \alpha \beta) \\ \vdots & \vdots \\ -4 & \alpha^2 \beta^2.\end{array}\right.$$

The calculation of the corresponding variances $Var[({\cal S}_{1})_{11}]$ and $Var[({\cal S}_{2})_{11}]$ is now straight forward. Expressed by the random variables Z _i introduced in section 2.1 it follows

$$ ({\cal S}_{1})_{11}=(S_{n})_{11}=\sum_{k=0}^{n-1} Z_{k} \,M_{k \,mod \,n} = Z_{0} \,m_{11}^{(0)} + Z_{n-1} \,m_{11}^{(n-1)} = Z_{0} - Z_{n-1}, $$

since there remain just two data cycles perturbing the top left cell. Thus for the expectation of ${\cal S}_{1})_{11}$ we get

$$ E[({\cal S}_{1})_{11}] = E(Z_{0}-Z_{n-1}) = E(Z_{0})-E(Z_{n-1}) = 0 \,. $$

Dually, we obtain the same result for arbitrary $r \in \textbf{N}$:

$$\begin{aligned} E[({\cal S}_{r})_{11}] & = & E[\sum_{k=0}^{r-1} (Z_{kn}-Z_{(k+1)n-1})] \\ & = & \sum_{k=0}^{r-1} E(Z_{kn}-Z_{(k+1)n-1}) \\ & = & \sum_{k=0}^{r-1} [E(Z_{kn})-E(Z_{(k+1)n-1})] = 0\end{aligned}$$

♦

Proof of corollary 1: The sufficient condition is obvious. The necessary condition is immediately derived from the definition of bidiagonal data cycles: Let w.l.o.g. the first column of $T^{O}$, which we denote by $t_{1} = (t_{11}, t_{12}, \ldots, t_{1m})$, coincide with the first one of $T^P$.

1. Since each $t_{1i}, \,i = 1, \ldots, m$, is unperturbed, the same holds for its associated diagonal elements
$$ \{t_{(1+j) mod\, m, (i+j) mod\,m} \, | \, j=1, \ldots, m-1\}, \,\,i = 1, \ldots, m. $$
2. We consider the unperturbed cells of the second column. Together with the column’s marginal total being unperturbed by nature of the method, it is possible to find the original value of the only (potentially) perturbed element $t_{21}$ of the second column $t_{2}$.

Now repeat steps 1. and 2. successively for $t_{3}, \ldots, t_{n-m}$ in order to build up the entire original table.

♦

Appendix B: Core program code

{$<<$ LinearAlgebra‘MatrixManipulation'(* Input of parameters and orginal table *)

/[Alpha] = 0.25 (* Input of α and β, default is 0.25 *)

/[Beta] = 0.25

TabOrig = Import["OrigTable.txt",“Table”] (*Import of original

table*)

Dim = Dimensions[TabOrig]

m = Dim[[1]] (*Number of rows*)

n = Dim[[2]] (*Number of columns*)

Ncover = 2 n (*Number of basic cycles to be applied, default 2 n*)(* Initialization of the perturbation matrix *)

Init[x_, y_, z_]=0

M = Array[Init, m, n, n] // MatrixForm

For[i=0, i$<$n,

For[k=0, k$<$m, M[[1, 1+k, 1+Mod[k+i, n], i+1]] = 1; k++];

For[k=0, k$<$m-1, M[[1, 1+k, 1+Mod[k+1+i, n], i+1]] = -1; k++];

M[[1, m, 1+Mod[i, n], i+1]] = -1; k++];

i++](* Drawing the random numbers and perturbation coefficients *)

RandVec = Range[Ncover]

NCoeff = Range[Ncover]

For[i=1, i$<$Ncover+1,

RandVec[[i]] = Random[];

NCoeff[[i]] =

If[RandVec[[i]]$<\backslash$[Alpha],1,If[RandVec[[i]]$<$

/[Alpha]+/[Beta], -1,0]];

i++](* Calculation of the perturbation tensor *)

Cycles = Array[Init, m, n, Ncover] // MatrixForm

For[i=0, i$<$Ncover,

For[j=1, j$<$m+1,

For[k=1, k$<$n+1,

Cycles[[1, j, k, i+1]]=NCoeff[[i + 1]] M[[1, j, k,

1+Mod[i, n]]];

k++];

j++];

i++]

Print[“Perturbation tensor:”]

Cycles

(* Calculation of the perturbed table *)

TabPubl = TabOrig

Print[“Vector of coefficients:”]

NCoeff

For[i = 1, i $<$ Ncover + 1, For[j=1, j$<$m+1, For[k=1, k$<$n+1,

If[TabPubl[[j, k]]$<$Abs[Cycles[[1, j, k, i]]],{NCoeff[[i]]=0;

Break[]}];

k++]; j++];

If[NCoeff[[i]]!=0, For[j=1, j$<$m+1, For[k=1, k$<$n+1,

TabPubl[[j, k]]=TabPubl[[j, k]]+Cycles[[1, j, k, i]];

k++]; j++]];

i++]

Print[“Perturbed table:”]

TabPubl // TableForm

Export["PertTable.html", TabPubl]

Print[“Reduced coefficient vector due to suppressed cycles:”]

NCoeff(* Perturbation without suppressed cycles *)

TabNoSup = TabOrig

Cyclesum = ZeroMatrix[m, n]

For[j=1, j$<$m+1, For[k=1, k$<$n+1,

Cyclesum[[j, k]]=Sum[Cycles[[1, j, k, i]], i, 1, Ncover];

k++];

j++]

For[j=1, j$<$m+1, For[k=1, k$<$n+1,

TabNoSup[[j, k]]=TabNoSup[[j, k]]+Cyclesum[[j, k]];

k++];

j++]

Print["As compared to the unbiased perturbed one without

suppression:"]

TabNoSup // TableForm

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lenz, R. Recent advances in cyclic perturbation of frequency tables. AStA Wirtsch Sozialstat Arch 10, 37–62 (2016). https://doi.org/10.1007/s11943-016-0180-6

Download citation

Received: 04 March 2015
Accepted: 19 January 2016
Published: 26 February 2016
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11943-016-0180-6

Keywords

Schlüsselwörter

JEL Klassifikation

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent advances in cyclic perturbation of frequency tables

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Further Developments with Perturbation Techniques to Protect Tabular Data

Measuring Disclosure Risk with Entropy in Population Based Frequency Tables

Pre-tabular Perturbation with Controlled Tabular Adjustment: Some Considerations

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proofs

Appendix B: Core program code

Rights and permissions

About this article

Cite this article

Keywords

Schlüsselwörter

JEL Klassifikation

Navigation

Recent advances in cyclic perturbation of frequency tables

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Further Developments with Perturbation Techniques to Protect Tabular Data

Measuring Disclosure Risk with Entropy in Population Based Frequency Tables

Pre-tabular Perturbation with Controlled Tabular Adjustment: Some Considerations

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proofs

Appendix B: Core program code

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Schlüsselwörter

JEL Klassifikation

Search

Navigation