Skip to main content
Log in

Recent advances in cyclic perturbation of frequency tables

Neue Entwicklungen in der zyklischen Überlagerung von Fallzahltabellen

  • Originalveröffentlichung
  • Published:
AStA Wirtschafts- und Sozialstatistisches Archiv Aims and scope Submit manuscript

Abstract

As part of statistical disclosure control data holders can only distribute confidential data being sufficiently protected meeting national and institutional legislation. When releasing frequency tables to users, data holders usually apply what are called pre- and post-tabular methods reviewing both diclosure limitation and the users’ requirements. The main characteristic of post-tabular methods is to compute tables on the basis of their original underlying microdata and to perturb them just before transmission to users. The method of cyclic perturbation, proposed in (Duncan and Roehrig, Database Technologies: concepts, methodologies, tools and applications, pp 1823–1843, 2007), is very promising. Here, a sequence of perturbation patterns is added consecutively to some original table following a certain stochastic procedure. The paper presents different variations to define that sequence and discusses appropriate parameter settings in order to balance out the conventional trade-off between data utility and disclosure risk.

Zusammenfassung

Im Zuge der Bereitstellung vertraulicher Einzel- und Tabellendaten müssen sich Datenhalter einerseits an der bestehenden Gesetzgebung und andererseits an den Bedürfnissen ihrer Datennutzer orientieren. Im Falle von Fallzahltabellen werden traditionell so genannte prä- und posttabulare Geheimhaltungsstrategien zur bestmöglichen Erfüllung beider gegenläufigen Zielkriterien verfolgt. Bei letzteren wird eine auf den originalen Mikrodaten erzeugte Tabelle vor der Weitergabe durch spezielle Methoden der Sperrung, Rundung oder Überlagerung von Zellwerten modifiziert. Als vielversprechend hat sich die in (Duncan und Roehrig, Database Technologies: concepts, methodologies, tools and applications, pp 1823–1843, 2007) vorgeschlagene Verfahrensgruppe der zyklischen Überlagerung herausgestellt. Basierend auf Zufallszahlen wird hier eine Folge von Überlagerungsmustern, die Basiszyklen genannt werden, sukzessive zu einer Originaltabelle addiert. Im vorliegenden Beitrag werden verschiedene Varianten der zyklischen Überlagerung vorgestellt und anhand des klassischen Zielkonfliktes, der Minimierung des Informationsverlustes bei gleichzeitiger Gewährung einer ausreichenden Datensicherheit, bewertet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Baglivo J, Oliver D, Pagano M (1988) Methods for the analysis of contingency tables with large and small cell counts. J Amer Statistical Assoc 83:1006–1013

  • Brandt M, Zwick M (2011) Improvement of data access—the long way to remote data access in Germany, discussion paper 39, Research Data Centres of the Federal Statistical Office and the Land Statistical Offices of Germany

  • Diaconis P, Sturmfels B (1998) Algebraic algorithms for sampling from conditional distributions. Annals of Statistics 26:363–397

  • Cox LH (2007) Contingency tables of network type: models, markov basis and applications. Statistica Sinica 17:1371–1393

  • Duncan GT, Roehrig SF (2007) Reconciling information privacy and information access in a globalized technology society. In: Erickson J (ed)Database technologies: concepts, methodologies, tools and applications. IGI Global, Hershey, pp 1823–1843

  • Duncan GT, Elliott M, Salazar-Gonzales JJ (2011) Statistical Confidentiality—Principles and Practice, Statistics for Social and Behavioral Sciences, Springer, Berlin

  • Fellegi IP (1972) On the Question of Statistical Confidentiality. J Amer Statistical Assoc 67:7–18

  • Fraser B, Wooton J (2005) A proposed method for confidentialising tabular output to protect against differencing, https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2005/wp.35.e.pdf, UNECE/Eurostat work session on statistical data confidentiality, Geneva

  • Gießing S (2004) Survey on methods of tabular data protection, PSD 2004. In: Domingo-Ferrer J, Torra V (eds), Lecture notes in computer science 3050, Springer, Berlin, pp 1–13

  • Gießing S (2011) Post-tabular stochastic noise to protect skewed business data, UNECE Conference of European Statisticians, Tarragona

  • Hafner H-P, Ritchie F, Lenz R (2014) User-focused threat identification for anonymised microdata, Conference of European stakeholders, Rome

  • Höhne J, Höninger J (2012) Morpheus—Remote access to micro data with a quality measure, Working Paper Series of the German Data Forum 203

  • Hundepool A, Van de Wetering A, Ramaswamy R, de Wolf P-P, Gießing S, Fischetti M, Salazar JJ, Castro J, Lowthian P (2014), τ-Argus user’s manual, version 4.1, see http://neon.vb.cbs.nl/casc/tau.htm

  • Hundepool A, Domingo-Ferrer J, Franconi L, Gießing S, Lenz R, Longhurst J, Schulte Nordholt E, Seri G, De Wolf P Handbook on statistical disclosure control, see http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf

  • Lenz R (2011) On the way to remote access to German official microdata—a glimpse of work in progress, Statistique et nouvelles technologies de l’information, Revue des Nouvelles Technologies de l’information (RNTI), 125–138

  • Ronning G, Sturm R, Höhne J, Lenz R, Rosemann M, Scheffler M, Vorgrimler D (2005) Handbuch zur Anonymisierung wirtschaftsstatistischer Mikrodaten, Statistik und Wissenschaft, Band 4, Wiesbaden

  • Salazar JJ (2006) Controlled rounding and cell perturbation: Statistical disclosure limitation methods for tabular data. Math Program 105(2–3):251–274

  • Shlomo N, Young C (2008)Invariant post-tabular protection of census frequency counts. In: Domingo-Ferrer J, Saygin Y (Eds) PSD 2008, Lecture Notes in Computer Science, pp 77–89

  • Smith D, Elliot M (2008) A measure of disclosure risk for tables of counts. Trans Data Priv 1:34–52

  • Statistische Ämter des Bundes und der Länder. Access to official German micro data. http://www.forschungsdatenzentrum.de/bestand/

Download references

Acknowledgement

This work was partially supported by the German Federal Ministry of Research and Education. The author also acknowledges the anonymous referees whose suggestions improved the article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rainer Lenz.

Appendices

Appendix A: Proofs

Proof of lemma 1: Obviously it holds \(E(Z)=\alpha - \beta\), so that we obtain

$$\begin{aligned} E({S}_{k}) = E(\sum_{i=0}^{k-1} X_{i}) = \sum_{i=0}^{k-1} E(X_{i}) & = & \sum_{i=0}^{k-1} E(Z_{i} \,M_{i \,mod \,N}) \\ & = & \sum_{i=0}^{k-1} E(Z_{i}) \,M_{i \,mod \,N} \\ & = & E(Z) \,\sum_{i=0}^{k-1} M_{i \,mod \,N} \\ & = & (\alpha - \beta) \,\sum_{i=0}^{k-1} M_{i \,mod \,N} \,.\end{aligned}$$

Proof of theorem 2: We define an undirected bipartite graph \({\cal G}\) with bipartition \((V, W)\) as follows. Let \(V=\{v_{1}, \ldots, v_{m}\}\) and \(W=\{w_{1}, \ldots, w_{n}\}\) the vertices referring to the rows and columns of T, respectively. In \({\cal G}\), two vertices v i and w j are connected by an edge if and only if \(t(i, j)> 0\) holds for the value of cell \((i, j)\). Since each row and each column of T contains at least two non-zero entries, there exists a 2-regular subgraph \({\cal G}'\) of \({\cal G}\), whose vertices have degree \(d(x)=2\) for all \(x \in{\cal G}'\). Hence, \({\cal G}'\) possesses an Eulerian tour. Let \(v_{i1}w_{j1}v_{i2}w_{j2} \ldots v_{ik}w_{jk}\) be that tour. We generate the non-zero entries of the associated cycle as follows: \(t_{i1,j1}=1, \, t_{i2,j1}=-1, \, t_{i2,j2}=1, \, t_{i3,j2}=1, \, \ldots, t_{ik,jk}=1, \, t_{ik,j1}=-1\).

Proof of theorem 4: Considering again \(({\cal S}_{r})\) as sequence of matrices, we may w.l.o.g. focus on some arbitrary chosen matrix element, since for each cell there exist two basic data cycles containing non-zero entries at that position. We choose the element \(({\cal S}_{r})_{11}\) in the left upper corner and obtain symmetric distributions for each r, particularly for \(r=1\) and \(r=2\):

$$ ({\cal S}_{1})_{11}=\left\{\begin{array}{cl} 2 & \mbox{with probability} \, \alpha \beta \\ 1 & \mbox{with probability} \, (1-\gamma) \gamma \\ 0 & \mbox{with probability} \, \alpha^2 + \beta^2 + \gamma^2 \\ -1 & \mbox{with probability} \, (1-\gamma) \gamma \\ -2 & \mbox{with probability} \, \alpha \beta\end{array}\right.$$

and

$$ ({\cal S}_{2})_{11}=\left\{\begin{array}{cl} 4 & \alpha^2 \beta^2 \\ 3 & 2 \gamma (\alpha \beta^2 + \alpha^2 \beta) \\ 2 & 2 (\alpha \beta^3 + \alpha^3 \beta) + \gamma^2 (4 \alpha \beta + \alpha^2 + \beta^2) \\ 1 & 2 \gamma (\alpha^3 + \beta^3 + \alpha^2 \beta + 3 \alpha \beta^2) + 2 \gamma^3 (\alpha + \beta) \\ 0 & \alpha^4 + \beta^4 + \gamma^4 + 4 \alpha^2 \beta^2 + 4 \gamma^2 (\alpha^2 + \beta^2 + \alpha \beta) \\ \vdots & \vdots \\ -4 & \alpha^2 \beta^2.\end{array}\right.$$

The calculation of the corresponding variances \(Var[({\cal S}_{1})_{11}]\) and \(Var[({\cal S}_{2})_{11}]\) is now straight forward. Expressed by the random variables Z i introduced in section 2.1 it follows

$$ ({\cal S}_{1})_{11}=(S_{n})_{11}=\sum_{k=0}^{n-1} Z_{k} \,M_{k \,mod \,n} = Z_{0} \,m_{11}^{(0)} + Z_{n-1} \,m_{11}^{(n-1)} = Z_{0} - Z_{n-1}, $$

since there remain just two data cycles perturbing the top left cell. Thus for the expectation of \({\cal S}_{1})_{11}\) we get

$$ E[({\cal S}_{1})_{11}] = E(Z_{0}-Z_{n-1}) = E(Z_{0})-E(Z_{n-1}) = 0 \,. $$

Dually, we obtain the same result for arbitrary \(r \in \textbf{N}\):

$$\begin{aligned} E[({\cal S}_{r})_{11}] & = & E[\sum_{k=0}^{r-1} (Z_{kn}-Z_{(k+1)n-1})] \\ & = & \sum_{k=0}^{r-1} E(Z_{kn}-Z_{(k+1)n-1}) \\ & = & \sum_{k=0}^{r-1} [E(Z_{kn})-E(Z_{(k+1)n-1})] = 0\end{aligned}$$

Proof of corollary 1: The sufficient condition is obvious. The necessary condition is immediately derived from the definition of bidiagonal data cycles: Let w.l.o.g. the first column of \(T^{O}\), which we denote by \(t_{1} = (t_{11}, t_{12}, \ldots, t_{1m})\), coincide with the first one of \(T^P\).

  • 1. Since each \(t_{1i}, \,i = 1, \ldots, m\), is unperturbed, the same holds for its associated diagonal elements

    $$ \{t_{(1+j) mod\, m, (i+j) mod\,m} \, | \, j=1, \ldots, m-1\}, \,\,i = 1, \ldots, m. $$
  • 2. We consider the unperturbed cells of the second column. Together with the column’s marginal total being unperturbed by nature of the method, it is possible to find the original value of the only (potentially) perturbed element \(t_{21}\) of the second column \(t_{2}\).

Now repeat steps 1. and 2. successively for \(t_{3}, \ldots, t_{n-m}\) in order to build up the entire original table.

Appendix B: Core program code

{\(<<\) LinearAlgebra‘MatrixManipulation'(* Input of parameters and orginal table *)

/[Alpha] = 0.25 (* Input of α and β, default is 0.25 *)

/[Beta] = 0.25

TabOrig = Import["OrigTable.txt",“Table”] (*Import of original

table*)

Dim = Dimensions[TabOrig]

m = Dim[[1]]  (*Number of rows*)

n = Dim[[2]]  (*Number of columns*)

Ncover = 2 n  (*Number of basic cycles to be applied, default 2 n*)(* Initialization of the perturbation matrix *)

Init[x_, y_, z_]=0

M = Array[Init, m, n, n] // MatrixForm

For[i=0, i\(<\)n,

For[k=0, k\(<\)m, M[[1, 1+k, 1+Mod[k+i, n], i+1]] = 1; k++];

For[k=0, k\(<\)m-1, M[[1, 1+k, 1+Mod[k+1+i, n], i+1]] = -1; k++];

M[[1, m, 1+Mod[i, n], i+1]] = -1; k++];

i++](* Drawing the random numbers and perturbation coefficients *)

RandVec = Range[Ncover]

NCoeff = Range[Ncover]

For[i=1, i\(<\)Ncover+1,

RandVec[[i]] = Random[];

NCoeff[[i]] =

If[RandVec[[i]]\(<\backslash\)[Alpha],1,If[RandVec[[i]]\(<\)

/[Alpha]+/[Beta], -1,0]];

i++](* Calculation of the perturbation tensor *)

Cycles = Array[Init, m, n, Ncover] // MatrixForm

For[i=0, i\(<\)Ncover,

For[j=1, j\(<\)m+1,

For[k=1, k\(<\)n+1,

Cycles[[1, j, k, i+1]]=NCoeff[[i + 1]] M[[1, j, k,

1+Mod[i, n]]];

k++];

j++];

i++]

Print[“Perturbation tensor:”]

Cycles

(* Calculation of the perturbed table *)

TabPubl = TabOrig

Print[“Vector of coefficients:”]

NCoeff

For[i = 1, i \(<\) Ncover + 1, For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,

If[TabPubl[[j, k]]\(<\)Abs[Cycles[[1, j, k, i]]],{NCoeff[[i]]=0;

Break[]}];

k++]; j++];

If[NCoeff[[i]]!=0, For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,

TabPubl[[j, k]]=TabPubl[[j, k]]+Cycles[[1, j, k, i]];

k++]; j++]];

i++]

Print[“Perturbed table:”]

TabPubl // TableForm

Export["PertTable.html", TabPubl]

Print[“Reduced coefficient vector due to suppressed cycles:”]

NCoeff(* Perturbation without suppressed cycles *)

TabNoSup = TabOrig

Cyclesum = ZeroMatrix[m, n]

For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,

Cyclesum[[j, k]]=Sum[Cycles[[1, j, k, i]], i, 1, Ncover];

k++];

j++]

For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,

TabNoSup[[j, k]]=TabNoSup[[j, k]]+Cyclesum[[j, k]];

k++];

j++]

Print["As compared to the unbiased perturbed one without

suppression:"]

TabNoSup // TableForm

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lenz, R. Recent advances in cyclic perturbation of frequency tables. AStA Wirtsch Sozialstat Arch 10, 37–62 (2016). https://doi.org/10.1007/s11943-016-0180-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11943-016-0180-6

Keywords

Schlüsselwörter

JEL Klassifikation

Navigation