Abstract
As part of statistical disclosure control data holders can only distribute confidential data being sufficiently protected meeting national and institutional legislation. When releasing frequency tables to users, data holders usually apply what are called pre- and post-tabular methods reviewing both diclosure limitation and the users’ requirements. The main characteristic of post-tabular methods is to compute tables on the basis of their original underlying microdata and to perturb them just before transmission to users. The method of cyclic perturbation, proposed in (Duncan and Roehrig, Database Technologies: concepts, methodologies, tools and applications, pp 1823–1843, 2007), is very promising. Here, a sequence of perturbation patterns is added consecutively to some original table following a certain stochastic procedure. The paper presents different variations to define that sequence and discusses appropriate parameter settings in order to balance out the conventional trade-off between data utility and disclosure risk.
Zusammenfassung
Im Zuge der Bereitstellung vertraulicher Einzel- und Tabellendaten müssen sich Datenhalter einerseits an der bestehenden Gesetzgebung und andererseits an den Bedürfnissen ihrer Datennutzer orientieren. Im Falle von Fallzahltabellen werden traditionell so genannte prä- und posttabulare Geheimhaltungsstrategien zur bestmöglichen Erfüllung beider gegenläufigen Zielkriterien verfolgt. Bei letzteren wird eine auf den originalen Mikrodaten erzeugte Tabelle vor der Weitergabe durch spezielle Methoden der Sperrung, Rundung oder Überlagerung von Zellwerten modifiziert. Als vielversprechend hat sich die in (Duncan und Roehrig, Database Technologies: concepts, methodologies, tools and applications, pp 1823–1843, 2007) vorgeschlagene Verfahrensgruppe der zyklischen Überlagerung herausgestellt. Basierend auf Zufallszahlen wird hier eine Folge von Überlagerungsmustern, die Basiszyklen genannt werden, sukzessive zu einer Originaltabelle addiert. Im vorliegenden Beitrag werden verschiedene Varianten der zyklischen Überlagerung vorgestellt und anhand des klassischen Zielkonfliktes, der Minimierung des Informationsverlustes bei gleichzeitiger Gewährung einer ausreichenden Datensicherheit, bewertet.
Similar content being viewed by others
References
Baglivo J, Oliver D, Pagano M (1988) Methods for the analysis of contingency tables with large and small cell counts. J Amer Statistical Assoc 83:1006–1013
Brandt M, Zwick M (2011) Improvement of data access—the long way to remote data access in Germany, discussion paper 39, Research Data Centres of the Federal Statistical Office and the Land Statistical Offices of Germany
Diaconis P, Sturmfels B (1998) Algebraic algorithms for sampling from conditional distributions. Annals of Statistics 26:363–397
Cox LH (2007) Contingency tables of network type: models, markov basis and applications. Statistica Sinica 17:1371–1393
Duncan GT, Roehrig SF (2007) Reconciling information privacy and information access in a globalized technology society. In: Erickson J (ed)Database technologies: concepts, methodologies, tools and applications. IGI Global, Hershey, pp 1823–1843
Duncan GT, Elliott M, Salazar-Gonzales JJ (2011) Statistical Confidentiality—Principles and Practice, Statistics for Social and Behavioral Sciences, Springer, Berlin
Fellegi IP (1972) On the Question of Statistical Confidentiality. J Amer Statistical Assoc 67:7–18
Fraser B, Wooton J (2005) A proposed method for confidentialising tabular output to protect against differencing, https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2005/wp.35.e.pdf, UNECE/Eurostat work session on statistical data confidentiality, Geneva
Gießing S (2004) Survey on methods of tabular data protection, PSD 2004. In: Domingo-Ferrer J, Torra V (eds), Lecture notes in computer science 3050, Springer, Berlin, pp 1–13
Gießing S (2011) Post-tabular stochastic noise to protect skewed business data, UNECE Conference of European Statisticians, Tarragona
Hafner H-P, Ritchie F, Lenz R (2014) User-focused threat identification for anonymised microdata, Conference of European stakeholders, Rome
Höhne J, Höninger J (2012) Morpheus—Remote access to micro data with a quality measure, Working Paper Series of the German Data Forum 203
Hundepool A, Van de Wetering A, Ramaswamy R, de Wolf P-P, Gießing S, Fischetti M, Salazar JJ, Castro J, Lowthian P (2014), τ-Argus user’s manual, version 4.1, see http://neon.vb.cbs.nl/casc/tau.htm
Hundepool A, Domingo-Ferrer J, Franconi L, Gießing S, Lenz R, Longhurst J, Schulte Nordholt E, Seri G, De Wolf P Handbook on statistical disclosure control, see http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf
Lenz R (2011) On the way to remote access to German official microdata—a glimpse of work in progress, Statistique et nouvelles technologies de l’information, Revue des Nouvelles Technologies de l’information (RNTI), 125–138
Ronning G, Sturm R, Höhne J, Lenz R, Rosemann M, Scheffler M, Vorgrimler D (2005) Handbuch zur Anonymisierung wirtschaftsstatistischer Mikrodaten, Statistik und Wissenschaft, Band 4, Wiesbaden
Salazar JJ (2006) Controlled rounding and cell perturbation: Statistical disclosure limitation methods for tabular data. Math Program 105(2–3):251–274
Shlomo N, Young C (2008)Invariant post-tabular protection of census frequency counts. In: Domingo-Ferrer J, Saygin Y (Eds) PSD 2008, Lecture Notes in Computer Science, pp 77–89
Smith D, Elliot M (2008) A measure of disclosure risk for tables of counts. Trans Data Priv 1:34–52
Statistische Ämter des Bundes und der Länder. Access to official German micro data. http://www.forschungsdatenzentrum.de/bestand/
Acknowledgement
This work was partially supported by the German Federal Ministry of Research and Education. The author also acknowledges the anonymous referees whose suggestions improved the article.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proofs
Proof of lemma 1: Obviously it holds \(E(Z)=\alpha - \beta\), so that we obtain
♦
Proof of theorem 2: We define an undirected bipartite graph \({\cal G}\) with bipartition \((V, W)\) as follows. Let \(V=\{v_{1}, \ldots, v_{m}\}\) and \(W=\{w_{1}, \ldots, w_{n}\}\) the vertices referring to the rows and columns of T, respectively. In \({\cal G}\), two vertices v i and w j are connected by an edge if and only if \(t(i, j)> 0\) holds for the value of cell \((i, j)\). Since each row and each column of T contains at least two non-zero entries, there exists a 2-regular subgraph \({\cal G}'\) of \({\cal G}\), whose vertices have degree \(d(x)=2\) for all \(x \in{\cal G}'\). Hence, \({\cal G}'\) possesses an Eulerian tour. Let \(v_{i1}w_{j1}v_{i2}w_{j2} \ldots v_{ik}w_{jk}\) be that tour. We generate the non-zero entries of the associated cycle as follows: \(t_{i1,j1}=1, \, t_{i2,j1}=-1, \, t_{i2,j2}=1, \, t_{i3,j2}=1, \, \ldots, t_{ik,jk}=1, \, t_{ik,j1}=-1\).
♦
Proof of theorem 4: Considering again \(({\cal S}_{r})\) as sequence of matrices, we may w.l.o.g. focus on some arbitrary chosen matrix element, since for each cell there exist two basic data cycles containing non-zero entries at that position. We choose the element \(({\cal S}_{r})_{11}\) in the left upper corner and obtain symmetric distributions for each r, particularly for \(r=1\) and \(r=2\):
and
The calculation of the corresponding variances \(Var[({\cal S}_{1})_{11}]\) and \(Var[({\cal S}_{2})_{11}]\) is now straight forward. Expressed by the random variables Z i introduced in section 2.1 it follows
since there remain just two data cycles perturbing the top left cell. Thus for the expectation of \({\cal S}_{1})_{11}\) we get
Dually, we obtain the same result for arbitrary \(r \in \textbf{N}\):
♦
Proof of corollary 1: The sufficient condition is obvious. The necessary condition is immediately derived from the definition of bidiagonal data cycles: Let w.l.o.g. the first column of \(T^{O}\), which we denote by \(t_{1} = (t_{11}, t_{12}, \ldots, t_{1m})\), coincide with the first one of \(T^P\).
-
1. Since each \(t_{1i}, \,i = 1, \ldots, m\), is unperturbed, the same holds for its associated diagonal elements
$$ \{t_{(1+j) mod\, m, (i+j) mod\,m} \, | \, j=1, \ldots, m-1\}, \,\,i = 1, \ldots, m. $$ -
2. We consider the unperturbed cells of the second column. Together with the column’s marginal total being unperturbed by nature of the method, it is possible to find the original value of the only (potentially) perturbed element \(t_{21}\) of the second column \(t_{2}\).
Now repeat steps 1. and 2. successively for \(t_{3}, \ldots, t_{n-m}\) in order to build up the entire original table.
♦
Appendix B: Core program code
{\(<<\) LinearAlgebra‘MatrixManipulation'(* Input of parameters and orginal table *)
/[Alpha] = 0.25 (* Input of α and β, default is 0.25 *)
/[Beta] = 0.25
TabOrig = Import["OrigTable.txt",“Table”] (*Import of original
table*)
Dim = Dimensions[TabOrig]
m = Dim[[1]] (*Number of rows*)
n = Dim[[2]] (*Number of columns*)
Ncover = 2 n (*Number of basic cycles to be applied, default 2 n*)(* Initialization of the perturbation matrix *)
Init[x_, y_, z_]=0
M = Array[Init, m, n, n] // MatrixForm
For[i=0, i\(<\)n,
For[k=0, k\(<\)m, M[[1, 1+k, 1+Mod[k+i, n], i+1]] = 1; k++];
For[k=0, k\(<\)m-1, M[[1, 1+k, 1+Mod[k+1+i, n], i+1]] = -1; k++];
M[[1, m, 1+Mod[i, n], i+1]] = -1; k++];
i++](* Drawing the random numbers and perturbation coefficients *)
RandVec = Range[Ncover]
NCoeff = Range[Ncover]
For[i=1, i\(<\)Ncover+1,
RandVec[[i]] = Random[];
NCoeff[[i]] =
If[RandVec[[i]]\(<\backslash\)[Alpha],1,If[RandVec[[i]]\(<\)
/[Alpha]+/[Beta], -1,0]];
i++](* Calculation of the perturbation tensor *)
Cycles = Array[Init, m, n, Ncover] // MatrixForm
For[i=0, i\(<\)Ncover,
For[j=1, j\(<\)m+1,
For[k=1, k\(<\)n+1,
Cycles[[1, j, k, i+1]]=NCoeff[[i + 1]] M[[1, j, k,
1+Mod[i, n]]];
k++];
j++];
i++]
Print[“Perturbation tensor:”]
Cycles
(* Calculation of the perturbed table *)
TabPubl = TabOrig
Print[“Vector of coefficients:”]
NCoeff
For[i = 1, i \(<\) Ncover + 1, For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,
If[TabPubl[[j, k]]\(<\)Abs[Cycles[[1, j, k, i]]],{NCoeff[[i]]=0;
Break[]}];
k++]; j++];
If[NCoeff[[i]]!=0, For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,
TabPubl[[j, k]]=TabPubl[[j, k]]+Cycles[[1, j, k, i]];
k++]; j++]];
i++]
Print[“Perturbed table:”]
TabPubl // TableForm
Export["PertTable.html", TabPubl]
Print[“Reduced coefficient vector due to suppressed cycles:”]
NCoeff(* Perturbation without suppressed cycles *)
TabNoSup = TabOrig
Cyclesum = ZeroMatrix[m, n]
For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,
Cyclesum[[j, k]]=Sum[Cycles[[1, j, k, i]], i, 1, Ncover];
k++];
j++]
For[j=1, j\(<\)m+1, For[k=1, k\(<\)n+1,
TabNoSup[[j, k]]=TabNoSup[[j, k]]+Cyclesum[[j, k]];
k++];
j++]
Print["As compared to the unbiased perturbed one without
suppression:"]
TabNoSup // TableForm
Rights and permissions
About this article
Cite this article
Lenz, R. Recent advances in cyclic perturbation of frequency tables. AStA Wirtsch Sozialstat Arch 10, 37–62 (2016). https://doi.org/10.1007/s11943-016-0180-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11943-016-0180-6