A recursive search algorithm for statistical disclosure assessment

Manning, Anna M.; Haglin, David J.; Keane, John A.

doi:10.1007/s10618-007-0078-6

A recursive search algorithm for statistical disclosure assessment

Published: 10 July 2007

Volume 16, pages 165–196, (2008)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Anna M. Manning¹,
David J. Haglin² &
John A. Keane¹

347 Accesses
21 Citations
Explore all metrics

Abstract

A new algorithm, SUDA2, is presented which finds minimally unique itemsets i.e., minimal itemsets of frequency one. These itemsets, referred to as Minimal Sample Uniques (MSUs), are important for statistical agencies who wish to estimate the risk of disclosure of their datasets. SUDA2 is a recursive algorithm which uses new observations about the properties of MSUs to prune and traverse the search space. Experimental comparisons with previous work demonstrate that SUDA2 is several orders of magnitude faster, enabling datasets of significantly more columns to be addressed. The ability of SUDA2 to identify the boundaries of the search space for MSUs is clearly demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, Proceedings of 20th international conference on very large data bases, September 12–15, 1994, Santiago de Chile, Chile, pp 487–499
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 international conference on management of data (SIGMOD 93), pp 207–216
Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo A (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in knowledge discovery and data mining, pp 307–328. The AAAI Press, Menlo Park
Google Scholar
Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A (2005) Anonymizing tables. In: Proceedings of the tenth international conference on database theory, Edinburgh. Springer, Berlin, pp 246–258
Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of 1998 ACM-SIGMOD int. conf. on management of data, pp 85–93
Berge C (1989) Hypergraphs: combinatorics of finite sets. North-Holland
Boulicaut J-F, Bykowski A and Rigotti C (2003). A condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl Discov 7(1): 5–22
Article MathSciNet Google Scholar
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of 1997 ACM-SIGMOD int. conf. on management of data, Tucson, Arizona, pp 255–264
Burdick D, Calimlim M, Gehrke J (2001) Mafia: a maximal frequent itemset algorithm for transactional databases. In: Proceedings ICDE 2001, pp 443–452
Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the 2005 SIAM international conference on data mining
Domingo-Ferrer J and Torra V (2005). Ordinal, continuous and heterogeneous k-Anonymity through microaggregation. Data Mining Knowl Discov 11(2): 195–212
Article MathSciNet Google Scholar
Dong G, Jiang C, Pei J, Li J, Wong L (2005) Mining succinct systems of minimal generators of formal concepts. In: Proceedings of the tenth international conference for database systems for advanced applications (DASFAA’05), pp 175–187
Elliot MJ and Dale A (1999). Scenarios of attack: the data intruder’s perspective on statistical disclosure risk. Netherlands Official Stat 14: 6–10
Google Scholar
Elliot MJ, Skinner CJ and Dale A (1998). Special uniques, random uniques and sticky populations: Some counterintuitive effects of geographical detail on disclosure risk. Res Official Stat 1(2): 53–67
Google Scholar
Elliot MJ, Manning AM and Ford RW (2002). A computational algorithm for handling the special uniques problem. Int J Uncertainty, Fuzziness Knowl Based Syst 10(5): 493–509
Article MATH Google Scholar
Elliot MJ, Manning A, Mayes K, Gurd J, Bane M (2005) SUDA: a program for identifying and grading special uniques. In: Proceedings of the Joint United Nations Economic Commission for Europe (UN-ECE) and European Statistics (Eurostat) Worksession on Statistical Confidentiality, Geneva, pp 353–362
Fienberg SE and Makov UE (1998). Confidentiality, uniqueness and disclosure limitation for categorical data. J Official Stat 4: 385–397
Google Scholar
Fienberg SE and Slavkovic AB (2005). Preserving the confidentiality of categorical statistical databases when releasing information for association rules. Data Mining Knowl Discov 11(2): 155–180
Article MathSciNet Google Scholar
Flouvat F, de Marchi F, Petit J-M (2004) ABS: adaptive borders search of frequent itemsets. In: Workshop on frequent itemset mining implementations (FIMI’04), In: conjunction with the IEEE International conference on data mining
Ghoting A, Otey ME, Parthasarathy S (2004) Loaded: link-based outlier and anomaly detection in evolving data sets. In: Proceedings of the fourth IEEE international conference on data mining, Brighton, UK, pp 387–390
Golle P (2006) Revisiting the uniqueness of simple demographics in the US population. In: Proceedings of the 5th ACM workshop on Privacy in electronic society, pp 77–80
Gouda K and Zaki MJ (2005). Genmax: an efficient algorithm for mining maximal frequent itemsets. Data Mining Knowl Discov 11(3): 223–242
Article MathSciNet Google Scholar
Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H and Sharm RS (2003). Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174
Article Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD conference, Dallas, Texas, pp 1–12
Hipp J, Güntzer U and Nakhaeizadeh G (2000). Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64
Article Google Scholar
Lin X, Clifton C and Zhu M (2005). Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inform Syst 8(1): 68–81
Article Google Scholar
Liu G, Li J, Wong L, Hsu W (2006) Positive borders or negative borders: How to make lossless generator based representations concise. In: Proceedings of the 2006 SIAM international conference on data mining
Lucchese C, Salvatore O, Perego R (2004) kDCI: on using direct count up to the third iteration. In: Workshop on frequent itemset mining implementations (FIMI’04), in conjunction with the IEEE international conference on data mining
Lucchese C, Salvatore O and Perego R (2006). Fast and memory efficient mining of frequent closed itemsets. IEEE Trans Knowl Data Eng 18(1): 21–36
Article Google Scholar
Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-diversity: privacy beyond k-Anonymity. In: Proceedings of the 22nd IEEE international conference on data engineering, Atlanta, Georgia, USA
Manning AM, Haglin DJ (2005) A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment. In: Proceedings of the fifth IEEE international conference on data mining, Houston, Texas, USA, pp 290–297
Mannila H and Toivonen H (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl Discov 1(3): 241–258
Article Google Scholar
Merz G, Murphy P (1996) UCI repository of machine learning databases. Technical Report, University of California, Department of Information and Computer Science: http://www.ics.uci.edu/ mlearn/MLRepository.html
Muralidhar K and Sarathy R (1999). Security of random data perturbation methods. ACM Trans Database Syst 24(4): 487–493
Article Google Scholar
Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Efficient mining of association rules using closed itemset lattices. Inform Syst 24(1): 25–46
Article Google Scholar
Samarati P (2001). Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6): 1010–1027
Article Google Scholar
Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information (Abstract). In: Proceedings of the seventeenth ACM symposium on principles of database systems, p 188
SAR1991 (1993) Office for National Statistics 1991 Great Britain Sample of Anonymised Records, Individual File [computer file] distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester, 1993. Available at: http://www.ccsr.ac.uk/sars
SAR2001 (2004) Office for National Statistics 2001 Great Britain Sample of Anonymised Records, Individual File [computer file] distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester, 2004. Available at: http://www.ccsr.ac.uk/sars
Singh A, Yu F, Dunteman G (2003) MASSC: A new data mask for limiting statistical information loss and disclosure. In: Joint ECE/EUROSTAT Worksession on Data Confidentiality. Luxembourg
Skinner CJ and Elliot MJ (2002). A measure of disclosure risk for microdata. J Roy Stat Soc Ser B 64: 855–867
Article MATH MathSciNet Google Scholar
Skinner CJ and Holmes DJ (1998). Estimating the re-identification risk per record. J Official Stat 14(4): 361–372
Google Scholar
Skinner C, Marsh C, Openshaw S and Wymer C (1994). Disclosure control for census microdata. J Official Stat 10(1): 31–51
Google Scholar
Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints. In: Proceedings of the third international conferences on knowledge discovery and data mining (KDD). pp 67–73
Sweeney L (2002). k-anonymity: a model for protecting privacy. Int J Uncertainty, Fuzziness Knowl Based Syst 10(5): 557–570
Article MATH MathSciNet Google Scholar
Truta TM, Fotouhi F, Barth-Jones D (2004) Assessing global disclosure risk in masked microdata. In: WPES ’04: proceedings of the 2004 ACM workshop on privacy in the electronic society. ACM Press, pp 85–93
Uno T, Asai T, Uchida Y, Arimura H (2004) An efficient algorithm for enumerating closed patterns in transaction databases. In: Proceedings of the 7th international conference on discovery science, pp 16–31
Willenborg L, de Waal T (1996) Statistical disclosure control in practice, Lecture notes in statistics III. Springer, New York
Willenborg L and Waal T (2001). Elements of statistical disclosure control. Springer-Verlag, New York
MATH Google Scholar
Zaki MJ, Hsiao C-J (2002) CHARM: an efficient algorithm for closed itemset mining. In: Proceedings of the 2002 SIAM international conference on data mining

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Oxford Rd., Manchester, M13 9PL, UK
Anna M. Manning & John A. Keane
Department of Computer and Information Sciences, Minnesota State University, 273 Wissink Hall, Mankato, MN, 56001, USA
David J. Haglin

Authors

Anna M. Manning
View author publications
You can also search for this author in PubMed Google Scholar
David J. Haglin
View author publications
You can also search for this author in PubMed Google Scholar
John A. Keane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna M. Manning.

Additional information

Responsible editor: Hannu Toivonen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manning, A.M., Haglin, D.J. & Keane, J.A. A recursive search algorithm for statistical disclosure assessment. Data Min Knowl Disc 16, 165–196 (2008). https://doi.org/10.1007/s10618-007-0078-6

Download citation

Received: 27 March 2006
Accepted: 06 June 2007
Published: 10 July 2007
Issue Date: April 2008
DOI: https://doi.org/10.1007/s10618-007-0078-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A recursive search algorithm for statistical disclosure assessment

Abstract

Access this article

Similar content being viewed by others

Multivariate Top-Coding for Statistical Disclosure Limitation

Random sampling of contingency tables via probabilistic divide-and-conquer

Profiling relational data: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A recursive search algorithm for statistical disclosure assessment

Abstract

Access this article

Similar content being viewed by others

Multivariate Top-Coding for Statistical Disclosure Limitation

Random sampling of contingency tables via probabilistic divide-and-conquer

Profiling relational data: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation