Abstract
A new algorithm, SUDA2, is presented which finds minimally unique itemsets i.e., minimal itemsets of frequency one. These itemsets, referred to as Minimal Sample Uniques (MSUs), are important for statistical agencies who wish to estimate the risk of disclosure of their datasets. SUDA2 is a recursive algorithm which uses new observations about the properties of MSUs to prune and traverse the search space. Experimental comparisons with previous work demonstrate that SUDA2 is several orders of magnitude faster, enabling datasets of significantly more columns to be addressed. The ability of SUDA2 to identify the boundaries of the search space for MSUs is clearly demonstrated.
Similar content being viewed by others
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, Proceedings of 20th international conference on very large data bases, September 12–15, 1994, Santiago de Chile, Chile, pp 487–499
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 international conference on management of data (SIGMOD 93), pp 207–216
Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo A (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in knowledge discovery and data mining, pp 307–328. The AAAI Press, Menlo Park
Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A (2005) Anonymizing tables. In: Proceedings of the tenth international conference on database theory, Edinburgh. Springer, Berlin, pp 246–258
Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of 1998 ACM-SIGMOD int. conf. on management of data, pp 85–93
Berge C (1989) Hypergraphs: combinatorics of finite sets. North-Holland
Boulicaut J-F, Bykowski A and Rigotti C (2003). A condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl Discov 7(1): 5–22
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of 1997 ACM-SIGMOD int. conf. on management of data, Tucson, Arizona, pp 255–264
Burdick D, Calimlim M, Gehrke J (2001) Mafia: a maximal frequent itemset algorithm for transactional databases. In: Proceedings ICDE 2001, pp 443–452
Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the 2005 SIAM international conference on data mining
Domingo-Ferrer J and Torra V (2005). Ordinal, continuous and heterogeneous k-Anonymity through microaggregation. Data Mining Knowl Discov 11(2): 195–212
Dong G, Jiang C, Pei J, Li J, Wong L (2005) Mining succinct systems of minimal generators of formal concepts. In: Proceedings of the tenth international conference for database systems for advanced applications (DASFAA’05), pp 175–187
Elliot MJ and Dale A (1999). Scenarios of attack: the data intruder’s perspective on statistical disclosure risk. Netherlands Official Stat 14: 6–10
Elliot MJ, Skinner CJ and Dale A (1998). Special uniques, random uniques and sticky populations: Some counterintuitive effects of geographical detail on disclosure risk. Res Official Stat 1(2): 53–67
Elliot MJ, Manning AM and Ford RW (2002). A computational algorithm for handling the special uniques problem. Int J Uncertainty, Fuzziness Knowl Based Syst 10(5): 493–509
Elliot MJ, Manning A, Mayes K, Gurd J, Bane M (2005) SUDA: a program for identifying and grading special uniques. In: Proceedings of the Joint United Nations Economic Commission for Europe (UN-ECE) and European Statistics (Eurostat) Worksession on Statistical Confidentiality, Geneva, pp 353–362
Fienberg SE and Makov UE (1998). Confidentiality, uniqueness and disclosure limitation for categorical data. J Official Stat 4: 385–397
Fienberg SE and Slavkovic AB (2005). Preserving the confidentiality of categorical statistical databases when releasing information for association rules. Data Mining Knowl Discov 11(2): 155–180
Flouvat F, de Marchi F, Petit J-M (2004) ABS: adaptive borders search of frequent itemsets. In: Workshop on frequent itemset mining implementations (FIMI’04), In: conjunction with the IEEE International conference on data mining
Ghoting A, Otey ME, Parthasarathy S (2004) Loaded: link-based outlier and anomaly detection in evolving data sets. In: Proceedings of the fourth IEEE international conference on data mining, Brighton, UK, pp 387–390
Golle P (2006) Revisiting the uniqueness of simple demographics in the US population. In: Proceedings of the 5th ACM workshop on Privacy in electronic society, pp 77–80
Gouda K and Zaki MJ (2005). Genmax: an efficient algorithm for mining maximal frequent itemsets. Data Mining Knowl Discov 11(3): 223–242
Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H and Sharm RS (2003). Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD conference, Dallas, Texas, pp 1–12
Hipp J, Güntzer U and Nakhaeizadeh G (2000). Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64
Lin X, Clifton C and Zhu M (2005). Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inform Syst 8(1): 68–81
Liu G, Li J, Wong L, Hsu W (2006) Positive borders or negative borders: How to make lossless generator based representations concise. In: Proceedings of the 2006 SIAM international conference on data mining
Lucchese C, Salvatore O, Perego R (2004) kDCI: on using direct count up to the third iteration. In: Workshop on frequent itemset mining implementations (FIMI’04), in conjunction with the IEEE international conference on data mining
Lucchese C, Salvatore O and Perego R (2006). Fast and memory efficient mining of frequent closed itemsets. IEEE Trans Knowl Data Eng 18(1): 21–36
Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-diversity: privacy beyond k-Anonymity. In: Proceedings of the 22nd IEEE international conference on data engineering, Atlanta, Georgia, USA
Manning AM, Haglin DJ (2005) A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment. In: Proceedings of the fifth IEEE international conference on data mining, Houston, Texas, USA, pp 290–297
Mannila H and Toivonen H (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl Discov 1(3): 241–258
Merz G, Murphy P (1996) UCI repository of machine learning databases. Technical Report, University of California, Department of Information and Computer Science: http://www.ics.uci.edu/ mlearn/MLRepository.html
Muralidhar K and Sarathy R (1999). Security of random data perturbation methods. ACM Trans Database Syst 24(4): 487–493
Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Efficient mining of association rules using closed itemset lattices. Inform Syst 24(1): 25–46
Samarati P (2001). Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6): 1010–1027
Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information (Abstract). In: Proceedings of the seventeenth ACM symposium on principles of database systems, p 188
SAR1991 (1993) Office for National Statistics 1991 Great Britain Sample of Anonymised Records, Individual File [computer file] distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester, 1993. Available at: http://www.ccsr.ac.uk/sars
SAR2001 (2004) Office for National Statistics 2001 Great Britain Sample of Anonymised Records, Individual File [computer file] distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester, 2004. Available at: http://www.ccsr.ac.uk/sars
Singh A, Yu F, Dunteman G (2003) MASSC: A new data mask for limiting statistical information loss and disclosure. In: Joint ECE/EUROSTAT Worksession on Data Confidentiality. Luxembourg
Skinner CJ and Elliot MJ (2002). A measure of disclosure risk for microdata. J Roy Stat Soc Ser B 64: 855–867
Skinner CJ and Holmes DJ (1998). Estimating the re-identification risk per record. J Official Stat 14(4): 361–372
Skinner C, Marsh C, Openshaw S and Wymer C (1994). Disclosure control for census microdata. J Official Stat 10(1): 31–51
Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints. In: Proceedings of the third international conferences on knowledge discovery and data mining (KDD). pp 67–73
Sweeney L (2002). k-anonymity: a model for protecting privacy. Int J Uncertainty, Fuzziness Knowl Based Syst 10(5): 557–570
Truta TM, Fotouhi F, Barth-Jones D (2004) Assessing global disclosure risk in masked microdata. In: WPES ’04: proceedings of the 2004 ACM workshop on privacy in the electronic society. ACM Press, pp 85–93
Uno T, Asai T, Uchida Y, Arimura H (2004) An efficient algorithm for enumerating closed patterns in transaction databases. In: Proceedings of the 7th international conference on discovery science, pp 16–31
Willenborg L, de Waal T (1996) Statistical disclosure control in practice, Lecture notes in statistics III. Springer, New York
Willenborg L and Waal T (2001). Elements of statistical disclosure control. Springer-Verlag, New York
Zaki MJ, Hsiao C-J (2002) CHARM: an efficient algorithm for closed itemset mining. In: Proceedings of the 2002 SIAM international conference on data mining
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Hannu Toivonen.
Rights and permissions
About this article
Cite this article
Manning, A.M., Haglin, D.J. & Keane, J.A. A recursive search algorithm for statistical disclosure assessment. Data Min Knowl Disc 16, 165–196 (2008). https://doi.org/10.1007/s10618-007-0078-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-007-0078-6