Skip to main content
Log in

A recursive search algorithm for statistical disclosure assessment

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

A new algorithm, SUDA2, is presented which finds minimally unique itemsets i.e., minimal itemsets of frequency one. These itemsets, referred to as Minimal Sample Uniques (MSUs), are important for statistical agencies who wish to estimate the risk of disclosure of their datasets. SUDA2 is a recursive algorithm which uses new observations about the properties of MSUs to prune and traverse the search space. Experimental comparisons with previous work demonstrate that SUDA2 is several orders of magnitude faster, enabling datasets of significantly more columns to be addressed. The ability of SUDA2 to identify the boundaries of the search space for MSUs is clearly demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, Proceedings of 20th international conference on very large data bases, September 12–15, 1994, Santiago de Chile, Chile, pp 487–499

  • Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 international conference on management of data (SIGMOD 93), pp 207–216

  • Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo A (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in knowledge discovery and data mining, pp 307–328. The AAAI Press, Menlo Park

    Google Scholar 

  • Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A (2005) Anonymizing tables. In: Proceedings of the tenth international conference on database theory, Edinburgh. Springer, Berlin, pp 246–258

  • Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of 1998 ACM-SIGMOD int. conf. on management of data, pp 85–93

  • Berge C (1989) Hypergraphs: combinatorics of finite sets. North-Holland

  • Boulicaut J-F, Bykowski A and Rigotti C (2003). A condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl Discov 7(1): 5–22

    Article  MathSciNet  Google Scholar 

  • Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of 1997 ACM-SIGMOD int. conf. on management of data, Tucson, Arizona, pp 255–264

  • Burdick D, Calimlim M, Gehrke J (2001) Mafia: a maximal frequent itemset algorithm for transactional databases. In: Proceedings ICDE 2001, pp 443–452

  • Calders T, Goethals B (2005) Depth-first non-derivable itemset mining. In: Proceedings of the 2005 SIAM international conference on data mining

  • Domingo-Ferrer J and Torra V (2005). Ordinal, continuous and heterogeneous k-Anonymity through microaggregation. Data Mining Knowl Discov 11(2): 195–212

    Article  MathSciNet  Google Scholar 

  • Dong G, Jiang C, Pei J, Li J, Wong L (2005) Mining succinct systems of minimal generators of formal concepts. In: Proceedings of the tenth international conference for database systems for advanced applications (DASFAA’05), pp 175–187

  • Elliot MJ and Dale A (1999). Scenarios of attack: the data intruder’s perspective on statistical disclosure risk. Netherlands Official Stat 14: 6–10

    Google Scholar 

  • Elliot MJ, Skinner CJ and Dale A (1998). Special uniques, random uniques and sticky populations: Some counterintuitive effects of geographical detail on disclosure risk. Res Official Stat 1(2): 53–67

    Google Scholar 

  • Elliot MJ, Manning AM and Ford RW (2002). A computational algorithm for handling the special uniques problem. Int J Uncertainty, Fuzziness Knowl Based Syst 10(5): 493–509

    Article  MATH  Google Scholar 

  • Elliot MJ, Manning A, Mayes K, Gurd J, Bane M (2005) SUDA: a program for identifying and grading special uniques. In: Proceedings of the Joint United Nations Economic Commission for Europe (UN-ECE) and European Statistics (Eurostat) Worksession on Statistical Confidentiality, Geneva, pp 353–362

  • Fienberg SE and Makov UE (1998). Confidentiality, uniqueness and disclosure limitation for categorical data. J Official Stat 4: 385–397

    Google Scholar 

  • Fienberg SE and Slavkovic AB (2005). Preserving the confidentiality of categorical statistical databases when releasing information for association rules. Data Mining Knowl Discov 11(2): 155–180

    Article  MathSciNet  Google Scholar 

  • Flouvat F, de Marchi F, Petit J-M (2004) ABS: adaptive borders search of frequent itemsets. In: Workshop on frequent itemset mining implementations (FIMI’04), In: conjunction with the IEEE International conference on data mining

  • Ghoting A, Otey ME, Parthasarathy S (2004) Loaded: link-based outlier and anomaly detection in evolving data sets. In: Proceedings of the fourth IEEE international conference on data mining, Brighton, UK, pp 387–390

  • Golle P (2006) Revisiting the uniqueness of simple demographics in the US population. In: Proceedings of the 5th ACM workshop on Privacy in electronic society, pp 77–80

  • Gouda K and Zaki MJ (2005). Genmax: an efficient algorithm for mining maximal frequent itemsets. Data Mining Knowl Discov 11(3): 223–242

    Article  MathSciNet  Google Scholar 

  • Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H and Sharm RS (2003). Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174

    Article  Google Scholar 

  • Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD conference, Dallas, Texas, pp 1–12

  • Hipp J, Güntzer U and Nakhaeizadeh G (2000). Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64

    Article  Google Scholar 

  • Lin X, Clifton C and Zhu M (2005). Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inform Syst 8(1): 68–81

    Article  Google Scholar 

  • Liu G, Li J, Wong L, Hsu W (2006) Positive borders or negative borders: How to make lossless generator based representations concise. In: Proceedings of the 2006 SIAM international conference on data mining

  • Lucchese C, Salvatore O, Perego R (2004) kDCI: on using direct count up to the third iteration. In: Workshop on frequent itemset mining implementations (FIMI’04), in conjunction with the IEEE international conference on data mining

  • Lucchese C, Salvatore O and Perego R (2006). Fast and memory efficient mining of frequent closed itemsets. IEEE Trans Knowl Data Eng 18(1): 21–36

    Article  Google Scholar 

  • Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-diversity: privacy beyond k-Anonymity. In: Proceedings of the 22nd IEEE international conference on data engineering, Atlanta, Georgia, USA

  • Manning AM, Haglin DJ (2005) A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment. In: Proceedings of the fifth IEEE international conference on data mining, Houston, Texas, USA, pp 290–297

  • Mannila H and Toivonen H (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl Discov 1(3): 241–258

    Article  Google Scholar 

  • Merz G, Murphy P (1996) UCI repository of machine learning databases. Technical Report, University of California, Department of Information and Computer Science: http://www.ics.uci.edu/ mlearn/MLRepository.html

  • Muralidhar K and Sarathy R (1999). Security of random data perturbation methods. ACM Trans Database Syst 24(4): 487–493

    Article  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Efficient mining of association rules using closed itemset lattices. Inform Syst 24(1): 25–46

    Article  Google Scholar 

  • Samarati P (2001). Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6): 1010–1027

    Article  Google Scholar 

  • Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information (Abstract). In: Proceedings of the seventeenth ACM symposium on principles of database systems, p 188

  • SAR1991 (1993) Office for National Statistics 1991 Great Britain Sample of Anonymised Records, Individual File [computer file] distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester, 1993. Available at: http://www.ccsr.ac.uk/sars

  • SAR2001 (2004) Office for National Statistics 2001 Great Britain Sample of Anonymised Records, Individual File [computer file] distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester, 2004. Available at: http://www.ccsr.ac.uk/sars

  • Singh A, Yu F, Dunteman G (2003) MASSC: A new data mask for limiting statistical information loss and disclosure. In: Joint ECE/EUROSTAT Worksession on Data Confidentiality. Luxembourg

  • Skinner CJ and Elliot MJ (2002). A measure of disclosure risk for microdata. J Roy Stat Soc Ser B 64: 855–867

    Article  MATH  MathSciNet  Google Scholar 

  • Skinner CJ and Holmes DJ (1998). Estimating the re-identification risk per record. J Official Stat 14(4): 361–372

    Google Scholar 

  • Skinner C, Marsh C, Openshaw S and Wymer C (1994). Disclosure control for census microdata. J Official Stat 10(1): 31–51

    Google Scholar 

  • Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints. In: Proceedings of the third international conferences on knowledge discovery and data mining (KDD). pp 67–73

  • Sweeney L (2002). k-anonymity: a model for protecting privacy. Int J Uncertainty, Fuzziness Knowl Based Syst 10(5): 557–570

    Article  MATH  MathSciNet  Google Scholar 

  • Truta TM, Fotouhi F, Barth-Jones D (2004) Assessing global disclosure risk in masked microdata. In: WPES ’04: proceedings of the 2004 ACM workshop on privacy in the electronic society. ACM Press, pp 85–93

  • Uno T, Asai T, Uchida Y, Arimura H (2004) An efficient algorithm for enumerating closed patterns in transaction databases. In: Proceedings of the 7th international conference on discovery science, pp 16–31

  • Willenborg L, de Waal T (1996) Statistical disclosure control in practice, Lecture notes in statistics III. Springer, New York

  • Willenborg L and Waal T (2001). Elements of statistical disclosure control. Springer-Verlag, New York

    MATH  Google Scholar 

  • Zaki MJ, Hsiao C-J (2002) CHARM: an efficient algorithm for closed itemset mining. In: Proceedings of the 2002 SIAM international conference on data mining

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna M. Manning.

Additional information

Responsible editor: Hannu Toivonen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manning, A.M., Haglin, D.J. & Keane, J.A. A recursive search algorithm for statistical disclosure assessment. Data Min Knowl Disc 16, 165–196 (2008). https://doi.org/10.1007/s10618-007-0078-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-007-0078-6

Keywords

Navigation