Abstract
We tackle the challenging problem of mining the simplest Boolean patterns from categorical datasets. Instead of complete enumeration, which is typically infeasible for this class of patterns, we develop effective sampling methods to extract a representative subset of the minimal Boolean patterns in disjunctive normal form (DNF). We propose a novel theoretical characterization of the minimal DNF expressions, which allows us to prune the pattern search space effectively. Our approach can provide a near-uniform sample of the minimal DNF patterns. We perform an extensive set of experiments to demonstrate the effectiveness of our sampling method. We also show that minimal DNF patterns make effective features for classification.
Similar content being viewed by others
References
Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad U et al (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 307–328
Akutsu T, Kuhara S, Maruyama O, Miyano S (1998) Identification of gene regulatory networks by strategic gene disruptions and gene overexpressions. In: ACM-SIAM symposium on discrete algorithms
Antonie M-L, Zaiane O (2004) Mining positive and negative association rules: an approach for confined rules. In: European conference on principles and practice of knowledge discovery in databases
Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000) Mining frequent patterns with counting inference. SIGKDD Explor 2(2):66–75
Bayardo RJ, Agrawal R (1999) Mining the most interesting rules. In: ACM SIGKDD international conference on knowledge discovery and data mining
Boley M, Gärtner T, Grosskreutz H, Fraunhofer I (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: SIAM data mining conference
Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inform Syst 21(1):65–89
Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: ACM SIGKDD international conference on knowledge discovery and data mining
Bshouty N (1995) Exact learning boolean functions via the monotone theory. Inform Comput 123(1):146–153
Calders T, Goethals B (2003) Minimal k-free representations of frequent sets. In: European conference on principles and practice of knowledge discovery in databases
Calders T, Goethals B (2005) Quick inclusion-exclusion. In: Proceedings ECML-PKDD workshop on knowledge discovery in inductive databases
Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–39
Chaoji V, Hasan MA, Salem S, Besson J, Zaki MJ (2008) ORIGAMI: a novel and effective approach for mining representative orthogonal graph patterns. Stat Anal Data Min 1(2):67–84
Cowles M, Carlin B (1996) Markov chain monte carlo convergence diagnostics: a comparative review. J Am Stat 91(434):883–904
Curk T, Demsar J, Xu Q, Leban G, Petrovic U, Bratko I, Shaulsky G, Zupan B (2005) Microarray data mining with visual programming. Bioinformatics 21(3):396–398
Dong G, Jiang C, Pei J, Li J, Wong L (2005) Mining succinct systems of minimal generators of formal concepts. In: International conference database systems for advanced applications
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on articial intelligence
Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, (http://archive.ics.uci.edu/ml)
Ganter B, Wille R (1999) Formal concept analysis: mathematical foundations. Springer, Berlin
Goethals B, Zaki MJ (2004) Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor 6(1):109–117
Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma R (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2):140–174
Gunopulos D, Mannila H, Saluja S (1997) Discovering all most specific sentences by randomized algorithm. In: 6th international conference on database theory
Hamrouni T, Yahia S Ben, Mephu Nguifo E (2009) Sweeping the disjunctive search space towards mining new exact concise representations of frequent itemsets. Data & Knowl Eng 68(10):1091–1111
Hasan MA, Zaki MJ (2009) Musk: uniform sampling of k maximal patterns. In: 9th SIAM international conference on data mining
Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741
Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the eleventh international joint conference on artificial intelligence
Jaroszewicz S, Simovici DA (2002) Support approximations using bonferroni-type inequalities. In: 6th European conference on principles of data mining and knowledge discovery
Kryszkiewicz M (2001) Concise representation of frequent patterns based on disjunction-free generators. In: IEEE International conference on data mining
Kryszkiewicz M (2005) Generalized disjunction-free representation of frequent patterns with negation. J Exp Theor Artif Intell 17(1/2):63–82
Li G, Zaki MJ (2012) Sampling minimal frequent boolean (DNF) patterns. In: 18th ACM SIGKDD international conference on knowledge discovery and data mining
Loekito E, Bailey J (2006) Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: ACM SIGKDD international conference on knowledge discovery and data mining
Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: International conference on knowledge discovery and data mining
Mitchell T (1982) Generalization as search. Artif Intell 18:203–226
Nanavati A, Chitrapura K, Joshi S, Krishnapuram R (2001) Association rule mining: Mining generalised disjunctive association rules. In: ACM international conference on information and knowledge management
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm R (Aug. 2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: ACM SIGKDD international conference on knowledge discovery and data mining
Rubinstein RY, Kroese DK (2008) Simulation and the Monte Carlo method, 2nd edn. Wiley, New York
Savasere A, Omiecinski E, Navathe S (1998) Mining for strong negative associations in a large database of customer transactions. In: IEEE International conference on data engineeging
Shima Y, Mitsuishi S, Hirata K, Harao M (2004) Extracting minimal and closed monotone dnf formulas. In: International conference on discovery science
Stumme G, Taouil R, Bastide Y, Pasquier N, Lakhal L (2002) Computing iceberg concept lattices with titanic. Data Knowl Eng 42(2):189–222
Veloso A, Meira W, Zaki MJ (2006) Lazy associative classification. In: IEEE International conference on data mining
Vimieiro R, Moscato P (2012) Mining disjunctive minimal generators with titanicor. Expert Syst Appl 39(9):8228–8238
Vimieiro R, Moscato P (2014) Disclosed: an efficient depth-first, top-down algorithm for mining disjunctive closed itemsets in high-dimensional data. Inform Sci 280:171–187
Vreeken J, Van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214
Wu X, Zhang C, Zhang S (2004) Efficient mining of both positive and negative association rules. ACM Trans Inform Syst 22(3):381–405
Yuan X, Buckles BP, Yuan Z, Zhang J (2002) Mining negative association rules. In: 7th international symposium on computers and communications
Zaki M, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: ACM SIGKDD international conference on knowledge discovery and data mining
Zaki MJ (Aug. 2000) Generating non-redundant association rules. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining
Zaki MJ, Hsiao C-J (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478
Zaki MJ, Ramakrishnan N, Zhao L (2010) Mining frequent boolean expressions: application to gene expression and regulatory modeling. Int J Knowl Discov Bioinform 1(3):68–96 Special issue on mining complex structures in biology
Zhao L, Zaki MJ, Ramakrishnan N (2006) Blosom: a framework for mining arbitrary boolean expressions. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining
Acknowledgments
This work was supported in part by NSF Awards CCF-1240646 and IIS-1302231.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Bart Goethals.
Rights and permissions
About this article
Cite this article
Li, G., Zaki, M.J. Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Disc 30, 181–225 (2016). https://doi.org/10.1007/s10618-015-0409-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0409-y