Abstract
Feature construction has been studied extensively, including for 0/1 data samples. Given the recent breakthroughs in closedness-related constraint-based mining, we are considering its impact on feature construction for classification tasks. We investigate the use of condensed representations of frequent itemsets based on closedness properties as new features. These itemset types have been proposed to avoid set counting in difficult association rule mining tasks, i.e. when data are noisy and/or highly correlated. However, our guess is that their intrinsic properties (say the maximality for the closed itemsets and the minimality for the δ-free itemsets) should have an impact on feature quality. Understanding this remains fairly open, and we discuss these issues thanks to itemset properties on the one hand and an experimental validation on various data sets (possibly noisy) on the other hand.
Similar content being viewed by others
References
Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Proceedings ACM SIGMOD’93, pp 207–216
Antonie M-L, Zaïane OR (2004) An associative classifier based on positive and negative rules. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD’04. ACM Press, pp 64–69
Baralis E, Chiusano S (2004) Essential classification rule sets. ACM Trans Database Syst 29(4): 635–674
Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000) Mining frequent patterns with counting inference. SIGKDD Explor 2(2): 66–75
Besson J, Pensa RG, Robardet C, Boulicaut J-F (2006) Constraint-based mining of fault-tolerant patterns from boolean data. In: KDID’05 selected and invited revised papers, vol. 3933 of LNCS, Springer, pp 55–71
Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89
Bonchi F, Lucchese C (2006) On condensed representations of constrained frequent patterns. Knowl Inf Syst 9(2): 180–201
Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Proceedings PKDD’00, vol. 1910 of LNCS, Springer, pp 75–85
Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl Discov 7(1): 5–22
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD’97. ACM Press, New york, pp 265–276
Bringmann B, Nijssen S, Zimmermann A (2009) Pattern based classification: a unifying perspective. In: LeGo’09 worskhop colocated with ECML/PKDD’09
Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inf Syst 18(1): 61–81
Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1): 45–77
Calders T, Rigotti C, Boulicaut J-F (2005) A survey on condensed representations for frequent sets. In: Constraint-based mining and inductive databases, vol 3848 of LNCS. Springer, Berlin, pp 64–80
Cerf L, Gay D, Selmaoui N, Boulicaut J-F (2008) A parameter free associative classifier. In: Proceedings DaWaK’08, vol 5182 of LNCS. Springer, Berlin, pp 238–247
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines’. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Cheng H, Yan X, Han J, Hsu C-W (2007) Discriminative frequent pattern analysis for effective classification. In: Proceedings ICDE’07. IEEE Computer Society, Silver Spring, pp 716–725
Cheng H, Yu PS, Han J (2006) AC-close: efficiently mining approximate closed itemsets by core pattern recovery. In: ICDM’06. pp 839–844
Cheng J, Ke Y, Ng W (2006) δ-tolerance closed frequent itemsets. In: ICDM’06, pp 139–148
Crémilleux B, Boulicaut J-F (2002) Simplest rules characterizing classes generated by delta-free sets. In: Proceedings ES’02. Springer, Berlin, pp 33–46
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings KDD’99. ACM Press, New york, pp 43–52
Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings DS’99, vol 1721 of LNCS, Springer, Berlin, pp 30–42
El-Manzalawy Y (2005) WLSVM: integrating libsvm into weka environment. http://www.cs.iastate.edu/~yasser/wlsvm/
Fayyad UM, Irani KB (1993) Multi-interval discretization of continous-valued attributes for classification learning. In: Proceedings IJCAI’93. Morgan Kaufmann, Los Altos, pp 1022–1027
Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2: 721–747
Ganter B, Stumme G, Wille R (eds) (2005) Formal concept analysis, foundations and applications, vol 3626 of lecture notes in computer science. Springer, Berlin
Garriga GC, Kralj P, Lavrac N (2006) Closed sets for labeled data. In: Proceedings PKDD’06. Springer, Berlin, pp 163–174
Garriga GC, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580
Gay D, Selmaoui N, Boulicaut J.-F (2007) Pattern-based decision tree construction. In: Proceedings of IEEE ICDIM’07. IEEE Press, New York, pp 291–296
Gay D, Selmaoui N, Boulicaut J-F (2008) Feature construction based on closedness properties is not that simple. In: Proceedings PAKDD’08, vol 5012 of LNCS. Springer, Berlin, pp 112–123
Gay D, Selmaoui N, Boulicaut J-F (2009) Application-independent feature construction from noisy samples In: Proceedings PAKDD’09, vol 5476 of LNCS. Springer, Berlin, pp 965–972
Hébert C, Crémilleux B (2005) Mining delta-strong characterization rules in large SAGE data. In: PKDD’05 discovery challenge on gene expression data
Hébert C, Crémilleux B (2006) Optimized rule mining through a unified framework for interestingness measures. In: Proceedings DaWaK’06, vol 4081 of LNCS. Springer, Berlin, pp 238–247
John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings UAI’95. Morgan Kaufmann, Los Altos, pp 338–345
Kubica J, Moore AW (2003) Probabilistic noise identification and data cleaning. In: Proceedings ICDM’03. IEEE Computer Society, Silver Spring, pp 131–138
Li J, Dong G, Ramamohanarao K (2000) Instance-based classification by emerging patterns. In: Proceedings the 4th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, pp 191–200
Li J, Dong G, Ramamohanarao K (2001) ‘Making use of the most expressive jumping emerging patterns for classification. Knowl Inf Syst 3(2): 131–145
Li J, Liu G, Wong L (2007) Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining KDD’07. ACM Press, New York
Li W, Han J, Pei J (2001) CMAR: accurate and efficient classification based on multiple class-association rules. In: Proceedings ICDM’01. IEEE Computer Society, New York, pp 369–376
Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings KDD’98. AAAI Press, pp 80–86
Liu G, Li J, Wong L (2007) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf Syst
Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362
Park S-H, Fürnkranz J. (2007) Efficient pairwise classification. In: ECML’07, pp 658–665
Pensa RG, Robardet C, Boulicaut J-F (2006) Supporting bi-cluster interpretation in 0/1 data by means of local patterns. Intell Data Anal 10(5): 457–472
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, Los Altos
Ramamohanarao K, Fan H (2007) Patterns based classifiers. World Wide Web 10(1): 71–83
Rebbapragada U, Brodley CE (2007) Class noise mitigation through instance weighting. In: Proceedings ECML’07, vol 4701 of LNCS. Springer, Berlin, pp 708–715
Selmaoui N, Leschi C, Gay D, Boulicaut J-F (2006) Feature construction and delta-free sets in 0/1 samples. In: Proceedings DS’06, vol 4265 of LNCS. Springer, Berlin, pp 363–367
Soulet A, Crémilleux B, Rioult F (2004) Condensed representation of emerging patterns. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery in databases, vol 3056 of LNCS, pp 127–132
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading
Utgoff PE, Brodley CE (1990) An incremental method for finding multivariate splits for decision trees. In: ICML’90, pp 58–65
Van Hulse J, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190
Wang J, Karypis G (2005) HARMONY: efficiently mining the best rules for classification. In: Proceedings SIAM SDM’05, pp 34–43
Wang J, Karypis G (2006) On mining instance-centric classification rules. IEEE Trans Knowl Data Eng 18(11): 1497–1511
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Los Altos
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Yang C, Fayyad UM, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings KDD’01. ACM Press, New York, pp 194–203
Yang Y, Wu X, Zhu X (2004) Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Proceedings PKDD’04, vol 3202 of LNCS. Springer, Berlin, pp 471–483
Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15(2): 233–257
Zhang Y, Wu X (2007) Noise modeling with associative corruption rules. In: Proceedings ICDM’07. IEEE Computer Society, New York, pp 733–738
Zheng Z (1995) Constructing nominal x-of-n attributes. In: IJCAI’95, pp 1064–1070
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Revue 22(3): 177–210
Author information
Authors and Affiliations
Corresponding author
Additional information
Dominique Gay was with PPME EA3325, ERIM EA3791, University of New-Caledonia when this work began.
Rights and permissions
About this article
Cite this article
Gay, D., Selmaoui-Folcher, N. & Boulicaut, JF. Application-independent feature construction based on almost-closedness properties. Knowl Inf Syst 30, 87–111 (2012). https://doi.org/10.1007/s10115-010-0369-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0369-x