Skip to main content
Log in

Application-independent feature construction based on almost-closedness properties

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Feature construction has been studied extensively, including for 0/1 data samples. Given the recent breakthroughs in closedness-related constraint-based mining, we are considering its impact on feature construction for classification tasks. We investigate the use of condensed representations of frequent itemsets based on closedness properties as new features. These itemset types have been proposed to avoid set counting in difficult association rule mining tasks, i.e. when data are noisy and/or highly correlated. However, our guess is that their intrinsic properties (say the maximality for the closed itemsets and the minimality for the δ-free itemsets) should have an impact on feature quality. Understanding this remains fairly open, and we discuss these issues thanks to itemset properties on the one hand and an experimental validation on various data sets (possibly noisy) on the other hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Proceedings ACM SIGMOD’93, pp 207–216

  2. Antonie M-L, Zaïane OR (2004) An associative classifier based on positive and negative rules. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD’04. ACM Press, pp 64–69

  3. Baralis E, Chiusano S (2004) Essential classification rule sets. ACM Trans Database Syst 29(4): 635–674

    Article  Google Scholar 

  4. Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000) Mining frequent patterns with counting inference. SIGKDD Explor 2(2): 66–75

    Article  Google Scholar 

  5. Besson J, Pensa RG, Robardet C, Boulicaut J-F (2006) Constraint-based mining of fault-tolerant patterns from boolean data. In: KDID’05 selected and invited revised papers, vol. 3933 of LNCS, Springer, pp 55–71

  6. Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89

    Article  Google Scholar 

  7. Bonchi F, Lucchese C (2006) On condensed representations of constrained frequent patterns. Knowl Inf Syst 9(2): 180–201

    Article  Google Scholar 

  8. Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Proceedings PKDD’00, vol. 1910 of LNCS, Springer, pp 75–85

  9. Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl Discov 7(1): 5–22

    Article  MathSciNet  Google Scholar 

  10. Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD’97. ACM Press, New york, pp 265–276

  11. Bringmann B, Nijssen S, Zimmermann A (2009) Pattern based classification: a unifying perspective. In: LeGo’09 worskhop colocated with ECML/PKDD’09

  12. Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inf Syst 18(1): 61–81

    Article  Google Scholar 

  13. Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1): 45–77

    MATH  Google Scholar 

  14. Calders T, Rigotti C, Boulicaut J-F (2005) A survey on condensed representations for frequent sets. In: Constraint-based mining and inductive databases, vol 3848 of LNCS. Springer, Berlin, pp 64–80

  15. Cerf L, Gay D, Selmaoui N, Boulicaut J-F (2008) A parameter free associative classifier. In: Proceedings DaWaK’08, vol 5182 of LNCS. Springer, Berlin, pp 238–247

  16. Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines’. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  17. Cheng H, Yan X, Han J, Hsu C-W (2007) Discriminative frequent pattern analysis for effective classification. In: Proceedings ICDE’07. IEEE Computer Society, Silver Spring, pp 716–725

  18. Cheng H, Yu PS, Han J (2006) AC-close: efficiently mining approximate closed itemsets by core pattern recovery. In: ICDM’06. pp 839–844

  19. Cheng J, Ke Y, Ng W (2006) δ-tolerance closed frequent itemsets. In: ICDM’06, pp 139–148

  20. Crémilleux B, Boulicaut J-F (2002) Simplest rules characterizing classes generated by delta-free sets. In: Proceedings ES’02. Springer, Berlin, pp 33–46

  21. Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings KDD’99. ACM Press, New york, pp 43–52

  22. Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings DS’99, vol 1721 of LNCS, Springer, Berlin, pp 30–42

  23. El-Manzalawy Y (2005) WLSVM: integrating libsvm into weka environment. http://www.cs.iastate.edu/~yasser/wlsvm/

  24. Fayyad UM, Irani KB (1993) Multi-interval discretization of continous-valued attributes for classification learning. In: Proceedings IJCAI’93. Morgan Kaufmann, Los Altos, pp 1022–1027

  25. Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2: 721–747

    MATH  MathSciNet  Google Scholar 

  26. Ganter B, Stumme G, Wille R (eds) (2005) Formal concept analysis, foundations and applications, vol 3626 of lecture notes in computer science. Springer, Berlin

  27. Garriga GC, Kralj P, Lavrac N (2006) Closed sets for labeled data. In: Proceedings PKDD’06. Springer, Berlin, pp 163–174

  28. Garriga GC, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580

    MATH  MathSciNet  Google Scholar 

  29. Gay D, Selmaoui N, Boulicaut J.-F (2007) Pattern-based decision tree construction. In: Proceedings of IEEE ICDIM’07. IEEE Press, New York, pp 291–296

  30. Gay D, Selmaoui N, Boulicaut J-F (2008) Feature construction based on closedness properties is not that simple. In: Proceedings PAKDD’08, vol 5012 of LNCS. Springer, Berlin, pp 112–123

  31. Gay D, Selmaoui N, Boulicaut J-F (2009) Application-independent feature construction from noisy samples In: Proceedings PAKDD’09, vol 5476 of LNCS. Springer, Berlin, pp 965–972

  32. Hébert C, Crémilleux B (2005) Mining delta-strong characterization rules in large SAGE data. In: PKDD’05 discovery challenge on gene expression data

  33. Hébert C, Crémilleux B (2006) Optimized rule mining through a unified framework for interestingness measures. In: Proceedings DaWaK’06, vol 4081 of LNCS. Springer, Berlin, pp 238–247

  34. John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings UAI’95. Morgan Kaufmann, Los Altos, pp 338–345

  35. Kubica J, Moore AW (2003) Probabilistic noise identification and data cleaning. In: Proceedings ICDM’03. IEEE Computer Society, Silver Spring, pp 131–138

  36. Li J, Dong G, Ramamohanarao K (2000) Instance-based classification by emerging patterns. In: Proceedings the 4th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, pp 191–200

  37. Li J, Dong G, Ramamohanarao K (2001) ‘Making use of the most expressive jumping emerging patterns for classification. Knowl Inf Syst 3(2): 131–145

    Article  Google Scholar 

  38. Li J, Liu G, Wong L (2007) Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining KDD’07. ACM Press, New York

  39. Li W, Han J, Pei J (2001) CMAR: accurate and efficient classification based on multiple class-association rules. In: Proceedings ICDM’01. IEEE Computer Society, New York, pp 369–376

  40. Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings KDD’98. AAAI Press, pp 80–86

  41. Liu G, Li J, Wong L (2007) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf Syst

  42. Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362

    Article  Google Scholar 

  43. Park S-H, Fürnkranz J. (2007) Efficient pairwise classification. In: ECML’07, pp 658–665

  44. Pensa RG, Robardet C, Boulicaut J-F (2006) Supporting bi-cluster interpretation in 0/1 data by means of local patterns. Intell Data Anal 10(5): 457–472

    Google Scholar 

  45. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, Los Altos

    Google Scholar 

  46. Ramamohanarao K, Fan H (2007) Patterns based classifiers. World Wide Web 10(1): 71–83

    Article  Google Scholar 

  47. Rebbapragada U, Brodley CE (2007) Class noise mitigation through instance weighting. In: Proceedings ECML’07, vol 4701 of LNCS. Springer, Berlin, pp 708–715

  48. Selmaoui N, Leschi C, Gay D, Boulicaut J-F (2006) Feature construction and delta-free sets in 0/1 samples. In: Proceedings DS’06, vol 4265 of LNCS. Springer, Berlin, pp 363–367

  49. Soulet A, Crémilleux B, Rioult F (2004) Condensed representation of emerging patterns. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery in databases, vol 3056 of LNCS, pp 127–132

  50. Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading

    Google Scholar 

  51. Utgoff PE, Brodley CE (1990) An incremental method for finding multivariate splits for decision trees. In: ICML’90, pp 58–65

  52. Van Hulse J, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190

    Article  Google Scholar 

  53. Wang J, Karypis G (2005) HARMONY: efficiently mining the best rules for classification. In: Proceedings SIAM SDM’05, pp 34–43

  54. Wang J, Karypis G (2006) On mining instance-centric classification rules. IEEE Trans Knowl Data Eng 18(11): 1497–1511

    Article  Google Scholar 

  55. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Los Altos

    MATH  Google Scholar 

  56. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  57. Yang C, Fayyad UM, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings KDD’01. ACM Press, New York, pp 194–203

  58. Yang Y, Wu X, Zhu X (2004) Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Proceedings PKDD’04, vol 3202 of LNCS. Springer, Berlin, pp 471–483

  59. Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15(2): 233–257

    Article  Google Scholar 

  60. Zhang Y, Wu X (2007) Noise modeling with associative corruption rules. In: Proceedings ICDM’07. IEEE Computer Society, New York, pp 733–738

  61. Zheng Z (1995) Constructing nominal x-of-n attributes. In: IJCAI’95, pp 1064–1070

  62. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Revue 22(3): 177–210

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominique Gay.

Additional information

Dominique Gay was with PPME EA3325, ERIM EA3791, University of New-Caledonia when this work began.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gay, D., Selmaoui-Folcher, N. & Boulicaut, JF. Application-independent feature construction based on almost-closedness properties. Knowl Inf Syst 30, 87–111 (2012). https://doi.org/10.1007/s10115-010-0369-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0369-x

Keywords

Navigation