Skip to main content

Advertisement

Log in

Efficient algorithms for finding optimal binary features in numeric and nominal labeled data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

An important subproblem in supervised tasks such as decision tree induction and subgroup discovery is finding an interesting binary feature (such as a node split or a subgroup refinement) based on a numeric or nominal attribute, with respect to some discrete or continuous target variable. Often one is faced with a trade-off between the expressiveness of such features on the one hand and the ability to efficiently traverse the feature search space on the other hand. In this article, we present efficient algorithms to mine binary features that optimize a given convex quality measure. For numeric attributes, we propose an algorithm that finds an optimal interval, whereas for nominal attributes, we give an algorithm that finds an optimal value set. By restricting the search to features that lie on a convex hull in a coverage space, we can significantly reduce computation time. We present some general theoretical results on the cardinality of convex hulls in coverage spaces of arbitrary dimensions and perform a complexity analysis of our algorithms. In the important case of a binary target, we show that these algorithms have linear runtime in the number of examples. We further provide algorithms for additive quality measures, which have linear runtime regardless of the target type. Additive measures are particularly relevant to feature discovery in subgroup discovery. Our algorithms are shown to perform well through experimentation and furthermore provide additional expressive power leading to higher-quality results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Note that it might not be possible to construct a Farey set for every \(R\), and we are just interested in asymptotics.

  2. The Cortana tool can be downloaded from http://datamining.liacs.nl/cortana.html.

References

  1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216

  2. Atzmüller M, Puppe F (2006) SD-Map—a fast algorithm for exhaustive subgroup discovery. In: Proceedings of PKDD, pp 6–17

  3. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Chapman & Hall/CRC, London

    MATH  Google Scholar 

  4. Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255

    Google Scholar 

  5. Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Mach Intell 13(4):340–354

    Article  Google Scholar 

  6. Conway JH, Guy RK (1996) Farey fractions and ford circles. The Book of Numbers, Springer, pp 152–154

  7. Costanigro M, Mittelhammer RC, McCluskey JJ (2009) Estimating class-specific parametric models under class uncertainty: local polynomial regression clustering in an hedonic analysis of wine markets. J Appl Econom 24:1117–1135

    Article  MathSciNet  Google Scholar 

  8. De Cock D (2011) Ames, ia real estate data, 2011. http://www.amstat.org/publications/jse/

  9. Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discov 8(2):97–126

    Article  MathSciNet  Google Scholar 

  10. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1022–1029

  11. Frank A, Asuncion A (2010) UCI machine learning repository, http://archive.ics.uci.edu/ml

  12. Fukuda T, Morimoto Y, Morishita S, Tokuyama T (1999) Mining optimized association rules for numeric attributes. J Comput Syst Sci 58(1):1–12

    Article  MATH  MathSciNet  Google Scholar 

  13. Fürnkranz J, Flach PA (2005) Roc ‘n’ rule learning—towards a better understanding of covering algorithms. Mach Learn 58(1):39–77

    Article  MATH  Google Scholar 

  14. Graham RL (1972) An efficient algorithm for determining the convex hull of a finite planar set. Inf Process Lett 1(4):132–133

    Article  MATH  Google Scholar 

  15. Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Mining Knowl Discov 19(2):210–226

    Article  Google Scholar 

  16. Hämäläinen W (2012) Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32(2):383–414

    Article  Google Scholar 

  17. Herrera F, Carmona C, González P, del Jesus M (2010) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29:495–525

    Article  Google Scholar 

  18. Kavšek B, Lavrač N, Jovanoski V (2003) Apriori-sd: adapting association rule learning to subgroup discovery. In Proceedings of intelligent data analysis (IDA), pp 230–241

  19. Klösgen W (2002) Handbook of data mining and knowledge discovery. Oxford University Press, New York

  20. Meeng M, Knobbe A (2011) Flexible enrichment with Cortana—software demo. In: Proceedings of BeneLearn, pp 117–119

  21. Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403

    MATH  Google Scholar 

  22. Nymann JE (1972) On the probability that k positive integers are relatively prime. J Number Theory 4(5):469–473

    Article  MATH  MathSciNet  Google Scholar 

  23. Preparata FP, Shamos MI (1985) Computational geometry: an introduction. Springer, Berlin

    Book  Google Scholar 

  24. Rényi A, Sulanke R (1963) Über die konvexe hülle von n zufällig gewälten punkten. Probab Theory Relat Fields 2:75–84

    MATH  Google Scholar 

  25. Rzepakowski P, Jaroszewicz S (2012) Decision trees for uplift modeling with single and multiple treatments. Knowl Inf Syst 32(2):303–327

    Article  Google Scholar 

  26. Sedgewick R, Bentley J (2002) Quicksort is optimal. Knuthfest, Stanford University, Stanford

    Google Scholar 

  27. Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of principles of data mining and knowledge discovery (PKDD), pp 78–87

Download references

Acknowledgments

This research is partially supported by the Netherlands Organization for Scientific Research (NWO) under Project Nr. 612.065.822 (Exceptional Model Mining), and by a Postdoc grant from the Research Foundation—Flanders (FWO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Mampaey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mampaey, M., Nijssen, S., Feelders, A. et al. Efficient algorithms for finding optimal binary features in numeric and nominal labeled data. Knowl Inf Syst 42, 465–492 (2015). https://doi.org/10.1007/s10115-013-0714-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0714-y

Keywords