Abstract
An important subproblem in supervised tasks such as decision tree induction and subgroup discovery is finding an interesting binary feature (such as a node split or a subgroup refinement) based on a numeric or nominal attribute, with respect to some discrete or continuous target variable. Often one is faced with a trade-off between the expressiveness of such features on the one hand and the ability to efficiently traverse the feature search space on the other hand. In this article, we present efficient algorithms to mine binary features that optimize a given convex quality measure. For numeric attributes, we propose an algorithm that finds an optimal interval, whereas for nominal attributes, we give an algorithm that finds an optimal value set. By restricting the search to features that lie on a convex hull in a coverage space, we can significantly reduce computation time. We present some general theoretical results on the cardinality of convex hulls in coverage spaces of arbitrary dimensions and perform a complexity analysis of our algorithms. In the important case of a binary target, we show that these algorithms have linear runtime in the number of examples. We further provide algorithms for additive quality measures, which have linear runtime regardless of the target type. Additive measures are particularly relevant to feature discovery in subgroup discovery. Our algorithms are shown to perform well through experimentation and furthermore provide additional expressive power leading to higher-quality results.





Similar content being viewed by others
Notes
Note that it might not be possible to construct a Farey set for every \(R\), and we are just interested in asymptotics.
The Cortana tool can be downloaded from http://datamining.liacs.nl/cortana.html.
References
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216
Atzmüller M, Puppe F (2006) SD-Map—a fast algorithm for exhaustive subgroup discovery. In: Proceedings of PKDD, pp 6–17
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Chapman & Hall/CRC, London
Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255
Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Mach Intell 13(4):340–354
Conway JH, Guy RK (1996) Farey fractions and ford circles. The Book of Numbers, Springer, pp 152–154
Costanigro M, Mittelhammer RC, McCluskey JJ (2009) Estimating class-specific parametric models under class uncertainty: local polynomial regression clustering in an hedonic analysis of wine markets. J Appl Econom 24:1117–1135
De Cock D (2011) Ames, ia real estate data, 2011. http://www.amstat.org/publications/jse/
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discov 8(2):97–126
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1022–1029
Frank A, Asuncion A (2010) UCI machine learning repository, http://archive.ics.uci.edu/ml
Fukuda T, Morimoto Y, Morishita S, Tokuyama T (1999) Mining optimized association rules for numeric attributes. J Comput Syst Sci 58(1):1–12
Fürnkranz J, Flach PA (2005) Roc ‘n’ rule learning—towards a better understanding of covering algorithms. Mach Learn 58(1):39–77
Graham RL (1972) An efficient algorithm for determining the convex hull of a finite planar set. Inf Process Lett 1(4):132–133
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Mining Knowl Discov 19(2):210–226
Hämäläinen W (2012) Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32(2):383–414
Herrera F, Carmona C, González P, del Jesus M (2010) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29:495–525
Kavšek B, Lavrač N, Jovanoski V (2003) Apriori-sd: adapting association rule learning to subgroup discovery. In Proceedings of intelligent data analysis (IDA), pp 230–241
Klösgen W (2002) Handbook of data mining and knowledge discovery. Oxford University Press, New York
Meeng M, Knobbe A (2011) Flexible enrichment with Cortana—software demo. In: Proceedings of BeneLearn, pp 117–119
Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
Nymann JE (1972) On the probability that k positive integers are relatively prime. J Number Theory 4(5):469–473
Preparata FP, Shamos MI (1985) Computational geometry: an introduction. Springer, Berlin
Rényi A, Sulanke R (1963) Über die konvexe hülle von n zufällig gewälten punkten. Probab Theory Relat Fields 2:75–84
Rzepakowski P, Jaroszewicz S (2012) Decision trees for uplift modeling with single and multiple treatments. Knowl Inf Syst 32(2):303–327
Sedgewick R, Bentley J (2002) Quicksort is optimal. Knuthfest, Stanford University, Stanford
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of principles of data mining and knowledge discovery (PKDD), pp 78–87
Acknowledgments
This research is partially supported by the Netherlands Organization for Scientific Research (NWO) under Project Nr. 612.065.822 (Exceptional Model Mining), and by a Postdoc grant from the Research Foundation—Flanders (FWO).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mampaey, M., Nijssen, S., Feelders, A. et al. Efficient algorithms for finding optimal binary features in numeric and nominal labeled data. Knowl Inf Syst 42, 465–492 (2015). https://doi.org/10.1007/s10115-013-0714-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0714-y