Efficient algorithms for finding optimal binary features in numeric and nominal labeled data

Mampaey, Michael; Nijssen, Siegfried; Feelders, Ad; Konijn, Rob; Knobbe, Arno

doi:10.1007/s10115-013-0714-y

Efficient algorithms for finding optimal binary features in numeric and nominal labeled data

Regular Paper
Published: 29 December 2013

Volume 42, pages 465–492, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Michael Mampaey¹,
Siegfried Nijssen^2,3,
Ad Feelders⁴,
Rob Konijn^2,5 &
…
Arno Knobbe²

528 Accesses
Explore all metrics

Abstract

An important subproblem in supervised tasks such as decision tree induction and subgroup discovery is finding an interesting binary feature (such as a node split or a subgroup refinement) based on a numeric or nominal attribute, with respect to some discrete or continuous target variable. Often one is faced with a trade-off between the expressiveness of such features on the one hand and the ability to efficiently traverse the feature search space on the other hand. In this article, we present efficient algorithms to mine binary features that optimize a given convex quality measure. For numeric attributes, we propose an algorithm that finds an optimal interval, whereas for nominal attributes, we give an algorithm that finds an optimal value set. By restricting the search to features that lie on a convex hull in a coverage space, we can significantly reduce computation time. We present some general theoretical results on the cardinality of convex hulls in coverage spaces of arbitrary dimensions and perform a complexity analysis of our algorithms. In the important case of a binary target, we show that these algorithms have linear runtime in the number of examples. We further provide algorithms for additive quality measures, which have linear runtime regardless of the target type. Additive measures are particularly relevant to feature discovery in subgroup discovery. Our algorithms are shown to perform well through experimentation and furthermore provide additional expressive power leading to higher-quality results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust subgroup discovery

Article Open access 12 August 2022

For real: a thorough look at numeric attributes in subgroup discovery

Article Open access 21 September 2020

Optimal Subgroup Discovery in Purely Numerical Data

Notes

Note that it might not be possible to construct a Farey set for every $R$, and we are just interested in asymptotics.
The Cortana tool can be downloaded from http://datamining.liacs.nl/cortana.html.

References

Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216
Atzmüller M, Puppe F (2006) SD-Map—a fast algorithm for exhaustive subgroup discovery. In: Proceedings of PKDD, pp 6–17
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Chapman & Hall/CRC, London
MATH Google Scholar
Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255
Google Scholar
Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Mach Intell 13(4):340–354
Article Google Scholar
Conway JH, Guy RK (1996) Farey fractions and ford circles. The Book of Numbers, Springer, pp 152–154
Costanigro M, Mittelhammer RC, McCluskey JJ (2009) Estimating class-specific parametric models under class uncertainty: local polynomial regression clustering in an hedonic analysis of wine markets. J Appl Econom 24:1117–1135
Article MathSciNet Google Scholar
De Cock D (2011) Ames, ia real estate data, 2011. http://www.amstat.org/publications/jse/
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discov 8(2):97–126
Article MathSciNet Google Scholar
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1022–1029
Frank A, Asuncion A (2010) UCI machine learning repository, http://archive.ics.uci.edu/ml
Fukuda T, Morimoto Y, Morishita S, Tokuyama T (1999) Mining optimized association rules for numeric attributes. J Comput Syst Sci 58(1):1–12
Article MATH MathSciNet Google Scholar
Fürnkranz J, Flach PA (2005) Roc ‘n’ rule learning—towards a better understanding of covering algorithms. Mach Learn 58(1):39–77
Article MATH Google Scholar
Graham RL (1972) An efficient algorithm for determining the convex hull of a finite planar set. Inf Process Lett 1(4):132–133
Article MATH Google Scholar
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Mining Knowl Discov 19(2):210–226
Article Google Scholar
Hämäläinen W (2012) Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32(2):383–414
Article Google Scholar
Herrera F, Carmona C, González P, del Jesus M (2010) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29:495–525
Article Google Scholar
Kavšek B, Lavrač N, Jovanoski V (2003) Apriori-sd: adapting association rule learning to subgroup discovery. In Proceedings of intelligent data analysis (IDA), pp 230–241
Klösgen W (2002) Handbook of data mining and knowledge discovery. Oxford University Press, New York
Meeng M, Knobbe A (2011) Flexible enrichment with Cortana—software demo. In: Proceedings of BeneLearn, pp 117–119
Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
MATH Google Scholar
Nymann JE (1972) On the probability that k positive integers are relatively prime. J Number Theory 4(5):469–473
Article MATH MathSciNet Google Scholar
Preparata FP, Shamos MI (1985) Computational geometry: an introduction. Springer, Berlin
Book Google Scholar
Rényi A, Sulanke R (1963) Über die konvexe hülle von n zufällig gewälten punkten. Probab Theory Relat Fields 2:75–84
MATH Google Scholar
Rzepakowski P, Jaroszewicz S (2012) Decision trees for uplift modeling with single and multiple treatments. Knowl Inf Syst 32(2):303–327
Article Google Scholar
Sedgewick R, Bentley J (2002) Quicksort is optimal. Knuthfest, Stanford University, Stanford
Google Scholar
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of principles of data mining and knowledge discovery (PKDD), pp 78–87

Download references

Acknowledgments

This research is partially supported by the Netherlands Organization for Scientific Research (NWO) under Project Nr. 612.065.822 (Exceptional Model Mining), and by a Postdoc grant from the Research Foundation—Flanders (FWO).

Author information

Authors and Affiliations

University of Bonn, Bonn, Germany
Michael Mampaey
Leiden University, Leiden, The Netherlands
Siegfried Nijssen, Rob Konijn & Arno Knobbe
KU Leuven, Louvain, Belgium
Siegfried Nijssen
Utrecht University, Utrecht, The Netherlands
Ad Feelders
VU University Amsterdam, Amsterdam, The Netherlands
Rob Konijn

Authors

Michael Mampaey
View author publications
You can also search for this author in PubMed Google Scholar
Siegfried Nijssen
View author publications
You can also search for this author in PubMed Google Scholar
Ad Feelders
View author publications
You can also search for this author in PubMed Google Scholar
Rob Konijn
View author publications
You can also search for this author in PubMed Google Scholar
Arno Knobbe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Mampaey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mampaey, M., Nijssen, S., Feelders, A. et al. Efficient algorithms for finding optimal binary features in numeric and nominal labeled data. Knowl Inf Syst 42, 465–492 (2015). https://doi.org/10.1007/s10115-013-0714-y

Download citation

Received: 09 March 2013
Revised: 16 August 2013
Accepted: 29 November 2013
Published: 29 December 2013
Issue Date: February 2015
DOI: https://doi.org/10.1007/s10115-013-0714-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient algorithms for finding optimal binary features in numeric and nominal labeled data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust subgroup discovery

For real: a thorough look at numeric attributes in subgroup discovery

Optimal Subgroup Discovery in Purely Numerical Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now