Skip to main content
Log in

Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abramowitz M, Stegun I (1970) Handbook of mathematical functions. Dover, New York

    Google Scholar 

  • Bay S (2001) Multivariate discretization for set mining. Mach Learn 3(4): 491–512

    MATH  Google Scholar 

  • Berger J (2006) The case of objective Bayesian analysis. Bayesian Anal 1(3): 385–402

    MathSciNet  Google Scholar 

  • Bernardo J, Smith A (2000) Bayesian theory. Wiley, New York

    MATH  Google Scholar 

  • Bertier P, Bouroche J (1981) Analyse des données multidimensionnelles. Presses Universitaires de France

  • Blake C, Merz C (1996) UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html

  • Boullé M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55(1): 53–69

    Article  MATH  Google Scholar 

  • Boullé M (2005) A Bayes optimal approach for partitioning the values of categorical attributes. J Mach Learn Res 6: 1431–1452

    MathSciNet  Google Scholar 

  • Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165

    Article  Google Scholar 

  • Boullé M (2007) Compression-based averaging of selective naive Bayes classifiers. J Mach Learn Res 8: 1659–1685

    MathSciNet  Google Scholar 

  • Boullé M (2008) Bivariate data grid models for supervised learning. Technical Report NSM/R&D/ TECH/EASY/TSI/4/MB, France Telecom R&D. http://perso.rd.francetelecom.fr/boulle/publications/BoulleNTTSI4MB08.pdf

  • Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International, California

    MATH  Google Scholar 

  • Carr D, Littlefield R, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large n. J Am Stat Assoc 82: 424–436

    Article  MathSciNet  Google Scholar 

  • Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0: step-by-step data mining guide

  • Cochran W (1954) Some methods for strengthening the common chi-squared tests. Biometrics 10(4): 417–451

    Article  MATH  MathSciNet  Google Scholar 

  • Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html

  • Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102

    MATH  Google Scholar 

  • Goldstein M (2006) Subjective Bayesian analysis: principles and practice. Bayesian Anal 1(3): 403–420

    MathSciNet  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182

    Article  MATH  Google Scholar 

  • Guyon I, Gunn S, Hur AB, Dror G (2006) Design and analysis of the NIPS2003 challenge. In: Guyon I, Gunn S, Nikravesh M, Zadeh L (eds) Feature extraction: foundations and applications, chap 9. Springer, New York, pp 237–263

    Google Scholar 

  • Hansen P, Mladenovic N (2001) Variable neighborhood search: principles and applications. Eur J Oper Res 130: 449–467

    Article  MATH  MathSciNet  Google Scholar 

  • Holte R (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–90

    Article  MATH  Google Scholar 

  • Kass G (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2): 119–127

    Article  Google Scholar 

  • Kerber R (1992) ChiMerge discretization of numeric attributes. In: Proceedings of the 10th international conference on artificial intelligence. MIT Press, Cambridge, pp 123–128

  • Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97(1-2): 273–324

    Article  MATH  Google Scholar 

  • Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 114–119

  • Kononenko I, Bratko I, Roskar E (1984) Experiments in automatic learning of medical diagnostic rules. Technical report, Joseph Stefan Institute, Faculty of Electrical Engineering and Computer Science, Ljubljana

  • Kurgan L, Cios J (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153

    Article  Google Scholar 

  • Kwedlo W, Kretowski M (1999) An evolutionary algorithm using multivariate discretization for decision rule induction. In: Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1704. Springer, Berlin, 392–397

  • Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: 10th National conference on artificial intelligence. AAAI Press, San Jose, pp 223–228

  • Maass W (1994) Efficient agnostic pac-learning with simple hypothesis. In: COLT ’94: Proceedings of the seventh annual conference on Computational learning theory. ACM Press, New York, pp 67–75

  • Nadif M, Govaert G (2005) Block clustering of contingency table and mixture model. In: Advances in intelligent data analysis VI. Lecture notes in computer science, vol 3646. Springer, Berlin, pp 249–259

  • Olszak M, Ritschard G (1995) The behaviour of nominal and ordinal partial association measures. The Statistician 44(2): 195–212

    Article  Google Scholar 

  • Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Quinlan J (1986) Induction of decision trees. Mach Learn 1: 81–106

    Google Scholar 

  • Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471

    Article  MATH  Google Scholar 

  • Ritschard G, Nicoloyannis N (2000) Aggregation and association in cross tables. In: PKDD ’00: proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 593–598

  • Robert C (1997) The Bayesian choice: a decision-theoretic motivation. Springer, New York

    Google Scholar 

  • Saporta G (1990) Probabilités analyse des données et statistique. TECHNIP, Paris

    MATH  Google Scholar 

  • Shannon C (1948) A mathematical theory of communication. Technical Report 27, Bell systems technical journal

  • Steck H, Jaakkola T (2004) Predictive discretization during model selection. Pattern Recognit LNCS 3175: 1–8

    Google Scholar 

  • Weaver W, Shannon C (1949) The mathematical theory of communication. University of Illinois Press, Urbana

    MATH  Google Scholar 

  • Zighed D, Rakotomalala R (2000) Graphes d’induction. Hermes, France

    Google Scholar 

  • Zighed D, Rabaseda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes for supervised learning. Int J Uncertain Fuzziness Knowl Based Syst 6(33): 307–326

    Article  MATH  Google Scholar 

  • Zighed D, Ritschard G, Erray W, Scuturici V (2005) Decision trees with optimal joint partitioning. Int J Intell Syst 20(7): 693–718

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Boullé.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boullé, M. Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. Adv Data Anal Classif 3, 39–61 (2009). https://doi.org/10.1007/s11634-009-0038-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-009-0038-7

Keywords

Mathematics Subject Classification (2000)

Navigation