Skip to main content
Log in

A non-parametric semi-supervised discretization method

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Semi-supervised classification methods aim to exploit labeled and unlabeled examples to train a predictive model. Most of these approaches make assumptions on the distribution of classes. This article first proposes a new semi-supervised discretization method, which adopts very low informative prior on data. This method discretizes the numerical domain of a continuous input variable, while keeping the information relative to the prediction of classes. Then, an in-depth comparison of this semi-supervised method with the original supervised MODL approach is presented. We demonstrate that the semi-supervised approach is asymptotically equivalent to the supervised approach, improved with a post-optimization of the intervals bounds location.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berger J (2006) The case of objective Bayesian analysis. Bayesian Anal 1(3): 385–402

    MathSciNet  Google Scholar 

  2. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: COLT ’98: Proceedings of the eleventh annual conference on Computational learning theory. ACM Press, New York, pp 92–100

  3. Boullé M (2005) A Bayes optimal approach for partitioning the values of categorical attributes. J Mach Learn Res 6: 1431–1452

    MathSciNet  Google Scholar 

  4. Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165

    Article  Google Scholar 

  5. Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: EWSL-91: Proceedings of the European working session on learning on machine learning. Springer, New York, pp 164–178

  6. Chapelle O, Schölkopf B, Zien A (2007) Semi-supervised learning. MIT Press, Cambridge

    Google Scholar 

  7. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: International conference on machine learning, pp 194–202

  8. Fawcett T (2003) Roc graphs: notes and practical considerations for data mining researchers. Technical Report HPL-2003-4, HP Labs. http://citeseer.ist.psu.edu/fawcett03roc.html

  9. Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102

    MATH  Google Scholar 

  10. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. Adv Knowl Discov Data Min 1–34

  11. Fujino A, Ueda N, Saito K (2007) A hybrid generative/discriminative approach to text classification with additional information. Inf Process Manage 43: 379–392

    Article  Google Scholar 

  12. Holte R (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–91

    Article  MATH  Google Scholar 

  13. Jin R, Breitbart Y, Muoh C. (2009) Data discretization unification. Knowl Inf Syst 19: 1–29

    Article  Google Scholar 

  14. Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 114–119

  15. Langley P, Iba W, Thomas K (1992) An analysis of Bayesian classifiers. In: Press A (ed) Tenth national conference on artificial intelligence, pp 223–228

  16. Liu H, Hussain F, Tan C, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6(4): 393–423

    Article  MathSciNet  Google Scholar 

  17. Maeireizo B, Litman D, Hwa R (2004) Analyzing the effectiveness and applicability of co-training. In: ACL ’04: the companion proceedings of the 42nd annual meeting of the association for computational linguistics

  18. Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Sciences, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html

  19. Pyle D (1999) Data preparation for data mining. Morgan Kaufmann , San Francisco, p 19

    Google Scholar 

  20. Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471

    Article  MATH  Google Scholar 

  21. Rosenberg C, Hebert M, Schneiderman H (2005) Semi-supervised self-training of object detection models. In: Seventh IEEE workshop on applications of computer vision

  22. Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison

  23. Shannon C (1948) A mathematical theory of communication. Key papers in the development of information theory

  24. Sugiyama M, Krauledat M, Müller K (2007) Covariate shift adaptation by importance weighted cross validation. J Mach Learn Res 8: 985–1005

    Google Scholar 

  25. Sugiyama M, Müller K (2005) Model selection under covariate shift. In: ICANN, International conference on computational on artificial neural networks: formal models and their applications

  26. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PY, Zhou Z, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1)

  27. Zhou ZH, Li M (2009) Semi-supervised learning by disagreement. Knowl Inf Syst doi:10.1007/s10115-009-0209-z

  28. Zighed D, Rakotomalala R (2000) Graphes d’induction. Hermes, France

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexis Bondu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bondu, A., Boullé, M. & Lemaire, V. A non-parametric semi-supervised discretization method. Knowl Inf Syst 24, 35–57 (2010). https://doi.org/10.1007/s10115-009-0230-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0230-2

Keywords

Navigation