Skip to main content
Log in

Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data Mining Algorithm

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

An introduction to the approaches used to discretise continuous database features is given, together with a discussion of the potential benefits of such techniques. These benefits are investigated by applying discretisation algorithms to two large commercial databases; the discretisations yielded are then evaluated using a simulated annealing based data mining algorithm. The results produced suggest that dramatic reductions in problem size may be achieved, yielding improvements in the speed of the data mining algorithm. However, it is also demonstrated under certain circumstances that the discretisation produced may give an increase in problem size or allow overfitting by the data mining algorithm. Such cases, within which often only a small proportion of the database belongs to the class of interest, highlight the need both for caution when producing discretisations and for the development of more robust discretisation algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J.C.W. Debuse, “Exploitation of modern heuristic techniques within a commercial data mining environment,” Ph.D. Thesis, University of East Anglia, 1997.

  2. J.C.W. Debuse and V.J. Rayward-Smith, “Feature subset selection within a simulated annealing data mining algorithm,” Journal of Intelligent Information Systems, vol. 9, pp. 57-81, 1997.

    Google Scholar 

  3. J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in Prieditis and Russell [30], pp. 194-202.

  4. J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.

  5. J.R. Quinlan, “Successor to C4.5,” Knowledge Discovery Nuggets, vol. 97:09, 1997.

  6. R. Kerber, “Chimerge: Discretization of numerical attributes,” in Proc. of the Tenth National Conf. on Artificial Intelligence, MIT Press, 1992, pp. 123-128.

  7. R. Kohavi and M. Sahami, “Error-based and entropy-based discretization of continuous features,” in Simoudis et al. [31], pp. 114-119.

  8. J. Catlett, “Megainduction: machine learning on very large databases,” Ph.D. Thesis, University of Sydney, 1991.

  9. A.K.C. Wong and D.K.Y. Cgiu, “Synthesizing statistical knowledge from incomplete mixed-mode data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-9,no. 6, pp. 796-805, 1987.

    Google Scholar 

  10. R.S. Garfinkel and G.L. Nemhauser, Integer Programming, Wiley: New York, 1972.

    Google Scholar 

  11. J.C.W. Debuse and V.J. Rayward-Smith, “One and a half dimensional clustering,” in Proc. of the Conf. on Applied Decision Technologies, UNICOM, Brunel, 1995, pp. 377-389.

    Google Scholar 

  12. W.D. Fisher, “On grouping for maximum homogeneity,” J. Am. Stat. Assoc., vol. 53, pp. 789-798, 1958.

    Google Scholar 

  13. J.A. Hartigan, Clustering Algorithms, Wiley: New York, 1975.

    Google Scholar 

  14. J.W. Carmichael, J.A. George, and R.S. Julius, “Finding natural clusters,” Syst. Zool., vol. 17, pp. 144-150, 1968.

    Google Scholar 

  15. J.W. Carmichael and P.H. Sneath, “Taxonometric maps,” Syst. Zool., vol. 18, pp. 402-415, 1969.

    Google Scholar 

  16. B. Everitt, Cluster Analysis, Wiley: New York, 1980.

    Google Scholar 

  17. R.C. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, pp. 63-91, 1993.

    Google Scholar 

  18. J. Catlett, “On changing continuous attributes into ordered discrete attributes,” in Proc. of the European Working Session on Learning, edited by Y. Kodratoff, Springer-Verlag, 1991, pp. 164-178.

  19. U.M. Fayyad and K.B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” in Proc. of the Thirteenth Int. Joint Conf. on Artificial Intelligence, Morgan Kaufmann, 1993, pp. 1022-1027.

  20. J. Rissanen, “A universal prior for integers and estimation by minimum description length,” Annals of Statistics, vol. 11, pp. 416-431, 1983.

    Google Scholar 

  21. J.R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal of Artificial Intelligence Research, vol. 4, pp. 77-90, 1996.

    Google Scholar 

  22. J.R. Quinlan and R.L. Rivest, “Inferring decision trees using the minimum description length principle,” Information and Computation, vol. 80, pp. 227-248, 1989.

    Google Scholar 

  23. W. Maass, “Efficient agnostic PAC-learning with simple hypotheses,” in Proc. of the Seventh Annual ACM Conf. on Computational Learning Theory, 1994, pp. 67-75.

  24. P. Auer, R. Holte, and W. Maass, “Theory and application of agnostic pac-learning with small decision trees,” in Prieditis and Russell [30], pp. 21-29.

  25. B. de la Iglesia, J.C.W. Debuse, and V.J. Rayward-Smith, “Discovering knowledge in commercial databases using modern heuristic techniques,” in Simoudis et al. [31], pp. 44-49.

  26. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller, “Equation of state calculation by fast computing machines,” Journal of Chemical Physics, vol. 21, pp. 1087-1091, 1953.

    Google Scholar 

  27. K.A. Dowsland, “Simulated annealing,” in Modern Heuristic Techniques for Combinatorial Problems, edited by C.R. Reeves, Blackwell Scientific, pp. 20-69, 1993.

  28. J.T. Alander, “An indexed bibliography of genetic algorithms and simulated annealing: Hybrids and comparisons,” Tech. Rep., Department of Information Technology and Production Economics, University of Vaasa, Finland, 1995.

    Google Scholar 

  29. M. Lundy and A. Mees, “Convergence of an annealing algorithm,” Mathematical Programming, vol. 34, pp. 111-124, 1986.

    Google Scholar 

  30. A. Prieditis and S. Russell (Eds.), Proc. of the Twelfth Int. Conf. on Machine Learning, Morgan Kaufmann, 1995.

  31. E. Simoudis, J.W. Han, and U. Fayyad (Eds.), Proc. of the Second Int. Conf. on Knowledge Discovery and Data Mining (KDD-96), 1996.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Debuse, J.C., Rayward-Smith, V.J. Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data Mining Algorithm. Applied Intelligence 11, 285–295 (1999). https://doi.org/10.1023/A:1008339026836

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008339026836

Navigation