Abstract
An introduction to the approaches used to discretise continuous database features is given, together with a discussion of the potential benefits of such techniques. These benefits are investigated by applying discretisation algorithms to two large commercial databases; the discretisations yielded are then evaluated using a simulated annealing based data mining algorithm. The results produced suggest that dramatic reductions in problem size may be achieved, yielding improvements in the speed of the data mining algorithm. However, it is also demonstrated under certain circumstances that the discretisation produced may give an increase in problem size or allow overfitting by the data mining algorithm. Such cases, within which often only a small proportion of the database belongs to the class of interest, highlight the need both for caution when producing discretisations and for the development of more robust discretisation algorithms.
Similar content being viewed by others
References
J.C.W. Debuse, “Exploitation of modern heuristic techniques within a commercial data mining environment,” Ph.D. Thesis, University of East Anglia, 1997.
J.C.W. Debuse and V.J. Rayward-Smith, “Feature subset selection within a simulated annealing data mining algorithm,” Journal of Intelligent Information Systems, vol. 9, pp. 57-81, 1997.
J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in Prieditis and Russell [30], pp. 194-202.
J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
J.R. Quinlan, “Successor to C4.5,” Knowledge Discovery Nuggets, vol. 97:09, 1997.
R. Kerber, “Chimerge: Discretization of numerical attributes,” in Proc. of the Tenth National Conf. on Artificial Intelligence, MIT Press, 1992, pp. 123-128.
R. Kohavi and M. Sahami, “Error-based and entropy-based discretization of continuous features,” in Simoudis et al. [31], pp. 114-119.
J. Catlett, “Megainduction: machine learning on very large databases,” Ph.D. Thesis, University of Sydney, 1991.
A.K.C. Wong and D.K.Y. Cgiu, “Synthesizing statistical knowledge from incomplete mixed-mode data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-9,no. 6, pp. 796-805, 1987.
R.S. Garfinkel and G.L. Nemhauser, Integer Programming, Wiley: New York, 1972.
J.C.W. Debuse and V.J. Rayward-Smith, “One and a half dimensional clustering,” in Proc. of the Conf. on Applied Decision Technologies, UNICOM, Brunel, 1995, pp. 377-389.
W.D. Fisher, “On grouping for maximum homogeneity,” J. Am. Stat. Assoc., vol. 53, pp. 789-798, 1958.
J.A. Hartigan, Clustering Algorithms, Wiley: New York, 1975.
J.W. Carmichael, J.A. George, and R.S. Julius, “Finding natural clusters,” Syst. Zool., vol. 17, pp. 144-150, 1968.
J.W. Carmichael and P.H. Sneath, “Taxonometric maps,” Syst. Zool., vol. 18, pp. 402-415, 1969.
B. Everitt, Cluster Analysis, Wiley: New York, 1980.
R.C. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, pp. 63-91, 1993.
J. Catlett, “On changing continuous attributes into ordered discrete attributes,” in Proc. of the European Working Session on Learning, edited by Y. Kodratoff, Springer-Verlag, 1991, pp. 164-178.
U.M. Fayyad and K.B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” in Proc. of the Thirteenth Int. Joint Conf. on Artificial Intelligence, Morgan Kaufmann, 1993, pp. 1022-1027.
J. Rissanen, “A universal prior for integers and estimation by minimum description length,” Annals of Statistics, vol. 11, pp. 416-431, 1983.
J.R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal of Artificial Intelligence Research, vol. 4, pp. 77-90, 1996.
J.R. Quinlan and R.L. Rivest, “Inferring decision trees using the minimum description length principle,” Information and Computation, vol. 80, pp. 227-248, 1989.
W. Maass, “Efficient agnostic PAC-learning with simple hypotheses,” in Proc. of the Seventh Annual ACM Conf. on Computational Learning Theory, 1994, pp. 67-75.
P. Auer, R. Holte, and W. Maass, “Theory and application of agnostic pac-learning with small decision trees,” in Prieditis and Russell [30], pp. 21-29.
B. de la Iglesia, J.C.W. Debuse, and V.J. Rayward-Smith, “Discovering knowledge in commercial databases using modern heuristic techniques,” in Simoudis et al. [31], pp. 44-49.
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller, “Equation of state calculation by fast computing machines,” Journal of Chemical Physics, vol. 21, pp. 1087-1091, 1953.
K.A. Dowsland, “Simulated annealing,” in Modern Heuristic Techniques for Combinatorial Problems, edited by C.R. Reeves, Blackwell Scientific, pp. 20-69, 1993.
J.T. Alander, “An indexed bibliography of genetic algorithms and simulated annealing: Hybrids and comparisons,” Tech. Rep., Department of Information Technology and Production Economics, University of Vaasa, Finland, 1995.
M. Lundy and A. Mees, “Convergence of an annealing algorithm,” Mathematical Programming, vol. 34, pp. 111-124, 1986.
A. Prieditis and S. Russell (Eds.), Proc. of the Twelfth Int. Conf. on Machine Learning, Morgan Kaufmann, 1995.
E. Simoudis, J.W. Han, and U. Fayyad (Eds.), Proc. of the Second Int. Conf. on Knowledge Discovery and Data Mining (KDD-96), 1996.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Debuse, J.C., Rayward-Smith, V.J. Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data Mining Algorithm. Applied Intelligence 11, 285–295 (1999). https://doi.org/10.1023/A:1008339026836
Issue Date:
DOI: https://doi.org/10.1023/A:1008339026836