Abstract
In this paper we address problems arising from the use of categorical valued data in rule induction. By naively using categorical values in rule induction, we risk reducing the chances of finding a good rule in terms both of confidence (accuracy) and of support or coverage. In this paper we introduce a technique called arcsin transformation where categorical valued data is replaced with numeric values. Our results show that on relatively large databases, containing many unordered categorical attributes, larger databases incorporating both unordered and numeric data, and especially those databases that are small containing rare cases, this technique is highly effective when dealing with categorical valued data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Lanner (2001) DataLamp and the templar framework. http://www.lanner.com/corporate. (2001)
UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html.
Agresti A. and Yang M. (1987) An empirical investigation of some effects of sparseness in contingency tables. Comm. Stat., 5:9–21.
Reid R. C. and Cressie N. A. (1988) Goodness-of-fit statistics for discrete multivariate data. Springer-Verlag, New York.
Haberman S. J. (1988) A Warning on the use of chi-squared statistics with frequency tables with small expected cell counts, volume 83, Issue 402. Journal of the american statistical association, pp. 555–560.
Bishop Y. M. M, Fienberg S. E. and Holland P. W. (1975) Discrete multivariate analysis, MIT Press, Cambridge, Massachusetts, pp. 491–492.
Freeman M. F. and Tukey J. W. (1950) Transformations related to the angular and the square root, volume 21, issue 4, Annals of mathematical statistics, pp. 607–611.
Angoss knowledge engineering (1987) http://www.angoss.com.
Richards G. and Rayward-Smith V. J. (2001) Discovery of association rules in tabular data, IEEE international conference on data mining, pp. 465–472.
Kaufman L. and Rousseeuw P. (1990) Finding groups in data: An introduction to cluster analysis, John Wiley and Sons Inc.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Wien
About this paper
Cite this paper
Burgess, M., Janacek, G.J., Rayward-Smith, V.J. (2003). Handling categorical data in rule induction. In: Pearson, D.W., Steele, N.C., Albrecht, R.F. (eds) Artificial Neural Nets and Genetic Algorithms. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0646-4_45
Download citation
DOI: https://doi.org/10.1007/978-3-7091-0646-4_45
Publisher Name: Springer, Vienna
Print ISBN: 978-3-211-00743-3
Online ISBN: 978-3-7091-0646-4
eBook Packages: Springer Book Archive