Forming categories in exploratory data analysis and data mining

Scott, P. D.; Williams, R. J.; Ho, K. M.

doi:10.1007/BFb0052844

Forming categories in exploratory data analysis and data mining

P. D. Scott¹,
R. J. Williams¹ &
K. M. Ho¹

Conference paper
First Online: 01 January 2006

730 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1280))

Abstract

This paper describes the techniques used for categorizing variables in Snout an intelligent assistant for exploratory data analysis of survey and similar data sets that is currently under development. We begin by reviewing existing work on category formation in data mining which has been mainly concerned with enabling decision tree programs to handle numeric variables. It is argued that there are other important but neglected aspects of category formation, notably the formation of new categorizations of nominal variables. We report the limited success achieved in categorizing variables from survey data using either endogenous methods or exogenous methods that maximise the association with only one dependent variable. We then describe the categorization technique used in Snout: a procedure that selects a partition that both maximises the number of variables associated with the partitioned variable and maximises the strength of those associations. We report on the success achieved using this procedure in exploring real survey data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Pacific Grove, CA., 1984.
MATH Google Scholar
J. Catlett. On changing continuous attributes into ordered discrete attributes. In Y. Kodratoff, editor, EWSL-91. Lecture Notes in Artificial Intelligence 482, pages 164–178. Springer-Verlag, Berlin — Heidelberg — New York, 1991.
Google Scholar
J. A. Davis. Elementary Survey Analysis. Prentice-Hall, Englewood Cliffs, New Jersey, 1971.
Google Scholar
J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretisation of continuous features. In Proc. Twelfth International Conference on Machine Learning, Los Altos, CA, 1995. Morgan Kaufman Publ. Inc.
Google Scholar
B.H. Erickson and T.A. Nosanchuk. Understanding Data. The Open University Press, 1979.
Google Scholar
B. S. Everitt. Cluster Analysis. Heinemann, London, 2nd edition, 1980.
MATH Google Scholar
B. S. Everitt and G. Dunn. Applied Multivariate Statistical Analysis. Edward Arnold, London, 1991.
Google Scholar
U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87–102, 1992.
MATH Google Scholar
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. Thirteenth International Joint Conference on Artificial Intelligence, pages 1022–1027, Los Altos, CA, 1993. Morgan Kaufman Publ. Inc.
Google Scholar
D. H. Fisher. Knowledge Acquisition Via Incremental Clustering. Machine Learning, 2:139–172, 1987.
Google Scholar
D. H. Fisher and P. Langley. Conceptual clustering and its relation to numerical taxonomy. In W. A. Gale, editor, Artificial Intelligence and Statistics, pages 77–116. Addison-Wesley, Reading, Mass., 1986.
Google Scholar
J. Healey. Statistics: A Tool For Social Research. Wadsworth, Belmont, CA., 1990.
Google Scholar
K. M. Ho and P. D. Scott. Discretization of continuous variables in bivariate relationships. In Proceedings of KDD-97, The Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA., Menlo Park, CA., August 1997. AAAI Press.
Google Scholar
R. C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63–91, 1993.
Article MATH Google Scholar
R. Kerber. Chimerge: Discretisation of numeric attributes. In AAAI-92 Proceedings of the Tenth National Conference on Artificial Intelligence, pages 123–128, Cambridge, Mass., 1992. The MIT Press.
Google Scholar
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
Google Scholar
J. R. Quinlan. Programs for Machine Learning. Morgan Kaufman Publ. Inc., Los Altos, CA, 1993.
Google Scholar
J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77–90, 1996.
MATH Google Scholar
M. Richeldi and M. Rossotto. Class-driven statistical discretisation of continous attributes (extended abstract). In ECML-95: Proceedings of the European Conference on Machine Learning, Lecture Notes in Artificial Intelligence, volume 914, Berlin — Heidelberg — New York, 1995. Springer-Verlag.
Google Scholar
P. D. Scott, A. P. M. Coxon, M. H. Hobbs, and R. J. Williams. Snout: An intelligent assistant ofr exploratory data analysis. In Lecture Notes in Artificial Intelligence: Proceedings of PKDD-97, The First European Symposium on Principles of Data Mining and Knowledge Discovery, Trondheim., Berlin — Heidelberg — New York, June 1997. Springer-Verlag.
Google Scholar
J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, Reading, Mass., 1977.
MATH Google Scholar
J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dept of Computer Science, University of Essex, CO4 3SQ, Colchester, UK
P. D. Scott, R. J. Williams & K. M. Ho

Authors

P. D. Scott
View author publications
You can also search for this author in PubMed Google Scholar
R. J. Williams
View author publications
You can also search for this author in PubMed Google Scholar
K. M. Ho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Xiaohui Liu Paul Cohen Michael Berthold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scott, P.D., Williams, R.J., Ho, K.M. (1997). Forming categories in exploratory data analysis and data mining. In: Liu, X., Cohen, P., Berthold, M. (eds) Advances in Intelligent Data Analysis Reasoning about Data. IDA 1997. Lecture Notes in Computer Science, vol 1280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0052844

Download citation

DOI: https://doi.org/10.1007/BFb0052844
Published: 19 May 2006
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63346-4
Online ISBN: 978-3-540-69520-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics