Skip to main content

Forming categories in exploratory data analysis and data mining

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1280))

Abstract

This paper describes the techniques used for categorizing variables in Snout an intelligent assistant for exploratory data analysis of survey and similar data sets that is currently under development. We begin by reviewing existing work on category formation in data mining which has been mainly concerned with enabling decision tree programs to handle numeric variables. It is argued that there are other important but neglected aspects of category formation, notably the formation of new categorizations of nominal variables. We report the limited success achieved in categorizing variables from survey data using either endogenous methods or exogenous methods that maximise the association with only one dependent variable. We then describe the categorization technique used in Snout: a procedure that selects a partition that both maximises the number of variables associated with the partitioned variable and maximises the strength of those associations. We report on the success achieved using this procedure in exploring real survey data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Pacific Grove, CA., 1984.

    MATH  Google Scholar 

  2. J. Catlett. On changing continuous attributes into ordered discrete attributes. In Y. Kodratoff, editor, EWSL-91. Lecture Notes in Artificial Intelligence 482, pages 164–178. Springer-Verlag, Berlin — Heidelberg — New York, 1991.

    Google Scholar 

  3. J. A. Davis. Elementary Survey Analysis. Prentice-Hall, Englewood Cliffs, New Jersey, 1971.

    Google Scholar 

  4. J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretisation of continuous features. In Proc. Twelfth International Conference on Machine Learning, Los Altos, CA, 1995. Morgan Kaufman Publ. Inc.

    Google Scholar 

  5. B.H. Erickson and T.A. Nosanchuk. Understanding Data. The Open University Press, 1979.

    Google Scholar 

  6. B. S. Everitt. Cluster Analysis. Heinemann, London, 2nd edition, 1980.

    MATH  Google Scholar 

  7. B. S. Everitt and G. Dunn. Applied Multivariate Statistical Analysis. Edward Arnold, London, 1991.

    Google Scholar 

  8. U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87–102, 1992.

    MATH  Google Scholar 

  9. U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. Thirteenth International Joint Conference on Artificial Intelligence, pages 1022–1027, Los Altos, CA, 1993. Morgan Kaufman Publ. Inc.

    Google Scholar 

  10. D. H. Fisher. Knowledge Acquisition Via Incremental Clustering. Machine Learning, 2:139–172, 1987.

    Google Scholar 

  11. D. H. Fisher and P. Langley. Conceptual clustering and its relation to numerical taxonomy. In W. A. Gale, editor, Artificial Intelligence and Statistics, pages 77–116. Addison-Wesley, Reading, Mass., 1986.

    Google Scholar 

  12. J. Healey. Statistics: A Tool For Social Research. Wadsworth, Belmont, CA., 1990.

    Google Scholar 

  13. K. M. Ho and P. D. Scott. Discretization of continuous variables in bivariate relationships. In Proceedings of KDD-97, The Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA., Menlo Park, CA., August 1997. AAAI Press.

    Google Scholar 

  14. R. C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63–91, 1993.

    Article  MATH  Google Scholar 

  15. R. Kerber. Chimerge: Discretisation of numeric attributes. In AAAI-92 Proceedings of the Tenth National Conference on Artificial Intelligence, pages 123–128, Cambridge, Mass., 1992. The MIT Press.

    Google Scholar 

  16. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.

    Google Scholar 

  17. J. R. Quinlan. Programs for Machine Learning. Morgan Kaufman Publ. Inc., Los Altos, CA, 1993.

    Google Scholar 

  18. J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77–90, 1996.

    MATH  Google Scholar 

  19. M. Richeldi and M. Rossotto. Class-driven statistical discretisation of continous attributes (extended abstract). In ECML-95: Proceedings of the European Conference on Machine Learning, Lecture Notes in Artificial Intelligence, volume 914, Berlin — Heidelberg — New York, 1995. Springer-Verlag.

    Google Scholar 

  20. P. D. Scott, A. P. M. Coxon, M. H. Hobbs, and R. J. Williams. Snout: An intelligent assistant ofr exploratory data analysis. In Lecture Notes in Artificial Intelligence: Proceedings of PKDD-97, The First European Symposium on Principles of Data Mining and Knowledge Discovery, Trondheim., Berlin — Heidelberg — New York, June 1997. Springer-Verlag.

    Google Scholar 

  21. J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, Reading, Mass., 1977.

    MATH  Google Scholar 

  22. J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Xiaohui Liu Paul Cohen Michael Berthold

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag

About this paper

Cite this paper

Scott, P.D., Williams, R.J., Ho, K.M. (1997). Forming categories in exploratory data analysis and data mining. In: Liu, X., Cohen, P., Berthold, M. (eds) Advances in Intelligent Data Analysis Reasoning about Data. IDA 1997. Lecture Notes in Computer Science, vol 1280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0052844

Download citation

  • DOI: https://doi.org/10.1007/BFb0052844

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63346-4

  • Online ISBN: 978-3-540-69520-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics