Abstract
Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known (i.e., external) classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that are meaningful and novel. An assessment of the information gained with new clusters can be effected by looking at the degree of separation between each new cluster and its most similar class. Our approach models each cluster and class as a multivariate Gaussian distribution and estimates their degree of separation through an information theoretic measure (i.e., through relative entropy or Kullback–Leibler distance). The inherently large computational cost of this step is alleviated by first projecting all data over the single dimension that best separates both distributions (using Fisher’s Linear Discriminant). We test our algorithm on a dataset of Martian surfaces using the traditional division into geological units as external classes and the new, hydrology-inspired, automatically performed division as novel clusters. We find the new partitioning constitutes a formally meaningful classification that deviates substantially from the traditional classification.
Similar content being viewed by others
References
Chapman MG, Masursky H, Dial ALJ (1989) Geological map of science area 1A, East Mangala Valles region on Mars. USGS Misc Geol Inv Map I-1696
Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): Theory and results. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press/MIT Press, Cambridge, MA
Cover TM, Thomas J (1991) Elements of information theory. Wiley-Interscience, New York
Diggle P (1983) Statistical analysis of spatial point patterns. Academic Press, New York
Dom B (2001) An information-theoretic external cluster-validity measure. Research report, IBM T.J. Watson Research Center RJ 10219
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569
Hubert L, Schultz J (1976) Quadratic assignment as a general data analysis strategy. Br J Math Stat Psychol 29:190–241
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ
Kanungo T, Dom B, Niblack W, Steele D (1996) A fast algorithm for MDL-based multi-band image segmentation. In: Sanz J (ed) Image technology. Springer-Verlag, Berlin
Krishnapuran R, Frigui H, Nasraoui O (1995) Fussy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation, part II. IEEE Trans Fuzzy Syst 3(1):44–60
McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York
Milligan GW, Soon SC, Sokol LM (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patterns Anal Mach Intell 5(1):40–47
Panayirci E, Dubes R (1983) A test for multidimensional clustering tendency. Pattern Recognit 16(4):433–444
Rand WM (1971) Objective criterion for evaluation of clustering methods. J Am Stat Assoc 66:846–851
Ripley B (1981) Spatial statistics. Wiley, New York
Rolph F, Fisher D (1968) Test for hierarchical structure in random data sets. Syst Zool 17:407–412
Smith D, Neumann G, Arvidson R, Guinness E, Slavney S (2003) Global surveyor laser altimeter mission experiment gridded data record. NASA Planetary Data System, MGS-M-MOLA-5-MEGDR-L3-V1.0
Stepinski T, Marinova MM, McGovern P, Clifford SM (2002) Fractal analysis of drainage basins on Mars. Geophys Res Lett 29(8)
Stepinski TEA (2004) Martian geomorphology from fractal analysis of drainage networks. J Geophys Res 109 (E02005, 10.1029/2003JE0020988)
Tanaka K (1994) The Venus geologic mappers handbook. US Geol Surv Open File Rep 99–438
Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, New York
Vaithyanathan S, Dom B (2000) Model selection in unsupervised learning with applications to document clustering. In: Proceedings of the 16th international conference on machine learning, Stanford University, CA
Wilhelms DE (1990) Planetary mapping. Cambridge University Press, Cambridge, UK
Witten IH Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Academic Press, New York
Zeng G, Dubes R (1985) A comparison of tests for randomness. Pattern recognition 18(2):191–198
Zuber M, Smith D, Solomon S, Muhleman D, Head J, Garvin J, Abshire J, Bufton J (1992) The Mars observer laser altimeter investigation. J Geophys Res 97:7781–7797
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vilalta, R., Stepinski, T. & Achari, M. An efficient approach to external cluster assessment with an application to martian topography. Data Min Knowl Disc 14, 1–23 (2007). https://doi.org/10.1007/s10618-006-0045-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-006-0045-7