Skip to main content
Log in

An efficient approach to external cluster assessment with an application to martian topography

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known (i.e., external) classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that are meaningful and novel. An assessment of the information gained with new clusters can be effected by looking at the degree of separation between each new cluster and its most similar class. Our approach models each cluster and class as a multivariate Gaussian distribution and estimates their degree of separation through an information theoretic measure (i.e., through relative entropy or Kullback–Leibler distance). The inherently large computational cost of this step is alleviated by first projecting all data over the single dimension that best separates both distributions (using Fisher’s Linear Discriminant). We test our algorithm on a dataset of Martian surfaces using the traditional division into geological units as external classes and the new, hydrology-inspired, automatically performed division as novel clusters. We find the new partitioning constitutes a formally meaningful classification that deviates substantially from the traditional classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Chapman MG, Masursky H, Dial ALJ (1989) Geological map of science area 1A, East Mangala Valles region on Mars. USGS Misc Geol Inv Map I-1696

  • Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): Theory and results. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press/MIT Press, Cambridge, MA

  • Cover TM, Thomas J (1991) Elements of information theory. Wiley-Interscience, New York

    MATH  Google Scholar 

  • Diggle P (1983) Statistical analysis of spatial point patterns. Academic Press, New York

    MATH  Google Scholar 

  • Dom B (2001) An information-theoretic external cluster-validity measure. Research report, IBM T.J. Watson Research Center RJ 10219

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569

    Article  MATH  Google Scholar 

  • Hubert L, Schultz J (1976) Quadratic assignment as a general data analysis strategy. Br J Math Stat Psychol 29:190–241

    MATH  MathSciNet  Google Scholar 

  • Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ

  • Kanungo T, Dom B, Niblack W, Steele D (1996) A fast algorithm for MDL-based multi-band image segmentation. In: Sanz J (ed) Image technology. Springer-Verlag, Berlin

  • Krishnapuran R, Frigui H, Nasraoui O (1995) Fussy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation, part II. IEEE Trans Fuzzy Syst 3(1):44–60

    Article  Google Scholar 

  • McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York

    MATH  Google Scholar 

  • Milligan GW, Soon SC, Sokol LM (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patterns Anal Mach Intell 5(1):40–47

    Article  Google Scholar 

  • Panayirci E, Dubes R (1983) A test for multidimensional clustering tendency. Pattern Recognit 16(4):433–444

    Article  MATH  Google Scholar 

  • Rand WM (1971) Objective criterion for evaluation of clustering methods. J Am Stat Assoc 66:846–851

    Article  Google Scholar 

  • Ripley B (1981) Spatial statistics. Wiley, New York

    Book  MATH  Google Scholar 

  • Rolph F, Fisher D (1968) Test for hierarchical structure in random data sets. Syst Zool 17:407–412

    Article  Google Scholar 

  • Smith D, Neumann G, Arvidson R, Guinness E, Slavney S (2003) Global surveyor laser altimeter mission experiment gridded data record. NASA Planetary Data System, MGS-M-MOLA-5-MEGDR-L3-V1.0

  • Stepinski T, Marinova MM, McGovern P, Clifford SM (2002) Fractal analysis of drainage basins on Mars. Geophys Res Lett 29(8)

  • Stepinski TEA (2004) Martian geomorphology from fractal analysis of drainage networks. J Geophys Res 109 (E02005, 10.1029/2003JE0020988)

  • Tanaka K (1994) The Venus geologic mappers handbook. US Geol Surv Open File Rep 99–438

  • Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, New York

    Google Scholar 

  • Vaithyanathan S, Dom B (2000) Model selection in unsupervised learning with applications to document clustering. In: Proceedings of the 16th international conference on machine learning, Stanford University, CA

  • Wilhelms DE (1990) Planetary mapping. Cambridge University Press, Cambridge, UK

    Google Scholar 

  • Witten IH Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Academic Press, New York

    Google Scholar 

  • Zeng G, Dubes R (1985) A comparison of tests for randomness. Pattern recognition 18(2):191–198

    Article  Google Scholar 

  • Zuber M, Smith D, Solomon S, Muhleman D, Head J, Garvin J, Abshire J, Bufton J (1992) The Mars observer laser altimeter investigation. J Geophys Res 97:7781–7797

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Vilalta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vilalta, R., Stepinski, T. & Achari, M. An efficient approach to external cluster assessment with an application to martian topography. Data Min Knowl Disc 14, 1–23 (2007). https://doi.org/10.1007/s10618-006-0045-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-006-0045-7

Keywords

Navigation