Abstract
Statistical principles suggest minimization of the total within-group distance (TWGD) as a robust criterion for clustering point data associated with a Geographical Information System [17]. This NP-hard problem must essentially be solved using heuristic methods, although admitting a linear programming formulation. Heuristics proposed so far require quadratic time, which is prohibitively expensive for data mining applications. This paper introduces data structures for the management of large bi-dimensional point data sets and for fast clustering via interchange heuristics. These structures avoid the need for quadratic time through approximations to proximity information. Our scheme is illustrated with two-dimensional quadtrees, but can be extended to use other structures suited to three dimensional data or spatial data with time-stamps. As a result, we obtain a fast and robust clustering method.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
J. Barnes and P. Hut. A hierarchical O(n log n) force-calculation algorithm. Nature, 324:446–449, 1986.
L. Belbin. The use of non-hierarchical allocation methods for clustering large sets of data. Australian Comp. J., 19:32–41, 1987.
R. L. Bowerman, P. H. Calamai, and G. B. Hal. The demand partitioning method for reducing aggregation errors in p-median problems. Computers & Operations Research, 26:1097–1111, 1999.
P. S. Bradley, O. L. Mangasarian, and W. N. Street. Clustering via concave minimization. Advances in neural information processing systems, 9:368-, 1997.
J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipode algorithm for particle simulation. SIAMJ. Science and Statistical Computing, 9:669–686, 1988.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likehood from incomplete data via the EM algorithm. J. Royal Statistical Soc. B, 39:1–38, 1977.
R. Duda & P. Hart. Pattern Classification and Scene Analysis. Wiley, US, 1973.
E. Erkut and B. Bozkaya. Analysis of aggregation error for the p-median problem. Computers & Operations Research, 26:1075–1096, 1999.
M. Ester, H. P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SDD-95, 70–82, 1995. Springer-Verlag LNCS 951.
V. Estivill-Castro and M. E. Houle. Robust clustering of large geo-referenced data sets. PAKDD-99, 327–337. Springer-Verlag LNAI 1574, 1999.
V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. Proc. Int. Workshop on Mining Spatial and Temporal Data (with PAKDD-2001), Hong Kong, 2001.
V. Estivill-Castro and M. E. Houle. Spatio-temporal data structures for minimization of total within-group distance. T. Rep. 2001-05, Dep. of CS & SE, U. of Newcastle. http://www.cs.newcastle.edu.au/Dept/techrep.html
V. Estivill-Castro and J. Yang. A fast and robust general purpose clustering algorithm. PRICAI 2000, 208–218, 2000. Springer-Verlag LNAI 1886.
U. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinement clustering algorithms. 4th KDD 194–198. AAAI Press, 1998.
T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement. Software Practice and Experience, 21:1129–1164, 1991.
M. Horn. Analysis and computation schemes for p-median heuristics. Environment and Planning A, 28:1699–1708, 1996.
A. T. Murray. Spatial characteristics and comparisons of interaction and median clustering models. Geographical Analysis, 32:1-, 2000.
A. T. Murray and R. L. Church. Applying simulated annealing to location-planning models. J. of Heuristics, 2:31–53, 1996.
R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Morgan Kaufmann.
J. J. Oliver, R. A. Baxter, and C. S. Wallace. Unsupervised learning using MML. 13th ML Conf., 364–372, 1996. Morgan Kaufmann.
S. Openshaw. Two exploratory space-time-attribute pattern analysers relevant to GIS. In Spatial Analysis and GIS, 83–104, UK, 1994. Taylor and Francis.
A. J. Quigley and P. Eades. FADE: Graph drawing, clustering and visual abstraction. 8th Symp. on Graph Drawing, 2000. Springer Verlag LNCS 1984.
M. Rao. Cluster analysis and mathematical programming. J. Amer. Statistical Assoc., 66:622–626, 1971.
K. Rosing and C. ReVelle. Optimal clustering. Environment and Planning A, 18:1463–1476, 1986.
P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. Wiley, USA, 1987.
H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, MA, 1989.
E. Schikuta and M. Erhart. The BANG-clustering system: Grid-based data analysis. IDA-97. Springer-Verlag LNCS 1280, 1997.
M. B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.
H. Vinod. Integer programming and the theory of grouping. J. Am. Statistical Assoc., 64:506–517, 1969.
W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. 23rd VLDB, 186–195, 1997. Morgan Kaufmann.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:an efficient data clustering method for very large databases. SIGMOD Record, 25:103–114, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Estivill-Castro, V., Houle, M.E. (2001). Data Structures for Minimization of Total Within-Group Distance for Spatio-temporal Clustering. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_8
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive