Abstract
When clusters with different densities and noise lie in a spatial point set, the major obstacle to classifying these data is the determination of the thresholds for classification, which may form a series of bins for allocating each point to different clusters. Much of the previous work has adopted a model-based approach, but is either incapable of estimating the thresholds in an automatic way, or limited to only two point processes, i.e. noise and clusters with the same density. In this paper, we present a new density-based cluster method (DECODE), in which a spatial data set is presumed to consist of different point processes and clusters with different densities belong to different point processes. DECODE is based upon a reversible jump Markov Chain Monte Carlo (MCMC) strategy and divided into three steps. The first step is to map each point in the data to its mth nearest distance, which is referred to as the distance between a point and its mth nearest neighbor. In the second step, classification thresholds are determined via a reversible jump MCMC strategy. In the third step, clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the thresholds. Four experiments, including two simulated data sets and two seismic data sets, are used to evaluate the algorithm. Results on simulated data show that our approach is capable of discovering the clusters automatically. Results on seismic data suggest that the clustered earthquakes, identified by DECODE, either imply the epicenters of forthcoming strong earthquakes or indicate the areas with the most intensive seismicity, this is consistent with the tectonic states and estimated stress distribution in the associated areas. The comparison between DECODE and other state-of-the-art methods, such as DBSCAN, OPTICS and Wavelet Cluster, illustrates the contribution of our approach: although DECODE can be computationally expensive, it is capable of identifying the number of point processes and simultaneously estimating the classification thresholds with little prior knowledge.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD ’98 international conference on management of data, Seattle, WA, USA, pp 94–105
Allard D, Fraley C (1997) Nonparametric maximun likelihood estimation of features in spatial point process using voronoi tessellation. J Am Stat Assoc 92: 1485–1493. doi:10.2307/2965419
Andrieu C, Freitas DN, Doucet A, Jordan IM (2003) An introduction to MCMC for machine learning. Mach Learn 50: 5–43. doi:10.1023/A:1020281327116
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of ACM-SIGMOD’99 international conference on management data, Philadelphia, USA, pp 46-60
Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584. doi:10.2307/2670109
Cheng KH (2002) An analysis of tectonic environment and contemporary seismicity of frontal orogeny in central Taiwan area. Seismol Geol 24(3): 400–411
China Seismograph Network (CSN) catalog available online at: http://www.csndmc.ac.cn. Accessed in 2008
Cressie NAC (1991) Statistics for spatial data, 1st edn. Wiley, New York
Daszykowski M, Walczak B, Massart DL (2001) Looking for natural patterns in data Part 1. Density-based approach. Chemom Intell Lab Syst 56: 83–92. doi:10.1016/S0169-7439(01)00111-3
Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34: 138–147. doi:10.2307/2347366
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd int. conf. on knowledge discovery and data mining, Portland, OR, pp 226–231
Feng H, Huang DY (1980) Earthquake catalogue inWest China (1970—1975,M≥1). Seismological Press, Beijing (in Chinese)
Feng H, Huang DY (1989) Earthquake catalogue inWest China (1976—1979,M≥1). Seismological Press, Beijing (in Chinese)
Fu ZX, Jiang LX (1997) On large-scale spatial heterogeneties of great shallow earthquakes and plates coupling mechanism in Chinese mainland and its adjacent area. Earthq Res China 13(1):1–9 (in Chinese)
Ghosh SC (2002) The raniganj coal basin: an example of an Indian Gondwana rift. Sediment Geol 147(Sp. Iss.): 155–176
Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732. doi:10.1093/biomet/82.4.711
Gu GX (1983) Chin seismic catalog (1831 BC-1969 AD). Science Press, Beijing
Han JW, Kamber M, Tung AKH (2001) Spatial clustering methods in data mining. In: Miller HJ, Han JW(eds) Geographic data mining and knowledge discovery. Taylor & Francis, London, pp 188–217
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the knowledge discovery and data mining, pp 58–65
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
Jasra A, Stephens DA, Gallagher K, Holmes CC (2006) Bayesian mixture modelling in geochronology via Markov chain Monte Carlo. Math Geol 38: 269–300. doi:10.1007/s11004-005-9019-3
Jiao MR, Zhang GM, Che S, Liu J (1999) Numerical calculations of tectonic stress field of Chinese mainland and its neighboring regions and their applications to explanation of seismic activity. Acta Seismologica Sin 12(2): 137–147. doi:10.1007/s11589-999-0018-1
Kagan YY, Houston H (2005) Relation between mainshock rupture process and Omori’s law for aftershock moment release rate. Geophys J Int 163: 1039–1048
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Lin CY, Chang CC (2005) A new density-based scheme for clustering based on genetic algorithm. Fundam Inform 68: 315–331
Liu P, Zhou D, Wu NJ (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings of IEEE international conference on service systems and service management, Chengdu, China, pp 1–4
Markus MB, Kriegel H-P, Raymond TN, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD of 2000 international conference on management of data, vol 29, pp 93–104
Matsu’ura RS, Karakama I (2005) A point-process analysis of the Matsushiro earthquake swarm sequence: the effect of water on earthquake occurrence. Pure Appl Geophys 162: 1319–1345. doi:10.1007/s00024-005-2672-0
Murtagh F, Starck JL (1998) Pattern clustering based on noise modeling in wavelet space. Pattern Recogn 31(7): 847–855. doi:10.1016/S0031-3203(97)00115-5
Neill DB (2006) Detection of spatial and spatio-temporal clusters. Ph.D. Thesis of University of South Carolina
Neill DB, Moore AW (2005) Anomalous spatial cluster detection. In: Proceeding of KDD 2005 workshop on data mining methods for anomaly detection, Chicago, Illinois, USA, pp 41–44
Pascual D, Pla F, Sanchez JS (2006) Non parametric local density-based clustering for multimodal overlapping distributions. In: Proceedings of intelligent data engineering and automated learning (IDEAL2006), Spain, Burgos, pp 671–678
Pei T, Yang M, Zhang JS, Zhou CH, Luo JC, Li QL (2003) Multi-scale expression of spatial activity anomalies of earthquakes and its indicative significance on the space and time attributes of strong earthquakes. Acta Seismologica Sin 3: 292–303. doi:10.1007/s11589-003-0033-6
Pei T, Zhu AX, Zhou CH, Li BL, Qin CZ (2006) A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. Int J Geogr Inf Sci 20: 153–168. doi:10.1080/13658810500399654
Reasenberg PA (1999) Foreshock occurrence rates before large earthquakes worldwide. Pure Appl Geophys 155: 355–379. doi:10.1007/s000240050269
Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components. J Roy Stat Soc Ser B-Methodol 59: 731–758
Robert CP, Casella G (2004) Monte Carlo statistical methods, 2nd edn. Springer, New York
Roy S, Bhattacharyya DK (2005) An approach to find embedded clusters using density based techniques. Lect Notes Comput Sci 3816: 523–535. doi:10.1007/11604655_59
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2: 169–194. doi:10.1023/A:1009745219419
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th international conference on very large data bases, New York City, NY, pp 428-439
Thompson HR (1956) Distribution of distance to nth nearest neighbour in a population of randomly distributed individuals. Ecology 27: 391–394. doi:10.2307/1933159
Tran TN, Wehrensa R, Lutgarde MCB (2006) KNN-kernel density-based clustering for high-dimensional multivariate data. Comput Stat Data Anal 51: 513–525. doi:10.1016/j.csda.2005.10.001
Umino N, Okada T, Hasegawa A (2002) Foreshock and aftershock sequence of the 1998 M ≥ 5.0 Sendai, northeastern Japan, earthquake and its implications for earthquake nucleation. Bull Seismol Soc Am 92: 2465–2477. doi:10.1785/0120010140
Wyss M, Toya Y (2000) Is background seismicity produced at a stationary Poissonian rate. Bull Seismol Soc Am 90: 1174–1187. doi:10.1785/0119990158
Zhang GM, Ma HS, Wang H, Wang XL (2005) Boundaries between active-tectonic blocks and strong earthquakes in the China mainland. Chin J Geophys 48: 602–610
Zhou CH, Pei T, Li QL, Chen JB, Qin CZ, Han ZJ (2006) Database of Integrated Catalog of Chinese earthquakes and Its Application. Water and Electricity Press, Beijing (in Chinese)
Zhuang JC, Chang CP, Ogata Y, Chen YI (2005) A study on the background and clustering seismicity in the Taiwan region by using point process models. J Geophys Res Solid Earth 110(B05S18). doi:10.1029/2004JB003157
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Pei, T., Jasra, A., Hand, D.J. et al. DECODE: a new method for discovering clusters of different densities in spatial data. Data Min Knowl Disc 18, 337–369 (2009). https://doi.org/10.1007/s10618-008-0120-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0120-3