DECODE: a new method for discovering clusters of different densities in spatial data

Pei, Tao; Jasra, Ajay; Hand, David J.; Zhu, A.-Xing; Zhou, Chenghu

doi:10.1007/s10618-008-0120-3

DECODE: a new method for discovering clusters of different densities in spatial data

Published: 20 November 2008

Volume 18, pages 337–369, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Tao Pei^1,2,
Ajay Jasra³,
David J. Hand⁴,
A.-Xing Zhu^1,5 &
…
Chenghu Zhou¹

737 Accesses
Explore all metrics

Abstract

When clusters with different densities and noise lie in a spatial point set, the major obstacle to classifying these data is the determination of the thresholds for classification, which may form a series of bins for allocating each point to different clusters. Much of the previous work has adopted a model-based approach, but is either incapable of estimating the thresholds in an automatic way, or limited to only two point processes, i.e. noise and clusters with the same density. In this paper, we present a new density-based cluster method (DECODE), in which a spatial data set is presumed to consist of different point processes and clusters with different densities belong to different point processes. DECODE is based upon a reversible jump Markov Chain Monte Carlo (MCMC) strategy and divided into three steps. The first step is to map each point in the data to its mth nearest distance, which is referred to as the distance between a point and its mth nearest neighbor. In the second step, classification thresholds are determined via a reversible jump MCMC strategy. In the third step, clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the thresholds. Four experiments, including two simulated data sets and two seismic data sets, are used to evaluate the algorithm. Results on simulated data show that our approach is capable of discovering the clusters automatically. Results on seismic data suggest that the clustered earthquakes, identified by DECODE, either imply the epicenters of forthcoming strong earthquakes or indicate the areas with the most intensive seismicity, this is consistent with the tectonic states and estimated stress distribution in the associated areas. The comparison between DECODE and other state-of-the-art methods, such as DBSCAN, OPTICS and Wavelet Cluster, illustrates the contribution of our approach: although DECODE can be computationally expensive, it is capable of identifying the number of point processes and simultaneously estimating the classification thresholds with little prior knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MDCUT²: a multi-density clustering algorithm with automatic detection of density variation in data with noise

Article 16 October 2018

Grid-Based Approach to Determining Parameters of the DBSCAN Algorithm

Clustering of Multiple Density Peaks

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD ’98 international conference on management of data, Seattle, WA, USA, pp 94–105
Allard D, Fraley C (1997) Nonparametric maximun likelihood estimation of features in spatial point process using voronoi tessellation. J Am Stat Assoc 92: 1485–1493. doi:10.2307/2965419
Article MATH Google Scholar
Andrieu C, Freitas DN, Doucet A, Jordan IM (2003) An introduction to MCMC for machine learning. Mach Learn 50: 5–43. doi:10.1023/A:1020281327116
Article MATH Google Scholar
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of ACM-SIGMOD’99 international conference on management data, Philadelphia, USA, pp 46-60
Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584. doi:10.2307/2670109
Article MATH Google Scholar
Cheng KH (2002) An analysis of tectonic environment and contemporary seismicity of frontal orogeny in central Taiwan area. Seismol Geol 24(3): 400–411
Google Scholar
China Seismograph Network (CSN) catalog available online at: http://www.csndmc.ac.cn. Accessed in 2008
Cressie NAC (1991) Statistics for spatial data, 1st edn. Wiley, New York
MATH Google Scholar
Daszykowski M, Walczak B, Massart DL (2001) Looking for natural patterns in data Part 1. Density-based approach. Chemom Intell Lab Syst 56: 83–92. doi:10.1016/S0169-7439(01)00111-3
Article Google Scholar
Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34: 138–147. doi:10.2307/2347366
Article MATH Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd int. conf. on knowledge discovery and data mining, Portland, OR, pp 226–231
Feng H, Huang DY (1980) Earthquake catalogue inWest China (1970—1975,M≥1). Seismological Press, Beijing (in Chinese)
Feng H, Huang DY (1989) Earthquake catalogue inWest China (1976—1979,M≥1). Seismological Press, Beijing (in Chinese)
Fu ZX, Jiang LX (1997) On large-scale spatial heterogeneties of great shallow earthquakes and plates coupling mechanism in Chinese mainland and its adjacent area. Earthq Res China 13(1):1–9 (in Chinese)
Google Scholar
Ghosh SC (2002) The raniganj coal basin: an example of an Indian Gondwana rift. Sediment Geol 147(Sp. Iss.): 155–176
Article Google Scholar
Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732. doi:10.1093/biomet/82.4.711
Article MATH MathSciNet Google Scholar
Gu GX (1983) Chin seismic catalog (1831 BC-1969 AD). Science Press, Beijing
Google Scholar
Han JW, Kamber M, Tung AKH (2001) Spatial clustering methods in data mining. In: Miller HJ, Han JW(eds) Geographic data mining and knowledge discovery. Taylor & Francis, London, pp 188–217
Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the knowledge discovery and data mining, pp 58–65
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Jasra A, Stephens DA, Gallagher K, Holmes CC (2006) Bayesian mixture modelling in geochronology via Markov chain Monte Carlo. Math Geol 38: 269–300. doi:10.1007/s11004-005-9019-3
Article MATH Google Scholar
Jiao MR, Zhang GM, Che S, Liu J (1999) Numerical calculations of tectonic stress field of Chinese mainland and its neighboring regions and their applications to explanation of seismic activity. Acta Seismologica Sin 12(2): 137–147. doi:10.1007/s11589-999-0018-1
Article Google Scholar
Kagan YY, Houston H (2005) Relation between mainshock rupture process and Omori’s law for aftershock moment release rate. Geophys J Int 163: 1039–1048
Article Google Scholar
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
Lin CY, Chang CC (2005) A new density-based scheme for clustering based on genetic algorithm. Fundam Inform 68: 315–331
MATH MathSciNet Google Scholar
Liu P, Zhou D, Wu NJ (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings of IEEE international conference on service systems and service management, Chengdu, China, pp 1–4
Markus MB, Kriegel H-P, Raymond TN, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD of 2000 international conference on management of data, vol 29, pp 93–104
Matsu’ura RS, Karakama I (2005) A point-process analysis of the Matsushiro earthquake swarm sequence: the effect of water on earthquake occurrence. Pure Appl Geophys 162: 1319–1345. doi:10.1007/s00024-005-2672-0
Article Google Scholar
Murtagh F, Starck JL (1998) Pattern clustering based on noise modeling in wavelet space. Pattern Recogn 31(7): 847–855. doi:10.1016/S0031-3203(97)00115-5
Article Google Scholar
Neill DB (2006) Detection of spatial and spatio-temporal clusters. Ph.D. Thesis of University of South Carolina
Neill DB, Moore AW (2005) Anomalous spatial cluster detection. In: Proceeding of KDD 2005 workshop on data mining methods for anomaly detection, Chicago, Illinois, USA, pp 41–44
Pascual D, Pla F, Sanchez JS (2006) Non parametric local density-based clustering for multimodal overlapping distributions. In: Proceedings of intelligent data engineering and automated learning (IDEAL2006), Spain, Burgos, pp 671–678
Pei T, Yang M, Zhang JS, Zhou CH, Luo JC, Li QL (2003) Multi-scale expression of spatial activity anomalies of earthquakes and its indicative significance on the space and time attributes of strong earthquakes. Acta Seismologica Sin 3: 292–303. doi:10.1007/s11589-003-0033-6
Article Google Scholar
Pei T, Zhu AX, Zhou CH, Li BL, Qin CZ (2006) A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. Int J Geogr Inf Sci 20: 153–168. doi:10.1080/13658810500399654
Article Google Scholar
Reasenberg PA (1999) Foreshock occurrence rates before large earthquakes worldwide. Pure Appl Geophys 155: 355–379. doi:10.1007/s000240050269
Article Google Scholar
Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components. J Roy Stat Soc Ser B-Methodol 59: 731–758
Article MATH MathSciNet Google Scholar
Robert CP, Casella G (2004) Monte Carlo statistical methods, 2nd edn. Springer, New York
MATH Google Scholar
Roy S, Bhattacharyya DK (2005) An approach to find embedded clusters using density based techniques. Lect Notes Comput Sci 3816: 523–535. doi:10.1007/11604655_59
Article Google Scholar
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2: 169–194. doi:10.1023/A:1009745219419
Article Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th international conference on very large data bases, New York City, NY, pp 428-439
Thompson HR (1956) Distribution of distance to nth nearest neighbour in a population of randomly distributed individuals. Ecology 27: 391–394. doi:10.2307/1933159
Article Google Scholar
Tran TN, Wehrensa R, Lutgarde MCB (2006) KNN-kernel density-based clustering for high-dimensional multivariate data. Comput Stat Data Anal 51: 513–525. doi:10.1016/j.csda.2005.10.001
Article MATH Google Scholar
Umino N, Okada T, Hasegawa A (2002) Foreshock and aftershock sequence of the 1998 M ≥ 5.0 Sendai, northeastern Japan, earthquake and its implications for earthquake nucleation. Bull Seismol Soc Am 92: 2465–2477. doi:10.1785/0120010140
Article Google Scholar
Wyss M, Toya Y (2000) Is background seismicity produced at a stationary Poissonian rate. Bull Seismol Soc Am 90: 1174–1187. doi:10.1785/0119990158
Article Google Scholar
Zhang GM, Ma HS, Wang H, Wang XL (2005) Boundaries between active-tectonic blocks and strong earthquakes in the China mainland. Chin J Geophys 48: 602–610
Google Scholar
Zhou CH, Pei T, Li QL, Chen JB, Qin CZ, Han ZJ (2006) Database of Integrated Catalog of Chinese earthquakes and Its Application. Water and Electricity Press, Beijing (in Chinese)
Zhuang JC, Chang CP, Ogata Y, Chen YI (2005) A study on the background and clustering seismicity in the Taiwan region by using point process models. J Geophys Res Solid Earth 110(B05S18). doi:10.1029/2004JB003157

Download references

Author information

Authors and Affiliations

Institute of Geographical Sciences and Natural Resources Research, 11A, Datun Road Anwai, Beijing, 100101, China
Tao Pei, A.-Xing Zhu & Chenghu Zhou
Institute for Mathematical Sciences, Imperial College, London, SW7 2PG, UK
Tao Pei
Department of Mathematics, Imperial College, London, UK
Ajay Jasra
Department of Mathematics and Institute for Mathematical Sciences, Imperial College, London, UK
David J. Hand
Department of Geography, University of Wisconsin Madison, 550N, Park Street, Madison, WI, 53706-1491, USA
A.-Xing Zhu

Authors

Tao Pei
View author publications
You can also search for this author inPubMed Google Scholar
Ajay Jasra
View author publications
You can also search for this author inPubMed Google Scholar
David J. Hand
View author publications
You can also search for this author inPubMed Google Scholar
A.-Xing Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Chenghu Zhou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chenghu Zhou.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pei, T., Jasra, A., Hand, D.J. et al. DECODE: a new method for discovering clusters of different densities in spatial data. Data Min Knowl Disc 18, 337–369 (2009). https://doi.org/10.1007/s10618-008-0120-3

Download citation

Received: 05 November 2007
Accepted: 21 October 2008
Published: 20 November 2008
Issue Date: June 2009
DOI: https://doi.org/10.1007/s10618-008-0120-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DECODE: a new method for discovering clusters of different densities in spatial data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MDCUT2: a multi-density clustering algorithm with automatic detection of density variation in data with noise

Grid-Based Approach to Determining Parameters of the DBSCAN Algorithm

Clustering of Multiple Density Peaks

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

MDCUT²: a multi-density clustering algorithm with automatic detection of density variation in data with noise