Extended Fuzzy C-Means hotspot detection method for large and very large event datasets
Introduction
An hotspot is generally defined as an area of localized events (for example, criminal incidents or fire events) geo-referenced as points on a geographical map. Detection of hotspots on this map is usually made by using clustering techniques. The most used techniques in this spatial analysis are the Fuzzy C-Means (FCM) algorithm [2] and the Density Based Estimation algorithms [23]. The FCM algorithm detects cluster prototypes as points on the map and calculates the membership degree of an event to an hotspot. For example, in [3], [6], [7], [11], [17] the FCM algorithm is used for detecting hotspots of crime analysis and in [1] for hotspots of rare disease analysis. The density estimation techniques identifies clusters by searching for dense concentrations of events. More precisely, the kernel density estimation is made via an interpolation process which transforms the event points in a continuous surface as, e.g., in [4], [5], [16], [20], [23] (resp., [21]) where hotspots of crime analysis (resp., crashes in a road network) are detected. In [18], [19] an extension of the FCM, called Extended Fuzzy C-Means (EFCM) algorithm, is presented: the cluster prototypes are hyper-volumes whose number is determined automatically from the data, so avoiding the serious problem which has the FCM algorithm in which the number of clusters must be assigned a priori (see [19] containing many examples).
In [8], [9] the EFCM algorithm detects hotspots as cluster prototypes and each hotspot is identified as a circle on the map (Fig. 1). The advantage of the EFCM algorithm in hotspot detection consists in the low computational complexity with respect to other clustering methods.
In [8] the authors applied this method for studying the spatial distribution and the spatio-temporal evolution of forest fires. In [9] the EFCM is applied for hotspot detection in disease analysis by considering as point events the residence of patients submitted to surgical interventions concerning the ear-laryngeal-pharyngeal apparatus between 2008 and 2012 in the district of Naples.
Today the access to a large amount of heterogeneous data has led to develop clustering approaches to large (L) and very large (VL) datasets. In literature many variations of FCM algorithms are present. Indeed, for instance, in [13] the authors presented a variation of the FCM algorithm, the so-called random sampling extensions FCM (rseFCM), in [24] a fast FCM algorithm was developed for managing L datasets, in [22] an extended fast Fuzzy C-Means clustering algorithm was used for the segmentation of a digital images VL dataset, in [25] a Particle Swarm Optimization algorithm is proposed for dealing with big data, in [26] three FCM algorithms are proposed for dealing with big data based on cloud computing.
In [14] a variation of the FCM algorithm for L and VL datasets is presented, namely the single pass FCM (spFCM) which is an iterative method in which a certain percentage of the dataset is loaded in memory at each step. The dataset is partitioned randomly in subsets of patterns. In turn each subset is partitioned into clusters by using a weighted FCM (wFCM) algorithm.
In [15] another variation of the FCM algorithm is presented, called online FCM (oFCM), where the dataset is partitioned into subsets and each subset is divided into clusters. Afterwards the weighed FCM algorithm is applied to a dataset composed from several clusters for obtaining the final clusters.
In [10] a variation of the FCM algorithm for L and VL datasets, called bit-reduced FCM (brFCM), is given as well. The brFCM was created for clustering L image dataset, which is “binned” in a reduced dataset clusterized via the wFCM algorithm, where the weights are the number of patterns aggregated into a bin.
In [13] there are kernel extensions of the rseFCM, spFCM, oFCM and brFCM algorithms. In this work we consider variations for L and VL datasets of the above four algorithms. We avoid kernalization of these algorithms because a serious drawback of this technique is the computational complexity O(n2) for storing the partition matrix [13]. There are two main differences between the variations of VL-EFCM and VL-FCM algorithms:
- ‐
the number of clusters is not set a priori in the EFCM algorithm and it is obtained at the end the iterative process. Hence the number of clusters detected for each subset of patterns can vary from subset to subset. For example, in the EFCM extension of the oFCM algorithm, the dimension of the final dataset is s · (C1 + C2 + …+ Cs), where s is the number of subsets and Cl (l =1,…, s) is the number of clusters detected applying the EFCM algorithm to the lth subset of the dataset;
- ‐
the weights assigned to each object are affected by the partition matrix which depends also from the radius of the clusters in the EFCM algorithm.
Since an L dataset is still loadable into memory, in order to make tests and comparisons between the different algorithms, we consider an L dataset constituted from the epicenters of earthquakes happened in Italy since 1970. In order to evaluate the error of the results, we propose two indices, both based on the difference with respect to the results obtained applying the EFCM algorithm on the whole L dataset. These indices, called l1 (recall) and l2 (precision), are based on the spatial intersection between the hotspots obtained by using both methods. For instance, we show I1 and I2 by considering a single hotspot in Fig. 2.
Generally speaking, the index I1 (resp., I2) is given by the percentage of the area of the intersection zones with respect to the total area of the hotspots detected by using the EFCM (resp., VL-EFCM) algorithm. In Section 2 the FCM and VL-FCM algorithms are discussed. Section 3 concerns with the EFCM and VL-EFCM algorithms. In Section 4 the VL-EFCM hotspot detection method is given. In Section 5 we show the results obtained in our case study and final considerations are reported in Section 6. For reasons of clarity, in Table 1 we show the symbols representing the parameters used as inputs in the pseudocodes of the algorithms discussed in the successive sections.
Section snippets
FCM and weighted FCM algorithms
Let X ={x1,…,xN}⊂Rn be the dataset composed from N patterns in the space Rn. The FCM algorithm is based on the minimization of the following objective function [2]: where uij is the membership degree of the jth pattern xj to the ith cluster vi (i = 1,…, C), U is the C× N partition matrix, V = {v1,…,vC} ⊂ Rn is the set of the centers of the C clusters (point prototypes), dij = is the (usually Euclidean) distance between vi and the jth xj
EFCM and weighted EFCM algorithms
The EFCM was firstly proposed in [18], [19], here recalled. In general, the prototypes are hyper-ellipsoids which become hyper-spheres in case of Euclidean metric. If dij is the distance between the pattern xj and the center vi of the ith prototype Vi and if ri is the radius of Vi, we say that xj belongs completely to Vi if dij ≤ ri. The covariance matrix Pi, associated to the ith cluster Vi, is given by whose determinant gives the volume of the ith cluster.
VL-EFCM hotspot detection
We consider a dataset of patterns composed by events geo-referenced. Each pattern is formed by two numerical features: the latitude and longitude in an assigned coordinate system which defines a location in a two-dimensional map. A pattern is formed from a point on the map which represents an event occurred in a location. Now we suppose that the dataset is a VL dataset in which many events have been geo-referenced as points (for example, by using a geo-coding). We use a VL-EFCM algorithm for
Test results
Our tests were performed considering an L dataset composed by all the epicentres of earthquakes occurred in Italy since 1970, extracted from the database ISIDE (Italian Seismological Instrumental and parametric DatabasE) available at http://iside.rm.ingv.it, managed by the Italian National Geological and Vulcanological Institute (INGV). The dataset is formed from more than 250.000 event points, each geo-referenced on the geographical map as shown in Fig. 4. We use a Pentium Intel Core I7
Conclusions
We present an hotspot detection method for L and VL event datasets, extending to the EFCM algorithm the four FCM variation algorithms for L and VL datasets. For evaluating the performances of the four algorithms, we consider an L event dataset formed from the epicentres of earthquakes registered in Italy since 1970. Tests have been performed by varying the cardinality of the subsets of the dataset and the size of the number of bins. The results show that the best performances are obtained by
Acknowledgement
This work was performed under the auspices of GCNS-INDAM.
References (26)
- et al.
The extended fuzzy C-means algorithm for hotspots in spatio-temporal GIS
Expert Syst. Appl.
(2011) - et al.
Extending fuzzy and probabilistic clustering to very large datasets
Comput. Stat. Data Anal.
(2006) - et al.
The detection of clusters in rare diseases
J. R. Stat. Soc. A
(1991) Pattern Recognition with Fuzzy Objective Function Algorithms
(1981)Environmental Criminology
(1981)- et al.
GIS and Crime Mapping (Chapter 6: Identifying Crime Hotspots
(2013) - et al.
When is a hotspot a hotspot? A procedure for creating statistically robust hotspot geographic maps of crime
- et al.
Social change and crime rate trends: a routine activity approach
Am. Sociol. Rev.
(1979) - et al.
A comparative evaluation of approaches to urban crime pattern analysis
Urban Stud.
(2000) - et al.
WebGIS based on spatio-temporal hotspots: an application to oto-laryngo-pharyngeal diseases
Soft Comput.
(2015)