Extended Fuzzy C-Means hotspot detection method for large and very large event datasets

doi:10.1016/j.ins.2018.02.029

Information Sciences

Volume 441, May 2018, Pages 198-215

https://doi.org/10.1016/j.ins.2018.02.029 Get rights and content

Abstract

We present a hotspot detection method based on the Extended Fuzzy C-Means (EFCM) algorithm for large (L) and very large (VL) datasets of events. Extensions of four VL-FCM algorithms are presented. We test our method applying these algorithms to an L dataset composed from the epicenters of earthquakes happened in Italy since 1970. Comparison have been made with respect to the results obtained by applying the EFCM algorithm on the whole event dataset and two indices are used for measuring the performances of the algorithms proposed.

Introduction

An hotspot is generally defined as an area of localized events (for example, criminal incidents or fire events) geo-referenced as points on a geographical map. Detection of hotspots on this map is usually made by using clustering techniques. The most used techniques in this spatial analysis are the Fuzzy C-Means (FCM) algorithm [2] and the Density Based Estimation algorithms [23]. The FCM algorithm detects cluster prototypes as points on the map and calculates the membership degree of an event to an hotspot. For example, in [3], [6], [7], [11], [17] the FCM algorithm is used for detecting hotspots of crime analysis and in [1] for hotspots of rare disease analysis. The density estimation techniques identifies clusters by searching for dense concentrations of events. More precisely, the kernel density estimation is made via an interpolation process which transforms the event points in a continuous surface as, e.g., in [4], [5], [16], [20], [23] (resp., [21]) where hotspots of crime analysis (resp., crashes in a road network) are detected. In [18], [19] an extension of the FCM, called Extended Fuzzy C-Means (EFCM) algorithm, is presented: the cluster prototypes are hyper-volumes whose number is determined automatically from the data, so avoiding the serious problem which has the FCM algorithm in which the number of clusters must be assigned a priori (see [19] containing many examples).

In [8], [9] the EFCM algorithm detects hotspots as cluster prototypes and each hotspot is identified as a circle on the map (Fig. 1). The advantage of the EFCM algorithm in hotspot detection consists in the low computational complexity with respect to other clustering methods.

In [8] the authors applied this method for studying the spatial distribution and the spatio-temporal evolution of forest fires. In [9] the EFCM is applied for hotspot detection in disease analysis by considering as point events the residence of patients submitted to surgical interventions concerning the ear-laryngeal-pharyngeal apparatus between 2008 and 2012 in the district of Naples.

Today the access to a large amount of heterogeneous data has led to develop clustering approaches to large (L) and very large (VL) datasets. In literature many variations of FCM algorithms are present. Indeed, for instance, in [13] the authors presented a variation of the FCM algorithm, the so-called random sampling extensions FCM (rseFCM), in [24] a fast FCM algorithm was developed for managing L datasets, in [22] an extended fast Fuzzy C-Means clustering algorithm was used for the segmentation of a digital images VL dataset, in [25] a Particle Swarm Optimization algorithm is proposed for dealing with big data, in [26] three FCM algorithms are proposed for dealing with big data based on cloud computing.

In [14] a variation of the FCM algorithm for L and VL datasets is presented, namely the single pass FCM (spFCM) which is an iterative method in which a certain percentage of the dataset is loaded in memory at each step. The dataset is partitioned randomly in subsets of patterns. In turn each subset is partitioned into clusters by using a weighted FCM (wFCM) algorithm.

In [15] another variation of the FCM algorithm is presented, called online FCM (oFCM), where the dataset is partitioned into subsets and each subset is divided into clusters. Afterwards the weighed FCM algorithm is applied to a dataset composed from several clusters for obtaining the final clusters.

In [10] a variation of the FCM algorithm for L and VL datasets, called bit-reduced FCM (brFCM), is given as well. The brFCM was created for clustering L image dataset, which is “binned” in a reduced dataset clusterized via the wFCM algorithm, where the weights are the number of patterns aggregated into a bin.

In [13] there are kernel extensions of the rseFCM, spFCM, oFCM and brFCM algorithms. In this work we consider variations for L and VL datasets of the above four algorithms. We avoid kernalization of these algorithms because a serious drawback of this technique is the computational complexity O(n²) for storing the partition matrix [13]. There are two main differences between the variations of VL-EFCM and VL-FCM algorithms:

‐
the number of clusters is not set a priori in the EFCM algorithm and it is obtained at the end the iterative process. Hence the number of clusters detected for each subset of patterns can vary from subset to subset. For example, in the EFCM extension of the oFCM algorithm, the dimension of the final dataset is s · (C₁ + C₂ + …+ C_s), where s is the number of subsets and C_l (l =1,…, s) is the number of clusters detected applying the EFCM algorithm to the lth subset of the dataset;
‐
the weights assigned to each object are affected by the partition matrix which depends also from the radius of the clusters in the EFCM algorithm.

Since an L dataset is still loadable into memory, in order to make tests and comparisons between the different algorithms, we consider an L dataset constituted from the epicenters of earthquakes happened in Italy since 1970. In order to evaluate the error of the results, we propose two indices, both based on the difference with respect to the results obtained applying the EFCM algorithm on the whole L dataset. These indices, called l₁ (recall) and l₂ (precision), are based on the spatial intersection between the hotspots obtained by using both methods. For instance, we show I₁ and I₂ by considering a single hotspot in Fig. 2.

Generally speaking, the index I₁ (resp., I₂) is given by the percentage of the area of the intersection zones with respect to the total area of the hotspots detected by using the EFCM (resp., VL-EFCM) algorithm. In Section 2 the FCM and VL-FCM algorithms are discussed. Section 3 concerns with the EFCM and VL-EFCM algorithms. In Section 4 the VL-EFCM hotspot detection method is given. In Section 5 we show the results obtained in our case study and final considerations are reported in Section 6. For reasons of clarity, in Table 1 we show the symbols representing the parameters used as inputs in the pseudocodes of the algorithms discussed in the successive sections.

Section snippets

FCM and weighted FCM algorithms

Let X ={x₁,…,x_N}⊂Rⁿ be the dataset composed from N patterns in the space Rⁿ. The FCM algorithm is based on the minimization of the following objective function [2]: $\begin{matrix} J (U, V) = \sum_{i = 1}^{C} \sum_{j = 1}^{N} u_{i j}^{m} d_{i j}^{2} = \sum_{i = 1}^{C} \sum_{j = 1}^{N} u_{i j}^{m} ∥ x_{j} - v_{i} ∥^{2} \end{matrix}$ where u_ij is the membership degree of the jth pattern x_j to the ith cluster v_i (i = 1,…, C), U is the C× N partition matrix, V = {v₁,…,v_C} ⊂ Rⁿ is the set of the centers of the C clusters (point prototypes), d_ij =  $∥ x_{j} - v_{i} ∥$ is the (usually Euclidean) distance between v_i and the jth x_j

EFCM and weighted EFCM algorithms

The EFCM was firstly proposed in [18], [19], here recalled. In general, the prototypes are hyper-ellipsoids which become hyper-spheres in case of Euclidean metric. If d_ij is the distance between the pattern x_j and the center v_i of the ith prototype V_i and if r_i is the radius of V_i, we say that x_j belongs completely to V_i if d_ij ≤ r_i. The covariance matrix P_i, associated to the ith cluster V_i, is given by $\begin{matrix} P_{i} = \frac{\sum_{j = 1}^{N} u_{i j}^{m} (x_{j} - v_{i}) {(x_{j} - v_{i})}^{T}}{\sum_{j = 1}^{N} u_{i j}^{m}} \end{matrix}$ whose determinant gives the volume of the ith cluster.

VL-EFCM hotspot detection

We consider a dataset of patterns composed by events geo-referenced. Each pattern is formed by two numerical features: the latitude and longitude in an assigned coordinate system which defines a location in a two-dimensional map. A pattern is formed from a point on the map which represents an event occurred in a location. Now we suppose that the dataset is a VL dataset in which many events have been geo-referenced as points (for example, by using a geo-coding). We use a VL-EFCM algorithm for

Test results

Our tests were performed considering an L dataset composed by all the epicentres of earthquakes occurred in Italy since 1970, extracted from the database ISIDE (Italian Seismological Instrumental and parametric DatabasE) available at http://iside.rm.ingv.it, managed by the Italian National Geological and Vulcanological Institute (INGV). The dataset is formed from more than 250.000 event points, each geo-referenced on the geographical map as shown in Fig. 4. We use a Pentium Intel Core I7

Conclusions

We present an hotspot detection method for L and VL event datasets, extending to the EFCM algorithm the four FCM variation algorithms for L and VL datasets. For evaluating the performances of the four algorithms, we consider an L event dataset formed from the epicentres of earthquakes registered in Italy since 1970. Tests have been performed by varying the cardinality of the subsets of the dataset and the size of the number of bins. The results show that the best performances are obtained by

Acknowledgement

This work was performed under the auspices of GCNS-INDAM.

References (26)

F. Di Martino et al.
The extended fuzzy C-means algorithm for hotspots in spatio-temporal GIS
Expert Syst. Appl.
(2011)
R. Hathaway et al.
Extending fuzzy and probabilistic clustering to very large datasets
Comput. Stat. Data Anal.
(2006)
J. Besag et al.
The detection of clusters in rare diseases
J. R. Stat. Soc. A
(1991)
J.C. Bezdek
Pattern Recognition with Fuzzy Objective Function Algorithms
(1981)
P. Brantingham
Environmental Criminology
(1981)
S.P. Chaney et al.
GIS and Crime Mapping (Chapter 6: Identifying Crime Hotspots
(2013)
S.P. Chainey et al.
When is a hotspot a hotspot? A procedure for creating statistically robust hotspot geographic maps of crime
L. Cohen et al.
Social change and crime rate trends: a routine activity approach
Am. Sociol. Rev.
(1979)
M. Craglia et al.
A comparative evaluation of approaches to urban crime pattern analysis
Urban Stud.
(2000)
F Di Martino et al.
WebGIS based on spatio-temporal hotspots: an application to oto-laryngo-pharyngeal diseases
Soft Comput.
(2015)

S. Eschrich et al.

Fast accurate fuzzy clustering through data reduction

IEEE Trans. Fuzzy Syst.

(2003)

K. Harries

Geographic Mapping Crime: Principle and Practice

(1999)

T.C. Havens et al.

Fuzzy C-means algorithms for very large data

IEEE Trans. Fuzzy Syst.

(2012)

Cited by (0)

View full text

Extended Fuzzy C-Means hotspot detection method for large and very large event datasets

Abstract

Introduction

Section snippets

FCM and weighted FCM algorithms

EFCM and weighted EFCM algorithms

VL-EFCM hotspot detection

Test results

Conclusions

Acknowledgement

Expert Syst. Appl.

Comput. Stat. Data Anal.

The detection of clusters in rare diseases

J. R. Stat. Soc. A

Pattern Recognition with Fuzzy Objective Function Algorithms

Environmental Criminology

GIS and Crime Mapping (Chapter 6: Identifying Crime Hotspots

When is a hotspot a hotspot? A procedure for creating statistically robust hotspot geographic maps of crime

Social change and crime rate trends: a routine activity approach

Am. Sociol. Rev.

A comparative evaluation of approaches to urban crime pattern analysis

Urban Stud.

WebGIS based on spatio-temporal hotspots: an application to oto-laryngo-pharyngeal diseases

Soft Comput.

Fast accurate fuzzy clustering through data reduction

IEEE Trans. Fuzzy Syst.

Geographic Mapping Crime: Principle and Practice

Fuzzy C-means algorithms for very large data

IEEE Trans. Fuzzy Syst.