Synonyms
Co-locations; Hotspots; K-primary-route summarization; Location prediction; Spatial autocorrelation; Spatial data analysis; Spatial decision trees; Spatial outliers; Spatial statistics; Ring shaped hotspots
Definition
Spatial data mining [19831,19832,3] is the process of discovering nontrivial, interesting, and useful patterns in large spatial datasets. The most common spatial pattern families are co-locations, spatial hotspots, spatial outliers, and location predictions.
Figure 1 gives an example of a spatial hotspot pattern (in the green circle) detected by SaTScan [4] from 250 cholera cases (shown by red points) that occurred near Broad Street in London, 1854. Notice that discovering spatial hotspots here is a nontrivial process due to the irregular size and special shape of the pattern. In addition, not all incidents contribute to the hotspot (e.g., red points outside the circles). Discovery of this pattern is very useful and interesting to detect outbreak of disease for public health.
Historical Background
Spatial data mining research began several decades ago when practitioners and researchers noticed that critical assumptions in classical data mining and statistics were violated by spatial datasets. First, whereas classical datasets often assume that data are discrete, spatial data were observed to reside in continuous space. For example, classical data mining and statistical methods may use market-basket datasets (e.g., history of Walmart’s transactions), where each item type in a transaction is discrete. However, “transactions” are not natural in continuous spatial datasets, and decomposing space across transactions leads to loss of information about neighbor relationships between items across transaction boundaries. In addition, spatial data often exhibits heterogeneity (i.e., no places on the Earth are identical), whereas classical data mining techniques often focus on spatially stationary global patterns (i.e., ignoring spatial variations across locations). Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. However, this assumption is generally false when analyzing spatial data, because spatial data tends to be highly self-correlated. For example, people with similar characteristics, occupation, and background tend to cluster together in the same neighborhoods. In spatial statistics [5], this tendency is called spatial autocorrelation. Ignoring spatial autocorrelation when analyzing data with spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the dataset. Thus, classical data mining algorithms often perform poorly when applied to spatial datasets. Better methods are needed to analyze spatial data to detect spatial patterns.
Foundations
The spatial data mining literature has focused on four main types of spatial patterns: (i) spatial outliers, which are spatial locations showing a significant difference from their neighbors; (ii) spatial co-locations or subsets of event types that tend to be found more often together throughout space than other subsets of event types; (iii) location predictions, that is, information that is inferred about locations favored by an event type based on other explanatory spatial variables; and (iv) spatial hotspots, unusual spatial groupings of events. The remainder of this section presents a general overview of each of these pattern categories.
A spatial outlier is a spatially referenced object whose nonspatial attribute values differ significantly from those of other spatially referenced objects in its spatial neighborhood. Figure 2 gives an example of spatial outliers, detected in traffic measurements for sensors on highway I-35W (North bound) in the Minneapolis-St. Paul area, for a 24-h time period. Station 9 may be considered a spatial outlier as it exhibits inconsistent traffic flow compared with its neighboring stations. Once a spatial outlier is identified, one may proceed with diagnosis. For example, the sensor at Station 9 may be diagnosed as malfunctioning. Spatial attributes are used to characterize location, neighborhood, and distance. Nonspatial attribute dimensions are used to compare a spatially referenced object to its neighbors. Spatial statistics literature provides two kinds of bipartite multidimensional tests, namely, graphical tests and quantitative tests. Graphical tests, such as variogram clouds and Moran scatterplots, are based on the visualization of spatial data and highlight spatial outliers. Quantitative methods provide a precise test to distinguish spatial outliers from the remainder of data.
Spatial co-location pattern discovery finds frequently co-located subsets of spatial event types given in a map of their locations. Figure 3 gives an example map with two examples of spatial co-locations. Readers are encouraged to determine for themselves the co-located pairs of spatial event types in Fig. 3. The answers provided there show that trees and fire tend to co-occur together across the spatial region, as well as the pattern bird and house. Spatial co-location is a generalization of a classical data mining pattern family called association rules, since transactions are not natural in spatial datasets, and partitioning space across transactions leads to loss of information about neighbor relationships between items near transaction boundaries. Additional details about co-location interest measures, e.g., participation index and K-functions, and mining algorithms are described in [7].
Location prediction is concerned with the discovery of a model to infer preferred locations of a spatial phenomenon from the maps of other explanatory spatial features. For example, ecologists may build models to predict habitats for endangered species using maps of vegetation, water bodies, climate, and other related species. Figure 4 gives an example of a dataset used in building a location prediction model for red-winged blackbirds in the Darr and Stubble wetlands on the shores of Lake Erie in Ohio, USA. This dataset consists of nest location, distance to open water, vegetation durability, and water depth maps. Classical prediction methods may be ineffective in this problem due to the presence of spatial autocorrelation. Spatial data mining techniques that capture the spatial autocorrelation of nest location such as the Spatial Autoregression Model (SAR) [5] and Markov Random Fields based Bayesian Classifiers (MRF-BC) are used for location prediction modeling. A comparison of these methods is discussed in [8].
Another problem similar to the above is the discovery of a model to infer a thematic map of different classes from the maps of other explanatory spatial features. For example, natural resource management researchers may build models to map wetland distributions on the earth surface using remote sensing imagery. Classical prediction methods such as decision trees may be limited in this problem due to the presence of spatial autocorrelation in a target thematic class map. This may lead to salt-and-pepper noise, i.e., pixels classified differently from all neighboring pixels. To address this limitation, focal-test-based spatial decision tree whose tree nodes not only test local information but also focal (neighborhood) information has been proposed [9]. Figure 4 gives an example of inputs and outputs of a decision tree and a spatial decision tree for wetland mapping in Chanhassen, MN, USA. The input feature maps consist of high-resolution (3 × 3 m) aerial photos (including R, G, B, near-infrared bands) collected in 2003 and 2008. The output of a decision tree (Fig. 5c) has lots of salt-and-pepper noise (e.g., black pixels highlighted by an ellipse). In contrast, the prediction of a spatial decision tree (Fig. 5d) is with less salt-and-pepper noise. More details on these two methods are discussed in [9].
Data summarization is an important topic in data mining for finding a compact representation of a dataset. In spatial network activity summarization [10], we are given a spatial network and a collection of activities (e.g., pedestrian fatality reports, crime reports), and the goal is to find k shortest paths that summarize the activities. This problem is important for applications where observations occur along linear paths such as roadways, train tracks, etc. For example, transportation planners and engineers need tools to assist them in identifying which frequently used road segments/stretches pose risks for pedestrians and consequently should be redesigned. Figure 6 shows a case study on pedestrian fatalities on a street network. As can be seen, the classical K-means approach, either Euclidean distance or network distance, cannot fully capture these network activities. In contrast, KMR can fully capture the linear patterns. For instance, the blue group and summary path capture the activities on the arterial road that were split across three groups in K-means.
Spatial hotspots are areas where events or activities inside the areas are significantly more than outside. Examples of spatial hotspots can be concentration of crime events in a city or outbreaks of a disease. Hotspot patterns have properties of clustering as well as anomalies from classical data mining. However, hotspot discovery [11] remains a challenging area of research due to variation in shape, size, density of hotspots, and underlying space (e.g., Euclidean or spatial networks such as roadmaps). Additional challenges arise from the spatiotemporal semantics such as emerging hotspots, displacement, etc.
The scan statistic is a statistic test used to detect clusters in a point process. A spatial scan statistic is a generalization of scan statistics in high-dimensional spatial point processes. It uses a window with a predefined shape (e.g., circle) and varying sizes to scan the study area and computes a test statistic called likelihood ratio. The likelihood ratio is the ratio of likelihood of the alternative hypothesis (higher activity level inside the window) over the likelihood of the null hypothesis (same activity level inside and outside). A p-value indicating statistical significance is estimated through Monte Carlo simulations. One popular scan statistics is SaTScan which detects significant circular hotspots. However, on network space (e.g., street networks), hotspot patterns may have linear shapes (e.g., routes) rather than circular shapes. [12] introduces a significant route discovery algorithm to detect significant linear hotspots from network activities. One example is in Fig. 7.
Key Applications
Spatial data mining and the discovery of spatial patterns have applications in a number of areas. Detecting spatial outliers is useful in many applications of geographic information systems and spatial databases, including the domains of public safety, public health, climatology, and location-based services. As noted earlier, for example, spatial outlier applications may be used to identify defective or out of the ordinary (i.e., unusually behaving) sensors in a transportation system (e.g., Fig. 1). Spatial co-location discovery is useful in ecology in the analysis of animal and plant habitats to identify co-locations of predator-prey species, symbiotic species, or fire events with fuel and ignition sources. Location prediction may provide applications toward predicting the climatic effects of El Nino on locations around the world. Finally, identification of spatial hotspots can be used in crime prevention and reduction, as well as in epidemiological tracking of disease.
Future Directions
In this chapter, we have presented several major achievements in spatial data mining research. Current research is mostly concentrated on mining spatial data in the Euclidean space. Developing spatial statistical models and spatial data mining methods on the network space (e.g., street maps, river networks) are still largely unexplored. These problems are important in many problems such as water monitoring in river networks and crime analysis on street maps. Networks pose new challenges due to the unique dependency structure, directionality, and distance metric. Future research in this area should be encouraged, e.g., how to generalize scan statistics on spatial networks. In addition, ring-shaped hotspot detection [13] is also an interesting and challenging problem. For example, Fig. 8 shows the arson crimes in San Diego, CA, in 2013. K-means clustering algorithm simply partitions the crime incidents into two clusters (in red and green colors). Circular hotspot detection algorithm, i.e., SaTScan, detects only one hotspot pattern (in green color). In contrast, ring-shaped hotspot detection algorithm finds two ring hotspots (in green) which are statistically significant. Another future direction is mining spatiotemporal data such as the outbreak of diseases and moving objects. Involving the time dimension adds new challenges such as temporal autocorrelation and temporal non-stationarity.
Recommended Reading
Shekhar S, Chawla S. A tour of spatial databases. Englewood-Cliffs: Prentice Hall; 2003.
Miller HJ, Han J. Geographic data mining and knowledge discovery. 2nd ed. Boca Raton: CRC Press; 2009.
Zhou X, Shekhar S, Ali R. Spatiotemporal change footprint pattern discovery: an inter-disciplinary survey. WIREs Interdiscip Rev: Data Min Knowl Disc(DMKD), 4, 1, 1–23, 2014.
Kulldorff M. A spatial scan statistic. Commun Stat-Theory Methods. 1997;26(6):1481–96.
Cressie NA. Statistics for spatial data. Rev ed. New York: Wiley; 1993.
Kou Y, Lu CT, Chen D. Algorithms for spatial outlier detection. In: Proceedings of the 3rd IEEE International Conference on Data Mining; 2003. p. 597–600.
Huang Y, Shekhar S, Xiong H. Discovering co-location patterns from spatial datasets: a general approach. IEEE Trans Knowl Data Eng. 2004;16(12):1472–85.
Shekhar S, Schrater P, Vatsavai R, Wu W, Chawla S. Spatial contextual classification and prediction models for mining geospatial data. IEEE Trans Multimed. (special issue on Multimedia Databases). 2002;4(2):174–88.
Jiang Z, Shekhar S, Zhou X, Knight J, Corcoran J. Focal-test-based spatial decision tree learning: a summary of results. In: Proceedings of the 13th IEEE International Conference on Data Mining; 2013. p. 320–9.
Oliver D, Shekhar S, Kang J, Laubscher R, Carlan V, Bannur A. A K-main routes approach to spatial network activity summarization. IEEE Trans Trans Knowl Data Eng. 2014;26(6):1464–78.
US Department of Justice – Mapping and Analysis for Public Safety report. Mapping crime: understanding hot spots. 2005. http://www.ncjrs.gov/pdffiles1/nij/209393.pdf.
Oliver D, Shekhar S, Zhou X, Eftelioglu E, Evans MR, Zhuang Q, Kang JM, Laubscher R, Farah C. Significant route discovery: a summary of results. In: Proceedings of the 8th International Conference on Geographic Information Science; 2014. p. 284–300.
Eftelioglu E, Shekhar S, Kang JM, Farah CC. Ring-shaped hotspot detection. IEEE Trans Knowl Data Eng. 2016;28(12):3367–81.
Longley PA, Goodchild M, Maquire DJ, Rhind DW. Geographic information systems and science. Chichester: Wiley; 2005.
Mamoulis N, Cao H, Cheung DW. Mining frequent spatio-temporal sequential patterns. In: Proceedings of the 5th IEEE International Conference on Data Mining; 2005. p. 82–9.
Shekhar S, Lu CT, Zhang P. A unified approach to detecting spatial outliers. GeoInformatica. 2003;7(2):139–66.
Shekhar S, Zhang P, Huang Y, Vatsavai R, Kargupta H, Joshi A, Sivakumar K, Yesha Y. Trend in spatial data mining. In: Data mining: next generation challenges and future directions. AAAI/MIT Press; 2003.
Solberg AH, Taxt T, Jain AK. A Markov random field model for classification of multisource satellite imagery. IEEE Trans Geosci Remote Sens. 1996;34(1):100–13.
Shekhar S, Evans M, Kang J, Mohan P. Identifying patterns in spatial information: a survey of methods. Wiley Interdiscip Rev: Data Min Knowl Disc. 2011;1(3):193–214.
Shekhar S, Gunturi V, Evans MR, Yang K. Spatial big-data challenges intersecting mobility and cloud computing. In: Proceedings of the 11th ACM International Workshop on Data Engineering for Wireless and Mobile Access; 2012. p. 1–6.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media LLC (outside the USA)
About this entry
Cite this entry
Shekhar, S., Jiang, Z., Kang, J., Gandhi, V. (2018). Spatial Data Mining. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_357
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_357
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering