Abstract
Nowadays ubiquitous sensor stations are deployed worldwide, in order to measure several geophysical variables (e.g. temperature, humidity, light) for a growing number of ecological and industrial processes. Although these variables are, in general, measured over large zones and long (potentially unbounded) periods of time, stations cannot cover any space location. On the other hand, due to their huge volume, data produced cannot be entirely recorded for future analysis. In this scenario, summarization, i.e. the computation of aggregates of data, can be used to reduce the amount of produced data stored on the disk, while interpolation, i.e. the estimation of unknown data in each location of interest, can be used to supplement station records. We illustrate a novel data mining solution, named interpolative clustering, that has the merit of addressing both these tasks in time-evolving, multivariate geophysical applications. It yields a time-evolving clustering model, in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes, in order to predict data. Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data. Weights of the linear combination are defined, in order to reflect the inverse distance of the unseen data to each cluster geometry. The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations. Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.
Similar content being viewed by others
Notes
The predictive clustering framework is originally defined in Blockeel et al. (1998), in order to combine clustering problems and classification/regression problems. The predictive inference is performed by distinguishing between target variables and explanatory variables. Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster, while training examples with dissimilar target values are grouped in separate clusters. Explanatory variables are used to generate a symbolic description of the clusters. Although the algorithm presented in Blockeel et al. (1998) can be, in principle, run by considering the same set of variables for both explanatory and target roles, this case is not investigated in the original study.
Inverse distance weighting is a common interpolation algorithm. It has several advantages that endorse its widespread use in geostatistics (Li and Revesz 2002; Karydas et al. 2009; Li et al. 2011): simplicity of implementation; lack of tunable parameters; ability to interpolate scattered data and work on any grid without suffering from multicollinearity.
We can extend this representation of a sensor network by considering a multi-dimensional representation of space. In the multi-dimensional case, multiple variables will be used to identify the location of a station. These multiple variables will be taken into account when computing the distance between sensors.
In the on-line learning phase, missing observations of a variable are interpolated in the data snapshot by using the inverse distance weighted sum of nearby known data in the row.
The spherical law of cosines is used, in order to approximate the geographical distance between the geographic coordinates (e.g. latitude and longitude) of two sensors.
The time cost of computing the local indicators of the spatial autocorrelation property can be made subquadratic by using a spatial data structure, in order to maintain, for each sensor in the network, the sphere of its neighbours. The structure will be updated only when a new sensor is either switched-on or switched-off in the network.
The quadtree decomposition of a cluster divides recursively a cluster quadrant into four subquadrants until final quadrants are determined. As we plan to compute \(Np^{\%}\) final quadrants, the number of levels of this quadtree decomposition is about \(\log _4(Np^{\%})\).
We compute RRMSE, in order to scale the error of a target variable with the domain size of the variable.
We note that the local indicators, which are computed by accounting for the pairwise comparison between neighbor stations, can be precomputed before building the tree. Thus, only the variance reduction of local indicators is evaluated over each node. On the other hand, MoranVar computes the global indicator of the spatial autocorrelation over each node. This requires the computation of the pairwise comparison between the neighbor stations that fall in the present node. As neighbors may change in number throughout the tree, the global measure has to be recomputed at each node.
References
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases (VLDB 2003), pp 81–92
Aggarwal CC, Han J, Wang J, Yu PS (2007) On clustering massive data streams: a summarization paradigm. In: Advances in database systems: data streams models and algorithms (book chapter), vol 31. Springer-US, pp 9–38
Aho T, Zenko B, Dzeroski S, Elomaa T (2012) Multi-target regression with rule ensembles. J Mach Learn Res 2(13):2367–2407
Angin P, Neville J (2008) A shrinkage approach for modeling non-stationary relational autocorrelation. In: Proceedings of the 8th IEEE international conference on data mining, IEEE Computer Society, pp 707–712
Anselin L (1995) Local indicators of spatial association:lisa. Geogr Anal 27(2):93–115
Appice A, Ceci M, Malerba D, Lanza A (2012) Learning and transferring geographically weighted regression trees across time. In: Proceedings of MSM/MUSE 2012, LNCS, vol 7472. Springer, Berlin, pp 97–117
Appice A, Ciampi A, Malerba D (2013a) Summarizing numeric spatial data streams by trend cluster discovery. Data Mining Knowl Discov. doi:10.1007/s10618-013-0337-7
Appice A, Ciampi A, Malerba D, Guccione P (2013b) Using trend clusters for spatiotemporal interpolation of missing data in a sensor network. J Spatial Inf Sci 6(1):119–153
Appice A, Pravilovic S, Malerba D, Lanza A (2013c) Enhancing regression models with spatio-temporal indicator additions. In: Baldoni M, Baroglio C, Boella G, Micalizio R (eds) Proceedings of AI*IA 2013: Advances in Artificial Intelligence—XIIIth international conference of the Italian Association for Artificial Intelligence, Lecture Notes in Computer Science, vol 8249. Springer, Berlin, pp 433–444
Bailey T, Krzanowski W (2012) An overview of approaches to the analysis and modelling of multivariate geostatistical data. Math Geosci 44(4):381–393. doi:10.1007/s11004-011-9360-7
Blanchet FG, Legendre P, Borcard D (2008) Modelling directional spatial processes in ecological data. Ecol Model 215(4):325–336. doi:10.1016/j.ecolmodel.2008.04.001. http://www.sciencedirect.com/science/article/pii/S0304380008001798
Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Proceedings of ICML. Morgan Kaufmann, pp 55–63
Boots B (2002) Local measures of spatial association. Ecoscience 9(2):168–176
Burrough P, McDonnell R (1998) Principles of geographical information systems. Oxford University Press, Oxford
Chen Z, Yang S, Li L, Xie Z (2010) A clustering approximation mechanism based on data spatial correlation in wireless sensor networks. In: Proceedings of the 9th conference on wireless telecommunications symposium, WTS 2010. IEEE Press, pp 208–214
Chiky R, Hébrail G (2008) Summarizing distributed data streams for storage in data warehouses. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery (DaWaK 2008), LNCS, vol 5182. Springer, Berlin, pp 65–74
Cressie N (1990) The origins of kriging. Math Geol 22(3):239–252. doi:10.1007/BF00889887
Cressie N (1993) Statistics for spatial data. Wiley, New York. doi:10.1111/j.1365-3121.1992.tb00605.x
Debeljak M, Trajanov A, Stojanova D, Leprince F, Džeroski S (2012) Using relational decision trees to model out-crossing rates in a multi-field setting. Ecol Model 245:75–83
Demšar D, Debeljak M, Lavigne C, Džeroski S (2005) Modelling pollen dispersal of genetically modified oilseed rape within the field. In: Abstracts of the 90th ESA annual meeting, The Ecological Society of America, p 152
Dray S, Jombart T (2011) Revisiting guerry’s data: introducing spatial constraints in multivariate analysis. Ann Appl Stat 5(4):2278–2299
Dray S, Legendre P, Peres-Neto PR (2006) Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (pcnm). Ecol Model 196(34):483–493. doi:10.1016/j.ecolmodel.2006.02.015. http://www.sciencedirect.com/science/article/pii/S0304380006000925
European Environment Agency (2006) Corine land cover 2006. http://sia.eionet.europa.eu/CLC2006
Gama J (2010) Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC, Boca Raton
Getis A (2008) A history of the concept of spatial autocorrelation: a geographer’s perspective. Geogr Anal 40(3):297–309
Getis A, Ord JK (1992) The analysis of spatial association by use of distance statistics. Geogr Anal 24(3):189–206
Goodchild M (1986) Spatial autocorrelation. Geo Books
Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, Oxford
Gora G, Wojna A (2002) RIONA: a classifier combining rule induction and k-NN method with automated selection of optimal neighbourhood. In: Proceedings of ECML 2002. Springer, Berlin, pp 111–123
Holden ZA, Evans JS (2010) Using fuzzy c-means and local autocorrelation to cluster satellite-inferred burn severity classes. Int J Wildland Fire 19(7):853–860
Ikonomovska E, Gama J, Dzeroski S (2011) Incremental multi-target model trees for data streams. In: Chu WC, Wong WE, Palakal MJ, Hung CC (eds) Proceedings of the 2011 ACM symposium on applied computing (SAC). ACM, pp 988–993
Ingelrest F, Barrenetxea G, Schaefer G, Vetterli M, Couach O, Parlange M (2010) Sensorscope: application-specific sensor network for environmental monitoring. ACM Trans Sens Netw 17(1–17):32
Isaaks EH, Srivastava RM (1989) An introduction to applied geostatistics. Oxford University Press, Oxford
Karydas C, Gitas I, Koutsogiannaki E, Lydakis-Simantiris N, Silleos G (2009) Evaluation of spatial interpolation techniques for mapping agricultural topsoil properties in Crete. In: Proceedings of EARSeL 2009, vol 8, pp 26–39
Kelley P, Barry R (1999) Sparse spatial autoregressions. Stat Probab Lett 33:291–297
Kim B, Tsiotras P (2009) Image segmentation on cell-center sampled quadtree and octree grids. pp 72, 480L–72, pp. 480L–9. doi:10.1117/12.810965
Kistler R, Kalnay E, Collins W, Saha S, White G, Woollen J, Chelliah M, Ebisuzaki W, Kanamitsu M, Kousky V, van den Dool H, Jenne R, Fiorino M (2001) The ncep/ncar 50-year reanalysis. Bull Am Meteorol Soc 82(2):247–267
Krige DG (1951) A statistical approach to some mine valuation and allied problems on the Witwatersrand. Master’s thesis
Lam N (1983) Spatial interpolation methods: a review. Am Cartogr 10:129–149. doi:10.1559/152304083783914958
Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74:1659–1673
LeSage JH, Pace K (2001) Spatial dependence in data mining. In: Data mining for scientific and engineering applications. Kluwer, Dordrecht, pp 439–460
Li J, Heap A (2008) A review of spatial interpolation methods for environmental scientists. Geoscience Australia, Record 2008/23
Li L, Revesz P (2002) A comparison of spatio-temporal interpolation methods. GIScience, LNCS 2478. Springer, Berlin, pp 145–160
Li L, Zhang X, Holt J, Tian J, Piltner R (2011) Spatiotemporal interpolation methods for air pollution exposure. In: Proceedings of SARA 2011, AAAI
Lin G, Chen L (2004) A spatial interpolation method based on radial basis function networks incorporating a semivariogram model. J Hydrol 288:288–298
Lu GY, Wong DW (2008) An adaptive inverse-distance weighting spatial interpolation technique. J Comput Geosci 34:1044–1055. doi:10.1016/j.cageo.2007.07.010
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Carbonell JG, Mitchell TM (eds) Michalski RS. Machine learning, an artificial intelligence approach, Tioga, pp 331–364
Nassar S, Sander J (2007) Effective summarization of multi-dimensional data streams for historical stream mining. In: Proceedings of the 19th international conference on scientific and statistical database management, SSDBM 2007. IEEE Computer Society, p 30
NOAACoastWatch (2013a) Ndbc standard meteorological buoy data. http://coastwatch.pfeg.noaa.gov/erddap/tabledap/cwwcNDBCMet.html
NOAACoastWatch (2013b) Wind diffusivity current, metop ascat, global, near real time (1 day composite). http://coastwatch.pfeg.noaa.gov/erddap/griddap/erdQAekm1day.html
NOAACoastWatch (2013c) Wind stress, metop ascat, global, near real time (1 day composite). http://coastwatch.pfeg.noaa.gov/erddap/griddap/erdQAstress1day.html
NOAANODC (2009) World ocean atlas 2009, seasonal climatology, 5 degree, temperature, salinity, oxygen. http://coastwatch.pfeg.noaa.gov/erddap/griddap/nodcWoa09sea5t.html
Ohashi O, Torgo L (2012) Spatial interpolation using multiple regression. In: Zaki MJ, Siebes A, Yu JX, Goethals B, Webb GI, Wu X (eds) 12th IEEE international conference on data mining, ICDM 2012. IEEE Computer Society, pp 1044–1049
Orkin M, Drogin R (1990) Vital statistics. McGraw Hill, New York
Pace P, Barry R (1997) Quick computation of regression with a spatially autoregressive dependent variable. Geogr Anal 29(3):232–247
Price M (2012) Arcgis 10: importing data from excel spreadsheets. http://www.esri.com/news/arcuser/0312/importing-data-from-excel-spreadsheets.html
Rodrigues PP, Gama J, Lopes LMB (2008) Clustering distributed sensor data streams. In: Proceedings of the European conference on machine learning and knowledge discovery in databases, LNCS 5212. Springer, Berlin, pp 282–297
Sampson PD, Guttorp P (1992) Nonparametric estimation of nonstationary spatial covariance structure. J Am Stat Assoc 87:108–119
Scrucca L (2005) Clustering multivariate spatial data based on local measures of spatial autocorrelation. Tech. Rep. 20, Quaderni del Dipartimento di Economia, Finanza e Statistica, Università di Perugia
Şen Z, Şalhn AD (2001) Spatial interpolation and estimation of solar irradiation by cumulative semivariograms. Solar Energy 71(1):11–21. doi:10.1016/S0038-092X(01)00009-3
Shepard D (1968a) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference, ACM ’68. ACM, New York, NY, USA, pp 517–524. doi:10.1145/800186.810616
Shepard D (1968b) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 ACM national conference, ACM, pp 517–524
Song YC, Meng HD (2010) The application of cluster analysis in geophysical data interpretation. Comput Geosci 14(2):263–271
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the KDD 2006, ACM, pp 706–711
Stein ML (1999) Interpolation of spatial data: some theory for kriging (springer series in statistics), 1st edn. Springer, Berlin
Stojanova D (2009) Estimating forest properties from remotely sensed data by using machine learning. Master’s thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
Stojanova D, Ceci M, Appice A, Dzeroski S (2012) Network regression with predictive clustering trees. Data Min Knowl Discov 25(2):378–413
Stojanova D, Ceci M, Appice A, Malerba D, Dzeroski S (2013) Dealing with spatial autocorrelation when learning predictive clustering trees. Ecol Inform 13:22–39
Teegavarapu RSV, Meskele T, Pathak CS (2012) Geo-spatial grid-based transformations of precipitation estimates using spatial interpolation methods. Comput Geosci 40:28–39. doi:10.1016/j.cageo.2011.07.004
Tobler W (1979) Cellular geography. Philos Geogr 20:379–386
Umer M, Kulik L, Tanin E (2010) Spatial interpolation in wireless sensor networks: localized algorithms for variogram modeling and Kriging. Geoinformatica 14(1):101–134. doi:10.1007/s10707-009-0078-3
Wang Y, Witten I (1997) Induction of model trees for predicting continuous classes. In: Proceedings of ECML 1997. Springer, Berlin, pp 128–137
Yong J, Xiao-ling Z, Jun S (2007) Unsupervised classification of polarimetric SAR Image by quad-tree segment and SVM. In: 1st Asian and Pacific conference on synthetic aperture radar, 2007 (APSAR 2007), pp 480–483. doi:10.1109/APSAR.2007.4418655
Acknowledgments
We would like to acknowledge the support of the European Commission through the project MAESTRA—Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944). The authors thank Saso Dzeroski for providing SIGMEA data and generating IRS and SPOT data, Enric Melé and Joaquima Messeguer for providing FOIXA data, unknown reviewers for their useful suggestions to improve this paper and Lynn Rudd for her help in reading the manuscript. FOIXA and SIGMEA data are collected in the SIGMEA project. SPOT and IRS data are obtained from EEA Corine Image 2006.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen and Filip Železný.
Rights and permissions
About this article
Cite this article
Appice, A., Malerba, D. Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering. Data Min Knowl Disc 28, 1266–1313 (2014). https://doi.org/10.1007/s10618-014-0372-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0372-z