Skip to main content
Log in

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Nowadays ubiquitous sensor stations are deployed worldwide, in order to measure several geophysical variables (e.g. temperature, humidity, light) for a growing number of ecological and industrial processes. Although these variables are, in general, measured over large zones and long (potentially unbounded) periods of time, stations cannot cover any space location. On the other hand, due to their huge volume, data produced cannot be entirely recorded for future analysis. In this scenario, summarization, i.e. the computation of aggregates of data, can be used to reduce the amount of produced data stored on the disk, while interpolation, i.e. the estimation of unknown data in each location of interest, can be used to supplement station records. We illustrate a novel data mining solution, named interpolative clustering, that has the merit of addressing both these tasks in time-evolving, multivariate geophysical applications. It yields a time-evolving clustering model, in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes, in order to predict data. Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data. Weights of the linear combination are defined, in order to reflect the inverse distance of the unseen data to each cluster geometry. The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations. Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The predictive clustering framework is originally defined in Blockeel et al. (1998), in order to combine clustering problems and classification/regression problems. The predictive inference is performed by distinguishing between target variables and explanatory variables. Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster, while training examples with dissimilar target values are grouped in separate clusters. Explanatory variables are used to generate a symbolic description of the clusters. Although the algorithm presented in Blockeel et al. (1998) can be, in principle, run by considering the same set of variables for both explanatory and target roles, this case is not investigated in the original study.

  2. Inverse distance weighting is a common interpolation algorithm. It has several advantages that endorse its widespread use in geostatistics (Li and Revesz 2002; Karydas et al. 2009; Li et al. 2011): simplicity of implementation; lack of tunable parameters; ability to interpolate scattered data and work on any grid without suffering from multicollinearity.

  3. We can extend this representation of a sensor network by considering a multi-dimensional representation of space. In the multi-dimensional case, multiple variables will be used to identify the location of a station. These multiple variables will be taken into account when computing the distance between sensors.

  4. In the on-line learning phase, missing observations of a variable are interpolated in the data snapshot by using the inverse distance weighted sum of nearby known data in the row.

  5. The spherical law of cosines is used, in order to approximate the geographical distance between the geographic coordinates (e.g. latitude and longitude) of two sensors.

  6. The time cost of computing the local indicators of the spatial autocorrelation property can be made subquadratic by using a spatial data structure, in order to maintain, for each sensor in the network, the sphere of its neighbours. The structure will be updated only when a new sensor is either switched-on or switched-off in the network.

  7. The quadtree decomposition of a cluster divides recursively a cluster quadrant into four subquadrants until final quadrants are determined. As we plan to compute \(Np^{\%}\) final quadrants, the number of levels of this quadtree decomposition is about \(\log _4(Np^{\%})\).

  8. http://www.di.uniba.it/~appice/software/ICT_TICT/index.htm.

  9. We compute RRMSE, in order to scale the error of a target variable with the domain size of the variable.

  10. We note that the local indicators, which are computed by accounting for the pairwise comparison between neighbor stations, can be precomputed before building the tree. Thus, only the variance reduction of local indicators is evaluated over each node. On the other hand, MoranVar computes the global indicator of the spatial autocorrelation over each node. This requires the computation of the pairwise comparison between the neighbor stations that fall in the present node. As neighbors may change in number throughout the tree, the global measure has to be recomputed at each node.

References

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases (VLDB 2003), pp 81–92

  • Aggarwal CC, Han J, Wang J, Yu PS (2007) On clustering massive data streams: a summarization paradigm. In: Advances in database systems: data streams models and algorithms (book chapter), vol 31. Springer-US, pp 9–38

  • Aho T, Zenko B, Dzeroski S, Elomaa T (2012) Multi-target regression with rule ensembles. J Mach Learn Res 2(13):2367–2407

    MathSciNet  Google Scholar 

  • Angin P, Neville J (2008) A shrinkage approach for modeling non-stationary relational autocorrelation. In: Proceedings of the 8th IEEE international conference on data mining, IEEE Computer Society, pp 707–712

  • Anselin L (1995) Local indicators of spatial association:lisa. Geogr Anal 27(2):93–115

    Article  Google Scholar 

  • Appice A, Ceci M, Malerba D, Lanza A (2012) Learning and transferring geographically weighted regression trees across time. In: Proceedings of MSM/MUSE 2012, LNCS, vol 7472. Springer, Berlin, pp 97–117

  • Appice A, Ciampi A, Malerba D (2013a) Summarizing numeric spatial data streams by trend cluster discovery. Data Mining Knowl Discov. doi:10.1007/s10618-013-0337-7

  • Appice A, Ciampi A, Malerba D, Guccione P (2013b) Using trend clusters for spatiotemporal interpolation of missing data in a sensor network. J Spatial Inf Sci 6(1):119–153

    Google Scholar 

  • Appice A, Pravilovic S, Malerba D, Lanza A (2013c) Enhancing regression models with spatio-temporal indicator additions. In: Baldoni M, Baroglio C, Boella G, Micalizio R (eds) Proceedings of AI*IA 2013: Advances in Artificial Intelligence—XIIIth international conference of the Italian Association for Artificial Intelligence, Lecture Notes in Computer Science, vol 8249. Springer, Berlin, pp 433–444

  • Bailey T, Krzanowski W (2012) An overview of approaches to the analysis and modelling of multivariate geostatistical data. Math Geosci 44(4):381–393. doi:10.1007/s11004-011-9360-7

    Article  Google Scholar 

  • Blanchet FG, Legendre P, Borcard D (2008) Modelling directional spatial processes in ecological data. Ecol Model 215(4):325–336. doi:10.1016/j.ecolmodel.2008.04.001. http://www.sciencedirect.com/science/article/pii/S0304380008001798

  • Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Proceedings of ICML. Morgan Kaufmann, pp 55–63

  • Boots B (2002) Local measures of spatial association. Ecoscience 9(2):168–176

    MathSciNet  Google Scholar 

  • Burrough P, McDonnell R (1998) Principles of geographical information systems. Oxford University Press, Oxford

    Google Scholar 

  • Chen Z, Yang S, Li L, Xie Z (2010) A clustering approximation mechanism based on data spatial correlation in wireless sensor networks. In: Proceedings of the 9th conference on wireless telecommunications symposium, WTS 2010. IEEE Press, pp 208–214

  • Chiky R, Hébrail G (2008) Summarizing distributed data streams for storage in data warehouses. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery (DaWaK 2008), LNCS, vol 5182. Springer, Berlin, pp 65–74

  • Cressie N (1990) The origins of kriging. Math Geol 22(3):239–252. doi:10.1007/BF00889887

    Article  MATH  MathSciNet  Google Scholar 

  • Cressie N (1993) Statistics for spatial data. Wiley, New York. doi:10.1111/j.1365-3121.1992.tb00605.x

  • Debeljak M, Trajanov A, Stojanova D, Leprince F, Džeroski S (2012) Using relational decision trees to model out-crossing rates in a multi-field setting. Ecol Model 245:75–83

  • Demšar D, Debeljak M, Lavigne C, Džeroski S (2005) Modelling pollen dispersal of genetically modified oilseed rape within the field. In: Abstracts of the 90th ESA annual meeting, The Ecological Society of America, p 152

  • Dray S, Jombart T (2011) Revisiting guerry’s data: introducing spatial constraints in multivariate analysis. Ann Appl Stat 5(4):2278–2299

    Article  MATH  MathSciNet  Google Scholar 

  • Dray S, Legendre P, Peres-Neto PR (2006) Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (pcnm). Ecol Model 196(34):483–493. doi:10.1016/j.ecolmodel.2006.02.015. http://www.sciencedirect.com/science/article/pii/S0304380006000925

  • European Environment Agency (2006) Corine land cover 2006. http://sia.eionet.europa.eu/CLC2006

  • Gama J (2010) Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC, Boca Raton

    Book  MATH  Google Scholar 

  • Getis A (2008) A history of the concept of spatial autocorrelation: a geographer’s perspective. Geogr Anal 40(3):297–309

    Article  Google Scholar 

  • Getis A, Ord JK (1992) The analysis of spatial association by use of distance statistics. Geogr Anal 24(3):189–206

    Article  Google Scholar 

  • Goodchild M (1986) Spatial autocorrelation. Geo Books

  • Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, Oxford

    Google Scholar 

  • Gora G, Wojna A (2002) RIONA: a classifier combining rule induction and k-NN method with automated selection of optimal neighbourhood. In: Proceedings of ECML 2002. Springer, Berlin, pp 111–123

  • Holden ZA, Evans JS (2010) Using fuzzy c-means and local autocorrelation to cluster satellite-inferred burn severity classes. Int J Wildland Fire 19(7):853–860

    Article  Google Scholar 

  • Ikonomovska E, Gama J, Dzeroski S (2011) Incremental multi-target model trees for data streams. In: Chu WC, Wong WE, Palakal MJ, Hung CC (eds) Proceedings of the 2011 ACM symposium on applied computing (SAC). ACM, pp 988–993

  • Ingelrest F, Barrenetxea G, Schaefer G, Vetterli M, Couach O, Parlange M (2010) Sensorscope: application-specific sensor network for environmental monitoring. ACM Trans Sens Netw 17(1–17):32

    Google Scholar 

  • Isaaks EH, Srivastava RM (1989) An introduction to applied geostatistics. Oxford University Press, Oxford

    Google Scholar 

  • Karydas C, Gitas I, Koutsogiannaki E, Lydakis-Simantiris N, Silleos G (2009) Evaluation of spatial interpolation techniques for mapping agricultural topsoil properties in Crete. In: Proceedings of EARSeL 2009, vol 8, pp 26–39

  • Kelley P, Barry R (1999) Sparse spatial autoregressions. Stat Probab Lett 33:291–297

    Article  Google Scholar 

  • Kim B, Tsiotras P (2009) Image segmentation on cell-center sampled quadtree and octree grids. pp 72, 480L–72, pp. 480L–9. doi:10.1117/12.810965

  • Kistler R, Kalnay E, Collins W, Saha S, White G, Woollen J, Chelliah M, Ebisuzaki W, Kanamitsu M, Kousky V, van den Dool H, Jenne R, Fiorino M (2001) The ncep/ncar 50-year reanalysis. Bull Am Meteorol Soc 82(2):247–267

    Article  Google Scholar 

  • Krige DG (1951) A statistical approach to some mine valuation and allied problems on the Witwatersrand. Master’s thesis

  • Lam N (1983) Spatial interpolation methods: a review. Am Cartogr 10:129–149. doi:10.1559/152304083783914958

    Article  Google Scholar 

  • Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74:1659–1673

    Article  Google Scholar 

  • LeSage JH, Pace K (2001) Spatial dependence in data mining. In: Data mining for scientific and engineering applications. Kluwer, Dordrecht, pp 439–460

  • Li J, Heap A (2008) A review of spatial interpolation methods for environmental scientists. Geoscience Australia, Record 2008/23

  • Li L, Revesz P (2002) A comparison of spatio-temporal interpolation methods. GIScience, LNCS 2478. Springer, Berlin, pp 145–160

  • Li L, Zhang X, Holt J, Tian J, Piltner R (2011) Spatiotemporal interpolation methods for air pollution exposure. In: Proceedings of SARA 2011, AAAI

  • Lin G, Chen L (2004) A spatial interpolation method based on radial basis function networks incorporating a semivariogram model. J Hydrol 288:288–298

    Article  Google Scholar 

  • Lu GY, Wong DW (2008) An adaptive inverse-distance weighting spatial interpolation technique. J Comput Geosci 34:1044–1055. doi:10.1016/j.cageo.2007.07.010

    Article  Google Scholar 

  • Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Carbonell JG, Mitchell TM (eds) Michalski RS. Machine learning, an artificial intelligence approach, Tioga, pp 331–364

  • Nassar S, Sander J (2007) Effective summarization of multi-dimensional data streams for historical stream mining. In: Proceedings of the 19th international conference on scientific and statistical database management, SSDBM 2007. IEEE Computer Society, p 30

  • NOAACoastWatch (2013a) Ndbc standard meteorological buoy data. http://coastwatch.pfeg.noaa.gov/erddap/tabledap/cwwcNDBCMet.html

  • NOAACoastWatch (2013b) Wind diffusivity current, metop ascat, global, near real time (1 day composite). http://coastwatch.pfeg.noaa.gov/erddap/griddap/erdQAekm1day.html

  • NOAACoastWatch (2013c) Wind stress, metop ascat, global, near real time (1 day composite). http://coastwatch.pfeg.noaa.gov/erddap/griddap/erdQAstress1day.html

  • NOAANODC (2009) World ocean atlas 2009, seasonal climatology, 5 degree, temperature, salinity, oxygen. http://coastwatch.pfeg.noaa.gov/erddap/griddap/nodcWoa09sea5t.html

  • Ohashi O, Torgo L (2012) Spatial interpolation using multiple regression. In: Zaki MJ, Siebes A, Yu JX, Goethals B, Webb GI, Wu X (eds) 12th IEEE international conference on data mining, ICDM 2012. IEEE Computer Society, pp 1044–1049

  • Orkin M, Drogin R (1990) Vital statistics. McGraw Hill, New York

    Google Scholar 

  • Pace P, Barry R (1997) Quick computation of regression with a spatially autoregressive dependent variable. Geogr Anal 29(3):232–247

    Article  Google Scholar 

  • Price M (2012) Arcgis 10: importing data from excel spreadsheets. http://www.esri.com/news/arcuser/0312/importing-data-from-excel-spreadsheets.html

  • Rodrigues PP, Gama J, Lopes LMB (2008) Clustering distributed sensor data streams. In: Proceedings of the European conference on machine learning and knowledge discovery in databases, LNCS 5212. Springer, Berlin, pp 282–297

  • Sampson PD, Guttorp P (1992) Nonparametric estimation of nonstationary spatial covariance structure. J Am Stat Assoc 87:108–119

    Article  Google Scholar 

  • Scrucca L (2005) Clustering multivariate spatial data based on local measures of spatial autocorrelation. Tech. Rep. 20, Quaderni del Dipartimento di Economia, Finanza e Statistica, Università di Perugia

  • Şen Z, Şalhn AD (2001) Spatial interpolation and estimation of solar irradiation by cumulative semivariograms. Solar Energy 71(1):11–21. doi:10.1016/S0038-092X(01)00009-3

    Article  Google Scholar 

  • Shepard D (1968a) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference, ACM ’68. ACM, New York, NY, USA, pp 517–524. doi:10.1145/800186.810616

  • Shepard D (1968b) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 ACM national conference, ACM, pp 517–524

  • Song YC, Meng HD (2010) The application of cluster analysis in geophysical data interpretation. Comput Geosci 14(2):263–271

    Article  MATH  Google Scholar 

  • Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the KDD 2006, ACM, pp 706–711

  • Stein ML (1999) Interpolation of spatial data: some theory for kriging (springer series in statistics), 1st edn. Springer, Berlin

  • Stojanova D (2009) Estimating forest properties from remotely sensed data by using machine learning. Master’s thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia

  • Stojanova D, Ceci M, Appice A, Dzeroski S (2012) Network regression with predictive clustering trees. Data Min Knowl Discov 25(2):378–413

    Article  MATH  MathSciNet  Google Scholar 

  • Stojanova D, Ceci M, Appice A, Malerba D, Dzeroski S (2013) Dealing with spatial autocorrelation when learning predictive clustering trees. Ecol Inform 13:22–39

  • Teegavarapu RSV, Meskele T, Pathak CS (2012) Geo-spatial grid-based transformations of precipitation estimates using spatial interpolation methods. Comput Geosci 40:28–39. doi:10.1016/j.cageo.2011.07.004

    Article  Google Scholar 

  • Tobler W (1979) Cellular geography. Philos Geogr 20:379–386

  • Umer M, Kulik L, Tanin E (2010) Spatial interpolation in wireless sensor networks: localized algorithms for variogram modeling and Kriging. Geoinformatica 14(1):101–134. doi:10.1007/s10707-009-0078-3

    Article  Google Scholar 

  • Wang Y, Witten I (1997) Induction of model trees for predicting continuous classes. In: Proceedings of ECML 1997. Springer, Berlin, pp 128–137

  • Yong J, Xiao-ling Z, Jun S (2007) Unsupervised classification of polarimetric SAR Image by quad-tree segment and SVM. In: 1st Asian and Pacific conference on synthetic aperture radar, 2007 (APSAR 2007), pp 480–483. doi:10.1109/APSAR.2007.4418655

Download references

Acknowledgments

We would like to acknowledge the support of the European Commission through the project MAESTRA—Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944). The authors thank Saso Dzeroski for providing SIGMEA data and generating IRS and SPOT data, Enric Melé and Joaquima Messeguer for providing FOIXA data, unknown reviewers for their useful suggestions to improve this paper and Lynn Rudd for her help in reading the manuscript. FOIXA and SIGMEA data are collected in the SIGMEA project. SPOT and IRS data are obtained from EEA Corine Image 2006.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Annalisa Appice.

Additional information

Responsible editors: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen and Filip Železný.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Appice, A., Malerba, D. Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering. Data Min Knowl Disc 28, 1266–1313 (2014). https://doi.org/10.1007/s10618-014-0372-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0372-z

Keywords

Navigation