Skip to main content
Log in

Summarizing numeric spatial data streams by trend cluster discovery

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. The discretization is trusted to sensors; this choice can be considered as a way to decentralize a small piece of the computation. In any case, the majority of the computation effort (clustering) still remains centralized on the server.

  2. Missing values are stored in \(H_i\) in the presence of sensors which transmit at one or more snapshots of the window, but they do not transmit at all the snapshots of the window.

  3. \(w>>1\) is plausible in the count-based window model of a stream.

  4. \(V_{h}\) and \(V_{w-h}\) are complex conjugates (Proakis and Manolakis 1996)

  5. This identity expresses in some way the law of conservation of energy.

  6. It is noteworthy that a sensing device which measures a series of data item can also decide which data (or aggregate of data) have to be sent to the sink.

  7. This way of computing the median is used to take into account the fact that each trend prototype value \(v_{j_t}\) at time \(t\) aggregates data items coming from \(\sharp C_j\) sensor devices.

  8. http://www.di.uniba.it/~kdde/index.php/SUMATRA

  9. http://db.csail.mit.edu/labdata/labdata.html

  10. http://climate.geog.udel.edu/~climate/html_pages/archive.html

  11. ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/

  12. The \(rmse\) is commonly used to evaluate the accuracy of predictive models in statistics. In any case, it has the disadvantage of heavily weighting outliers. This property, undesirable in noised streams, motivates the analysis of the \(mae\) as an alternative error measure.

References

  • Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the international conference on management of data, SIGMOD 2000. ACM, New York, pp 487–498

  • Aggarwal CC, Han J, Wang J, Yu PS (2007) On clustering massive data streams: a summarization paradigm. In: Advances in database systems: data streams models and algorithms, vol 31. Springer, Heidelberg, pp 9–38

  • Ai C, Du R, Zhang M, Li Y (2009) In-network historical data storage and query processing based on distributed indexing techniques in wireless sensor networks. In: Proceedings of the 4th international conference on wireless algorithms systems, and applications, WASA 2009. Springer, Berlin, pp 264–273

  • Al Wadi S, Ismail MT, Karim SAA (2010) A comparison between Haar wavelet transform and fast fourier transform in analyzing financial time series data. Res J Appl Sci 5(5):352–360

    Article  MathSciNet  Google Scholar 

  • Alon N, Matias Y, Szegedy M (1996) The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM symposium on theory of computing, STOC 1996. ACM, New York, pp 20–29

  • Armenakis C (1992) Estimation and organization of spatio-temporal data. In: Proceedings of the Canadian conference on GIS92, pp 900-911

  • Browdy MH (1990) Simulated annealing: an improved computer model for political redistricting. Yale Law Policy Rev 8(1):163–179

    Google Scholar 

  • Buratti C, Conti A, Dardari D, Verdone R (2009) An overview on wireless sensor networks technology and evolution. Sensors 9:6869–6896

    Article  Google Scholar 

  • Chen Z, Yang S, Li L, Xie Z (2010) A clustering approximation mechanism based on data spatial correlation in wireless sensor networks. In: Proceedings of the 9th conference on wireless telecommunications symposium, WTS 2010. IEEE Press, Piscataway, pp 208–214

  • Chiky R, Hébrail G (2008) Summarizing distributed data streams for storage in data warehouses. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery, DaWaK 2008. Lecture notes in computer science, vol 5182. Springer, Berlin, pp 65–74

  • Chou Y (1975) Statistical analysis, 2nd edn. Holt, Rinehart & Winston of Canada Ltd, New York

  • Ciampi A, Appice A, Malerba D (2010) Summarization for geographically distributed data streams. In: Proceedings of the 14th international conference on knowledge-based and intelligent information and engineering systems, KES 2010. Lecture notes in computer science, vol 6278. Springer, Berlin, pp 339–348

  • Ciampi A, Appice A, Malerba D (2010) Online and offline trend cluster discovery in spatially distributed data streams. In: Atzmüller M, Hotho A, Strohmaier M, Chin A (eds) International workshops on analysis of social media and ubiquitous data, MSM 2010 and MUSE 2010, Revised selected Papers. Lecture Notes in Computer Science, vol 6904. Springer, Berlin, pp 142–161

  • Ciampi A, Appice A, Malerba D, Guccione P (2011) Trend cluster based compression of geographically distributed data streams. In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, part of the IEEE symposium series on computational intelligence 2011, pp 168–175

  • Draper NR, Smith H (1982) Applied regression analysis. Wiley, New York

    Google Scholar 

  • Duque J, Ramos R, Surinach J (2007) Supervised regionalization methods: a survey. Int Reg Sci Rev 30:195–220

    Article  Google Scholar 

  • Furfaro F, Mazzeo GM, Saccà D, Sirangelo C (2008) Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowl Inf Syst 15(3):335–380

    Article  Google Scholar 

  • Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Rec 34(2):18–26

    Article  Google Scholar 

  • Ganesan D, Greenstein B, Estrin D, Heidemann JS, Govindan R (2005) Multiresolution storage and search in sensor networks. ACM TOS 1(3):277–315

    Article  Google Scholar 

  • Garofalakis M, Kumar A (2004) Deterministic wavelet thresholding for maximum-error metrics. In: Proceedings of the 23rd symposium on principles of database systems, PODS 2004. ACM, New York, pp 166–176

  • Gilbert AC, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss MJ (2002) Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 24th annual ACM symposium on theory of computing, STOC 2002. ACM, New York, pp 389–398

  • Gordon AD (1996) A survey of constrained classification. Comput Stat Data Anal 21(1):17–29

    Article  MATH  Google Scholar 

  • Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. ACM SIGMOD Rec 30(2):58–66

    Article  Google Scholar 

  • Guo D (2008) Regionalization with dynamically constrained agglomerative clustering and partitioning (redcap). Int J Geogr Inf Sci 22(7):801–823

    Article  Google Scholar 

  • Hershberger J, Shrivastava N, Suri S, Toth CD (2006) Adaptive spatial partitioning for multidimensional data streams. Algorithmica 46(1):97–117

    Article  MATH  MathSciNet  Google Scholar 

  • Hutson J (1983) TRIX: triple exponential smoothing oscillator? Technical Analysis of Stocks and Commodities

  • Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the international conference on management of data, SIGMOD 1995. ACM, New York, pp 233–244

  • Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: Proceedings of the 24th international conference on very large data bases, VLDB 1998. Morgan Kaufmann, San Francisco, pp 275–286

  • Jurcík P, Severino R, Koubaa A, Alves M, Tovar E (2008) Real-time communications over cluster-tree sensor networks with mobile sink behavior. In: Proceedings of the 14th IEEE international conference on embedded and real-time computing systems and applications, RTCSA 2008. IEEE Computer Society, pp 401–412

  • Kittler J (1976) A local sensitive method for clustering analysis. Pattern Recognition, pp 22–33

  • Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous trend-based clustering in data streams. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery, DaWaK 2008. Lecture notes in computer science, vol 5182. Springer, Berlin, pp 251–262

  • Legendre P (1987) Constrained clustering. In: Legendre P, Legendre L (eds) Developments in numerical ecology, Springer, Berlin, pp 289–307

  • Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74:1659–1673

    Article  Google Scholar 

  • LeSage J, Pace K (2001) Spatial dependence in data mining. In: Data mining for scientific and engineering applications. Kluwer, Boston, pp 439–460

  • Lin J, Keogh EJ, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144

    Article  MathSciNet  Google Scholar 

  • Ma X, Li S, Luo Q, Yang D, Tang S (2007) Distributed, hierarchical clustering and summarization in sensor networks. In: Proceedings of the Joint 9th Asia-Pacific Web and 8th international conference on web-age information management and advances in data and web management, APWeb/WAIM 2007, Springer, Berlin, pp 168–175

  • Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) Tag: a tiny aggregation service for ad-hoc sensor networks. In: Culler DE, Druschel P (eds) Proceedings of the 5th symposium on operating system design and implementation, OSDI 2002. USENIX Association

  • Malerba D, Appice A, Varlaro A, Lanza A (2005) Spatial clustering of structured objects. In: Proceedings of the 15th international conference of inductive logic programming, ILP 2005. Lecture notes in computer science, vol 3625. Springer, Berlin, pp 227–245

  • Mallat S (1998) A Wavelet Tour for Signal Processing. Academic Press, London

    Google Scholar 

  • Matias Y, Vitter JS, Wang M (2000) Dynamic maintenance of wavelet-based histograms. In: Proceedings of the 26th international conference on very large data bases, VLDB 2000. Morgan Kaufmann, San Francisco, pp 101–110

  • Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Murtagh F (1985) A survey of algorithms for contiguity-constrained clustering and related problems. Comput J 28(1):82–88

    Article  MathSciNet  Google Scholar 

  • Nassar S, Sander J (2007) Effective summarization of multi-dimensional data streams for historical stream mining. In: Proceedings of the 19th international conference on scientific and statistical database management, SSDBM 2007. IEEE Computer Society, p 30

  • Perruchet C (1983) Constrained agglomerative hierarchical classification. Pattern Recognition, pp 213–217

  • Proakis JG, Manolakis DG (1996) Digital signal processing: principles, algorithms, and applications. Prentice-Hall, Upper Saddle River

    Google Scholar 

  • Recchia A (2010) Contiguity-constrained hierarchical agglomerative clustering using sas. J Stat Softw 33

  • Rodrigues PP, Gama J, Lopes LMB (2008) Clustering distributed sensor data streams. In: Proceedings of the European Conference on machine learning and lnowledge discovery in databases. Lecture notes in computer science, vol 5212. Springer, Berlin, p 282–297

  • Rusu F, Dobra A (2009) Sketching sampled data streams. In: Proceedings of the 25th international conference on data engineering, ICDE 2009. IEEE Computer Society, pp 381–392

  • Sanjay C, Shashi S, Wu W (2001) Modeling spatial dependencies for mining geospatial data: an introduction. In: Geographic data mining and knowledge discovery. Taylor and Francis, London, pp 131–159

  • Shekhar S, Chawla S (2003) Spatial databases: a tour. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Su W, Akan O, Cayirci E (2004) Communication protocols for sensor networks. In: Raghavendra CS, Sivalingam KM, Znati T (eds) Wireless sensor networks. Springer, Berlin, pp 21–50

  • Thaper N, Guha S, Indyk P, Koudas N (2002) Dynamic multidimensional histograms. In: Proceedings of the international conference on management of data, SIGMOD 2002. ACM, New York, pp 428–439

  • Thiesson B, Kin J (2012) Fast variational mode-seeking. In: Proceedings of the 15th international conference on artificial intelligence and statistics, AISTATS 2012

  • Tobler W (1979) Cellular geography. Philosophy in geography. Kluwer, Dordrecht, pp 379–386

  • Valkanas G, Kotsifakos A, Gunopulos D, Galpin I, Gray AJG, Fernandes AAA, Paton NW (2011) Deploying in-network data analysis techniques in sensor networks. In: Zaslavsky AB, Chrysanthis PK, Lee DL, Chakraborty D, Kalogeraki V, Mokbel MF, Chow CY (eds) Proceedings of the 12th IEEE international conference on mobile data management, MDM 2011. pp 341–344

  • Watfa M, Daher W, Azar HA (2009) A sensor network data aggregation technique. Int J Comput Theor Eng 1(1):1793–82013

    Google Scholar 

  • Wise SM, Haining RP, Ma J (1997) Regionalization tools for the exploratory spatial analysis of health data. In: Fischer M, Hewings G, Nagurney A, Nijkamp F, Snickars P (eds) Recent developments in spatial analysis: spatial statistics, behavioural modelling and neuro-computing, The regional science series. Springer, Berlin, pp 83–100

  • Yoon S, Shahabi C (2005) Exploiting spatial correlation towards an energy efficient clustered aggregation technique (cag). In: Proceedings of the IEEE international conference on communications

  • Yoon S, Shahabi C (2007) The clustered aggregation (cag) technique leveraging spatial and temporal correlations in wireless sensor networks. ACM Trans Sens Netw 3(1)

  • Zhu Y, Shasha D (2002) Statstream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases, VLDB 2002. VLDB Endowment, pp 358–369

  • Zordan D, Martínez B, Vilajosana I, Rossi M (2012) To compress or not to compress: processing vs transmission tradeoffs for energy constrained sensor networking. CoRR abs/1206.2129

Download references

Acknowledgments

This work fulfills the research objectives of the project PRIN 2009 Project “Learning Techniques in Relational Domains and their Applications”, funded by the Italian Ministry of University and Research (MIUR). The authors thank unknown reviewers for their useful suggestions to improve this paper, Pietro Guccione (Politecnico di Bari) for comments and valuable discussions on signal processing techniques and Lynn Rudd for her help in reading the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Annalisa Appice.

Additional information

Responsible editor: M.J. Zaki.

Appendices

Appendix 1 [Proposition 1]

The size of the window \(W_i\) is \(\sigma _I(n_i+1)+\sigma _F n_i w (bytes)\).

Proof

This proposition is proved by considering that to store \(W_i\), it is essential to store the (integer) enumerative window code \(i\), the (integer) identifiers of nodes which are active along the window and, for each node, the series of \(w\) (float) measurements. Then,

$$\begin{aligned} \begin{array}{ccl} size(W_i)&{}=&{}size(i)+\displaystyle \sum _{p\in N_i}{size(p)}+ \displaystyle \sum _{p\in N_i}\displaystyle \sum _{t=1}^w{size(H_i[p][t])}\\ &{}=&{}{\sigma _I}+{\sigma _I n_i} + \sigma _F n_i w= \sigma _I(n_i+1) + \sigma _F n_i w \hbox { (bytes).} \end{array} \end{aligned}$$
(31)

Appendix 2 [Proposition 4]

Let \(P_i\) be the trend cluster set computed from \(W_i\), then the size of \(P_i\) is \(\sigma _{I}(n_i+p_i+1)+\sigma _F p_iw (bytes)\), where \(size\)(integer)= \(\sigma _{I}\) and \(size\)(float)= \(\sigma _F\).

Proof

As \(P_i=\{[i,C_k,T_k]\ | \ k=1,2,\ldots ,p_i ]\}\), then:

$$\begin{aligned} \begin{array}{ccl} size(P_i)&{}=&{}size(i)+\displaystyle \sum _{k=1}^{p_i}{\left( size(k)+size([C_k,T_k])\right) }\\ &{}=&{}{\sigma _I}+\displaystyle \sum _{k=1}^{p_i}{\left( \sigma _I+size(C_k) +size( T_k )\right) .} \end{array} \end{aligned}$$
(32)

By replacing \(size(C_k)\) as it is reported in Proposition 2 and \(size(T_k)\) as it is reported in Proposition 3 in Eq. 32, \(size(P_i)\) is equal to:

$$\begin{aligned} \begin{array}{ccl} size(P_i)&{}=&{}\sigma _I+\sum _{k=1}^{p_i}{(\sigma _I\sharp C_k+\sigma _F w)}\\ &{}=&{} \sigma _I+ \sigma _I p_i + \sigma _I n_i+ \sigma _F p_i w =\sigma _I(n_i+p_i+1)+\sigma _F p_i w \hbox { (bytes),} \end{array} \end{aligned}$$
(33)

where \(\displaystyle \sum _{k=1}^{p_i}{\sharp C_k}=n_i\), as \(C_1,\ldots C_{p_i}\) is a partitioning of \(N_i\).

Appendix 3 [Proposition 5]

\(P_i\) is a summarization of \(W_i\) under the condition that \({p_i}/{n_i}<={w}/({{\sigma _I}/{\sigma _F}+w})\).

Proof

Proving Proposition 5 is equivalent to proving that, under the hypothesis that \({p_i}/{n_i}<{w}/({{\sigma _I}/{\sigma _F}+w})\), then \(size(P_i)\le size(W_i)\). According to Propositions 1–4 we have that:

$$\begin{aligned} \begin{array}{ccl} size(P_i)&{}=&{}\sigma _I(n_i+p_i+1)+\sigma _F p_i w \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hbox {[Proposition~1].}\\ &{}=&{} \sigma _I(n_i+1) +\sigma _I p_i + \sigma _F p_i w = \sigma _I(n_i+1) + p_i (\sigma _I+\sigma _F w) \\ &{}\mathbf \le &{} \sigma _I (n_i+1) + {(\sigma _F w n_i)}/{(\sigma _I+\sigma _F w)} (\sigma _I+w\sigma _F) \quad [p_i\le \frac{(\sigma _I w n_i)}{(\sigma _I+\sigma _F w)}]\\ &{}=&{} \sigma _I(n_i+1)+ w\sigma _F n_i = {size(W_i)} \quad \quad \quad \quad \quad \quad \quad \hbox {[Proposition~4].} \end{array}\nonumber \\ \end{aligned}$$
(34)

Appendix 4 [Proposition 6]

Let \(P_i\) be the set of trend clusters computed in \(W_i\). \(W_i\) can be reconstructed from \(P_i\) with an absolute error upper bound that is \(\delta \).

Proof

Let \(N_i\) be the set of nodes partitioned into the trend clusters of \(P_i\), that is, \(N_i=\displaystyle \bigcup _{[i,C_k,T_k]\in P_i} {C_k}\). Let \(\pi _i: N_i \mapsto \mathbb R ^w\) be the stream reconstruction function which is associated to \(P_i\) and is defined such that \(\pi _i(u)= [v_{k_1}, \ldots , v_{k_w}]\), where \(u\in C_k\), \(T_k=[(1,v_{k_1}),\ldots ,(w,v_{k_w})]\) and \([i,C_k,T_k]\in P_i\). According to Definition 5, the triple \([i,C_k,T_k]\in P_i\) satisfies the property of polyline purity, hence:

$$\begin{aligned} \underbrace{\forall [i,C_k,T_k] \in P_i \ \forall u\in C_k}_{\forall u \in N_i} \ \forall t=1,\ldots ,w\ :|value(u,t)-v_{k_t}|\le \delta , \end{aligned}$$
(35)

where \(value(u,t)\) is the measurement transmitted by \(u\) at the \(t\)-th snapshot of the window. By considering that \(\pi _i(u)[t]=v_{k_t}\), we can reformulate Eq. 35 in the same way, as follows:

$$\begin{aligned} \forall u\in N_i \ \forall t=1,\ldots ,w\ :\ |value(u,t)-\pi _i(u)[t]|\le \delta . \end{aligned}$$
(36)

Thus, we conclude that the similarity domain threshold \(\delta \) represents an upper bound of the absolute error performed by \(P_i\) to summarize \(W_i\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Appice, A., Ciampi, A. & Malerba, D. Summarizing numeric spatial data streams by trend cluster discovery. Data Min Knowl Disc 29, 84–136 (2015). https://doi.org/10.1007/s10618-013-0337-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0337-7

Keywords

Navigation