Summarizing numeric spatial data streams by trend cluster discovery

Appice, Annalisa; Ciampi, Anna; Malerba, Donato

doi:10.1007/s10618-013-0337-7

Summarizing numeric spatial data streams by trend cluster discovery

Published: 23 August 2013

Volume 29, pages 84–136, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Annalisa Appice¹,
Anna Ciampi¹ &
Donato Malerba¹

788 Accesses
19 Citations
Explore all metrics

Abstract

Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prospective Data Model and Distributed Query Processing for Mobile Sensing Data Streams

Mining frequent items and itemsets from distributed data streams for emergency detection and management

Article 29 January 2016

Real-Time Anomaly Detection from Environmental Data Streams

Notes

The discretization is trusted to sensors; this choice can be considered as a way to decentralize a small piece of the computation. In any case, the majority of the computation effort (clustering) still remains centralized on the server.
Missing values are stored in $H_i$ in the presence of sensors which transmit at one or more snapshots of the window, but they do not transmit at all the snapshots of the window.
$w>>1$ is plausible in the count-based window model of a stream.
$V_{h}$ and $V_{w-h}$ are complex conjugates (Proakis and Manolakis 1996)
This identity expresses in some way the law of conservation of energy.
It is noteworthy that a sensing device which measures a series of data item can also decide which data (or aggregate of data) have to be sent to the sink.
This way of computing the median is used to take into account the fact that each trend prototype value $v_{j_t}$ at time $t$ aggregates data items coming from $\sharp C_j$ sensor devices.
http://www.di.uniba.it/~kdde/index.php/SUMATRA
http://db.csail.mit.edu/labdata/labdata.html
http://climate.geog.udel.edu/~climate/html_pages/archive.html
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/
The $rmse$ is commonly used to evaluate the accuracy of predictive models in statistics. In any case, it has the disadvantage of heavily weighting outliers. This property, undesirable in noised streams, motivates the analysis of the $mae$ as an alternative error measure.

References

Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the international conference on management of data, SIGMOD 2000. ACM, New York, pp 487–498
Aggarwal CC, Han J, Wang J, Yu PS (2007) On clustering massive data streams: a summarization paradigm. In: Advances in database systems: data streams models and algorithms, vol 31. Springer, Heidelberg, pp 9–38
Ai C, Du R, Zhang M, Li Y (2009) In-network historical data storage and query processing based on distributed indexing techniques in wireless sensor networks. In: Proceedings of the 4th international conference on wireless algorithms systems, and applications, WASA 2009. Springer, Berlin, pp 264–273
Al Wadi S, Ismail MT, Karim SAA (2010) A comparison between Haar wavelet transform and fast fourier transform in analyzing financial time series data. Res J Appl Sci 5(5):352–360
Article MathSciNet Google Scholar
Alon N, Matias Y, Szegedy M (1996) The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM symposium on theory of computing, STOC 1996. ACM, New York, pp 20–29
Armenakis C (1992) Estimation and organization of spatio-temporal data. In: Proceedings of the Canadian conference on GIS92, pp 900-911
Browdy MH (1990) Simulated annealing: an improved computer model for political redistricting. Yale Law Policy Rev 8(1):163–179
Google Scholar
Buratti C, Conti A, Dardari D, Verdone R (2009) An overview on wireless sensor networks technology and evolution. Sensors 9:6869–6896
Article Google Scholar
Chen Z, Yang S, Li L, Xie Z (2010) A clustering approximation mechanism based on data spatial correlation in wireless sensor networks. In: Proceedings of the 9th conference on wireless telecommunications symposium, WTS 2010. IEEE Press, Piscataway, pp 208–214
Chiky R, Hébrail G (2008) Summarizing distributed data streams for storage in data warehouses. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery, DaWaK 2008. Lecture notes in computer science, vol 5182. Springer, Berlin, pp 65–74
Chou Y (1975) Statistical analysis, 2nd edn. Holt, Rinehart & Winston of Canada Ltd, New York
Ciampi A, Appice A, Malerba D (2010) Summarization for geographically distributed data streams. In: Proceedings of the 14th international conference on knowledge-based and intelligent information and engineering systems, KES 2010. Lecture notes in computer science, vol 6278. Springer, Berlin, pp 339–348
Ciampi A, Appice A, Malerba D (2010) Online and offline trend cluster discovery in spatially distributed data streams. In: Atzmüller M, Hotho A, Strohmaier M, Chin A (eds) International workshops on analysis of social media and ubiquitous data, MSM 2010 and MUSE 2010, Revised selected Papers. Lecture Notes in Computer Science, vol 6904. Springer, Berlin, pp 142–161
Ciampi A, Appice A, Malerba D, Guccione P (2011) Trend cluster based compression of geographically distributed data streams. In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, part of the IEEE symposium series on computational intelligence 2011, pp 168–175
Draper NR, Smith H (1982) Applied regression analysis. Wiley, New York
Google Scholar
Duque J, Ramos R, Surinach J (2007) Supervised regionalization methods: a survey. Int Reg Sci Rev 30:195–220
Article Google Scholar
Furfaro F, Mazzeo GM, Saccà D, Sirangelo C (2008) Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowl Inf Syst 15(3):335–380
Article Google Scholar
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Rec 34(2):18–26
Article Google Scholar
Ganesan D, Greenstein B, Estrin D, Heidemann JS, Govindan R (2005) Multiresolution storage and search in sensor networks. ACM TOS 1(3):277–315
Article Google Scholar
Garofalakis M, Kumar A (2004) Deterministic wavelet thresholding for maximum-error metrics. In: Proceedings of the 23rd symposium on principles of database systems, PODS 2004. ACM, New York, pp 166–176
Gilbert AC, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss MJ (2002) Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 24th annual ACM symposium on theory of computing, STOC 2002. ACM, New York, pp 389–398
Gordon AD (1996) A survey of constrained classification. Comput Stat Data Anal 21(1):17–29
Article MATH Google Scholar
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. ACM SIGMOD Rec 30(2):58–66
Article Google Scholar
Guo D (2008) Regionalization with dynamically constrained agglomerative clustering and partitioning (redcap). Int J Geogr Inf Sci 22(7):801–823
Article Google Scholar
Hershberger J, Shrivastava N, Suri S, Toth CD (2006) Adaptive spatial partitioning for multidimensional data streams. Algorithmica 46(1):97–117
Article MATH MathSciNet Google Scholar
Hutson J (1983) TRIX: triple exponential smoothing oscillator? Technical Analysis of Stocks and Commodities
Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the international conference on management of data, SIGMOD 1995. ACM, New York, pp 233–244
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: Proceedings of the 24th international conference on very large data bases, VLDB 1998. Morgan Kaufmann, San Francisco, pp 275–286
Jurcík P, Severino R, Koubaa A, Alves M, Tovar E (2008) Real-time communications over cluster-tree sensor networks with mobile sink behavior. In: Proceedings of the 14th IEEE international conference on embedded and real-time computing systems and applications, RTCSA 2008. IEEE Computer Society, pp 401–412
Kittler J (1976) A local sensitive method for clustering analysis. Pattern Recognition, pp 22–33
Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous trend-based clustering in data streams. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery, DaWaK 2008. Lecture notes in computer science, vol 5182. Springer, Berlin, pp 251–262
Legendre P (1987) Constrained clustering. In: Legendre P, Legendre L (eds) Developments in numerical ecology, Springer, Berlin, pp 289–307
Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74:1659–1673
Article Google Scholar
LeSage J, Pace K (2001) Spatial dependence in data mining. In: Data mining for scientific and engineering applications. Kluwer, Boston, pp 439–460
Lin J, Keogh EJ, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144
Article MathSciNet Google Scholar
Ma X, Li S, Luo Q, Yang D, Tang S (2007) Distributed, hierarchical clustering and summarization in sensor networks. In: Proceedings of the Joint 9th Asia-Pacific Web and 8th international conference on web-age information management and advances in data and web management, APWeb/WAIM 2007, Springer, Berlin, pp 168–175
Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) Tag: a tiny aggregation service for ad-hoc sensor networks. In: Culler DE, Druschel P (eds) Proceedings of the 5th symposium on operating system design and implementation, OSDI 2002. USENIX Association
Malerba D, Appice A, Varlaro A, Lanza A (2005) Spatial clustering of structured objects. In: Proceedings of the 15th international conference of inductive logic programming, ILP 2005. Lecture notes in computer science, vol 3625. Springer, Berlin, pp 227–245
Mallat S (1998) A Wavelet Tour for Signal Processing. Academic Press, London
Google Scholar
Matias Y, Vitter JS, Wang M (2000) Dynamic maintenance of wavelet-based histograms. In: Proceedings of the 26th international conference on very large data bases, VLDB 2000. Morgan Kaufmann, San Francisco, pp 101–110
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Book MATH Google Scholar
Murtagh F (1985) A survey of algorithms for contiguity-constrained clustering and related problems. Comput J 28(1):82–88
Article MathSciNet Google Scholar
Nassar S, Sander J (2007) Effective summarization of multi-dimensional data streams for historical stream mining. In: Proceedings of the 19th international conference on scientific and statistical database management, SSDBM 2007. IEEE Computer Society, p 30
Perruchet C (1983) Constrained agglomerative hierarchical classification. Pattern Recognition, pp 213–217
Proakis JG, Manolakis DG (1996) Digital signal processing: principles, algorithms, and applications. Prentice-Hall, Upper Saddle River
Google Scholar
Recchia A (2010) Contiguity-constrained hierarchical agglomerative clustering using sas. J Stat Softw 33
Rodrigues PP, Gama J, Lopes LMB (2008) Clustering distributed sensor data streams. In: Proceedings of the European Conference on machine learning and lnowledge discovery in databases. Lecture notes in computer science, vol 5212. Springer, Berlin, p 282–297
Rusu F, Dobra A (2009) Sketching sampled data streams. In: Proceedings of the 25th international conference on data engineering, ICDE 2009. IEEE Computer Society, pp 381–392
Sanjay C, Shashi S, Wu W (2001) Modeling spatial dependencies for mining geospatial data: an introduction. In: Geographic data mining and knowledge discovery. Taylor and Francis, London, pp 131–159
Shekhar S, Chawla S (2003) Spatial databases: a tour. Prentice Hall, Upper Saddle River
Google Scholar
Su W, Akan O, Cayirci E (2004) Communication protocols for sensor networks. In: Raghavendra CS, Sivalingam KM, Znati T (eds) Wireless sensor networks. Springer, Berlin, pp 21–50
Thaper N, Guha S, Indyk P, Koudas N (2002) Dynamic multidimensional histograms. In: Proceedings of the international conference on management of data, SIGMOD 2002. ACM, New York, pp 428–439
Thiesson B, Kin J (2012) Fast variational mode-seeking. In: Proceedings of the 15th international conference on artificial intelligence and statistics, AISTATS 2012
Tobler W (1979) Cellular geography. Philosophy in geography. Kluwer, Dordrecht, pp 379–386
Valkanas G, Kotsifakos A, Gunopulos D, Galpin I, Gray AJG, Fernandes AAA, Paton NW (2011) Deploying in-network data analysis techniques in sensor networks. In: Zaslavsky AB, Chrysanthis PK, Lee DL, Chakraborty D, Kalogeraki V, Mokbel MF, Chow CY (eds) Proceedings of the 12th IEEE international conference on mobile data management, MDM 2011. pp 341–344
Watfa M, Daher W, Azar HA (2009) A sensor network data aggregation technique. Int J Comput Theor Eng 1(1):1793–82013
Google Scholar
Wise SM, Haining RP, Ma J (1997) Regionalization tools for the exploratory spatial analysis of health data. In: Fischer M, Hewings G, Nagurney A, Nijkamp F, Snickars P (eds) Recent developments in spatial analysis: spatial statistics, behavioural modelling and neuro-computing, The regional science series. Springer, Berlin, pp 83–100
Yoon S, Shahabi C (2005) Exploiting spatial correlation towards an energy efficient clustered aggregation technique (cag). In: Proceedings of the IEEE international conference on communications
Yoon S, Shahabi C (2007) The clustered aggregation (cag) technique leveraging spatial and temporal correlations in wireless sensor networks. ACM Trans Sens Netw 3(1)
Zhu Y, Shasha D (2002) Statstream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases, VLDB 2002. VLDB Endowment, pp 358–369
Zordan D, Martínez B, Vilajosana I, Rossi M (2012) To compress or not to compress: processing vs transmission tradeoffs for energy constrained sensor networking. CoRR abs/1206.2129

Download references

Acknowledgments

This work fulfills the research objectives of the project PRIN 2009 Project “Learning Techniques in Relational Domains and their Applications”, funded by the Italian Ministry of University and Research (MIUR). The authors thank unknown reviewers for their useful suggestions to improve this paper, Pietro Guccione (Politecnico di Bari) for comments and valuable discussions on signal processing techniques and Lynn Rudd for her help in reading the manuscript.

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari “Aldo Moro”, via Orabona 4, 70125 , Bari, Italy
Annalisa Appice, Anna Ciampi & Donato Malerba

Authors

Annalisa Appice
View author publications
You can also search for this author in PubMed Google Scholar
Anna Ciampi
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Annalisa Appice.

Additional information

Responsible editor: M.J. Zaki.

Appendices

Appendix 1 [Proposition 1]

The size of the window $W_i$ is $\sigma _I(n_i+1)+\sigma _F n_i w (bytes)$.

Proof

This proposition is proved by considering that to store $W_i$, it is essential to store the (integer) enumerative window code $i$, the (integer) identifiers of nodes which are active along the window and, for each node, the series of $w$ (float) measurements. Then,

$$\begin{aligned} \begin{array}{ccl} size(W_i)&{}=&{}size(i)+\displaystyle \sum _{p\in N_i}{size(p)}+ \displaystyle \sum _{p\in N_i}\displaystyle \sum _{t=1}^w{size(H_i[p][t])}\\ &{}=&{}{\sigma _I}+{\sigma _I n_i} + \sigma _F n_i w= \sigma _I(n_i+1) + \sigma _F n_i w \hbox { (bytes).} \end{array} \end{aligned}$$

(31)

Appendix 2 [Proposition 4]

Let $P_i$ be the trend cluster set computed from $W_i$, then the size of $P_i$ is $\sigma _{I}(n_i+p_i+1)+\sigma _F p_iw (bytes)$, where $size$(integer)= $\sigma _{I}$ and $size$(float)= $\sigma _F$.

Proof

As $P_i=\{[i,C_k,T_k]\ | \ k=1,2,\ldots ,p_i ]\}$, then:

$$\begin{aligned} \begin{array}{ccl} size(P_i)&{}=&{}size(i)+\displaystyle \sum _{k=1}^{p_i}{\left( size(k)+size([C_k,T_k])\right) }\\ &{}=&{}{\sigma _I}+\displaystyle \sum _{k=1}^{p_i}{\left( \sigma _I+size(C_k) +size( T_k )\right) .} \end{array} \end{aligned}$$

(32)

By replacing $size(C_k)$ as it is reported in Proposition 2 and $size(T_k)$ as it is reported in Proposition 3 in Eq. 32, $size(P_i)$ is equal to:

$$\begin{aligned} \begin{array}{ccl} size(P_i)&{}=&{}\sigma _I+\sum _{k=1}^{p_i}{(\sigma _I\sharp C_k+\sigma _F w)}\\ &{}=&{} \sigma _I+ \sigma _I p_i + \sigma _I n_i+ \sigma _F p_i w =\sigma _I(n_i+p_i+1)+\sigma _F p_i w \hbox { (bytes),} \end{array} \end{aligned}$$

(33)

where $\displaystyle \sum _{k=1}^{p_i}{\sharp C_k}=n_i$, as $C_1,\ldots C_{p_i}$ is a partitioning of $N_i$.

Appendix 3 [Proposition 5]

$P_i$ is a summarization of $W_i$ under the condition that ${p_i}/{n_i}<={w}/({{\sigma _I}/{\sigma _F}+w})$.

Proof

Proving Proposition 5 is equivalent to proving that, under the hypothesis that ${p_i}/{n_i}<{w}/({{\sigma _I}/{\sigma _F}+w})$, then $size(P_i)\le size(W_i)$. According to Propositions 1–4 we have that:

$$\begin{aligned} \begin{array}{ccl} size(P_i)&{}=&{}\sigma _I(n_i+p_i+1)+\sigma _F p_i w \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hbox {[Proposition~1].}\\ &{}=&{} \sigma _I(n_i+1) +\sigma _I p_i + \sigma _F p_i w = \sigma _I(n_i+1) + p_i (\sigma _I+\sigma _F w) \\ &{}\mathbf \le &{} \sigma _I (n_i+1) + {(\sigma _F w n_i)}/{(\sigma _I+\sigma _F w)} (\sigma _I+w\sigma _F) \quad [p_i\le \frac{(\sigma _I w n_i)}{(\sigma _I+\sigma _F w)}]\\ &{}=&{} \sigma _I(n_i+1)+ w\sigma _F n_i = {size(W_i)} \quad \quad \quad \quad \quad \quad \quad \hbox {[Proposition~4].} \end{array}\nonumber \\ \end{aligned}$$

(34)

Appendix 4 [Proposition 6]

Let $P_i$ be the set of trend clusters computed in $W_i$. $W_i$ can be reconstructed from $P_i$ with an absolute error upper bound that is $\delta $.

Proof

Let $N_i$ be the set of nodes partitioned into the trend clusters of $P_i$, that is, $N_i=\displaystyle \bigcup _{[i,C_k,T_k]\in P_i} {C_k}$. Let $\pi _i: N_i \mapsto \mathbb R ^w$ be the stream reconstruction function which is associated to $P_i$ and is defined such that $\pi _i(u)= [v_{k_1}, \ldots , v_{k_w}]$, where $u\in C_k$, $T_k=[(1,v_{k_1}),\ldots ,(w,v_{k_w})]$ and $[i,C_k,T_k]\in P_i$. According to Definition 5, the triple $[i,C_k,T_k]\in P_i$ satisfies the property of polyline purity, hence:

$$\begin{aligned} \underbrace{\forall [i,C_k,T_k] \in P_i \ \forall u\in C_k}_{\forall u \in N_i} \ \forall t=1,\ldots ,w\ :|value(u,t)-v_{k_t}|\le \delta , \end{aligned}$$

(35)

where $value(u,t)$ is the measurement transmitted by $u$ at the $t$-th snapshot of the window. By considering that $\pi _i(u)[t]=v_{k_t}$, we can reformulate Eq. 35 in the same way, as follows:

$$\begin{aligned} \forall u\in N_i \ \forall t=1,\ldots ,w\ :\ |value(u,t)-\pi _i(u)[t]|\le \delta . \end{aligned}$$

(36)

Thus, we conclude that the similarity domain threshold $\delta $ represents an upper bound of the absolute error performed by $P_i$ to summarize $W_i$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Appice, A., Ciampi, A. & Malerba, D. Summarizing numeric spatial data streams by trend cluster discovery. Data Min Knowl Disc 29, 84–136 (2015). https://doi.org/10.1007/s10618-013-0337-7

Download citation

Received: 17 January 2012
Accepted: 07 August 2013
Published: 23 August 2013
Issue Date: January 2015
DOI: https://doi.org/10.1007/s10618-013-0337-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Summarizing numeric spatial data streams by trend cluster discovery

Abstract

Access this article

Similar content being viewed by others

Prospective Data Model and Distributed Query Processing for Mobile Sensing Data Streams

Mining frequent items and itemsets from distributed data streams for emergency detection and management

Real-Time Anomaly Detection from Environmental Data Streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1 [Proposition 1]

Proof

Appendix 2 [Proposition 4]

Proof

Appendix 3 [Proposition 5]

Proof

Appendix 4 [Proposition 6]

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Summarizing numeric spatial data streams by trend cluster discovery

Abstract

Access this article

Similar content being viewed by others

Prospective Data Model and Distributed Query Processing for Mobile Sensing Data Streams

Mining frequent items and itemsets from distributed data streams for emergency detection and management

Real-Time Anomaly Detection from Environmental Data Streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1 [Proposition 1]

Proof

Appendix 2 [Proposition 4]

Proof

Appendix 3 [Proposition 5]

Proof

Appendix 4 [Proposition 6]

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation