Abstract
Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.



















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The discretization is trusted to sensors; this choice can be considered as a way to decentralize a small piece of the computation. In any case, the majority of the computation effort (clustering) still remains centralized on the server.
Missing values are stored in \(H_i\) in the presence of sensors which transmit at one or more snapshots of the window, but they do not transmit at all the snapshots of the window.
\(w>>1\) is plausible in the count-based window model of a stream.
\(V_{h}\) and \(V_{w-h}\) are complex conjugates (Proakis and Manolakis 1996)
This identity expresses in some way the law of conservation of energy.
It is noteworthy that a sensing device which measures a series of data item can also decide which data (or aggregate of data) have to be sent to the sink.
This way of computing the median is used to take into account the fact that each trend prototype value \(v_{j_t}\) at time \(t\) aggregates data items coming from \(\sharp C_j\) sensor devices.
The \(rmse\) is commonly used to evaluate the accuracy of predictive models in statistics. In any case, it has the disadvantage of heavily weighting outliers. This property, undesirable in noised streams, motivates the analysis of the \(mae\) as an alternative error measure.
References
Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the international conference on management of data, SIGMOD 2000. ACM, New York, pp 487–498
Aggarwal CC, Han J, Wang J, Yu PS (2007) On clustering massive data streams: a summarization paradigm. In: Advances in database systems: data streams models and algorithms, vol 31. Springer, Heidelberg, pp 9–38
Ai C, Du R, Zhang M, Li Y (2009) In-network historical data storage and query processing based on distributed indexing techniques in wireless sensor networks. In: Proceedings of the 4th international conference on wireless algorithms systems, and applications, WASA 2009. Springer, Berlin, pp 264–273
Al Wadi S, Ismail MT, Karim SAA (2010) A comparison between Haar wavelet transform and fast fourier transform in analyzing financial time series data. Res J Appl Sci 5(5):352–360
Alon N, Matias Y, Szegedy M (1996) The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM symposium on theory of computing, STOC 1996. ACM, New York, pp 20–29
Armenakis C (1992) Estimation and organization of spatio-temporal data. In: Proceedings of the Canadian conference on GIS92, pp 900-911
Browdy MH (1990) Simulated annealing: an improved computer model for political redistricting. Yale Law Policy Rev 8(1):163–179
Buratti C, Conti A, Dardari D, Verdone R (2009) An overview on wireless sensor networks technology and evolution. Sensors 9:6869–6896
Chen Z, Yang S, Li L, Xie Z (2010) A clustering approximation mechanism based on data spatial correlation in wireless sensor networks. In: Proceedings of the 9th conference on wireless telecommunications symposium, WTS 2010. IEEE Press, Piscataway, pp 208–214
Chiky R, Hébrail G (2008) Summarizing distributed data streams for storage in data warehouses. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery, DaWaK 2008. Lecture notes in computer science, vol 5182. Springer, Berlin, pp 65–74
Chou Y (1975) Statistical analysis, 2nd edn. Holt, Rinehart & Winston of Canada Ltd, New York
Ciampi A, Appice A, Malerba D (2010) Summarization for geographically distributed data streams. In: Proceedings of the 14th international conference on knowledge-based and intelligent information and engineering systems, KES 2010. Lecture notes in computer science, vol 6278. Springer, Berlin, pp 339–348
Ciampi A, Appice A, Malerba D (2010) Online and offline trend cluster discovery in spatially distributed data streams. In: Atzmüller M, Hotho A, Strohmaier M, Chin A (eds) International workshops on analysis of social media and ubiquitous data, MSM 2010 and MUSE 2010, Revised selected Papers. Lecture Notes in Computer Science, vol 6904. Springer, Berlin, pp 142–161
Ciampi A, Appice A, Malerba D, Guccione P (2011) Trend cluster based compression of geographically distributed data streams. In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, part of the IEEE symposium series on computational intelligence 2011, pp 168–175
Draper NR, Smith H (1982) Applied regression analysis. Wiley, New York
Duque J, Ramos R, Surinach J (2007) Supervised regionalization methods: a survey. Int Reg Sci Rev 30:195–220
Furfaro F, Mazzeo GM, Saccà D, Sirangelo C (2008) Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowl Inf Syst 15(3):335–380
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Rec 34(2):18–26
Ganesan D, Greenstein B, Estrin D, Heidemann JS, Govindan R (2005) Multiresolution storage and search in sensor networks. ACM TOS 1(3):277–315
Garofalakis M, Kumar A (2004) Deterministic wavelet thresholding for maximum-error metrics. In: Proceedings of the 23rd symposium on principles of database systems, PODS 2004. ACM, New York, pp 166–176
Gilbert AC, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss MJ (2002) Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 24th annual ACM symposium on theory of computing, STOC 2002. ACM, New York, pp 389–398
Gordon AD (1996) A survey of constrained classification. Comput Stat Data Anal 21(1):17–29
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. ACM SIGMOD Rec 30(2):58–66
Guo D (2008) Regionalization with dynamically constrained agglomerative clustering and partitioning (redcap). Int J Geogr Inf Sci 22(7):801–823
Hershberger J, Shrivastava N, Suri S, Toth CD (2006) Adaptive spatial partitioning for multidimensional data streams. Algorithmica 46(1):97–117
Hutson J (1983) TRIX: triple exponential smoothing oscillator? Technical Analysis of Stocks and Commodities
Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the international conference on management of data, SIGMOD 1995. ACM, New York, pp 233–244
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: Proceedings of the 24th international conference on very large data bases, VLDB 1998. Morgan Kaufmann, San Francisco, pp 275–286
Jurcík P, Severino R, Koubaa A, Alves M, Tovar E (2008) Real-time communications over cluster-tree sensor networks with mobile sink behavior. In: Proceedings of the 14th IEEE international conference on embedded and real-time computing systems and applications, RTCSA 2008. IEEE Computer Society, pp 401–412
Kittler J (1976) A local sensitive method for clustering analysis. Pattern Recognition, pp 22–33
Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous trend-based clustering in data streams. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery, DaWaK 2008. Lecture notes in computer science, vol 5182. Springer, Berlin, pp 251–262
Legendre P (1987) Constrained clustering. In: Legendre P, Legendre L (eds) Developments in numerical ecology, Springer, Berlin, pp 289–307
Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74:1659–1673
LeSage J, Pace K (2001) Spatial dependence in data mining. In: Data mining for scientific and engineering applications. Kluwer, Boston, pp 439–460
Lin J, Keogh EJ, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144
Ma X, Li S, Luo Q, Yang D, Tang S (2007) Distributed, hierarchical clustering and summarization in sensor networks. In: Proceedings of the Joint 9th Asia-Pacific Web and 8th international conference on web-age information management and advances in data and web management, APWeb/WAIM 2007, Springer, Berlin, pp 168–175
Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) Tag: a tiny aggregation service for ad-hoc sensor networks. In: Culler DE, Druschel P (eds) Proceedings of the 5th symposium on operating system design and implementation, OSDI 2002. USENIX Association
Malerba D, Appice A, Varlaro A, Lanza A (2005) Spatial clustering of structured objects. In: Proceedings of the 15th international conference of inductive logic programming, ILP 2005. Lecture notes in computer science, vol 3625. Springer, Berlin, pp 227–245
Mallat S (1998) A Wavelet Tour for Signal Processing. Academic Press, London
Matias Y, Vitter JS, Wang M (2000) Dynamic maintenance of wavelet-based histograms. In: Proceedings of the 26th international conference on very large data bases, VLDB 2000. Morgan Kaufmann, San Francisco, pp 101–110
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Murtagh F (1985) A survey of algorithms for contiguity-constrained clustering and related problems. Comput J 28(1):82–88
Nassar S, Sander J (2007) Effective summarization of multi-dimensional data streams for historical stream mining. In: Proceedings of the 19th international conference on scientific and statistical database management, SSDBM 2007. IEEE Computer Society, p 30
Perruchet C (1983) Constrained agglomerative hierarchical classification. Pattern Recognition, pp 213–217
Proakis JG, Manolakis DG (1996) Digital signal processing: principles, algorithms, and applications. Prentice-Hall, Upper Saddle River
Recchia A (2010) Contiguity-constrained hierarchical agglomerative clustering using sas. J Stat Softw 33
Rodrigues PP, Gama J, Lopes LMB (2008) Clustering distributed sensor data streams. In: Proceedings of the European Conference on machine learning and lnowledge discovery in databases. Lecture notes in computer science, vol 5212. Springer, Berlin, p 282–297
Rusu F, Dobra A (2009) Sketching sampled data streams. In: Proceedings of the 25th international conference on data engineering, ICDE 2009. IEEE Computer Society, pp 381–392
Sanjay C, Shashi S, Wu W (2001) Modeling spatial dependencies for mining geospatial data: an introduction. In: Geographic data mining and knowledge discovery. Taylor and Francis, London, pp 131–159
Shekhar S, Chawla S (2003) Spatial databases: a tour. Prentice Hall, Upper Saddle River
Su W, Akan O, Cayirci E (2004) Communication protocols for sensor networks. In: Raghavendra CS, Sivalingam KM, Znati T (eds) Wireless sensor networks. Springer, Berlin, pp 21–50
Thaper N, Guha S, Indyk P, Koudas N (2002) Dynamic multidimensional histograms. In: Proceedings of the international conference on management of data, SIGMOD 2002. ACM, New York, pp 428–439
Thiesson B, Kin J (2012) Fast variational mode-seeking. In: Proceedings of the 15th international conference on artificial intelligence and statistics, AISTATS 2012
Tobler W (1979) Cellular geography. Philosophy in geography. Kluwer, Dordrecht, pp 379–386
Valkanas G, Kotsifakos A, Gunopulos D, Galpin I, Gray AJG, Fernandes AAA, Paton NW (2011) Deploying in-network data analysis techniques in sensor networks. In: Zaslavsky AB, Chrysanthis PK, Lee DL, Chakraborty D, Kalogeraki V, Mokbel MF, Chow CY (eds) Proceedings of the 12th IEEE international conference on mobile data management, MDM 2011. pp 341–344
Watfa M, Daher W, Azar HA (2009) A sensor network data aggregation technique. Int J Comput Theor Eng 1(1):1793–82013
Wise SM, Haining RP, Ma J (1997) Regionalization tools for the exploratory spatial analysis of health data. In: Fischer M, Hewings G, Nagurney A, Nijkamp F, Snickars P (eds) Recent developments in spatial analysis: spatial statistics, behavioural modelling and neuro-computing, The regional science series. Springer, Berlin, pp 83–100
Yoon S, Shahabi C (2005) Exploiting spatial correlation towards an energy efficient clustered aggregation technique (cag). In: Proceedings of the IEEE international conference on communications
Yoon S, Shahabi C (2007) The clustered aggregation (cag) technique leveraging spatial and temporal correlations in wireless sensor networks. ACM Trans Sens Netw 3(1)
Zhu Y, Shasha D (2002) Statstream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases, VLDB 2002. VLDB Endowment, pp 358–369
Zordan D, Martínez B, Vilajosana I, Rossi M (2012) To compress or not to compress: processing vs transmission tradeoffs for energy constrained sensor networking. CoRR abs/1206.2129
Acknowledgments
This work fulfills the research objectives of the project PRIN 2009 Project “Learning Techniques in Relational Domains and their Applications”, funded by the Italian Ministry of University and Research (MIUR). The authors thank unknown reviewers for their useful suggestions to improve this paper, Pietro Guccione (Politecnico di Bari) for comments and valuable discussions on signal processing techniques and Lynn Rudd for her help in reading the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M.J. Zaki.
Appendices
Appendix 1 [Proposition 1]
The size of the window \(W_i\) is \(\sigma _I(n_i+1)+\sigma _F n_i w (bytes)\).
Proof
This proposition is proved by considering that to store \(W_i\), it is essential to store the (integer) enumerative window code \(i\), the (integer) identifiers of nodes which are active along the window and, for each node, the series of \(w\) (float) measurements. Then,
Appendix 2 [Proposition 4]
Let \(P_i\) be the trend cluster set computed from \(W_i\), then the size of \(P_i\) is \(\sigma _{I}(n_i+p_i+1)+\sigma _F p_iw (bytes)\), where \(size\)(integer)= \(\sigma _{I}\) and \(size\)(float)= \(\sigma _F\).
Proof
As \(P_i=\{[i,C_k,T_k]\ | \ k=1,2,\ldots ,p_i ]\}\), then:
By replacing \(size(C_k)\) as it is reported in Proposition 2 and \(size(T_k)\) as it is reported in Proposition 3 in Eq. 32, \(size(P_i)\) is equal to:
where \(\displaystyle \sum _{k=1}^{p_i}{\sharp C_k}=n_i\), as \(C_1,\ldots C_{p_i}\) is a partitioning of \(N_i\).
Appendix 3 [Proposition 5]
\(P_i\) is a summarization of \(W_i\) under the condition that \({p_i}/{n_i}<={w}/({{\sigma _I}/{\sigma _F}+w})\).
Proof
Proving Proposition 5 is equivalent to proving that, under the hypothesis that \({p_i}/{n_i}<{w}/({{\sigma _I}/{\sigma _F}+w})\), then \(size(P_i)\le size(W_i)\). According to Propositions 1–4 we have that:
Appendix 4 [Proposition 6]
Let \(P_i\) be the set of trend clusters computed in \(W_i\). \(W_i\) can be reconstructed from \(P_i\) with an absolute error upper bound that is \(\delta \).
Proof
Let \(N_i\) be the set of nodes partitioned into the trend clusters of \(P_i\), that is, \(N_i=\displaystyle \bigcup _{[i,C_k,T_k]\in P_i} {C_k}\). Let \(\pi _i: N_i \mapsto \mathbb R ^w\) be the stream reconstruction function which is associated to \(P_i\) and is defined such that \(\pi _i(u)= [v_{k_1}, \ldots , v_{k_w}]\), where \(u\in C_k\), \(T_k=[(1,v_{k_1}),\ldots ,(w,v_{k_w})]\) and \([i,C_k,T_k]\in P_i\). According to Definition 5, the triple \([i,C_k,T_k]\in P_i\) satisfies the property of polyline purity, hence:
where \(value(u,t)\) is the measurement transmitted by \(u\) at the \(t\)-th snapshot of the window. By considering that \(\pi _i(u)[t]=v_{k_t}\), we can reformulate Eq. 35 in the same way, as follows:
Thus, we conclude that the similarity domain threshold \(\delta \) represents an upper bound of the absolute error performed by \(P_i\) to summarize \(W_i\).
Rights and permissions
About this article
Cite this article
Appice, A., Ciampi, A. & Malerba, D. Summarizing numeric spatial data streams by trend cluster discovery. Data Min Knowl Disc 29, 84–136 (2015). https://doi.org/10.1007/s10618-013-0337-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0337-7