Skip to main content
Log in

An incremental algorithm for clustering spatial data streams: exploring temporal locality

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Clustering sensor data discovers useful information hidden in sensor networks. In sensor networks, a sensor has two types of attributes: a geographic attribute (i.e, its spatial location) and non-geographic attributes (e.g., sensed readings). Sensor data are periodically collected and viewed as spatial data streams, where a spatial data stream consists of a sequence of data points exhibiting attributes in both the geographic and non-geographic domains. Previous studies have developed a dual clustering problem for spatial data by considering similarity-connected relationships in both geographic and non-geographic domains. However, the clustering processes in stream environments are time-sensitive because of frequently updated sensor data. For sensor data, the readings from one sensor are similar for a period, and the readings refer to temporal locality features. Using the temporal locality features of the sensor data, this study proposes an incremental clustering (IC) algorithm to discover clusters efficiently. The IC algorithm comprises two phases: cluster prediction and cluster refinement. The first phase estimates the probability of two sensors belonging to a cluster from the previous clustering results. According to the estimation, a coarse clustering result is derived. The cluster refinement phase then refines the coarse result. This study evaluates the performance of the IC algorithm using synthetic and real datasets. Experimental results show that the IC algorithm outperforms exiting approaches confirming the scalability of the IC algorithm. In addition, the effect of temporal locality features on the IC algorithm is analyzed and thoroughly examined in the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

References

  1. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases (VLDB), pp 81–92

  2. Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196

    Article  Google Scholar 

  3. Aghabozorgi SR, Saybani MR, Wah TY (2012) Incremental clustering of time-series by fuzzy clustering. J Inf Sci Eng (JISE) 28(4):671–688

    MathSciNet  Google Scholar 

  4. Alïtelhadj A, Boughanem M, Mezghiche M, Souam F (2012) Using structural similarity for clustering xml documents. Knowl Inf Syst (KAIS) 32(1):109–139

    Article  Google Scholar 

  5. Bagnall AJ, Janacek GJ (2004) Clustering time series from arma models with clipped data. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 49–58

  6. Banerjee A, Ghosh J (2006) Scalable clustering algorithms with balancing constraints. Data Min Knowl Discov 13(3):365–395

    Article  MathSciNet  Google Scholar 

  7. Beringer J, HLullermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng (DKE) 58(2):180–204

    Article  Google Scholar 

  8. Borovkova S, Permana FJ (2004) Modelling electricity prices by the potential jumpdiffusion. In: Proceedings of the autumn school and international conference on stochastic finance (StochFin), pp 239–263

  9. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining (SDM), pp 326–337

  10. Cartea A, Figueroa MG (2005) Pricing in electricity markets a mean reverting jump diffusion model with seasonality. Appl Math Finance 12(4):313–335

    Article  MATH  Google Scholar 

  11. Costa G, Manco G, Ortale R (2010) Density-based clustering of data streams at multiple resolutions. Data Min Knowl Discov (DMKD) 20(1):152–187

    Article  MathSciNet  Google Scholar 

  12. Dai BR, Lin CR, Chen MS (2007) Constrained data clustering by depth control and progressive constraint relaxation. Int J Very Large Data Bases 16(2):201–217

    Article  Google Scholar 

  13. Davidson I, Ester M, Ravi SS (2007) Efficient incremental constrained clustering. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), pp 240–249

  14. Davidson I, Ravi SS (2005) Clustering with constraints: Feasibility issues and the k-means algorithm. In: Proceedings of the fifth SIAM international conference on data mining (SDM), pp 138–149

  15. Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 323–333

  16. Ge R, Ester M, Gao BJ, Hu Z, Bhattacharya B, Ben-Moshe B (2008) Joint cluster analysis of attribute data and relationship data: the connected k-center problem, algorithms and applications. ACM Trans Knowl Discov Data 2(2):7:1–7:35

    Article  Google Scholar 

  17. Ge R, Ester M, Jin W, Davidson I (2007) Constraint-driven clustering. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining(SIGKDD), pp 320–329

  18. Guha S, Meyerson A, Mishra N, Motwani R, OCallaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng (TKDE) 15(3):515–528

    Article  Google Scholar 

  19. Guo L, Ai C, Wang X, Cai Z, Li Y (2009) Real time clustering of sensory data in wireless sensor networks. In: Proceedings of the 28th IEEE international performance computing and communications conference (IPCCC), pp 33–40

  20. Halkidi M, Spiliopoulou M, Pavlou A (2012) A semi-supervised incremental clustering algorithm for streaming data. In: Proceedings of the 16th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 578–590

  21. Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, New York

    Google Scholar 

  22. Huang J, Zhang J (2010) Distributed dual cluster algorithm based on grid for sensor streams. Int J Digit Content Technol Appl (JDCTA) 4(9):225–233

    Article  Google Scholar 

  23. Kavitha V, Punithavalli M (2010) Clustering time series data stream—a literature survey. ACM Trans Knowl Discov Data (IJCSIS) 8(1):289–294

    Google Scholar 

  24. Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th international conference on machine learning (ICML), pp 307–314

  25. Liao ZX, Peng WC (2012) Clustering spatial data with a geographic constraint: exploring local search. Knowl Inf Syst 31(1):153–170

    Article  Google Scholar 

  26. Lin CR, Liu KH, Chen MS (2005) Dual clustering: integrating data clustering over optimization and constraint domains. IEEE Trans Knowl Data Eng 17(5):628–637

    Article  Google Scholar 

  27. Lin J, Vlachos M, Keogh EJ, Gunopulos D (2004) Iterative incremental clustering of time series. In: Proceedings of the ninth international conference on extending database technology (EDBT), pp 106V122

  28. Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng (DKE) 68(1):1–27

    Article  Google Scholar 

  29. Mirkin B (2011) Clustering for data mining: a data recovery approach. Taylor and Francis, London

    Google Scholar 

  30. O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th IEEE international conference on data engineering (ICDE), pp 685–694

  31. Pensa RG, Ienco D, Meo R (2012) Hierarchical co-clustering: off-line and incremental approaches. Data Mining Knowl Discov (DMKD). doi:10.1007/s10618-012-0292-8

  32. Robert CP, Casella G (2004) Monte carlo statistical methods. Springer, New York

    Book  MATH  Google Scholar 

  33. Rodrigues PP, Gama J, Lopes L (2008) Clustering distributed sensor data streams. In: Proceedings of the European conference on machine learning and knowledge discovery in databases—part II (ECML PKDD), pp 282–297

  34. Rodrigues PP, Gama J, Pedroso JP (2008) Hierarchical clustering of time series data streams. IEEE Trans Knowl Data Eng (TKDE) 20(5):615–627

    Google Scholar 

  35. Shi YB, Yuan CA, Huang Y, Wen YG (2010) A method of spatial clustering based on the combination of the spatial coordinate and attributes. In: Proceedings of the sixth international conference on information systems security (ICISS), pp 526–529

  36. Shi Y, Zhang L (2010) Coid: A clustervoutlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Sys. doi:10.1007/s10115-010-0323-y

  37. Tai CH, Dai BR, Chen MS (2007) Incremental clustering in geography and optimization spaces. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 272–283

  38. Taiwan area national freeway bureau. http://www.freeway.gov.tw/

  39. Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, New York

    Google Scholar 

  40. Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML), pp 1103–1110

  41. Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):1–28

    Article  Google Scholar 

  42. Wang JW, Cheng CH (2007) An efficient method for estimating null values in relational databases. Knowl Inf Syst 12(3):379–394

    Article  Google Scholar 

  43. Wei LY, Peng WC (2009) Clustering data streams in optimization and geography domains. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 997–1005

  44. Yang J (2003) Dynamic clustering of evolving streams with a single pass. In: Proceedings of the 19th IEEE international conference on data engineering (ICDE), pp 695–697

  45. Zhou J, Guan J, Li P (2007c) Dcad: a dual clustering algorithm for distributed spatial databases. Geo-Spatial Inf Sci 10(2):137–144

    Article  Google Scholar 

  46. Zhou A, Cao F, Yan Y, Sha C, He X (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 133–142

  47. Zhou A, Cao F, Yan Y, Sha C, He X (2007) Distributed data stream clustering: a fast em-based approach. In: Proceedings of the 23rd international conference on data engineering (ICDE), pp 736–745

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions. Wen-Chih Peng was supported in part by the National Science Council, Project No.100-2218-E-009-016-MY3 and 100-2218-E-009-013-MY3, by TaiwanMoE ATU Program, by HTC and by Delta.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen-Chih Peng.

Appendix A. Hierarchical-based clustering algorithm

Appendix A. Hierarchical-based clustering algorithm

A hierarchical-based clustering (HBC) algorithm has been proposed for solving dual clustering problems in spatial data streams [43]. In each time window, the HBC algorithm first constructs an SC-graph and places explicit edges with their dissimilarity values in a priority queue \(Q\), which is a data structure that returns the edge with the minimal dissimilarity value in the set of explicit edges in \(Q\). As with a bottom-up hierarchical clustering algorithm, each vertex is initially regarded as a single cluster. The explicit edge with the minimal dissimilarity value is then removed from the priority queue \(Q\). Let that edge be \(e_{e}(S_{i},S_{j})\). If \(S_{i}\) and \(S_{j}\) belong to the same cluster, there is no need to cluster them. However, if they belong to different clusters (e.g., \(S_{i}\in C_{i}, S_{j}\in C_{j}\), and \(C_{i}\ne C_{j}\)), both the connectivity requirement and the complete subgraph requirement (Section 3.1) should be verified. If these two requirements are met, clusters \(C_{i}\) and \(C_{j}\) are merged to form a new cluster. This procedure is performed iteratively until \(Q\) is empty.

figure d

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, LY., Peng, WC. An incremental algorithm for clustering spatial data streams: exploring temporal locality. Knowl Inf Syst 37, 453–483 (2013). https://doi.org/10.1007/s10115-013-0636-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0636-8

Keywords

Navigation