Abstract
A phenomenon appears in a sensor network when a group of sensors continuously produces similar readings (i.e., data streams) over a period of time. This involves the processing of hundreds and maybe thousands of data streams in real-time. This paper focuses on detecting environmental phenomena and determining possible correlation between such phenomena.
This paper proposes an efficient scheme for a detecting and tracking phenomena, e.g., air pollution and oil spills. To achieve fast online response, the proposed algorithms use a Discrete Fourier Transformation (DFT) to reduce the dimensionality of the streams. Each stream is represented by a point in a multidimensional grid in the frequency domain. The algorithm uses an improved unsupervised grid-based clustering technique to detect similar streams and to form clusters. The paper also proposes an efficient algorithm for detecting correlation among phenomena. The proposed algorithm calculates the correlation coefficient in the frequency domain. It makes use of the DFT coefficients that are calculated for detecting the phenomena. The proposed correlation detection algorithm uses only few DFT coefficients in the frequency domain.
Experiments on synthetic data streams show that the proposed algorithm for detecting and tracking phenomena is much faster than the DBSCAN clustering technique, which is based on the R-tree index. At the same time, the proposed phenomena detection algorithm produces the same quality as that of the DBSCAN by only using two DFT coefficients in most of the cases. The experimental results also showed that the proposed technique for detecting the correlation among phenomena performs as good as the traditional Pearson correlation formula but it is much faster.
Similar content being viewed by others
Abbreviations
- x :
-
Data stream vector in the time domain
- X :
-
Data stream vector in the frequency domain
- N :
-
The number of data streams
- w :
-
The size of a sliding window
- f :
-
The number of DFT coefficients used to represent a stream
- d :
-
Distance between neighbor streams
- k :
-
Number of neighbors considered for every stream
- C :
-
The number of true clusters in a data set
- G :
-
The number of generated clusters by the proposed algorithm
- CorrT :
-
Correlation coefficient threshold
References
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Base, Berlin, Germany, September 2003
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on Very Large Data Base, Toronto, Canada, September 2004
Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: Proceedings of the 4th FODO International Conference of Foundations of Data Organization and Algorithms, Chicago, Illinois, USA, October 1993
Ali, M., Mokbel, M., Aref, W., Kamel, I.: Detection and tracking of discrete phenomena in sensor-network databases. In: Proceedings of the 17th International Conference on Scientific and Statistical Database Management, Santa Barbara, California, June 2005
Ali, M., Aref, W., Helal, S., Kamel, I.: NILE-PDT: a phenomenon detection and tracking framework for data stream management systems. In: 31st International Conference on Very Large Data Bases VLDB 2005, Norway, September 2005
Ali, M., Aref, W., Kamel, I.: Scalability management in sensor-network phenomenaBases. In: 18th International Conference Scientific and Statistical Database Management, Vienna, Austria, July 2006
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd Symposium on Principles of Database Systems, San Diego, California, USA, April 2003
Beringer, J., Hullermeier, E.: Online clustering of data streams. Data Knowl. Eng. 58 (2006)
Beringer, J., Hullermeir, E.: Fuzzy clustering of parallel data streams. In: Advances in Fuzzy Clustering and Its Applications. Wiley, New York (2007)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the SIAM Conference on Data Mining, Bethesda, Maryland, April 2006
Chandrasekaran, S., Franklin, M.J.: Streaming queries over streaming data. In: Proceedings of the 28th VLDB International Conference on Very Large Data Base, Hong Kong, China, August 2002
Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proceedings of 35th ACM Symposium on Theory of Computing, San Diego, California, USA, June 2003
Chen, J., DeWitt, D., Tian, F., Wang, Y.: NiagaraCQ: a scalable continuous query system for Internet databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA May 2000
Cole, R., Shasha, D., Zhao: Fast window correlations over uncooperative time series. In: Proceedings of the 11th SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, August 2005
Coman, A., Nascimento, M.: A distributed algorithm for joins in sensor networks. In: Proceedings of the 19th SSDBM International Conference on Scientific and Statistical Database Management, Banff, Canada, July 2007
Cortes, C., Fisher, K., Pregibon, D., Rogers, A., Hancock, F.: A language for extracting signatures from data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, August 2000
Dai, B., Huang, J., Yeh, M., Chen, M.: Adaptive clustering for multiple evolving streams. IEEE Trans. Knowl. Data Eng. 18(9) (2006)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD, Association for Computing Machinery International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, August 2000
Domingos, P., Hulten, G.: A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, Morgan Kaufmann, San Mateo, June–July 2001
Golab, L., Ozsu, M.: Issues in data stream management. In: Proceedings of the 21st SIGMOD International Conference on Management of Data, San Diego, CA, USA, June 2003
Golab, L., Ozsu, M.: Processing sliding window multi-joins in continuous queries over data streams. In: Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, Redondo Beach, CA, November 2000
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng., 15 (2003), special issue on clustering
Hammad, M., Aref, W., Elmagarmid, A.: Stream window join: tracking moving objects in sensor-network databases. In: Proceedings of the 15th SSDBM International Conference on Scientific and Statistical Database Management, Cambridge, MA, USA, July 2003
Hammad, M., Franklin, M., Aref, W., Elmagarmid, A.: Scheduling for shared window joins over data streams. In: Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003
Kang, J. Naughton J., Viglas, S.: Evaluating window joins over unbounded streams. In: Proceedings of the 19th ICDE International Conference on Data Engineering, Bangalore, India, February 2003
Lamboray, E., Wurmlin, S., Gross, M.: Data streaming in telepresence environments. IEEE Trans. Vis. Comput. Graph. 11(6) (2005)
Madden, M. Shah M., Hellerstein, J., Raman, V.: Continuously adaptive continuous queries over streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 2002
Nasraoui, O., Rojas, C.: Robust clustering for tracking noisy evolving data streams. In: Proceedings of the SIAM Conference on Data Mining, Bethesda, MD, April, 2006
Naughton, J., Burger, J., Viglas, S.: Maximizing the output rate of multi-way join queries over streaming information sources. In: Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003
Nittel, S., Leung, K., Braverman, A.: Scaling Clustering Algorithms for Massive Data Sets using Data Streams. In: Proceedings of the 20th International Conference on Data Engineering, Boston, USA, April 2004
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, March 2002
Oppenheim, A., Schafer, R., Buck, J.: Discrete-Time Signal Processing, 2nd edn. Prentice Hall, New York (1999)
Ordonez, C.: Clustering binary data streams with K-means. In: Proceedings of the 13th ACM Data Mining and Knowledge Discovery, San Diego, CA, USA, June 2003
Papadimitriou, S., Brockwell, A., Faloutsos, C.: Adaptive hands-off stream mining. In: Proceedings of the 27th VLDB International Conference on Very Large Data Base, pp. 560–571, Berlin, Germany, September 2003
Papadimitriou, S., Sun, J., Faloutsos, C.: Streaming pattern discovery in multiple time-series. In: Proceedings of the 31st VLDB International Conference on Very Large Data Base, pp. 697–708. Trondheim, Norway, August–September 2005
Park, N., Lee, W.: Statistical grid-based clustering over data streams. In: Proceedings of the 22nd SIGMOD International Conference on Management of Data, Toronto, Canada, March 2004
Sakurai, Y., Papadimitriou, S., Faloutsos: Braid: stream mining through group lag correlations. In: Proceedings of the 24th ACM SIGMOD International Conference on Management of Data, pp. 599–610, Baltimore, MD, June 2005
Sakurai, Y., Faloutsos, C., Yamamuro, M.: Stream monitoring under the time warping distance. In: Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, April 2007
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Person Education, Inc. (2006)
Tasoulis, D., Adams, N., Hand, D.: Unsupervised clustering in streaming data. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, China, December 2006
WEKA machine learning package, University of Waikato. http://www.cs.waikato.ac.nz/ml/weka
Wilcox, R.R.: Introduction to Robust Estimation and Hypothesis Testing. Academic Press, San Diego (2005)
Zhang, D., Li, J., Kimeli, K., Wang, W.: Sliding window based multi-join algorithms over distributed data streams. In: Proceedings of the 22nd ICDE International Conference on Data Engineering, Atlanta, GA, USA, April 2006
Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed data stream clustering: a fast EM-based approach. In: Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, April 2007
Zhu, Y., Shasha, D.: StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th International Conference on Very Large Data Base, Hong Kong, China, August 2002
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ahmed K. Elmagarmid.
Rights and permissions
About this article
Cite this article
Kamel, I., Al Aghbari, Z. & Awad, T. MG-join: detecting phenomena and their correlation in high dimensional data streams. Distrib Parallel Databases 28, 67–92 (2010). https://doi.org/10.1007/s10619-010-7065-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-010-7065-4