Skip to main content
Log in

MG-join: detecting phenomena and their correlation in high dimensional data streams

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

A phenomenon appears in a sensor network when a group of sensors continuously produces similar readings (i.e., data streams) over a period of time. This involves the processing of hundreds and maybe thousands of data streams in real-time. This paper focuses on detecting environmental phenomena and determining possible correlation between such phenomena.

This paper proposes an efficient scheme for a detecting and tracking phenomena, e.g., air pollution and oil spills. To achieve fast online response, the proposed algorithms use a Discrete Fourier Transformation (DFT) to reduce the dimensionality of the streams. Each stream is represented by a point in a multidimensional grid in the frequency domain. The algorithm uses an improved unsupervised grid-based clustering technique to detect similar streams and to form clusters. The paper also proposes an efficient algorithm for detecting correlation among phenomena. The proposed algorithm calculates the correlation coefficient in the frequency domain. It makes use of the DFT coefficients that are calculated for detecting the phenomena. The proposed correlation detection algorithm uses only few DFT coefficients in the frequency domain.

Experiments on synthetic data streams show that the proposed algorithm for detecting and tracking phenomena is much faster than the DBSCAN clustering technique, which is based on the R-tree index. At the same time, the proposed phenomena detection algorithm produces the same quality as that of the DBSCAN by only using two DFT coefficients in most of the cases. The experimental results also showed that the proposed technique for detecting the correlation among phenomena performs as good as the traditional Pearson correlation formula but it is much faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

x :

Data stream vector in the time domain

X :

Data stream vector in the frequency domain

N :

The number of data streams

w :

The size of a sliding window

f :

The number of DFT coefficients used to represent a stream

d :

Distance between neighbor streams

k :

Number of neighbors considered for every stream

C :

The number of true clusters in a data set

G :

The number of generated clusters by the proposed algorithm

CorrT :

Correlation coefficient threshold

References

  1. Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Base, Berlin, Germany, September 2003

  2. Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on Very Large Data Base, Toronto, Canada, September 2004

  3. Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: Proceedings of the 4th FODO International Conference of Foundations of Data Organization and Algorithms, Chicago, Illinois, USA, October 1993

  4. Ali, M., Mokbel, M., Aref, W., Kamel, I.: Detection and tracking of discrete phenomena in sensor-network databases. In: Proceedings of the 17th International Conference on Scientific and Statistical Database Management, Santa Barbara, California, June 2005

  5. Ali, M., Aref, W., Helal, S., Kamel, I.: NILE-PDT: a phenomenon detection and tracking framework for data stream management systems. In: 31st International Conference on Very Large Data Bases VLDB 2005, Norway, September 2005

  6. Ali, M., Aref, W., Kamel, I.: Scalability management in sensor-network phenomenaBases. In: 18th International Conference Scientific and Statistical Database Management, Vienna, Austria, July 2006

  7. Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd Symposium on Principles of Database Systems, San Diego, California, USA, April 2003

  8. Beringer, J., Hullermeier, E.: Online clustering of data streams. Data Knowl. Eng. 58 (2006)

  9. Beringer, J., Hullermeir, E.: Fuzzy clustering of parallel data streams. In: Advances in Fuzzy Clustering and Its Applications. Wiley, New York (2007)

    Google Scholar 

  10. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the SIAM Conference on Data Mining, Bethesda, Maryland, April 2006

  11. Chandrasekaran, S., Franklin, M.J.: Streaming queries over streaming data. In: Proceedings of the 28th VLDB International Conference on Very Large Data Base, Hong Kong, China, August 2002

  12. Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proceedings of 35th ACM Symposium on Theory of Computing, San Diego, California, USA, June 2003

  13. Chen, J., DeWitt, D., Tian, F., Wang, Y.: NiagaraCQ: a scalable continuous query system for Internet databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA May 2000

  14. Cole, R., Shasha, D., Zhao: Fast window correlations over uncooperative time series. In: Proceedings of the 11th SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, August 2005

  15. Coman, A., Nascimento, M.: A distributed algorithm for joins in sensor networks. In: Proceedings of the 19th SSDBM International Conference on Scientific and Statistical Database Management, Banff, Canada, July 2007

  16. Cortes, C., Fisher, K., Pregibon, D., Rogers, A., Hancock, F.: A language for extracting signatures from data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, August 2000

  17. Dai, B., Huang, J., Yeh, M., Chen, M.: Adaptive clustering for multiple evolving streams. IEEE Trans. Knowl. Data Eng. 18(9) (2006)

  18. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD, Association for Computing Machinery International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, August 2000

  19. Domingos, P., Hulten, G.: A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, Morgan Kaufmann, San Mateo, June–July 2001

    Google Scholar 

  20. Golab, L., Ozsu, M.: Issues in data stream management. In: Proceedings of the 21st SIGMOD International Conference on Management of Data, San Diego, CA, USA, June 2003

  21. Golab, L., Ozsu, M.: Processing sliding window multi-joins in continuous queries over data streams. In: Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003

  22. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, Redondo Beach, CA, November 2000

  23. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng., 15 (2003), special issue on clustering

  24. Hammad, M., Aref, W., Elmagarmid, A.: Stream window join: tracking moving objects in sensor-network databases. In: Proceedings of the 15th SSDBM International Conference on Scientific and Statistical Database Management, Cambridge, MA, USA, July 2003

  25. Hammad, M., Franklin, M., Aref, W., Elmagarmid, A.: Scheduling for shared window joins over data streams. In: Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003

  26. Kang, J. Naughton J., Viglas, S.: Evaluating window joins over unbounded streams. In: Proceedings of the 19th ICDE International Conference on Data Engineering, Bangalore, India, February 2003

  27. Lamboray, E., Wurmlin, S., Gross, M.: Data streaming in telepresence environments. IEEE Trans. Vis. Comput. Graph. 11(6) (2005)

  28. Madden, M. Shah M., Hellerstein, J., Raman, V.: Continuously adaptive continuous queries over streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 2002

  29. Nasraoui, O., Rojas, C.: Robust clustering for tracking noisy evolving data streams. In: Proceedings of the SIAM Conference on Data Mining, Bethesda, MD, April, 2006

  30. Naughton, J., Burger, J., Viglas, S.: Maximizing the output rate of multi-way join queries over streaming information sources. In: Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003

  31. Nittel, S., Leung, K., Braverman, A.: Scaling Clustering Algorithms for Massive Data Sets using Data Streams. In: Proceedings of the 20th International Conference on Data Engineering, Boston, USA, April 2004

  32. O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, March 2002

  33. Oppenheim, A., Schafer, R., Buck, J.: Discrete-Time Signal Processing, 2nd edn. Prentice Hall, New York (1999)

    Google Scholar 

  34. Ordonez, C.: Clustering binary data streams with K-means. In: Proceedings of the 13th ACM Data Mining and Knowledge Discovery, San Diego, CA, USA, June 2003

  35. Papadimitriou, S., Brockwell, A., Faloutsos, C.: Adaptive hands-off stream mining. In: Proceedings of the 27th VLDB International Conference on Very Large Data Base, pp. 560–571, Berlin, Germany, September 2003

  36. Papadimitriou, S., Sun, J., Faloutsos, C.: Streaming pattern discovery in multiple time-series. In: Proceedings of the 31st VLDB International Conference on Very Large Data Base, pp. 697–708. Trondheim, Norway, August–September 2005

  37. Park, N., Lee, W.: Statistical grid-based clustering over data streams. In: Proceedings of the 22nd SIGMOD International Conference on Management of Data, Toronto, Canada, March 2004

  38. Sakurai, Y., Papadimitriou, S., Faloutsos: Braid: stream mining through group lag correlations. In: Proceedings of the 24th ACM SIGMOD International Conference on Management of Data, pp. 599–610, Baltimore, MD, June 2005

  39. Sakurai, Y., Faloutsos, C., Yamamuro, M.: Stream monitoring under the time warping distance. In: Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, April 2007

  40. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Person Education, Inc. (2006)

  41. Tasoulis, D., Adams, N., Hand, D.: Unsupervised clustering in streaming data. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, China, December 2006

  42. WEKA machine learning package, University of Waikato. http://www.cs.waikato.ac.nz/ml/weka

  43. Wilcox, R.R.: Introduction to Robust Estimation and Hypothesis Testing. Academic Press, San Diego (2005)

    MATH  Google Scholar 

  44. Zhang, D., Li, J., Kimeli, K., Wang, W.: Sliding window based multi-join algorithms over distributed data streams. In: Proceedings of the 22nd ICDE International Conference on Data Engineering, Atlanta, GA, USA, April 2006

  45. Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed data stream clustering: a fast EM-based approach. In: Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, April 2007

  46. Zhu, Y., Shasha, D.: StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th International Conference on Very Large Data Base, Hong Kong, China, August 2002

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ibrahim Kamel.

Additional information

Communicated by Ahmed K. Elmagarmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kamel, I., Al Aghbari, Z. & Awad, T. MG-join: detecting phenomena and their correlation in high dimensional data streams. Distrib Parallel Databases 28, 67–92 (2010). https://doi.org/10.1007/s10619-010-7065-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-010-7065-4

Keywords

Navigation