skip to main content
research-article

Fast Discovery of Group Lag Correlations in Streams

Published: 01 December 2010 Publication History

Abstract

The study of data streams has received considerable attention in various communities (theory, databases, data mining, networking), due to several important applications, such as network analysis, sensor monitoring, financial data analysis, and moving object tracking. Our goal in this article is to monitor multiple numerical streams and determine which pairs are correlated with lags, as well as the value of each such lag. Lag correlations and anticorrelations are frequent and very interesting in practice. For example, a decrease in interest rates typically precedes an increase in house sales by a few months; higher amounts of fluoride in drinking water may lead to fewer dental cavities some years later. Other lag settings include network analysis, sensor monitoring, financial data analysis, and tracking of moving objects. Such data streams are often correlated or anticorrelated, but with unknown lag.
We propose BRAID, a method of detecting lag correlations among data streams. BRAID can handle data streams of semi-infinite length incrementally, quickly, and with small resource consumption. However, BRAID requires space and time quadratic on a number of streams k. We also propose ThinBRAID, which is even faster than BRAID, requiring O(k) space and time per time tick. Our theoretical analysis shows that BRAID/ThinBRAID can estimate lag correlations with little or, often, with no error. Our experiments on real and realistic data show that BRAID and ThinBRAID detect the correct lag perfectly most of the time (the largest relative error was about 1%), while they are significantly faster (up to 40,000 times) than the naïve implementation.

References

[1]
Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and Zdonik, S. B. 2003. Aurora: a new model and architecture for data stream management. VLDB J. 12, 2, 120--139.
[2]
Achlioptas, D. 2001. Database-friendly random projections. In Proceedings of the ACM SIGACI-SIGMOD SIGART Symposium on Principles of Database Systems (PODS). 274--281.
[3]
Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proceedings of (FODO). 69--84.
[4]
Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). 221--232.
[5]
Babcock, B., Babu, S., Datar, M., and Motwani, R. 2003. Chain : Operator scheduling for memory minimization in data stream systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 253--264.
[6]
Box, G. E., Jenkins, G. M., and Reinsel, G. C. 1994. Time Series Analysis: Forecasting and Control, 3rd Ed. Prentice Hall, Englewood Cliffs, NJ.
[7]
Brent, R. P. 2002. Algorithm for Minimization without Derivatives. Dover Publications, Mincola, NY.
[8]
Cai, Y. and Ng, R. T. 2004. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 599--610.
[9]
Carney, D., Cetintemel, U., Rasin, A., Zdonik, S. B., Cherniack, M., and Stonebraker, M. 2003. Operator scheduling in a data stream manager. In Proceedings of the International Conference on Very Large Databases (VLDB). 838--849.
[10]
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., and Shah, M. A. 2003. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proceedings of the Conference on Innovative Data Systems Research.
[11]
Chandrasekaran, S. and Franklin, M. J. 2004. Remembrance of streams past: Overload-sensitive management of archived streams. In Proceedings of the International Conference on Very Large Databases (VLDB). 348--359.
[12]
Cole, R., Shasha, D., and Zhao, X. 2005. Fast window correlations over uncooperative time series. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining. 743--749.
[13]
Considine, J., Li, F., Kollios, G., and Byers, J. W. 2004. Approximate aggregation techniques for sensor databases. In Proceedings of the International Conference on Data Engineering. 449--460.
[14]
Cormode, G., Korn, F., and Tirthapura, S. 2008. Time-decaying aggregates in out-of-order streams. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). 89--98.
[15]
Cranor, C. D., Johnson, T., Spatscheck, O., and Shkapenyuk, V. 2003. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 647--651.
[16]
Das, A., Gehrke, J., and Riedewald, M. 2003. Approximate join processing over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 40--51.
[17]
Dobra, A., Garofalakis, M. N., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 61--72.
[18]
Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining. 71--80.
[19]
Elfeky, M. G., Aref, W. G., and Elmagarmid, A. K. 2005. WARP: Time warping for periodicity detection. In Proceedings of the IEEE International Conference on Data Mining. 138--145.
[20]
Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. 1994. Fast subsequence matching in time-series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 419--429.
[21]
Forsythe, G. E. 1977. Computer Methods for Mathematical Computations. Prentice-Hall, Englewood Cliffs, NJ.
[22]
Ganti, V., Gehrke, J., and Ramakrishnan, R. 2002. Mining data streams under block evolution. SIGKDD Explor. 3, 2, 1--10.
[23]
Gehrke, J., Korn, F., and Srivastava, D. 2001. On computing correlated aggregates over continual data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 13--24.
[24]
Gilbert, A. C., Guha, S., Indyk, P., Muthukrishnan, S., and Strauss, M. 2002. Near-optimal sparse fourier representations via sampling. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC). 152--161.
[25]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the International Conference on Very Large Databases (VLDB). 79--88.
[26]
Guha, S. and Koudas, N. 2002. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In Proceedings of the International Conference on Data Engineering. 567--576.
[27]
Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O’Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15, 3, 515--528.
[28]
Han, J. and Kamber, M. 2000. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA.
[29]
Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 97--106.
[30]
Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the International Conference on Very Large Databases (VLDB). 363--372.
[31]
Johnson, W. and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into hilbert space. Contemp. Math. 26, 189--206.
[32]
Keogh, E. J., Chakrabarti, K., Mehrotra, S., and Pazzani, M. J. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 151--162.
[33]
Koper, K., Wallace, T., Taylor, S., and Hartse, H. 2001. Forensic seismology and the sinking of the kursk. EOS Trans. AGU 82, 37, 45--46.
[34]
Lathi, B. P. 1998. Signal Processing and Linear Systems. Oxford University Press, Oxford, U.K.
[35]
Madden, S., Shah, M. A., Hellerstein, J. M., and Raman, V. 2002. Continuously adaptive continuous queries over streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 49--60.
[36]
Matias, Y., Vitter, J. S., and Wang, M. 2000. Dynamic maintenance of wavelet-based histograms. In Proceedings of the International Conference on Very Large Databases (VLDB). 101--110.
[37]
Moon, Y.-S., Whang, K.-Y., and Han, W.-S. 2002. General match: a subsequence matching method in time-series databases based on generalized windows. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 382--393.
[38]
Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G. S., Olston, C., Rosenstein, J., and Varma, R. 2003. Query processing, approximation, and resource management in a data stream management system. In Proceedings of the Conference on Innovative Data Systems Research (CIDR).
[39]
Papadimitriou, S., Brockwell, A., and Faloutsos, C. 2003. Adaptive, hands-off stream mining. In Proceedings of the International Conference on Very Large Databases (VLDB). 560--571.
[40]
Papadimitriou, S., Sun, J., and Faloutsos, C. 2005. Streaming pattern discovery in multiple time-series. In Proceedings of the International Conference on Very Large Databases (VLDB). 697--708.
[41]
Patel, P., Keogh, E. J., Lin, J., and Lonardi, S. 2002. Mining motifs in massive time series databases. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 370--377.
[42]
Sakurai, Y., Yoshikawa, M., Uemura, S., and Kojima, H. 2000. The a-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the International Conference on Very Large Databases (VLDB). 516--526.
[43]
Sakurai, Y., Papadimitriou, S., and Faloutsos, C. 2005a. BRAID: Stream mining through group lag correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (PODS). 599--610.
[44]
Sakurai, Y., Yoshikawa, M., and Faloutsos, C. 2005b. FTW: Fast similarity search under the time warping distance. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). 326--337.
[45]
Tao, Y., Faloutsos, C., Papadias, D., and Liu, B. 2004. Prediction and indexing of moving objects with unknown motion patterns. In Proceedings of the ACM SIGMOD International Conference on Management of Data (PODS). 611--622.
[46]
Tatbul, N., Cetintemel, U., Zdonik, S. B., Cherniack, M., and Stonebraker, M. 2003. Load shedding in a data stream manager. In Proceedings of the International Conference on Very Large Databases (VLDB). 309--320.
[47]
Vlachos, M., Turaga, D., and Yu, P. 2006. Resource adaptive periodicity estimation of streaming data. In Proceedings of the International Conference on Extending Database Technology.
[48]
Wang, M., Madhyastha, T., Chang, N. H., Papadimitriou, S., and Faloutsos, C. 2002. Data mining meets performance evaluation: Fast algorithms for modeling bursty traffic. In Proceedings of the International Conference on Data Engineering (ICDE). 507--516.
[49]
Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Databases (VLDB). 194--205.
[50]
Yi, B.-K., Sidiropoulos, N., Johnson, T., Jagadish, H., Faloutsos, C., and Biliris, A. 2000. Online data mining for co-evolving time sequences. In Proceedings of the International Conference on Data Engineering (ICDE). 13--22.
[51]
Zhu, Y. and Shasha, D. 2002. Statistical monitoring of thousands of data streams in real time. In Proceedings of the International Conference on Very Large Databases (VLDB). 358--369.
[52]
Zhu, Y. and Shasha, D. 2003. Efficient elastic burst detection in data streams. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 336--345.

Cited By

View all
  • (2017)On sensor selection in linked information networksComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2017.05.024126:C(100-113)Online publication date: 24-Oct-2017
  • (2014)Early detection of drought impact on rice paddies in Indonesia by means of Niño 3.4 indexTheoretical and Applied Climatology10.1007/s00704-014-1258-0121:3-4(669-684)Online publication date: 2-Sep-2014

Index Terms

  1. Fast Discovery of Group Lag Correlations in Streams

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 5, Issue 1
    December 2010
    199 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1870096
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 December 2010
    Accepted: 01 May 2010
    Revised: 01 March 2010
    Received: 01 May 2009
    Published in TKDD Volume 5, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Time-series
    2. cross-correlation
    3. data streams

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)On sensor selection in linked information networksComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2017.05.024126:C(100-113)Online publication date: 24-Oct-2017
    • (2014)Early detection of drought impact on rice paddies in Indonesia by means of Niño 3.4 indexTheoretical and Applied Climatology10.1007/s00704-014-1258-0121:3-4(669-684)Online publication date: 2-Sep-2014

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media