Skip to main content
Log in

Efficient discovery of longest-lasting correlation in sequence databases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The search for similar subsequences is a core module for various analytical tasks in sequence databases. Typically, the similarity computations require users to set a length. However, there is no robust means by which to define the proper length for different application needs. In this study, we examine a new query that is capable of returning the longest-lasting highly correlated subsequences in a sequence database, which is particularly helpful to analyses without prior knowledge regarding the query length. A baseline, yet expensive, solution is to calculate the correlations for every possible subsequence length. To boost performance, we study a space-constrained index that provides a tight correlation bound for subsequences of similar lengths and offset by intraobject and interobject grouping techniques. To the best of our knowledge, this is the first index to support a normalized distance metric of arbitrary length subsequences. In addition, we study the use of a smart cache for disk-resident data (e.g., millions of sequence objects) and a graph processing unit-based parallel processing technique for frequently updated data (e.g., nonindexable streaming sequences) to compute the longest-lasting highly correlated subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

Notes

  1. In this work the terms ‘correlation’ and ‘similarity’ are interchangeable.

  2. Lock-step measures may be more preferable for sequences whose values are collected periodically (e.g., financial data and sensor values).

  3. http://finance.yahoo.com/.

  4. http://www.google.com/trends/correlate.

  5. BP is the abbreviation of British Petroleum.

  6. http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill.

  7. http://www.pmel.noaa.gov/tao/.

  8. As reported by [44], a query is trivially correlated to two close subsequences, which may give a meaningless result.

  9. The values of \(\mu \) and \(\sigma \) may be changed significantly with subsequences of different lengths.

  10. The proof can be found in our preliminary study [29].

  11. The space overhead (i.e., O(2m)) of \(S_{q}\) and \(S_{q^2}\) is negligible.

  12. For clarity, the time complexity of every single step is stated in the pseudocodes.

  13. Our techniques are also applicable to p-norm distance. The details are omitted for simplicity.

  14. Assuming that the number of objects is larger than the maximum number of threads supported by the GPU device.

  15. No locking is required as each bit flag is accessed only by its corresponding thread.

  16. http://finance.yahoo.com/.

  17. http://vision.stanford.edu/aditya86/ImageNetDogs/main.html.

  18. http://www.pmel.noaa.gov/tao/.

  19. http://degroup.cis.umac.mo/%7Eyuhong/lcs/.

  20. The length pruning technique is based on a lower bound derived from the maximum normalized z-value in a sequence over all possible means and variances at all lengths [37].

References

  1. Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)

  2. Assent, I., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008)

  3. Athitsos, V., Papapetrou, P., Potamias, M., Kollios, G., Gunopulos, D.: Approximate embedding-based subsequence matching of time series. In: SIGMOD, pp. 365–378 (2008)

  4. Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: SIGMOD, pp. 322–331 (1990)

  5. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bragge, T., Tarvainen, M., Karjalainen, P.A.: High-resolution qrs detection algorithm for sparsely sampled ECG recordings. University of Kuopio, Department of Applied Physics Report (2004)

  7. Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010)

  8. Chan, K.P., Fu, A.W.-C.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999)

  9. Chandola, V., Mithal, V., Kumar, V.: Comparative evaluation of anomaly detection techniques for sequence data. In: ICDM, pp. 743–748 (2008)

  10. Chang, C.-I.: Hyperspectral imaging: techniques for spectral detection and classification. Plenum Publishing Co., New York (2003)

    Book  Google Scholar 

  11. Chang, K., Deka, B., Hwu, W.W., Roth, D.: Efficient pattern-based time series classification on GPU. In: ICDM, pp. 131–140 (2012)

  12. Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB, pp. 435–446 (2007)

  13. Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive, July (2015). www.cs.ucr.edu/~eamonn/time_series_data/

  14. Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: KDD, pp. 743–749 (2005)

  15. NVIDIA CUDA Programming Guide. http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html

  16. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.J.: Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2), 1542–1552 (2008)

    Google Scholar 

  17. Duda, R.O., Hart, P.E., et al.: Pattern classification and scene analysis, vol. 3. Wiley, New York (1973)

    MATH  Google Scholar 

  18. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994)

  19. Filho, R.F.S., Traina, A.J.M., C.T. Jr., Faloutsos, C.: Similarity search without tears: the omni family of all-purpose access methods. In: ICDE, pp. 623–630 (2001)

  20. Jiang, X., Li, C., Luo, P., Wang, M., Yu, Y.: Prominent streak discovery in sequence data. In: KDD, pp. 1280–1288 (2011)

  21. Kahveci, T., Singh, A.K.: Optimizing similarity search for arbitrary length time series queries. IEEE TKDE 16(4), 418–433 (2004)

    Google Scholar 

  22. Keogh, E.J., Chakrabarti, K., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD, pp. 151–162 (2001)

  23. Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)

    Article  MATH  Google Scholar 

  24. Keogh, E.J., Kasetty, S.: On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min. Knowl. Discov. 7(4), 349–371 (2003)

    Article  MathSciNet  Google Scholar 

  25. Keogh, E.J., Wei, L., Xi, X., Vlachos, M., Lee, S.-H., Protopapas, P.: Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J. 18(3), 611–630 (2009)

    Article  Google Scholar 

  26. Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.-F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR Workshop on Fine-Grained Visual Categorization (FGVC) (2011)

  27. Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997)

  28. Kristoufek, L.: Measuring correlations between non-stationary series with dcca coefficient. Phys. A 402, 291–298 (2014)

    Article  Google Scholar 

  29. Li, Y., U, L.H., Yiu, M.L., Gong, Z.: Discovering longest-lasting correlation in sequence databases. PVLDB 6(14), 1666–1677 (2013)

    Google Scholar 

  30. Liao, T.W.: Clustering of time series data—a survey. Pattern Recogn. 38(11), 1857–1874 (2005)

    Article  MATH  Google Scholar 

  31. Lim, S.-H., Park, H., Kim, S.-W.: Using multiple indexes for efficient subsequence matching in time-series databases. Inf. Sci. 177(24), 5691–5706 (2007)

    Article  MATH  Google Scholar 

  32. Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)

    Article  MathSciNet  Google Scholar 

  33. Monte Carlo simulated stock price generator. http://www.investopedia.com/articles/investing/102715/simulating-stockprices-using-excel.asp

  34. Micikevicius, P.: Advanced cuda, C. (2009) http://www.nvidia.com/content/GTC/documents/1029_GTC09.pdf

  35. Moon, B., Jagadish, H.V., Faloutsos, C., Saltz, J.H.: Analysis of the clustering properties of the hilbert space-filling curve. IEEE TKDE 13(1), 124–141 (2001)

    Google Scholar 

  36. Morton, G.M.: A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, New York (1966)

    Google Scholar 

  37. Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)

  38. Mueen, A., Keogh, E.J., Shamlo, N.B.: Finding time series motifs in disk-resident data. In: ICDM, pp. 367–376 (2009)

  39. Mueen, A., Keogh, E.J., Young, N.: Logical-shapelets: an expressive primitive for time series classification. In: KDD, pp. 1154–1162 (2011)

  40. Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009)

  41. Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010)

  42. Papapetrou, P., Athitsos, V., Potamias, M., Kollios, G., Gunopulos, D.: Embedding-based subsequence matching in time-series databases. ACM TODS 36(3), 17 (2011)

  43. Paparrizos, J., Gravano, L.: k-shape: Efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015)

  44. Patel, P., Keogh, E.J., Lin, J., Lonardi, S.: Mining motifs in massive time series databases. In: ICDM, pp. 370–377 (2002)

  45. Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999)

  46. Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD, pp. 262–270 (2012)

  47. Sakurai, Y., Papadimitriou, S., Braid, C. Faloutsos.: Stream mining through group lag correlations. In: SIGMOD, pp. 599–610 (2005)

  48. Sart, D., Mueen, A., Najjar, W.A., Keogh, E.J., Niennattrakul, V.: Accelerating dynamic time warping subsequence search with gpus and fpgas. In: ICDM, pp. 1001–1006 (2010)

  49. Smith, J.E., Goodman, J.R.: Instruction cache replacement policies and organizations. IEEE Trans. Comput. 34(3), 234–241 (1985)

    Article  Google Scholar 

  50. Yi, B.-K., Faloutsos, C.: Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp. 385–394 (2000)

  51. Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)

  52. Zebende, G.: Dcca cross-correlation coefficient: quantifying level of cross-correlation. Phys. A 390(4), 614–618 (2011)

    Article  Google Scholar 

  53. Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp. 358–369 (2002)

Download references

Acknowledgments

This work was supported by Grants MYRG109(Y1-L3)-FST12ULH, MYRG2014-00106-FST, MYRG105-FST13-GZG, and MYRG2015-00070-FST from UMAC Research Committee, Grants FDCT/106/2012/A3 and FDCT/116/2013/A3 from FDCT Macau, and Grant GRF 152043/15E from Hong Kong RGC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leong Hou U.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., U, L.H., Yiu, M.L. et al. Efficient discovery of longest-lasting correlation in sequence databases. The VLDB Journal 25, 767–790 (2016). https://doi.org/10.1007/s00778-016-0432-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0432-7

Keywords

Navigation