Abstract
The search for similar subsequences is a core module for various analytical tasks in sequence databases. Typically, the similarity computations require users to set a length. However, there is no robust means by which to define the proper length for different application needs. In this study, we examine a new query that is capable of returning the longest-lasting highly correlated subsequences in a sequence database, which is particularly helpful to analyses without prior knowledge regarding the query length. A baseline, yet expensive, solution is to calculate the correlations for every possible subsequence length. To boost performance, we study a space-constrained index that provides a tight correlation bound for subsequences of similar lengths and offset by intraobject and interobject grouping techniques. To the best of our knowledge, this is the first index to support a normalized distance metric of arbitrary length subsequences. In addition, we study the use of a smart cache for disk-resident data (e.g., millions of sequence objects) and a graph processing unit-based parallel processing technique for frequently updated data (e.g., nonindexable streaming sequences) to compute the longest-lasting highly correlated subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.




























Similar content being viewed by others
Notes
In this work the terms ‘correlation’ and ‘similarity’ are interchangeable.
Lock-step measures may be more preferable for sequences whose values are collected periodically (e.g., financial data and sensor values).
BP is the abbreviation of British Petroleum.
As reported by [44], a query is trivially correlated to two close subsequences, which may give a meaningless result.
The values of \(\mu \) and \(\sigma \) may be changed significantly with subsequences of different lengths.
The proof can be found in our preliminary study [29].
The space overhead (i.e., O(2m)) of \(S_{q}\) and \(S_{q^2}\) is negligible.
For clarity, the time complexity of every single step is stated in the pseudocodes.
Our techniques are also applicable to p-norm distance. The details are omitted for simplicity.
Assuming that the number of objects is larger than the maximum number of threads supported by the GPU device.
No locking is required as each bit flag is accessed only by its corresponding thread.
The length pruning technique is based on a lower bound derived from the maximum normalized z-value in a sequence over all possible means and variances at all lengths [37].
References
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)
Assent, I., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008)
Athitsos, V., Papapetrou, P., Potamias, M., Kollios, G., Gunopulos, D.: Approximate embedding-based subsequence matching of time series. In: SIGMOD, pp. 365–378 (2008)
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: SIGMOD, pp. 322–331 (1990)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Bragge, T., Tarvainen, M., Karjalainen, P.A.: High-resolution qrs detection algorithm for sparsely sampled ECG recordings. University of Kuopio, Department of Applied Physics Report (2004)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010)
Chan, K.P., Fu, A.W.-C.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999)
Chandola, V., Mithal, V., Kumar, V.: Comparative evaluation of anomaly detection techniques for sequence data. In: ICDM, pp. 743–748 (2008)
Chang, C.-I.: Hyperspectral imaging: techniques for spectral detection and classification. Plenum Publishing Co., New York (2003)
Chang, K., Deka, B., Hwu, W.W., Roth, D.: Efficient pattern-based time series classification on GPU. In: ICDM, pp. 131–140 (2012)
Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB, pp. 435–446 (2007)
Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive, July (2015). www.cs.ucr.edu/~eamonn/time_series_data/
Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: KDD, pp. 743–749 (2005)
NVIDIA CUDA Programming Guide. http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.J.: Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2), 1542–1552 (2008)
Duda, R.O., Hart, P.E., et al.: Pattern classification and scene analysis, vol. 3. Wiley, New York (1973)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994)
Filho, R.F.S., Traina, A.J.M., C.T. Jr., Faloutsos, C.: Similarity search without tears: the omni family of all-purpose access methods. In: ICDE, pp. 623–630 (2001)
Jiang, X., Li, C., Luo, P., Wang, M., Yu, Y.: Prominent streak discovery in sequence data. In: KDD, pp. 1280–1288 (2011)
Kahveci, T., Singh, A.K.: Optimizing similarity search for arbitrary length time series queries. IEEE TKDE 16(4), 418–433 (2004)
Keogh, E.J., Chakrabarti, K., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD, pp. 151–162 (2001)
Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)
Keogh, E.J., Kasetty, S.: On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min. Knowl. Discov. 7(4), 349–371 (2003)
Keogh, E.J., Wei, L., Xi, X., Vlachos, M., Lee, S.-H., Protopapas, P.: Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J. 18(3), 611–630 (2009)
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.-F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR Workshop on Fine-Grained Visual Categorization (FGVC) (2011)
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997)
Kristoufek, L.: Measuring correlations between non-stationary series with dcca coefficient. Phys. A 402, 291–298 (2014)
Li, Y., U, L.H., Yiu, M.L., Gong, Z.: Discovering longest-lasting correlation in sequence databases. PVLDB 6(14), 1666–1677 (2013)
Liao, T.W.: Clustering of time series data—a survey. Pattern Recogn. 38(11), 1857–1874 (2005)
Lim, S.-H., Park, H., Kim, S.-W.: Using multiple indexes for efficient subsequence matching in time-series databases. Inf. Sci. 177(24), 5691–5706 (2007)
Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)
Monte Carlo simulated stock price generator. http://www.investopedia.com/articles/investing/102715/simulating-stockprices-using-excel.asp
Micikevicius, P.: Advanced cuda, C. (2009) http://www.nvidia.com/content/GTC/documents/1029_GTC09.pdf
Moon, B., Jagadish, H.V., Faloutsos, C., Saltz, J.H.: Analysis of the clustering properties of the hilbert space-filling curve. IEEE TKDE 13(1), 124–141 (2001)
Morton, G.M.: A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, New York (1966)
Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)
Mueen, A., Keogh, E.J., Shamlo, N.B.: Finding time series motifs in disk-resident data. In: ICDM, pp. 367–376 (2009)
Mueen, A., Keogh, E.J., Young, N.: Logical-shapelets: an expressive primitive for time series classification. In: KDD, pp. 1154–1162 (2011)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009)
Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010)
Papapetrou, P., Athitsos, V., Potamias, M., Kollios, G., Gunopulos, D.: Embedding-based subsequence matching in time-series databases. ACM TODS 36(3), 17 (2011)
Paparrizos, J., Gravano, L.: k-shape: Efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015)
Patel, P., Keogh, E.J., Lin, J., Lonardi, S.: Mining motifs in massive time series databases. In: ICDM, pp. 370–377 (2002)
Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999)
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD, pp. 262–270 (2012)
Sakurai, Y., Papadimitriou, S., Braid, C. Faloutsos.: Stream mining through group lag correlations. In: SIGMOD, pp. 599–610 (2005)
Sart, D., Mueen, A., Najjar, W.A., Keogh, E.J., Niennattrakul, V.: Accelerating dynamic time warping subsequence search with gpus and fpgas. In: ICDM, pp. 1001–1006 (2010)
Smith, J.E., Goodman, J.R.: Instruction cache replacement policies and organizations. IEEE Trans. Comput. 34(3), 234–241 (1985)
Yi, B.-K., Faloutsos, C.: Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp. 385–394 (2000)
Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)
Zebende, G.: Dcca cross-correlation coefficient: quantifying level of cross-correlation. Phys. A 390(4), 614–618 (2011)
Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp. 358–369 (2002)
Acknowledgments
This work was supported by Grants MYRG109(Y1-L3)-FST12ULH, MYRG2014-00106-FST, MYRG105-FST13-GZG, and MYRG2015-00070-FST from UMAC Research Committee, Grants FDCT/106/2012/A3 and FDCT/116/2013/A3 from FDCT Macau, and Grant GRF 152043/15E from Hong Kong RGC.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., U, L.H., Yiu, M.L. et al. Efficient discovery of longest-lasting correlation in sequence databases. The VLDB Journal 25, 767–790 (2016). https://doi.org/10.1007/s00778-016-0432-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-016-0432-7