Efficient discovery of longest-lasting correlation in sequence databases

Li, Yuhong; U, Leong Hou; Yiu, Man Lung; Gong, Zhiguo

doi:10.1007/s00778-016-0432-7

Efficient discovery of longest-lasting correlation in sequence databases

Regular Paper
Published: 23 June 2016

Volume 25, pages 767–790, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yuhong Li¹,
Leong Hou U¹,
Man Lung Yiu² &
…
Zhiguo Gong¹

1909 Accesses
7 Citations
Explore all metrics

Abstract

The search for similar subsequences is a core module for various analytical tasks in sequence databases. Typically, the similarity computations require users to set a length. However, there is no robust means by which to define the proper length for different application needs. In this study, we examine a new query that is capable of returning the longest-lasting highly correlated subsequences in a sequence database, which is particularly helpful to analyses without prior knowledge regarding the query length. A baseline, yet expensive, solution is to calculate the correlations for every possible subsequence length. To boost performance, we study a space-constrained index that provides a tight correlation bound for subsequences of similar lengths and offset by intraobject and interobject grouping techniques. To the best of our knowledge, this is the first index to support a normalized distance metric of arbitrary length subsequences. In addition, we study the use of a smart cache for disk-resident data (e.g., millions of sequence objects) and a graph processing unit-based parallel processing technique for frequently updated data (e.g., nonindexable streaming sequences) to compute the longest-lasting highly correlated subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Correlated DNA Subsequence Search via Graph-Based Representation

Scalable data series subsequence matching with ULISSE

Article 04 July 2020

Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Article 24 June 2017

Notes

In this work the terms ‘correlation’ and ‘similarity’ are interchangeable.
Lock-step measures may be more preferable for sequences whose values are collected periodically (e.g., financial data and sensor values).
http://finance.yahoo.com/.
http://www.google.com/trends/correlate.
BP is the abbreviation of British Petroleum.
http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill.
http://www.pmel.noaa.gov/tao/.
As reported by [44], a query is trivially correlated to two close subsequences, which may give a meaningless result.
The values of $\mu $ and $\sigma $ may be changed significantly with subsequences of different lengths.
The proof can be found in our preliminary study [29].
The space overhead (i.e., O(2m)) of $S_{q}$ and $S_{q^2}$ is negligible.
For clarity, the time complexity of every single step is stated in the pseudocodes.
Our techniques are also applicable to p-norm distance. The details are omitted for simplicity.
Assuming that the number of objects is larger than the maximum number of threads supported by the GPU device.
No locking is required as each bit flag is accessed only by its corresponding thread.
http://finance.yahoo.com/.
http://vision.stanford.edu/aditya86/ImageNetDogs/main.html.
http://www.pmel.noaa.gov/tao/.
http://degroup.cis.umac.mo/%7Eyuhong/lcs/.
The length pruning technique is based on a lower bound derived from the maximum normalized z-value in a sequence over all possible means and variances at all lengths [37].

References

Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)
Assent, I., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008)
Athitsos, V., Papapetrou, P., Potamias, M., Kollios, G., Gunopulos, D.: Approximate embedding-based subsequence matching of time series. In: SIGMOD, pp. 365–378 (2008)
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: SIGMOD, pp. 322–331 (1990)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Bragge, T., Tarvainen, M., Karjalainen, P.A.: High-resolution qrs detection algorithm for sparsely sampled ECG recordings. University of Kuopio, Department of Applied Physics Report (2004)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010)
Chan, K.P., Fu, A.W.-C.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999)
Chandola, V., Mithal, V., Kumar, V.: Comparative evaluation of anomaly detection techniques for sequence data. In: ICDM, pp. 743–748 (2008)
Chang, C.-I.: Hyperspectral imaging: techniques for spectral detection and classification. Plenum Publishing Co., New York (2003)
Book Google Scholar
Chang, K., Deka, B., Hwu, W.W., Roth, D.: Efficient pattern-based time series classification on GPU. In: ICDM, pp. 131–140 (2012)
Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB, pp. 435–446 (2007)
Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive, July (2015). www.cs.ucr.edu/~eamonn/time_series_data/
Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: KDD, pp. 743–749 (2005)
NVIDIA CUDA Programming Guide. http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.J.: Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2), 1542–1552 (2008)
Google Scholar
Duda, R.O., Hart, P.E., et al.: Pattern classification and scene analysis, vol. 3. Wiley, New York (1973)
MATH Google Scholar
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994)
Filho, R.F.S., Traina, A.J.M., C.T. Jr., Faloutsos, C.: Similarity search without tears: the omni family of all-purpose access methods. In: ICDE, pp. 623–630 (2001)
Jiang, X., Li, C., Luo, P., Wang, M., Yu, Y.: Prominent streak discovery in sequence data. In: KDD, pp. 1280–1288 (2011)
Kahveci, T., Singh, A.K.: Optimizing similarity search for arbitrary length time series queries. IEEE TKDE 16(4), 418–433 (2004)
Google Scholar
Keogh, E.J., Chakrabarti, K., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD, pp. 151–162 (2001)
Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)
Article MATH Google Scholar
Keogh, E.J., Kasetty, S.: On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min. Knowl. Discov. 7(4), 349–371 (2003)
Article MathSciNet Google Scholar
Keogh, E.J., Wei, L., Xi, X., Vlachos, M., Lee, S.-H., Protopapas, P.: Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J. 18(3), 611–630 (2009)
Article Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.-F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR Workshop on Fine-Grained Visual Categorization (FGVC) (2011)
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997)
Kristoufek, L.: Measuring correlations between non-stationary series with dcca coefficient. Phys. A 402, 291–298 (2014)
Article Google Scholar
Li, Y., U, L.H., Yiu, M.L., Gong, Z.: Discovering longest-lasting correlation in sequence databases. PVLDB 6(14), 1666–1677 (2013)
Google Scholar
Liao, T.W.: Clustering of time series data—a survey. Pattern Recogn. 38(11), 1857–1874 (2005)
Article MATH Google Scholar
Lim, S.-H., Park, H., Kim, S.-W.: Using multiple indexes for efficient subsequence matching in time-series databases. Inf. Sci. 177(24), 5691–5706 (2007)
Article MATH Google Scholar
Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)
Article MathSciNet Google Scholar
Monte Carlo simulated stock price generator. http://www.investopedia.com/articles/investing/102715/simulating-stockprices-using-excel.asp
Micikevicius, P.: Advanced cuda, C. (2009) http://www.nvidia.com/content/GTC/documents/1029_GTC09.pdf
Moon, B., Jagadish, H.V., Faloutsos, C., Saltz, J.H.: Analysis of the clustering properties of the hilbert space-filling curve. IEEE TKDE 13(1), 124–141 (2001)
Google Scholar
Morton, G.M.: A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, New York (1966)
Google Scholar
Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)
Mueen, A., Keogh, E.J., Shamlo, N.B.: Finding time series motifs in disk-resident data. In: ICDM, pp. 367–376 (2009)
Mueen, A., Keogh, E.J., Young, N.: Logical-shapelets: an expressive primitive for time series classification. In: KDD, pp. 1154–1162 (2011)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009)
Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010)
Papapetrou, P., Athitsos, V., Potamias, M., Kollios, G., Gunopulos, D.: Embedding-based subsequence matching in time-series databases. ACM TODS 36(3), 17 (2011)
Paparrizos, J., Gravano, L.: k-shape: Efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015)
Patel, P., Keogh, E.J., Lin, J., Lonardi, S.: Mining motifs in massive time series databases. In: ICDM, pp. 370–377 (2002)
Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999)
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD, pp. 262–270 (2012)
Sakurai, Y., Papadimitriou, S., Braid, C. Faloutsos.: Stream mining through group lag correlations. In: SIGMOD, pp. 599–610 (2005)
Sart, D., Mueen, A., Najjar, W.A., Keogh, E.J., Niennattrakul, V.: Accelerating dynamic time warping subsequence search with gpus and fpgas. In: ICDM, pp. 1001–1006 (2010)
Smith, J.E., Goodman, J.R.: Instruction cache replacement policies and organizations. IEEE Trans. Comput. 34(3), 234–241 (1985)
Article Google Scholar
Yi, B.-K., Faloutsos, C.: Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp. 385–394 (2000)
Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)
Zebende, G.: Dcca cross-correlation coefficient: quantifying level of cross-correlation. Phys. A 390(4), 614–618 (2011)
Article Google Scholar
Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp. 358–369 (2002)

Download references

Acknowledgments

This work was supported by Grants MYRG109(Y1-L3)-FST12ULH, MYRG2014-00106-FST, MYRG105-FST13-GZG, and MYRG2015-00070-FST from UMAC Research Committee, Grants FDCT/106/2012/A3 and FDCT/116/2013/A3 from FDCT Macau, and Grant GRF 152043/15E from Hong Kong RGC.

Author information

Authors and Affiliations

Department of Computer and Information Science, University of Macau, Macau SAR, China
Yuhong Li, Leong Hou U & Zhiguo Gong
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, China
Man Lung Yiu

Authors

Yuhong Li
View author publications
You can also search for this author inPubMed Google Scholar
Leong Hou U
View author publications
You can also search for this author inPubMed Google Scholar
Man Lung Yiu
View author publications
You can also search for this author inPubMed Google Scholar
Zhiguo Gong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Leong Hou U.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., U, L.H., Yiu, M.L. et al. Efficient discovery of longest-lasting correlation in sequence databases. The VLDB Journal 25, 767–790 (2016). https://doi.org/10.1007/s00778-016-0432-7

Download citation

Received: 22 September 2015
Revised: 29 May 2016
Accepted: 06 June 2016
Published: 23 June 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s00778-016-0432-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient discovery of longest-lasting correlation in sequence databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Correlated DNA Subsequence Search via Graph-Based Representation

Scalable data series subsequence matching with ULISSE

Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now