ABSTRACT
Time series analysis, as an application for high dimensional data mining, is a common task in biochemistry, meteorology, climate research, bio-medicine or marketing. Similarity search in data with increasing dimensionality results in an exponential growth of the search space, referred to as Curse of Dimensionality. A common approach to postpone this effect is to apply approximation to reduce the dimensionality of the original data prior to indexing. However, approximation involves loss of information, which also leads to an exponential growth of the search space. Therefore, indexing an approximation with a high dimensionality, i. e. high quality, is desirable.
We introduce Symbolic Fourier Approximation (SFA) and the SFA trie which allows for indexing of not only large datasets but also high dimensional approximations. This is done by exploiting the trade-off between the quality of the approximation and the degeneration of the index by using a variable number of dimensions to represent each approximation. Our experiments show that SFA combined with the SFA trie can scale up to a factor of 5--10 more indexed dimensions than previous approaches. Thus, it provides lower page accesses and CPU costs by a factor of 2--25 respectively 2--11 for exact similarity search using real world and synthetic data.
- Agrawal, R., Faloutsos, C., and Swami, A. N. Efficient Similarity Search In Sequence Databases. In Proc. (FODO) (1993), Springer Verlag, pp. 69--84. Google ScholarDigital Library
- Assent, I., Krieger, R., Afschari, F., and Seidl, T. The ts-tree: efficient time series search and retrieval. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology (New York, NY, USA, 2008), EDBT '08, ACM, pp. 252--263. Google ScholarDigital Library
- Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. The R *-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec. 19, 2 (1990), 322--331. Google ScholarDigital Library
- Cai, Y., and Ng, R. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proc of the 2004 ACM SIGMOD (2004), ACM, pp. 599--610. Google ScholarDigital Library
- Camerra, A., Palpanas, T., Shieh, J., and Keogh, E. isax 2.0: Indexing and mining one billion time series. Data Mining, IEEE International Conference on 0 (2010), 58--67. Google ScholarDigital Library
- Chakrabarti, K., Keogh, E., Mehrotra, S., and Pazzani, M. Locally adaptive dimensionality reduction for indexing large time series databases. In Proc. of ACM SIGMOD (2002), pp. 151--162. Google ScholarDigital Library
- Chan, F. K.-P., chee Fu, A. W., and Yu, C. Haar wavelets for efficient similarity search of time-series: With and without time warping. IEEE Transactions on Knowledge and Data Engineering 15, 3 (2003), 686--705. Google ScholarDigital Library
- Chan, K.-P., and chee Fu, A. W. Efficient time series matching by wavelets. In ICDE (1999), pp. 126--133. Google ScholarDigital Library
- Chen, Q., Chen, L., Lian, X., Liu, Y., and Yu, J. X. Indexable PLA for efficient similarity search. In Proc of the VLDB (2007), pp. 435--446. Google ScholarDigital Library
- De Santo, M., Foggia, P., Sansone, C., and Vento, M. A large database of graphs and its use for benchmarking graph isomorphism algorithms. Pattern Recogn. Lett. 24 (May 2003), 1067--1079. Google ScholarDigital Library
- Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. Fast subsequence matching in time-series databases. In SIGMOD Rec. (1994), pp. 419--429. Google ScholarDigital Library
- Gaede, V., and Günther, O. Multidimensional access methods. ACM Comput. Surv. 30, 2 (1998), 170--231. Google ScholarDigital Library
- Guttman, A. R-trees: a dynamic index structure for spatial searching. In Proc of the 1984 ACM SIGMOD (1984), ACM, pp. 47--57. Google ScholarDigital Library
- Keogh, E. The UCR time series data mining archive. http://www.cs.ucr.edu/eamonn/TSDMA/index.html, 2002.Google Scholar
- Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems 3, 3 (2001), 263--286.Google ScholarCross Ref
- Keogh, E., and Pazzani, M. A simple dimensionality reduction technique for fast similarity search in large time series databases. In Knowledge Discovery and Data Mining PAKDD 2000 (2000), vol. 1805, Springer, pp. 122--133. Google ScholarDigital Library
- Korn, F., Jagadish, H. V., and Faloutsos, C. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proc of the 1997 ACM SIGMOD (1997), ACM, pp. 289--300. Google ScholarDigital Library
- Lin, J., Keogh, E., Lonardi, S., and chi Chiu, B. Y. A symbolic representation of time series, with implications for streaming algorithms. In DMKD (2003), pp. 2--11. Google ScholarDigital Library
- Lin, J., Keogh, E., Wei, L., and Lonardi, S. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15, 2 (2007), 107--144. Google ScholarDigital Library
- Marwan, N., Thiel, M., and Nowaczyk, N. R. Cross recurrence plot based synchronization of time series. Nonlinear Processes in Geophysics 9, 3/4 (2002), 325--331.Google ScholarCross Ref
- Palpanas, T., Vlachos, M., Keogh, E., Gunopulos, D., and Truppel, W. Online amnesic approximation of streaming time series. In ICDE (2004), pp. 338--349. Google ScholarDigital Library
- Popivanov, I. Similarity search over time series data using wavelets. In ICDE (2002), pp. 212--221. Google ScholarDigital Library
- Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 26 (March 2010), 853--854. Google ScholarDigital Library
- Rafiei, D., and Mendelzon, A. Efficient retrieval of similar time sequences using DFT. In Proc. FODO Conference, Kobe (1998), pp. 249--257.Google Scholar
- Schäfer, P., and Högqvist, M. SFA web page. http://www.zib.de/patrick.schaefer/sfa/, 2011.Google Scholar
- Shieh, J., and Keogh, E. iSAX: indexing and mining terabyte sized time series. In KDD (2008), pp. 623--631. Google ScholarDigital Library
Index Terms
- SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets
Recommendations
Experiencing SAX: a novel symbolic representation of time series
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, ...
DDR: an index method for large time-series datasets
The tree index structure is a traditional method for searching similar data in large datasets. It is based on the presupposition that most sub-trees are pruned in the searching process. As a result, the number of page accesses is reduced. However, time-...
iSAX: disk-aware mining and indexing of massive time series datasets
Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets ...
Comments