research-article

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

Authors:
Patrick Schäfer

Zuse Institute Berlin, Berlin, Germany

Zuse Institute Berlin, Berlin, Germany
View Profile

,
Mikael Högqvist

Zuse Institute Berlin, Berlin, Germany

Zuse Institute Berlin, Berlin, Germany
View Profile

EDBT '12: Proceedings of the 15th International Conference on Extending Database TechnologyMarch 2012Pages 516–527https://doi.org/10.1145/2247596.2247656

Published:27 March 2012Publication History

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Pages 516–527

ABSTRACT

Time series analysis, as an application for high dimensional data mining, is a common task in biochemistry, meteorology, climate research, bio-medicine or marketing. Similarity search in data with increasing dimensionality results in an exponential growth of the search space, referred to as Curse of Dimensionality. A common approach to postpone this effect is to apply approximation to reduce the dimensionality of the original data prior to indexing. However, approximation involves loss of information, which also leads to an exponential growth of the search space. Therefore, indexing an approximation with a high dimensionality, i. e. high quality, is desirable.

We introduce Symbolic Fourier Approximation (SFA) and the SFA trie which allows for indexing of not only large datasets but also high dimensional approximations. This is done by exploiting the trade-off between the quality of the approximation and the degeneration of the index by using a variable number of dimensions to represent each approximation. Our experiments show that SFA combined with the SFA trie can scale up to a factor of 5--10 more indexed dimensions than previous approaches. Thus, it provides lower page accesses and CPU costs by a factor of 2--25 respectively 2--11 for exact similarity search using real world and synthetic data.

References

Agrawal, R., Faloutsos, C., and Swami, A. N. Efficient Similarity Search In Sequence Databases. In Proc. (FODO) (1993), Springer Verlag, pp. 69--84. Google ScholarDigital Library
Assent, I., Krieger, R., Afschari, F., and Seidl, T. The ts-tree: efficient time series search and retrieval. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology (New York, NY, USA, 2008), EDBT '08, ACM, pp. 252--263. Google ScholarDigital Library
Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. The R *-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec. 19, 2 (1990), 322--331. Google ScholarDigital Library
Cai, Y., and Ng, R. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proc of the 2004 ACM SIGMOD (2004), ACM, pp. 599--610. Google ScholarDigital Library
Camerra, A., Palpanas, T., Shieh, J., and Keogh, E. isax 2.0: Indexing and mining one billion time series. Data Mining, IEEE International Conference on 0 (2010), 58--67. Google ScholarDigital Library
Chakrabarti, K., Keogh, E., Mehrotra, S., and Pazzani, M. Locally adaptive dimensionality reduction for indexing large time series databases. In Proc. of ACM SIGMOD (2002), pp. 151--162. Google ScholarDigital Library
Chan, F. K.-P., chee Fu, A. W., and Yu, C. Haar wavelets for efficient similarity search of time-series: With and without time warping. IEEE Transactions on Knowledge and Data Engineering 15, 3 (2003), 686--705. Google ScholarDigital Library
Chan, K.-P., and chee Fu, A. W. Efficient time series matching by wavelets. In ICDE (1999), pp. 126--133. Google ScholarDigital Library
Chen, Q., Chen, L., Lian, X., Liu, Y., and Yu, J. X. Indexable PLA for efficient similarity search. In Proc of the VLDB (2007), pp. 435--446. Google ScholarDigital Library
De Santo, M., Foggia, P., Sansone, C., and Vento, M. A large database of graphs and its use for benchmarking graph isomorphism algorithms. Pattern Recogn. Lett. 24 (May 2003), 1067--1079. Google ScholarDigital Library
Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. Fast subsequence matching in time-series databases. In SIGMOD Rec. (1994), pp. 419--429. Google ScholarDigital Library
Gaede, V., and Günther, O. Multidimensional access methods. ACM Comput. Surv. 30, 2 (1998), 170--231. Google ScholarDigital Library
Guttman, A. R-trees: a dynamic index structure for spatial searching. In Proc of the 1984 ACM SIGMOD (1984), ACM, pp. 47--57. Google ScholarDigital Library
Keogh, E. The UCR time series data mining archive. http://www.cs.ucr.edu/eamonn/TSDMA/index.html, 2002.Google Scholar
Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems 3, 3 (2001), 263--286.Google ScholarCross Ref
Keogh, E., and Pazzani, M. A simple dimensionality reduction technique for fast similarity search in large time series databases. In Knowledge Discovery and Data Mining PAKDD 2000 (2000), vol. 1805, Springer, pp. 122--133. Google ScholarDigital Library
Korn, F., Jagadish, H. V., and Faloutsos, C. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proc of the 1997 ACM SIGMOD (1997), ACM, pp. 289--300. Google ScholarDigital Library
Lin, J., Keogh, E., Lonardi, S., and chi Chiu, B. Y. A symbolic representation of time series, with implications for streaming algorithms. In DMKD (2003), pp. 2--11. Google ScholarDigital Library
Lin, J., Keogh, E., Wei, L., and Lonardi, S. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15, 2 (2007), 107--144. Google ScholarDigital Library
Marwan, N., Thiel, M., and Nowaczyk, N. R. Cross recurrence plot based synchronization of time series. Nonlinear Processes in Geophysics 9, 3/4 (2002), 325--331.Google ScholarCross Ref
Palpanas, T., Vlachos, M., Keogh, E., Gunopulos, D., and Truppel, W. Online amnesic approximation of streaming time series. In ICDE (2004), pp. 338--349. Google ScholarDigital Library
Popivanov, I. Similarity search over time series data using wavelets. In ICDE (2002), pp. 212--221. Google ScholarDigital Library
Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 26 (March 2010), 853--854. Google ScholarDigital Library
Rafiei, D., and Mendelzon, A. Efficient retrieval of similar time sequences using DFT. In Proc. FODO Conference, Kobe (1998), pp. 249--257.Google Scholar
Schäfer, P., and Högqvist, M. SFA web page. http://www.zib.de/patrick.schaefer/sfa/, 2011.Google Scholar
Shieh, J., and Keogh, E. iSAX: indexing and mining terabyte sized time series. In KDD (2008), pp. 623--631. Google ScholarDigital Library

Index Terms

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
  2. Information storage systems
    1. Record storage systems
      1. Directory structures
        B-trees
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Trees
  2. Probability and statistics
    1. Statistical paradigms
      1. Time series analysis

Recommendations

Experiencing SAX: a novel symbolic representation of time series

Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, ...
Read More
DDR: an index method for large time-series datasets

The tree index structure is a traditional method for searching similar data in large datasets. It is based on the presupposition that most sub-trees are pruned in the searching process. As a result, the number of page accesses is reduced. However, time-...
Read More
iSAX: disk-aware mining and indexing of massive time series datasets

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
March 2012
643 pages
ISBN:9781450307901
DOI:10.1145/2247596
Editors:
Elke Rundensteiner
Worcester Polytechnic Institute
,
Volker Markl
Technische Universität Berlin, Germany
,
Ioana Manolescu
INRIA, France
,
Sihem Amer-Yahia
QCRI, Doha, Qatar
,
Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
,
Ismail Ari
Ozyegin University, Turkey
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
discretisation
indexing
symbolic representation
time series
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 97
  Total Citations
  View Citations
- 642
  Total Downloads
- Downloads (Last 12 months)106
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Experiencing SAX: a novel symbolic representation of time series

DDR: an index method for large time-series datasets

iSAX: disk-aware mining and indexing of massive time series datasets