skip to main content
10.1145/2247596.2247656acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

Published:27 March 2012Publication History

ABSTRACT

Time series analysis, as an application for high dimensional data mining, is a common task in biochemistry, meteorology, climate research, bio-medicine or marketing. Similarity search in data with increasing dimensionality results in an exponential growth of the search space, referred to as Curse of Dimensionality. A common approach to postpone this effect is to apply approximation to reduce the dimensionality of the original data prior to indexing. However, approximation involves loss of information, which also leads to an exponential growth of the search space. Therefore, indexing an approximation with a high dimensionality, i. e. high quality, is desirable.

We introduce Symbolic Fourier Approximation (SFA) and the SFA trie which allows for indexing of not only large datasets but also high dimensional approximations. This is done by exploiting the trade-off between the quality of the approximation and the degeneration of the index by using a variable number of dimensions to represent each approximation. Our experiments show that SFA combined with the SFA trie can scale up to a factor of 5--10 more indexed dimensions than previous approaches. Thus, it provides lower page accesses and CPU costs by a factor of 2--25 respectively 2--11 for exact similarity search using real world and synthetic data.

References

  1. Agrawal, R., Faloutsos, C., and Swami, A. N. Efficient Similarity Search In Sequence Databases. In Proc. (FODO) (1993), Springer Verlag, pp. 69--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Assent, I., Krieger, R., Afschari, F., and Seidl, T. The ts-tree: efficient time series search and retrieval. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology (New York, NY, USA, 2008), EDBT '08, ACM, pp. 252--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. The R *-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec. 19, 2 (1990), 322--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cai, Y., and Ng, R. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proc of the 2004 ACM SIGMOD (2004), ACM, pp. 599--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Camerra, A., Palpanas, T., Shieh, J., and Keogh, E. isax 2.0: Indexing and mining one billion time series. Data Mining, IEEE International Conference on 0 (2010), 58--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chakrabarti, K., Keogh, E., Mehrotra, S., and Pazzani, M. Locally adaptive dimensionality reduction for indexing large time series databases. In Proc. of ACM SIGMOD (2002), pp. 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chan, F. K.-P., chee Fu, A. W., and Yu, C. Haar wavelets for efficient similarity search of time-series: With and without time warping. IEEE Transactions on Knowledge and Data Engineering 15, 3 (2003), 686--705. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chan, K.-P., and chee Fu, A. W. Efficient time series matching by wavelets. In ICDE (1999), pp. 126--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen, Q., Chen, L., Lian, X., Liu, Y., and Yu, J. X. Indexable PLA for efficient similarity search. In Proc of the VLDB (2007), pp. 435--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. De Santo, M., Foggia, P., Sansone, C., and Vento, M. A large database of graphs and its use for benchmarking graph isomorphism algorithms. Pattern Recogn. Lett. 24 (May 2003), 1067--1079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. Fast subsequence matching in time-series databases. In SIGMOD Rec. (1994), pp. 419--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gaede, V., and Günther, O. Multidimensional access methods. ACM Comput. Surv. 30, 2 (1998), 170--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guttman, A. R-trees: a dynamic index structure for spatial searching. In Proc of the 1984 ACM SIGMOD (1984), ACM, pp. 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Keogh, E. The UCR time series data mining archive. http://www.cs.ucr.edu/eamonn/TSDMA/index.html, 2002.Google ScholarGoogle Scholar
  15. Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems 3, 3 (2001), 263--286.Google ScholarGoogle ScholarCross RefCross Ref
  16. Keogh, E., and Pazzani, M. A simple dimensionality reduction technique for fast similarity search in large time series databases. In Knowledge Discovery and Data Mining PAKDD 2000 (2000), vol. 1805, Springer, pp. 122--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Korn, F., Jagadish, H. V., and Faloutsos, C. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proc of the 1997 ACM SIGMOD (1997), ACM, pp. 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lin, J., Keogh, E., Lonardi, S., and chi Chiu, B. Y. A symbolic representation of time series, with implications for streaming algorithms. In DMKD (2003), pp. 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lin, J., Keogh, E., Wei, L., and Lonardi, S. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15, 2 (2007), 107--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Marwan, N., Thiel, M., and Nowaczyk, N. R. Cross recurrence plot based synchronization of time series. Nonlinear Processes in Geophysics 9, 3/4 (2002), 325--331.Google ScholarGoogle ScholarCross RefCross Ref
  21. Palpanas, T., Vlachos, M., Keogh, E., Gunopulos, D., and Truppel, W. Online amnesic approximation of streaming time series. In ICDE (2004), pp. 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Popivanov, I. Similarity search over time series data using wavelets. In ICDE (2002), pp. 212--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 26 (March 2010), 853--854. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rafiei, D., and Mendelzon, A. Efficient retrieval of similar time sequences using DFT. In Proc. FODO Conference, Kobe (1998), pp. 249--257.Google ScholarGoogle Scholar
  25. Schäfer, P., and Högqvist, M. SFA web page. http://www.zib.de/patrick.schaefer/sfa/, 2011.Google ScholarGoogle Scholar
  26. Shieh, J., and Keogh, E. iSAX: indexing and mining terabyte sized time series. In KDD (2008), pp. 623--631. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
            March 2012
            643 pages
            ISBN:9781450307901
            DOI:10.1145/2247596

            Copyright © 2012 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 27 March 2012

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate7of10submissions,70%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader