Generating data series query workloads

Zoumpatianos, Kostas; Lou, Yin; Ileana, Ioana; Palpanas, Themis; Gehrke, Johannes

doi:10.1007/s00778-018-0513-x

Generating data series query workloads

Regular Paper
Published: 17 July 2018

Volume 27, pages 823–846, (2018)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Kostas Zoumpatianos ORCID: orcid.org/0000-0002-6221-8254^1,3,
Yin Lou²,
Ioana Ileana³,
Themis Palpanas³ &
…
Johannes Gehrke⁴

420 Accesses
18 Citations
Explore all metrics

Abstract

Data series (including time series) has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Up to this point, very little attention has been paid to properly evaluating such index structures, with most previous works relying solely on randomly selected data series to use as queries. In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating query workloads. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. This is the first paper that introduces a method for quantifying hardness of data series queries, as well as the ability to generate queries of predefined hardness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Notes

Note that when these values are measured over time (usually at fixed time intervals), we call them time series. However, time series are just one special case of data series: A series can also be defined over other measures (e.g., mass in mass spectroscopy, position in genome sequences, angle in radial chemical profiles, etc.). For the rest of this paper, we will use the terms sequence, data series, and time series interchangeably.
Website: http://www.mi.parisdescartes.fr/~themisp/bends/
This work (built on our preliminary version [46]) includes a more precise formal definition of the problem, a deeper analysis of previous workloads, a robust geometric solution for placing nearest neighbors at predefined distances from a query that removes earlier limitations, and an expanded experimental evaluation section.
In this work, we use the well-known FFT algorithm.
Informally, the effort is the amount of work that an index needs to perform. We formally define the notion of effort later in this section.
A similar definition has been proposed in the past [5].
We also use the same datasets in our experimental section.
ftp://ftp.ensembl.org/pub/release-42/
This algorithm iterates over all symbols in the DNA sequence and constructs the series as a cumulative sum, which increases by 2 for every appearance of the base “A,” by 1 for “G” and decreases by 1 and 2 for each appearance of “C” and “T,” respectively.

References

Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: FODO (1993)
Chapter Google Scholar
Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: Efficient time series search and retrieval. In: EDBT (2008)
Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.J.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017). https://doi.org/10.1007/s10618-016-0483-9
Article MathSciNet Google Scholar
Bay, S.D., Kibler, D., Pazzani, M.J., Smyth, P.: The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD Explorations (2000)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor" meaningful? In: ICDT (1999)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM (2010)
Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS (2013)
Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD (2002)
Chan, K.P., Fu, A.C.: Efficient time series matching by wavelets. In: ICDE (1999)
Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB (2007)
Chow, C., Mokbel, M.F., Bao, J., Liu, X.: Query-aware location anonymization for road networks. GeoInformatica 15(3), 571–607 (2011). https://doi.org/10.1007/s10707-010-0117-0
Article Google Scholar
Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: Return to the basics. In: VLDB (2012)
Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. In: VLDB (2015)
Das, G., Gunopulos, D., Mannila, H.: Finding similar time series. In: Principles of Data Mining and Knowledge Discovery, First European Symposium, PKDD ’97, Trondheim, Norway, June 24-27, 1997, Proceedings, pp. 88–100 (1997). https://doi.org/10.1007/3-540-63223-9_109
Chapter Google Scholar
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD (1994)
Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: Advanced Data Mining and Applications, Second International Conference, ADMA 2006, Xi’an, China, August 14-16, 2006, Proceedings, pp. 31–41 (2006). https://doi.org/10.1007/11811305_3
Chapter Google Scholar
Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: Constraint specification and implementation. In: Principles and Practice of Constraint Programming (1995)
Chapter Google Scholar
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD (1984)
Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag. 9(3), 27–39 (2014)
Article Google Scholar
Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP (1999)
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: KDD (2011)
Keogh, E.: Machine learning in time series databases (and everything is a time series!). In: Tutorial at the AAAI International Conference on Artificial Intelligence, vol. 2 (2011)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3 (2000)
Article Google Scholar
Keogh, E., Pazzani, M.: Scaling up dynamic time warping to massive datasets. In: PKDD (1999)
Google Scholar
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD (1997)
Kremer, H., Günnemann, S., Ivanescu, A.M., Assent, I., Seidl, T.: Efficient processing of multiple dtw queries in time series databases. In: SSDBM (2011)
Li, C.S., Yu, P., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996)
Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD (2003)
Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)
Article MathSciNet Google Scholar
Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)
Article Google Scholar
Prabhakar, S., Xia, Y., Kalashnikov, D.V., Aref, W.G., Hambrusch, S.E.: Query indexing and velocity constrained indexing: scalable techniques for continuous queries on moving objects. IEEE Trans. Comput. 51(10), 1124–1140 (2002). https://doi.org/10.1109/TC.2002.1039840
Article MathSciNet MATH Google Scholar
Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD (1997)
Rafiei, D., Mendelzon, A.: Efficient retrieval of similar time sequences using dft. In: ICDE (1998)
Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD (2012)
Ratanamahatana, C.A., Lin, J., Gunopulos, D., Keogh, E.J., Vlachos, M., Das, G.: Mining time series data. In: Data Mining and Knowledge Discovery Handbook, 2nd ed., pp. 1049–1077 (2010). https://doi.org/10.1007/978-0-387-09823-4_56
Chapter Google Scholar
Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD (1998)
Schäfer, P., Högqvist, M.: Sfa: A symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT (2012)
Shasha, D.: Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)
Google Scholar
Shieh, J., Keogh, E.: isax: Indexing and mining terabyte sized time series. In: KDD (2008)
Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. DMKD 26(2), 275–309 (2013)
Article MathSciNet Google Scholar
Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. In: VLDB (2013)
Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD (2009)
Yi, B.K., Jagadish, H., Faloutsos, C.: Efficient retrieval of similar time sequences under time warping. In: ICDE (1998)
Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD (2014)
Zoumpatianos, K., Idreos, S., Palpanas, T.: Rinse: Interactive data series exploration. In: VLDB (2015)
Article Google Scholar
Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1603–1612 (2015)

Download references

Author information

Authors and Affiliations

Harvard University, Cambridge, USA
Kostas Zoumpatianos
Airbnb Inc., San Francisco, USA
Yin Lou
LIPADE, Paris Descartes University, Paris, France
Kostas Zoumpatianos, Ioana Ileana & Themis Palpanas
Microsoft Inc., Redmond, USA
Johannes Gehrke

Authors

Kostas Zoumpatianos
View author publications
You can also search for this author in PubMed Google Scholar
Yin Lou
View author publications
You can also search for this author in PubMed Google Scholar
Ioana Ileana
View author publications
You can also search for this author in PubMed Google Scholar
Themis Palpanas
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Gehrke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kostas Zoumpatianos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zoumpatianos, K., Lou, Y., Ileana, I. et al. Generating data series query workloads. The VLDB Journal 27, 823–846 (2018). https://doi.org/10.1007/s00778-018-0513-x

Download citation

Received: 20 December 2017
Revised: 29 June 2018
Accepted: 10 July 2018
Published: 17 July 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s00778-018-0513-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating data series query workloads

Abstract

Access this article

Similar content being viewed by others

ADS: the adaptive data series index

Evolution of a Data Series Index

PARROT: pattern-based correlation exploitation in big partitioned data series

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generating data series query workloads

Abstract

Access this article

Similar content being viewed by others

ADS: the adaptive data series index

Evolution of a Data Series Index

PARROT: pattern-based correlation exploitation in big partitioned data series

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation