Skip to main content
Log in

DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Time series anomaly detection is one of the most active areas of research in data mining, with dozens of new approaches been suggested each year. In spite of all these creative solutions proposed for this problem, recent empirical evidence suggests that the time series discord, a relatively simple twenty-year old distance-based technique, remains among the state-of-art techniques. While there are many algorithms for computing the time series discords, they all have limitations. First, they are limited to the batch case, whereas the online case is more actionable. Second, these algorithms exhibit poor scalability beyond tens of thousands of datapoints. In this work we introduce DAMP, a novel algorithm that addresses both these issues. DAMP computes exact left-discords on fast arriving streams, at up to 300,000 Hz using a commodity desktop. This allows us to find time series discords in datasets with trillions of datapoints for the first time. We will demonstrate the utility of our algorithm with the most ambitious set of time series anomaly detection experiments ever conducted. We will further show that our speedup improvements can be applied in the multidimensional case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. Note that some papers misattribute the success of their anomaly detection to the Matrix Profile or to HOTSAX, but these are simple different algorithms to compute time series discords.

  2. This observation is true for heartbeats, gaits, machine cycles etc. One exception is for events tied to a cultural calendar. For example, for taxi demand or electrical power demand, the most similar day to any given day, is not the previous day, but the same day one week earlier.

  3. Actually, using exactly one heartbeat (or pattern more generally), may make the downstream algorithms brittle to the choice of the starting point of the heartbeat. To bypass this issue, we always extract two consecutive beats.

  4. Here we explain “non-trivial anomaly detector”. Simple rule-based conditionals such as: “if the time series ever reports a value that is higher than any value you have seen before, then flag anomaly” could be used as an anomaly detector, and could be instantaneously instantiated. By non-trivial we mean any TSAD algorithm that examines each subsequence for any information about shape, autocorrelation, Markov properties etc., and compares this information (in the most general sense), to a model gleaned from training data. The reader will appreciate that this includes essentially all proposed anomaly detectors in the literature.

References

  • Aubet F-X, Zügner D, Gasthaus J (2021) Monte Carlo EM for deep time series anomaly detection. arXiv:2112.14436 [cs, stat]

  • Audibert J, Marti S, Guyard F, Zuluaga MA (2021) From univariate to multivariate time series anomaly detection with non-local information. In: Lemaire V, Malinowski S, Bagnall A et al (eds) Advanced analytics and learning on temporal data. Springer International Publishing, Cham, pp 186–194

    Chapter  Google Scholar 

  • Boniol P, Linardi M, Roncallo F et al (2021a) Unsupervised and scalable subsequence anomaly detection in large data series. VLDB J 30:909–931. https://doi.org/10.1007/s00778-021-00655-8

    Article  Google Scholar 

  • Boniol P, Paparrizos J, Palpanas T, Franklin MJ (2021b) SAND: streaming subsequence anomaly detection. Proc VLDB Endow 14:1717–1729

    Article  Google Scholar 

  • Case Western Reserve University Bearing Data Center (2021) Available: https://csegroups.case.edu/bearingdatacenter/home. Accessed: Nov. 15, 2021

  • CNC Crashes. Video. (15 Feb 2018). from https://youtu.be/t2tBtZCa7j4?t=205. Retrieved December 20, 2021

  • Daigavane A, Wagstaff KL, Doran G et al (2022) Unsupervised detection of Saturn magnetic field boundary crossings from plasma spectrometer data. Comput Geosci 161:105040

    Article  Google Scholar 

  • DAMP (2022) https://sites.google.com/view/discord-aware-matrix-profile

  • Dau HA, Bagnall A, Kamgar K et al (2019) The UCR time series archive. IEEE/CAA J Autom Sin 6:1293–1305. https://doi.org/10.1109/JAS.2019.1911747

    Article  Google Scholar 

  • Doshi K, Abudalou S, Yilmaz Y (2022) TiSAT: time series anomaly transformer. arXiv:2203.05167 [cs, eess, stat]

  • Higham NJ (2002) Accuracy and stability of numerical algorithms, 2 edn. ISBN: 978-0-89871-521-7

  • Hundman K, Constantinou V, Laporte C et al (2018) Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, London United Kingdom, pp 387–395

  • Imani S, Madrid F, Ding W et al (2020) Introducing time series snippets: a new primitive for summarizing long time series. Data Min Knowl Disc 34:1713–1743. https://doi.org/10.1007/s10618-020-00702-y

    Article  MathSciNet  MATH  Google Scholar 

  • Keogh E (2021) Irrational exuberance why we should not believe 95% of papers on time series anomaly detection. In: 7th SIGKDD workshop on mining and learning from time series at SIGKDD 2021. Workshop Keynote https://www.youtube.com/watch?v=Vg1p3DouX8w&t=324s

  • Khansa HE, Gervet C and Brouillet A (2012) Prominent discord discovery with matrix profile: application to climate data insight. In: 10th international conference of advanced computer science & information technology (ACSIT 2022) May 21~22, 2022, Zurich, Switzerland

  • Kirti R, Karadi R (2012) Cardiac tamponade: atypical presentations after cardiac surgery. Acute Med 11:93–96

    Article  Google Scholar 

  • Mueen A, Zhu Y, Yeh M et al (2017) The fastest similarity search algorithm for time series subsequences under euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.htmlAccessed 24 Janurary, 2022

  • Nakamura T, Imamura M, Mercer R, Keogh E (2020) Merlin: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE international conference on data mining (ICDM). IEEE, Sorrento, Italy, pp 1190–1195

  • National Weather Service. January 24, 2019 heavy rain and flooding. From https://www.weather.gov/aly/24Jan19HeavyRainFlood. Retrieved May 1 2022

  • Neupane D, Seok J (2020) Bearing fault detection and diagnosis using case western reserve university dataset with deep learning approaches: A review. IEEE Access 8:93155–93178. https://doi.org/10.1109/ACCESS.2020.2990528

    Article  Google Scholar 

  • Nilsson F (2022) Joint human-machine exploration of industrial time series using the matrix profile. In: Halmstad university, school of information technology, Halmstad embedded and intelligent systems research (EIS), CAISR—center for applied intelligent systems research

  • Palpanas T (2022) Personal communication June 4th 2022

  • Paparrizos J, Kang Y, Boniol P et al (2022) TSB-UAD: An end-to-end benchmark suite for univariate time-series anomaly detection. In: Proceedings of the VLDB endowment (PVLDB) journal

  • Park D, Hoshi Y, Kemp CC (2018) A multimodal anomaly detector for robot-assisted feeding using an LSTM-based variational autoencoder. IEEE Robot Autom Lett 3:1544–1551. https://doi.org/10.1109/LRA.2018.2801475

    Article  Google Scholar 

  • Park JY, Wilson E, Parker A, Nagy Z (2020) The good, the bad, and the ugly: data-driven load profile discord identification in a large building portfolio. Energy Build 215:109892

    Article  Google Scholar 

  • Silive.com. Wild storm pelts Staten Island with giant hail—‘threat of tornado has passed’ from https://www.silive.com/news/2019/05/nws-issues-tornado-warning-for-staten-island.html. Retrieved May 1 2022

  • Su Y, Zhao Y, Niu C et al (2019) Robust anomaly detection for multivariate time series through stochastic recurrent neural network pp 2828–2837

  • Thill M, Konen W, Bäck T (2020) Time series encodings with temporal convolutional networks. Springer, Cham, pp 161–173

    Google Scholar 

  • Truong HT, Ta BP, Le QA et al (2022) Light-weight federated learning-based anomaly detection for time-series data in industrial control systems. Comput Ind 140:103692. https://doi.org/10.1016/j.compind.2022.103692

    Article  Google Scholar 

  • Wastewater News. Valentine’s day storm slams California, pushing water agencies to the edge. From www.news.cornell.edu/Chronicle/00/5.18.00/wireless_class.html. Retrieved Dec 1 2021

  • Wikipedia. Leap year problem. from https://en.wikipedia.org/wiki/Leap_year_problem. Retrieved December 1, 2021

  • Wu R, Keogh E (2021) Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3112126

    Article  Google Scholar 

  • Yeh C-CM, Zheng Y, Wang J et al (2021) Error-bounded approximate time series joins using compact dictionary representations of time series. CoRR abs arXiv:2112.12965

  • Yeh C-CM, Zhu Y, Dau HA et al (2019) Online amnestic dtw to allow real-time golden batch monitoring. pp 2604–2612

  • Zheng X, Xu N, Trinh L et al (2021) PSML: a multi-scale time-series dataset for machine learning in decarbonized energy grids. arXiv preprint arXiv: 2110.06324

  • Zhu Y, Yeh C-CM, Zimmerman Z et al (2018) Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds. In: IEEE pp 837–846

Download references

Acknowledgements

We sincerely thank the authors of Boniol et al. (2021a) for their help in creating Fig. 23.

Funding

This research was supported by NSF OIA-1757207, CNS-2008910 and RI-2104537, the French National Research Agency (ANR-19-P3IA-0002), and NSF 2103976, Mitsubishi, Visa and Toyota.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yue Lu.

Ethics declarations

Conflict of interest

The authors have no conflict of interest.

Ethical approval

This research complies with all ethical guidelines at the three institutions represented.

Human or animal rights

No human subjects were used by the current authors. The ECG data was collected by others over a decade ago, under strict human subject protocols.

Additional information

Communicated by Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: tool to set window size

Appendix: tool to set window size

For most of the experiments in this paper, we use human intuition to manually set the value of the parameter m. We followed the well-known “folk wisdom” that was explained and tested in Nakamura et al. (2020). The idea is simply this. Plot a subset of the data, estimate the length of a single period “by-eye” and set m to a value a little less than a period. Here we used m = 90% of estimated period length. In (Nakamura et al. 2020) the authors show that in most cases, the anomalies returned doing this, would not change even if you set m to twice or half this value (They were using the full MP, not the left MP, but we found similar results with DAMP).

However, for the experiment in Table 9, we want to completely exclude any human intervention or tuning. Therefore, we used the following four lines of Matlab code to automatically calculate the window size for the 250 KDD Cup 2021 datasets.

figure a

The window size is obtained by finding the peak of autocorrelation in the range of 10–1000 (the value of parameter m is limited to the range of 10–1000). To avoid the ‘findpeaks’ function returning a null value (very unlikely, but possible), we set the default value of m to 1000.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, Y., Wu, R., Mueen, A. et al. DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams. Data Min Knowl Disc 37, 627–669 (2023). https://doi.org/10.1007/s10618-022-00911-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00911-7

Keywords

Navigation