Abstract
Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, the approaches that have been proposed so far in the literature have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems and propose NormA, a novel approach, suitable for domain-agnostic anomaly detection. NormA is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach correctly identifies all single and recurrent anomalies of various types, with no prior knowledge of the characteristics of these anomalies (except for their length). Moreover, it outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.
















Similar content being viewed by others
Change history
31 August 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00778-021-00678-1
Notes
If the dimension that imposes the ordering of the sequence is time then we talk about time series. In the rest of this paper, we will use the terms sequence, data series, and time series interchangeably.
The authors of these papers define the problem as kth-discord discovery.
References
http://data-acoustics.com/measurements/bearing-faults/bearing-4/ (2007)
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2015)
Abboud, D., Elbadaoui, M., Smith, W., Randall, R.: Advanced bearing diagnostics: A comparative study of two powerful approaches. MSSP 114 (2019)
Abdul-Aziz, A., Woike, M.R., Oza, N.C., Matthews, B.L., lekki, J.D.: Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Struct. Health Monit. (2012)
Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing (2017)
Antoni, J., Borghesani, P.: A statistical methodology for the design of condition indicators. Mech. Syst. Signal Process. 290–327 (2019)
Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management (dagstuhl seminar 19282). Dagstuhl Rep. 9(7), 24–39 (2019)
Barnet, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated Anomaly Detection in Large Sequences. In: ICDE pp. 1834–1837 (2020)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: SAD: an unsupervised system for subsequence anomaly detection. In: 36th IEEE International Conference on Data Engineering, ICDE, pp. 1778–1781. IEEE (2020)
Boniol, P., Palpanas, T.: Series2graph: graph-based subsequence anomaly detection for time series. Proc. VLDB Endow. 13(11), 1821–1834 (2020)
Boniol, P., Palpanas, T., Meftah, M., Remy, E.: Graphan: graph-based subsequence anomaly detection. Proc. VLDB Endow. 13(12), 2941–2944 (2020)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD (2000)
Bryant, P.G.: On the minimum description length (mdl) principle for hierarchical classifications. In: Data Science, Classification, and Related Methods (1998)
Bu, Y., Chen, L., Fu, A.W.C., Liu, D.: Efficient anomaly monitoring over moving object trajectory streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 159–168. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1557019.1557043
Bu, Y., Leung, O.T., Fu, A.W., Keogh, E.J., Pei, J., Meshkin, S.: WAT: finding top-k discords in time series database. In: SIAM (2007)
Chiu, B.Y., Keogh, E.J., Lonardi, S.: Probabilistic discovery of time series motifs. In: KDD (2003)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 2, 112–127 (2018)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13, 402–419 (2019)
Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: ADMA pp. 31–41 (2006)
Gharghabi, S., Yeh, C.M., Ding, Y., Ding, W., Hibbing, P., LaMunion, S., Kaplan, A., Crouter, S.E., Keogh, E.J.: Domain agnostic online semantic segmentation for multi-dimensional time series. Data Min. Knowl. Discov. 33(1), 96–130 (2019)
Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000 (June 13)). Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.fullPMID:1085218; https://doi.org/10.1161/01.CIR.101.23.e215
Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1-6:20 (2016)
Hadjem, M., Naït-Abdesselam, F., Khokhar, A.A.: St-segment and t-wave anomalies prediction in an ECG data using rusboost. In: Healthcom (2016)
Keogh, E., Lin, J.: Clustering of time-series subsequences is meaningless: implications for previous and future research. KAIS 8(2) (2004)
Keogh, E., Lonardi, S., Ratanamahatana, C., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. DMKD 14, 99–129 (2007)
Keogh, E.J., Lin, J., Fu, A.W.: HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM (2005)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDBJ 28(6) (2019)
Lee, J., Han, J., Li, X.: Trajectory outlier detection: a partition-and-detect framework. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 140–149 (2008)
Lee, T., Gottschlich, J., Tatbul, N., Metcalf, E., Zdonik, S.: greenhouse: a zero-positive machine learning system for time-series anomaly detection. CoRR arXiv:abs/1801.03168 (2018). URL http://arxiv.org/abs/1801.03168
Li, X., Lin, J.: Linear time motif discovery in time series. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 136–144. SIAM (2019)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11, 2236–2248 (2019)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.: Matrix profile x: Valmod - scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix Profile Goes MAD: variable-length motif and discord discovery in data series. In: DAMI (2020)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: ICDM, ICDM (2008)
Liu, Y., Chen, X., Wang, F.: Efficient detection of discords for time series stream. In: Advances in Data and Web Management (2009)
Luo, W., Gallagher, M.: Faster and parameter-free discord search in quasi-periodic time series. In: Advances in Knowledge Discovery and Data Mining (2011)
Malhotra, P., Vig, L., Shroff, G., Agarwal, P.: Long short term memory networks for anomaly detection in time series. In: ESANN (2015)
Moody, G.B., Mark, R.G.: The impact of the mit-bih arrhythmia database. IEEE Eng. Med. Biol. Mag. 20, 45–50 (2001)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM (2009)
Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 44(2), 47–52 (2015)
Palpanas, T.: Evolution of a Data Series Index. In: CCIS, pp. 68–83 (2020)
Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). SIGREC 48(3) (2019)
Paparrizos, J., Gravano, L.: K-shape: efficient and accurate clustering of time series. SIGMOD Rec. 45(1), 69–76 (2016). https://doi.org/10.1145/2949741.2949758
Paul Boniol (advisor: Themis Palpanas): Unsupervised subsequence anomaly detection in large sequences. In: Proceedings of the VLDB 2020 PhD Workshop colocated with the 46th International Conference on Very Large Databases (VLDB 2020), CEUR Workshop Proceedings, vol. 2652 (2020)
Peng, B., Palpanas, T., Fatourou, P.: Messi: In-memory data series indexing. In: ICDE (2020)
Peng, B., Palpanas, T., Fatourou, P.: Paris+: data series indexing on multi-core architectures. In: TKDE (2020)
Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: 2011 IEEE 11th International Conference on Data Mining, pp. 547–556 (2011)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Safran: Personal communication with Dr. Dohy Hong (2018)
Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Time series anomaly discovery with grammar-based compression. In: EDBT (2015)
Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Grammarviz 3.0: Interactive discovery of variable-length time series patterns. TKDD 12, 1–28 (2018)
Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19, 24–27 (2009)
Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: VLDB (2006)
Wang, J., Balasubramanian, A., de la Vega, L.M., Green, J., Samal, A., Prabhakaran, B.: Word recognition from continuous articulatory movement time-series data using symbolic representations. In: SLPAT (2013)
Wang, X., Lin, J., Patel, N., Braun, M.: A self-learning and online algorithm for time series anomaly detection, with application in CPU manufacturing. In: CIKM (2016)
Whitney, C., Gottlieb, D., Redline, S., Norman, R., Dodge, R., Shahar, E., Surovec, S., Nieto, F.: Reliability of scoring respiratory disturbance indices and sleep staging. Sleep 21, 749–757 (1998)
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945). http://www.jstor.org/stable/3001968
Wu, Q., Qi, X., Fuller, E., Zhang, C.Q.: Follow the leader: A centrality guided clustering and its application to social network analysis. Sci. World J. (2013)
Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. In: ICDM (2007)
Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. KAIS 17(2) (2008)
Yankov, D., Keogh, E.J., Medina, J., Chiu, B.Y., Zordan, V.B.: Detecting time series motifs under uniform scaling. In: KDD (2007)
Yeh, C., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H., Silva, D., Mueen, A., Keogh, E.: Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: ICDM (2016)
Yu, Y., Cao, L., Rundensteiner, E.A., Wang, Q.: Outlier detection over massive-scale trajectory streams. ACM Trans. Database Syst. (TODS) 42, 1–33 (2017)
Zhu, Y., Zimmerman, Z., Senobari, N.S., Yeh, C.M., Funning, G., Mueen, A., Brisk, P., Keogh, E.: Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 739–748 (2016). https://doi.org/10.1109/ICDM.2016.0085
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Boniol, P., Linardi, M., Roncallo, F. et al. Unsupervised and scalable subsequence anomaly detection in large data series. The VLDB Journal 30, 909–931 (2021). https://doi.org/10.1007/s00778-021-00655-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00655-8