Abstract
The aim of this work is to obtain a useful anomaly definition for online analysis of time series. The idea is to develop an anomaly concept which is sustainable for long-lived and frequent streamings. As a solution, we provide an adaptation of the discord concept, which has been successfully used for anomaly detection on time series. An online approach implies the frequent processing of a data streaming for timely providing anomaly alerts. This requires a modification since discord search is not exactly decomposable in its original definition. With a statistical approach, allowing to rate the significance of the discords of each analysis, it has been possible to obtain a solution where the number of false positives is minimized. The new online anomalies are called significant online discords (sods). As a novel feature, sod search determines the quantity of anomalies in the time series under investigation. The search for sods has been implemented and its properties validated with synthetic and real data. As a result, we found that sods can be considered as a useful new tool for anomaly detection in fast streaming time series or Big Data contexts.
Similar content being viewed by others
References
Aggarwal CC (2007) Data streams: models and algorithms, vol 31. Advances in database system. Springer, Berlin
Aghabozorgi S, Shirkhorshidi AS, Wah TY (2015) Time-series clustering—a decade review. Inf Syst 53:16–38
Ahmad S, Lavin A, Purdy S, Agha Z (2017) Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262(2017):134–147. https://doi.org/10.1016/j.neucom.2017.04.070 ISSN 0925–2312
Avogadro P, Dominoni MA (2019) Topological approach for finding the nearest neighbor sequence in time series. In: Proceedings of the 12th international conference on knowledge discovery and information retrieval (KDIR) 2019, pp 233–244
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. ACM Press, Addison-Wesley, New York Seiten 75 ff. ISBN 0-201-39829-X
Barbará D, Domeniconi C, Duric Z, Filippone M, Mansfield R, Lawson E (2008) Detecting suspicious behavior in surveillance images. In: IEEE international conference on proceedings of data mining workshops, ICDMW’08, IEEE, pp 891–900
Bentley JL, Sedgewick R (1997) Fast algorithms for sorting and searching strings. In: Proceedings of the 8 annual ACM–SIAM symposium on discrete algorithms, pp 360–369
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Box GEP, Jenkins G, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, Hoboken
Chandola V, Arindam B, Vipin K (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41.3(2009):15
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceeding KDD ’03 proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 493–498. ISBN: 1-58113-737-0. https://doi.org/10.1145/956750.956808
Gama J (2012) A survey on learning from data streams: current and future trends. Prog Artif Intell 1:45. https://doi.org/10.1007/s13748-011-0002-6
Gama J, Zliobaite I, Bifet A, Pechenizky M, Bouchachia A (2013) A survey on concept drift adaptation. ACM Comput Surv 46:1–35
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220 [Circulation Electronic Pages; http://circ.ahajournals.org/content/101/23/e215.full]
Goldin DQ, Kanellakis PC (1995) On similarity queries for time-series data: constraint specification and implementation. In: Montanari U, Rossi F (eds) Principles and practice of constraint programming—CP ’95 CP, vol 976. Lecture notes in computer science. Springer, Berlin
Govindan RB, Narayanan K, Gopinathan MS (1998) On the evidence of deterministic chaos in ECG: surrogate and predictability analysis. Chaos 8(2):495–502
Hawkins DM (1980) Identification of outliers. Springer, Dodrecht
Hawkins J, Ahmad S (2016) Why neurons have thousands of synapses, a theory of sequence memory in neocortex. Front Neural Circuits 10(2016):1–13. https://doi.org/10.3389/fncir.2016.00023
Hayes MA, Capretz MAM (2015) Contextual anomaly detection framework for big sensor data. J Big Data 2:2. https://doi.org/10.1186/s40537-014-0011-y
Hill DJ, Minsker BS, Amir E (2009) Real-time Bayesian anomaly detection in streaming environmental data. Water Resour J. https://doi.org/10.1029/2008WR006956
James J et al (2018) Data Never Sleeps 6.0. https://www.domo.com/blog/data-never-sleeps-6/. Accessed 05 Mar 2020
Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis, 1st edn. Wiley series in probability and statistics. Wiley, New York
Keogh E, Lin J, Fu A (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), pp 226–233
Keogh E, Lin J, Lee S-H, Van Herle H (2006) Finding the most unusual time series sequence: algorithms and applications. Knowl Inf Syst 11(1):1–27. https://doi.org/10.1007/s10115-006-0034-6
Kontaki M, Gounaris A, Papadopoulos AN, Tsichlas T, Manolopoulos Y (2011) Continuous monitoring of distance-based outliers over data streams. In: Proceedings of the 27th IEEE international conference on data engineering (ICDE’11), Hannover, Germany
Laguna P, Mark RG, Goldberger AL, Moody GB (1997) A database for evaluation of algorithms for measurement of QT and other waveform intervals in the ECG. Comput Cardiol 24:673–676
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery
Malhotra P, Vig L, Shroff G, Agarwal P (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings of ESANN 2015, Bruges (Belgium), 22–24 April 2015, ISBN 978-287587014-8
Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78
MOA, Machine Learning for Streams. https://moa.cms.waikato.ac.nz/. Accessed 5 Mar 2020
Padilla DE, Brinkworth R, McDonnell MD (2013) Performance of a hierarchical temporal memory network in noisy sequence learning. In: Proceedings of the international conference on computational intelligence and cybernetics, IEEE, pp 45–51. https://doi.org/10.1109/CyberneticsCom.2013.6865779
Page ES (1954) Continuous inspection scheme. Biometrika 41(1/2):100–115. https://doi.org/10.1093/biomet/41.1-2.100
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. In: ACM SIGKDD explorations newsletter—special issue on learning from imbalanced datasets, vol 6, no 1, pp 50–59, ACM, New York, NY, USA
Pimentel M, Clifton D, Tarassenko L (2014) A review of novelty detection. Signal Process 99:215–249
Polunchenko AS, Tartakovsky AG (2012) State-of-the-art in sequential change-point detection. Methodol Comput Appl Probab 14:649. https://doi.org/10.1007/s11009-011-9256-5
Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, Chen C, Frankenstein S, Lerner M (2014) GrammarViz 2.0: a tool for grammar-based pattern discovery in time series. In: Proceedings of ECML/PKDD conference, 2014
Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, Chen C, Frankenstein S, Lerner M (2015) Time series anomaly discovery with grammar-based compression. In: Proceedings of the international conference on extending database technology, EDBT 15
Sheng B, Li Q, Mao W, Jin W (2007) Outlier detection in sensor networks. In: Proceedings of the 8th ACM international symposium on mobile ad hoc networking and computing, MobiHoc ’07, ACM, New York, NY, USA, pp 219–228
The Matrix Profile Website (2019). https://www.cs.ucr.edu/~eamonn/MatrixProfile.html. Accessed 3 Oct 2019
Tran L, Fan L, Shahabi C (2016) Distance-based outlier detection in data streams. Proc VLDB Endow 9:1089–1100
Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Boston ISBN 0-201-07616-0. OCLC 3058187
Wang C, Viswanathan K, Choudur L, Talwar V, Satterfield W, Schwan K (2011) Statistical techniques for online anomaly detection in data centers. In: Proceedings of the IFIP/IEEE international symposium on integrated network management (1M), 23–27 May 2011
Wang X, Lin J, Senin P, Oates T, Gandhi, Boedihardjo AP, Chen C, Frankenstein S (2016) RPM: representative pattern mining for efficient time series classification. In: Proceedings of the international conference on extending database technology, EDBT 16, pp 185–196
Wong J (2015) Netflix Surus, GitHub, Online Code Repos. https://github.com/Netflix/Surus. Accessed 5 Mar 2020
Yang D, Rundensteiner E, Ward M (2009) Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th international conference on extending database technology (EDBT’09), Saint Petersburg, Russia
Yeh CC-M, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016) Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets, IEEE ICDM 2016
Zimmerman Z, Kamgar K, Senobari NS, Crites B, Funning G, Brisk P, Keogh E (2019) Matrix profile XIV: scaling time series motif discovery with GPUs to break a quintillion pairwise comparisons a day and beyond. In: Proceedings of the ACM symposium on cloud computing, association for computing machinery, New York, NY, USA, SoCC ’19, pp 74–86. https://doi.org/10.1145/3357223.3362721
Zhu Y, Zimmerman Z, Senobari NS, Yeh C-CM, Funning G, Mueen A, Brisk P, Keogh E (2018) Exploiting a novel algorithm and GPUs to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins. Knowl Inf Syst 54(1):203–236
Zhang GP (2003) Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50:159–175
Zhao G, Li Z, Liu F, Tang Y (2013) A concept drifting based clustering framework for data streams. In: 2013 fourth international conference on proceedings of emerging intelligent data and web technologies (EIDWT), pp 122–129. https://doi.org/10.1109/EIDWT.2013.26
Acknowledgements
PA would like to thank Audrey Adams for editing suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Avogadro, P., Palonca, L. & Dominoni, M.A. Online anomaly search in time series: significant online discords. Knowl Inf Syst 62, 3083–3106 (2020). https://doi.org/10.1007/s10115-020-01453-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01453-4