Abstract
Time series data remains a perennially important datatype considered in data mining. In the last decade there has been an increasing realization that time series data can be best understood by reasoning about time series subsequences on the basis of their similarity to other subsequences: the two most familiar such time series concepts being motifs and discords. Time series motifs refer to two particularly close subsequences, whereas time series discords indicate subsequences that are far from their nearest neighbors. However, we argue that it can sometimes be useful to simultaneously reason about a subsequence’s closeness to certain data and its distance to other data. In this work we introduce a novel primitive called the Contrast Profile that allows us to efficiently compute such a definition in a principled way. As we will show, the Contrast Profile has many downstream uses, including anomaly detection, data exploration, and preprocessing unstructured data for classification. We demonstrate the utility of the Contrast Profile by showing how it allows end-to-end classification in datasets with tens of billions of datapoints, and how it can be used to explore datasets and reveal subtle patterns that might otherwise escape our attention. Moreover, we demonstrate the generality of the Contrast Profile by presenting detailed case studies in domains as diverse as seismology, animal behavior, and cardiology.
Similar content being viewed by others
Notes
The original paper lists “two Australian fur seals (Arctocephalus pusillus doriferus) and three New Zealand fur seals (Arctocephalus forsteri)”, however it now commonly accepted that these are the same species and are united under the more neutral name of Antipodean fur seals.
This is what logically must be done, however by caching distance calculations and only recomputing values that could have changed, the time and space overhead for the Kth-1 Plato is inconsequential.
References
Abdoli A, Murillo AC, Yeh C-CM et al (2018) Time series classification to improve poultry welfare. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp 635–642
Abdoli A, Murillo AC, Gerry AC, Keogh EJ (2019) Time Series classification: lessons learned in the (literal) field while studying chicken behavior. In: 2019 ieee international conference on big data (Big Data), pp 5962–5964
Abdoli A, Alaee S, Imani S, et al (2020) Fitbit for chickens? Time series data mining can increase the productivity of poultry farms. In: Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, pp 3328–3336
Alaee S, Mercer R, Kamgar K, Keogh E (2021) Time series motifs discovery under DTW allows more robust discovery of conserved structure. Data Min Knowl Discov 35:863–910. https://doi.org/10.1007/s10618-021-00740-0
Allen R (1982) Automatic phase pickers: their present use and future prospects. Bull Seismol Soc Am 72:S225–S242. https://doi.org/10.1785/BSSA07206B0225
Aquarium of the Pacific (2017) Galumphing: how seals move on land
Beentjes MP (1990) Comparative terrestrial locomotion of the Hooker’s sea lion (Phocarctos hookeri) and the New Zealand fur seal (Arctocephalus forsteri): evolutionary and ecological implications. Zool J Linn Soc 98:307–325. https://doi.org/10.1111/j.1096-3642.1990.tb01204.x
Bergen KJ, Johnson PA, de Hoop MV, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science 3:eaau0323. https://doi.org/10.1126/science.aau0323
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT’99. Springer, Berlin, Heidelberg, pp 217–235
Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification and regression trees. Routledge, Boca Raton
Bu Y, Chen L, Fu AW-C, Liu D (2009) Efficient anomaly monitoring over moving object trajectory streams. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 159–168
Dau HA, Bagnall A, Kamgar K et al (2019) Welcome to the UCR time series classification/clustering page. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/. Accessed 17 Jan 2021
Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89:31–71. https://doi.org/10.1016/S0004-3702(96)00034-3
Duputel Z, Tsai VC, Rivera L, Kanamori H (2013) Using centroid time-delays to characterize source durations and identify earthquakes with unique characteristics. Earth Planet Sci Lett 374:92–100. https://doi.org/10.1016/j.epsl.2013.05.024
Field EH, Arrowsmith RJ, Biasi GP et al (2014) Uniform California earthquake rupture forecast, version 3 (UCERF3)—The time-independent model. Bull Seismol Soc Am 104:1122–1180. https://doi.org/10.1785/0120130164
Goldberger AL, Amaral LAN, Glass L et al (2000) Physiobank, physiotoolkit, and physionet. Circulation 101:e215–e220. https://doi.org/10.1161/01.CIR.101.23.e215
Guan X, Raich R, Wong W-K (2016) Efficient multi-instance learning for activity recognition from time series data using an auto-regressive hidden Markov model. In: Proceedings of the 33rd international conference on machine learning. PMLR, pp 2330–2339
Hu B, Chen Y, Keogh E (2016) Classification of streaming time series under more realistic assumptions. Data Min Knowl Discov 30:403–437. https://doi.org/10.1007/s10618-015-0415-0
Hutton K, Woessner J, Hauksson E (2010) Earthquake monitoring in southern California for seventy-seven years (1932–2008). Bull Seismol Soc Am 100:423–446. https://doi.org/10.1785/0120090130
Kouadri WM, Ouziri M, Benbernou S et al (2020) Quality of sentiment analysis tools: the reasons of inconsistency. Proc VLDB Endow 14:668–681
Ladds MA, Thompson AP, Slip DJ et al (2016) Seeing it all: evaluating supervised machine learning methods for the classification of diverse otariid behaviours. PLoS ONE 11:e0166898. https://doi.org/10.1371/journal.pone.0166898
Lin J, Keogh E (2006) Group SAX: Extending the notion of contrast sets to time series and multimedia data. Knowledge discovery in databases: PKDD 2006. Springer, Berlin, Heidelberg, pp 284–296
MATLAB (n.d.) Sequence classification using deep learning. https://www.mathworks.com/help/deeplearning/ug/classify-sequence-data-using-lstm-networks.html. Accessed 21 Jan 2021e
Mercer R (2021) Contrast profile. https://sites.google.com/view/contrastprofile. Accessed 5 Jan 2021
Mueen A, Keogh E, Young N (2011) Logical-shapelets: an expressive primitive for time series classification. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp 1154–1162
Mueen A, Zhu Y, Yeh CM et al (2015) The fastest similarity search algorithm for time series subsequences under Euclidean distance. https://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html. Accessed 18 Jan 2021
Murillo AC, Abdoli A, Blatchford RA et al (2020) Parasitic mites alter chicken behaviour and negatively impact animal welfare. Sci Rep 10:8236. https://doi.org/10.1038/s41598-020-65021-0
Nakamura T, Imamura M, Mercer R, Keogh E (2020) MERLIN: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE international conference on data mining (ICDM), pp 1190–1195
NCEDC (2014) Northern California earthquake data center
Pedestrian Counting System (2013b) City of Melbourne—Pedestrian counting system. In: Pedestrian Counting System. http://www.pedestrian.melbourne.vic.gov.au/#date=28-10-2021&time=8. Accessed 27 Oct 2021
Petersen MD, Mueller CS, Haller KM et al (2014) 2014 update of the United States national seismic hazard maps 8
Raghu N, Manjunatha KN (2019) Arrhythmia detection using machine learning techniques
Rakthanmanon T, Keogh E (2013) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 668–676
Rakthanmanon T (2013) Fast shapelets—Supporting website. http://alumni.cs.ucr.edu/~rakthant/FastShapelet/. Accessed 28 Sep 2021
Ross ZE, Trugman DT, Hauksson E, Shearer PM (2019) Searching for hidden earthquakes in Southern California. Science. https://doi.org/10.1126/science.aaw6888
Rost S, Thomas C (2002) Array seismology: methods and applications. Rev Geophys. https://doi.org/10.1029/2000RG000100
SCEDC (n.d.) Southern California Earthquake Data Center at Caltech. https://scedc.caltech.edu/faq/index.html#reviewed. Accessed 5 Oct 2021a
Schaff DP, Waldhauser F (2005) Waveform cross-correlation-based differential travel-time measurements at the Northern California Seismic Network. Bull Seismol Soc Am 95:2446–2461. https://doi.org/10.1785/0120040221
Scholz J-R, Widmer-Schnidrig R, Davis P et al (2020) Detection, analysis, and removal of glitches from InSight’s seismic data From Mars. Earth Space Sci 7:e2020EA001317. https://doi.org/10.1029/2020EA001317
Senobari NS, Funning GJ, Keogh E et al (2018) Super-efficient cross-correlation (SEC-C): a fast matched filtering code suitable for desktop computers. Seismol Res Lett 90:322–334. https://doi.org/10.1785/0220180122
Sharma BK, Kumar A, Murthy VM (2010) Evaluation of seismic events detection algorithms J. Geol Soc India 75:533–538. https://doi.org/10.1007/s12594-010-0042-8
Shelly DR, Beroza GC, Ide S, Nakamula S (2006) Low-frequency earthquakes in Shikoku, Japan, and their relationship to episodic tremor and slip. Nature 442:188–191. https://doi.org/10.1038/nature04931
Trnkoczy A (1999) Understanding and parameter setting of STA/LTA trigger algorithm, p 20
Wiemer S, Wyss M (2000) Minimum magnitude of completeness in earthquake catalogs: examples from Alaska, the Western United States, and Japan. Bull Seismol Soc Am 90:859–869. https://doi.org/10.1785/0119990114
Willett DS, George J, Willett NS et al (2016) Machine learning for characterization of insect vector feeding. PLoS Comput Biol 12:e1005158. https://doi.org/10.1371/journal.pcbi.1005158
Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp 947–956
Yeh CM, Zhu Y, Ulanova L et al (2016) Matrix Profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322
Yeh CM, Zhu Y, Dau HA et al (2019) Online amnestic DTW to allow real-time golden batch monitoring. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, Anchorage AK USA, pp 2604–2612
Yildirim O, Baloglu UB, Tan R-S et al (2019) A new approach for arrhythmia classification using deep coded features and LSTM networks. Comput Methods Programs Biomed 176:121–133. https://doi.org/10.1016/j.cmpb.2019.05.004
Yoon CE, O’Reilly O, Bergen KJ, Beroza GC (2015) Earthquake detection through computationally efficient similarity search. Sci Adv. https://doi.org/10.1126/sciadv.1501057
Zhu Y, Zimmerman Z, Shakibay Senobari N et al (2018) Exploiting a novel algorithm and GPUs to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins. Knowl Inf Syst 54:203–236. https://doi.org/10.1007/s10115-017-1138-x
Zhu Y, Gharghabi S, Silva DF et al (2020) The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code. Data Min Knowl Discov 34:949–979. https://doi.org/10.1007/s10618-019-00668-6
Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms S. In: Natarajan (ed) Imprecise and approximate computation. Springer, US, Boston, MA, pp 43–62
Acknowledgements
We thank all the creators of the data sets used in this work. We give a special thanks to Dr. Ladds for helping us confirm behaviors found in (Ladds et al. 2016).
Funding
Funding was provided NSF IIS 2103976 and USDA NIFA-2017-67012-26100.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mercer, R., Alaee, S., Abdoli, A. et al. Introducing the contrast profile: a novel time series primitive that allows real world classification. Data Min Knowl Disc 36, 877–915 (2022). https://doi.org/10.1007/s10618-022-00824-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00824-5