Skip to main content
Log in

Introducing the contrast profile: a novel time series primitive that allows real world classification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Time series data remains a perennially important datatype considered in data mining. In the last decade there has been an increasing realization that time series data can be best understood by reasoning about time series subsequences on the basis of their similarity to other subsequences: the two most familiar such time series concepts being motifs and discords. Time series motifs refer to two particularly close subsequences, whereas time series discords indicate subsequences that are far from their nearest neighbors. However, we argue that it can sometimes be useful to simultaneously reason about a subsequence’s closeness to certain data and its distance to other data. In this work we introduce a novel primitive called the Contrast Profile that allows us to efficiently compute such a definition in a principled way. As we will show, the Contrast Profile has many downstream uses, including anomaly detection, data exploration, and preprocessing unstructured data for classification. We demonstrate the utility of the Contrast Profile by showing how it allows end-to-end classification in datasets with tens of billions of datapoints, and how it can be used to explore datasets and reveal subtle patterns that might otherwise escape our attention. Moreover, we demonstrate the generality of the Contrast Profile by presenting detailed case studies in domains as diverse as seismology, animal behavior, and cardiology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. The original paper lists “two Australian fur seals (Arctocephalus pusillus doriferus) and three New Zealand fur seals (Arctocephalus forsteri)”, however it now commonly accepted that these are the same species and are united under the more neutral name of Antipodean fur seals.

  2. This is what logically must be done, however by caching distance calculations and only recomputing values that could have changed, the time and space overhead for the Kth-1 Plato is inconsequential.

References

  • Abdoli A, Murillo AC, Yeh C-CM et al (2018) Time series classification to improve poultry welfare. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp 635–642

  • Abdoli A, Murillo AC, Gerry AC, Keogh EJ (2019) Time Series classification: lessons learned in the (literal) field while studying chicken behavior. In: 2019 ieee international conference on big data (Big Data), pp 5962–5964

  • Abdoli A, Alaee S, Imani S, et al (2020) Fitbit for chickens? Time series data mining can increase the productivity of poultry farms. In: Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, pp 3328–3336

  • Alaee S, Mercer R, Kamgar K, Keogh E (2021) Time series motifs discovery under DTW allows more robust discovery of conserved structure. Data Min Knowl Discov 35:863–910. https://doi.org/10.1007/s10618-021-00740-0

    Article  MathSciNet  MATH  Google Scholar 

  • Allen R (1982) Automatic phase pickers: their present use and future prospects. Bull Seismol Soc Am 72:S225–S242. https://doi.org/10.1785/BSSA07206B0225

    Article  Google Scholar 

  • Aquarium of the Pacific (2017) Galumphing: how seals move on land

  • Beentjes MP (1990) Comparative terrestrial locomotion of the Hooker’s sea lion (Phocarctos hookeri) and the New Zealand fur seal (Arctocephalus forsteri): evolutionary and ecological implications. Zool J Linn Soc 98:307–325. https://doi.org/10.1111/j.1096-3642.1990.tb01204.x

    Article  Google Scholar 

  • Bergen KJ, Johnson PA, de Hoop MV, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science 3:eaau0323. https://doi.org/10.1126/science.aau0323

    Article  Google Scholar 

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT’99. Springer, Berlin, Heidelberg, pp 217–235

    Chapter  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification and regression trees. Routledge, Boca Raton

    Book  Google Scholar 

  • Bu Y, Chen L, Fu AW-C, Liu D (2009) Efficient anomaly monitoring over moving object trajectory streams. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 159–168

  • Dau HA, Bagnall A, Kamgar K et al (2019) Welcome to the UCR time series classification/clustering page. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/. Accessed 17 Jan 2021

  • Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89:31–71. https://doi.org/10.1016/S0004-3702(96)00034-3

    Article  MATH  Google Scholar 

  • Duputel Z, Tsai VC, Rivera L, Kanamori H (2013) Using centroid time-delays to characterize source durations and identify earthquakes with unique characteristics. Earth Planet Sci Lett 374:92–100. https://doi.org/10.1016/j.epsl.2013.05.024

    Article  Google Scholar 

  • Field EH, Arrowsmith RJ, Biasi GP et al (2014) Uniform California earthquake rupture forecast, version 3 (UCERF3)—The time-independent model. Bull Seismol Soc Am 104:1122–1180. https://doi.org/10.1785/0120130164

    Article  Google Scholar 

  • Goldberger AL, Amaral LAN, Glass L et al (2000) Physiobank, physiotoolkit, and physionet. Circulation 101:e215–e220. https://doi.org/10.1161/01.CIR.101.23.e215

    Article  Google Scholar 

  • Guan X, Raich R, Wong W-K (2016) Efficient multi-instance learning for activity recognition from time series data using an auto-regressive hidden Markov model. In: Proceedings of the 33rd international conference on machine learning. PMLR, pp 2330–2339

  • Hu B, Chen Y, Keogh E (2016) Classification of streaming time series under more realistic assumptions. Data Min Knowl Discov 30:403–437. https://doi.org/10.1007/s10618-015-0415-0

    Article  MathSciNet  MATH  Google Scholar 

  • Hutton K, Woessner J, Hauksson E (2010) Earthquake monitoring in southern California for seventy-seven years (1932–2008). Bull Seismol Soc Am 100:423–446. https://doi.org/10.1785/0120090130

    Article  Google Scholar 

  • Kouadri WM, Ouziri M, Benbernou S et al (2020) Quality of sentiment analysis tools: the reasons of inconsistency. Proc VLDB Endow 14:668–681

    Article  Google Scholar 

  • Ladds MA, Thompson AP, Slip DJ et al (2016) Seeing it all: evaluating supervised machine learning methods for the classification of diverse otariid behaviours. PLoS ONE 11:e0166898. https://doi.org/10.1371/journal.pone.0166898

    Article  Google Scholar 

  • Lin J, Keogh E (2006) Group SAX: Extending the notion of contrast sets to time series and multimedia data. Knowledge discovery in databases: PKDD 2006. Springer, Berlin, Heidelberg, pp 284–296

    Chapter  Google Scholar 

  • MATLAB (n.d.) Sequence classification using deep learning. https://www.mathworks.com/help/deeplearning/ug/classify-sequence-data-using-lstm-networks.html. Accessed 21 Jan 2021e

  • Mercer R (2021) Contrast profile. https://sites.google.com/view/contrastprofile. Accessed 5 Jan 2021

  • Mueen A, Keogh E, Young N (2011) Logical-shapelets: an expressive primitive for time series classification. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp 1154–1162

  • Mueen A, Zhu Y, Yeh CM et al (2015) The fastest similarity search algorithm for time series subsequences under Euclidean distance. https://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html. Accessed 18 Jan 2021

  • Murillo AC, Abdoli A, Blatchford RA et al (2020) Parasitic mites alter chicken behaviour and negatively impact animal welfare. Sci Rep 10:8236. https://doi.org/10.1038/s41598-020-65021-0

    Article  Google Scholar 

  • Nakamura T, Imamura M, Mercer R, Keogh E (2020) MERLIN: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE international conference on data mining (ICDM), pp 1190–1195

  • NCEDC (2014) Northern California earthquake data center

  • Pedestrian Counting System (2013b) City of Melbourne—Pedestrian counting system. In: Pedestrian Counting System. http://www.pedestrian.melbourne.vic.gov.au/#date=28-10-2021&time=8. Accessed 27 Oct 2021

  • Petersen MD, Mueller CS, Haller KM et al (2014) 2014 update of the United States national seismic hazard maps 8

  • Raghu N, Manjunatha KN (2019) Arrhythmia detection using machine learning techniques

  • Rakthanmanon T, Keogh E (2013) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 668–676

  • Rakthanmanon T (2013) Fast shapelets—Supporting website. http://alumni.cs.ucr.edu/~rakthant/FastShapelet/. Accessed 28 Sep 2021

  • Ross ZE, Trugman DT, Hauksson E, Shearer PM (2019) Searching for hidden earthquakes in Southern California. Science. https://doi.org/10.1126/science.aaw6888

    Article  Google Scholar 

  • Rost S, Thomas C (2002) Array seismology: methods and applications. Rev Geophys. https://doi.org/10.1029/2000RG000100

    Article  Google Scholar 

  • SCEDC (n.d.) Southern California Earthquake Data Center at Caltech. https://scedc.caltech.edu/faq/index.html#reviewed. Accessed 5 Oct 2021a

  • Schaff DP, Waldhauser F (2005) Waveform cross-correlation-based differential travel-time measurements at the Northern California Seismic Network. Bull Seismol Soc Am 95:2446–2461. https://doi.org/10.1785/0120040221

    Article  Google Scholar 

  • Scholz J-R, Widmer-Schnidrig R, Davis P et al (2020) Detection, analysis, and removal of glitches from InSight’s seismic data From Mars. Earth Space Sci 7:e2020EA001317. https://doi.org/10.1029/2020EA001317

    Article  Google Scholar 

  • Senobari NS, Funning GJ, Keogh E et al (2018) Super-efficient cross-correlation (SEC-C): a fast matched filtering code suitable for desktop computers. Seismol Res Lett 90:322–334. https://doi.org/10.1785/0220180122

    Article  Google Scholar 

  • Sharma BK, Kumar A, Murthy VM (2010) Evaluation of seismic events detection algorithms J. Geol Soc India 75:533–538. https://doi.org/10.1007/s12594-010-0042-8

    Article  Google Scholar 

  • Shelly DR, Beroza GC, Ide S, Nakamula S (2006) Low-frequency earthquakes in Shikoku, Japan, and their relationship to episodic tremor and slip. Nature 442:188–191. https://doi.org/10.1038/nature04931

    Article  Google Scholar 

  • Trnkoczy A (1999) Understanding and parameter setting of STA/LTA trigger algorithm, p 20

  • Wiemer S, Wyss M (2000) Minimum magnitude of completeness in earthquake catalogs: examples from Alaska, the Western United States, and Japan. Bull Seismol Soc Am 90:859–869. https://doi.org/10.1785/0119990114

    Article  Google Scholar 

  • Willett DS, George J, Willett NS et al (2016) Machine learning for characterization of insect vector feeding. PLoS Comput Biol 12:e1005158. https://doi.org/10.1371/journal.pcbi.1005158

    Article  Google Scholar 

  • Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp 947–956

  • Yeh CM, Zhu Y, Ulanova L et al (2016) Matrix Profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322

  • Yeh CM, Zhu Y, Dau HA et al (2019) Online amnestic DTW to allow real-time golden batch monitoring. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, Anchorage AK USA, pp 2604–2612

  • Yildirim O, Baloglu UB, Tan R-S et al (2019) A new approach for arrhythmia classification using deep coded features and LSTM networks. Comput Methods Programs Biomed 176:121–133. https://doi.org/10.1016/j.cmpb.2019.05.004

    Article  Google Scholar 

  • Yoon CE, O’Reilly O, Bergen KJ, Beroza GC (2015) Earthquake detection through computationally efficient similarity search. Sci Adv. https://doi.org/10.1126/sciadv.1501057

    Article  Google Scholar 

  • Zhu Y, Zimmerman Z, Shakibay Senobari N et al (2018) Exploiting a novel algorithm and GPUs to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins. Knowl Inf Syst 54:203–236. https://doi.org/10.1007/s10115-017-1138-x

    Article  Google Scholar 

  • Zhu Y, Gharghabi S, Silva DF et al (2020) The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code. Data Min Knowl Discov 34:949–979. https://doi.org/10.1007/s10618-019-00668-6

    Article  Google Scholar 

  • Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms S. In: Natarajan (ed) Imprecise and approximate computation. Springer, US, Boston, MA, pp 43–62

    Chapter  Google Scholar 

Download references

Acknowledgements

We thank all the creators of the data sets used in this work. We give a special thanks to Dr. Ladds for helping us confirm behaviors found in (Ladds et al. 2016).

Funding

Funding was provided NSF IIS 2103976 and USDA NIFA-2017-67012-26100.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryan Mercer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mercer, R., Alaee, S., Abdoli, A. et al. Introducing the contrast profile: a novel time series primitive that allows real world classification. Data Min Knowl Disc 36, 877–915 (2022). https://doi.org/10.1007/s10618-022-00824-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00824-5

Keywords

Navigation