Accelerating pattern-based time series classification: a linear time and space string mining approach

Raza, Atif; Kramer, Stefan

doi:10.1007/s10115-019-01378-7

Accelerating pattern-based time series classification: a linear time and space string mining approach

Regular Paper
Published: 12 July 2019

Volume 62, pages 1113–1141, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

589 Accesses
9 Citations
Explore all metrics

Abstract

Subsequences-based time series classification algorithms provide interpretable and generally more accurate classification models compared to the nearest neighbor approach, albeit at a considerably higher computational cost. A number of discretized time series-based algorithms have been proposed to reduce the computational complexity of these algorithms; however, the asymptotic time complexity of the proposed algorithms is also cubic or higher-order polynomial. We present a remarkably fast and resource-efficient time series classification approach which employs a linear time and space string mining algorithm for extracting frequent patterns from discretized time series data. Compared to other subsequence or pattern-based classification algorithms, the proposed approach only requires a few parameters, which can be chosen arbitrarily and do not require any fine-tuning for different datasets. The time series data are discretized using symbolic aggregate approximation, and frequent patterns are extracted using a string mining algorithm. An independence test is used to select the most discriminative frequent patterns, which are subsequently used to create a transformed version of the time series data. Finally, a classification model can be trained using any off-the-shelf algorithm. Extensive empirical evaluations demonstrate the competitive classification accuracy of our approach compared to other state-of-the-art approaches. The experiments also show that our approach is at least one to two orders of magnitude faster than the existing pattern-based methods due to the extremely fast frequent pattern extraction, which is the most computationally intensive process in pattern-based time series classification approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Deep learning for time series classification: a review

Article 02 March 2019

Hassan Ismail Fawaz, Germain Forestier, … Pierre-Alain Muller

The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Article Open access 18 December 2020

Alejandro Pasos Ruiz, Michael Flynn, … Anthony Bagnall

A survey of methods for time series change point detection

Article 08 September 2016

Samaneh Aminikhanghahi & Diane J. Cook

Notes

Throughout the text, we refer to real-valued time series segments as “subsequences” and the discretized/symbolic segments as “patterns.”
The presented mathematical notation is for the simple case of integer values of p; later SAX refinements enable handling non-integer window sizes as well.
Research suggests that a large number of time series datasets follow the Gaussian distribution. For the minority of datasets which do not follow this assumption, selecting the breakpoints using the Gaussian curve can deteriorate the efficiency of SAX; however, the “correctness of the algorithm is unaffected” [10].
Usually, multi-view learning refers to learning with different sets of features of vectorial data; however, here we use the term for multiple representations of a time series data originating from different parameterizations.
1. This criterion and procedure are not to be confused with closed or open/free patterns [16]. 2. Note that there can be two patterns p and q, with one pattern p being more general than the other, $p \prec q$, both having the same value of $\chi ^2$ ($\chi ^2(p) = \chi ^2(q)$), but yet occurring in different sets of positive and negative examples. However, this should be expected to be a rather infrequent case. The overall filtering procedure of patterns just makes sure that the patterns are frequent enough in the positives, infrequent enough in the negatives, highly discriminative and, given the same discriminative power, as general as possible.
A discussion about calculation of the $\chi ^2$ test statistic and the information gain is provided in “Appendix.”
http://www.timeseriesclassification.com/.
Our implementation is available from https://github.com/atifraza/MiSTiCl.
http://www.cs.ucr.edu/~eamonn/time_series_data/.
Following the parameter settings provided by the UEA Time Series Repository.
The results for ST have been taken from the UEA Time Series Repository.

References

Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dhaliwal J, Puglisi SJ, Turpin A (2012) Practical efficient string mining. IEEE Trans Knowl Data Eng 24(4):735–744. https://doi.org/10.1109/TKDE.2010.242
Article Google Scholar
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data. Proc VLDB Endow 1(2):1542–1552
Article Google Scholar
Fischer J, Heun V, Kramer S (2005) Fast frequent string mining using suffix arrays. In: 5th International conference on data mining, IEEE, ICDM ’05, pp 609–612. https://doi.org/10.1109/ICDM.2005.62
Fischer J, Heun V, Kramer S (2006) Optimal string mining under frequency constraints. In: Knowledge discovery in databases, PKDD 2006, lecture notes in computer science, vol 4213. Springer, Berlin, pp 139–150. https://doi.org/10.1007/11871637_17
Google Scholar
Freund Y (1995) Boosting a Weak Learning Algorithm by Majority. Inf Comput 121(2):256–285. https://doi.org/10.1006/inco.1995.1136
Article MathSciNet Google Scholar
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
Article MATH Google Scholar
Hills J, Lines J, Baranauskas E, Mapp J, Bagnall A (2014) Classification of time series by shapelet transformation. Data Min Knowl Discov 28(4):851–881. https://doi.org/10.1007/s10618-013-0322-1
Article MathSciNet MATH Google Scholar
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144. https://doi.org/10.1007/s10618-007-0064-z
Article MathSciNet Google Scholar
Lin J, Khade R, Li Y (2012) Rotation-invariant similarity in time series using bag-of-patterns representation. J Intell Inf Syst 39(2):287–315. https://doi.org/10.1007/s10844-012-0196-5
Article Google Scholar
Rakthanmanon T, Keogh E (2013) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM international conference on data mining, SDM, Society for Industrial and Applied Mathematics, pp 668–676. https://doi.org/10.1137/1.9781611972832.74
Schäfer P (2015) The BOSS is concerned with time series classification in the presence of noise. Data Min Knowl Discov 29(6):1505–1530. https://doi.org/10.1007/s10618-014-0377-7
Article MathSciNet MATH Google Scholar
Schäfer P (2016) Scalable time series classification. Data Min Knowl Discov 30(5):1273–1298. https://doi.org/10.1007/s10618-015-0441-y
Article MathSciNet MATH Google Scholar
Senin P, Malinchik S (2013) SAX-VSM: interpretable time series classification using SAX and vector space model. In: 13th International conference on data mining, IEEE, ICDM ’13, pp 1175–1180. https://doi.org/10.1109/ICDM.2013.52
Toivonen H (2017) Frequent pattern. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, pp 524–529. https://doi.org/10.1007/978-1-4899-7687-1_318
Chapter Google Scholar
Ye L, Keogh E (2011) Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Discov 22(1):149–182. https://doi.org/10.1007/s10618-010-0179-5
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We are grateful to the reviewers for their comments and suggestions which helped in improving the quality of this paper. The first author was supported by a scholarship from the Higher Education Commission (HEC), Pakistan, and the German Academic Exchange Service (DAAD), Germany.

Author information

Authors and Affiliations

Institute of Computer Science, Johannes Gutenberg University Mainz, Staudingerweg 9, Mainz, Germany
Atif Raza & Stefan Kramer

Authors

Atif Raza
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Kramer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atif Raza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Calculating independence test statistics

Section 3.3 provides the algorithmic details for selecting frequent patterns based on their discriminative power. $\chi ^2$ independence test or information gain values can be used to determine the discriminative power of a given pattern and find out how effectively it can identify the instances of a given class. This section explains the procedure of calculating these statistics based on the occurrence frequency of a pattern in the positive and negative class dataset splits. In this regard, the notation used for the following discussion is given below.

Symbol	Representation
$\widehat{P}$	Positive class dataset
$\widehat{N}$	Negative class dataset
$N_{\widehat{P}}$	Number of instances in $\widehat{P}$
$N_{\widehat{N}}$	Number of instances in $\widehat{N}$
p	Frequent pattern
$f_{\widehat{P}}$	Occurrence frequency of p in $\widehat{P}$
$f_{\widehat{N}}$	Occurrence frequency of p in $\widehat{N}$

1.1 Calculating the $\chi ^2$ test statistic

The $\chi ^2$ test statistic is calculated based on observed ($O_{ij}$) and expected ($E_{ij}$) values for the given categorical variables. The formula for calculating the $\chi ^2$ statistic is given below.

$$\begin{aligned} \chi ^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \end{aligned}$$

Observed values ($O_{ij}$) correspond to the number of instances observed as belonging to a certain categorical variable. In our case, it is the number of instances labeled as belonging to the positive or negative class given a particular frequent pattern. This can be determined using the instance counts of the positive and negative class datasets and occurrence frequency values of the given pattern in the respective dataset splits. Based on these values, a contingency table can be created as follows.

	Dataset splits
	Positive, $\widehat{P}$	Negative, $\widehat{N}$
$ With (p)$	$O_{11}=\lfloor f_{\widehat{P}} \times N_{\widehat{P}} \rceil $	$O_{12}=\lfloor f_{\widehat{N}} \times N_{\widehat{N}} \rceil $	$RSum_{1.} = O_{11}+O_{12}$
$ WithOut (p)$	$O_{21}=N_{\widehat{P}}-O_{11}$	$O_{22}=N_{\widehat{N}}-O_{12}$	$RSum_{2.} = O_{21}+O_{22}$
	$CSum_{.1}=O_{11}+O_{21}$	$CSum_{.2}=O_{12}+O_{22}$	$n=\sum _{i,j} O_{ij}$

The rows of this contingency table correspond to the number of instances containing or not containing the given pattern p, while the columns correspond to the positive and negative dataset splits, respectively. The combined total of row and column sums equals the total number of instances in the positive and negative dataset splits. Finally, the expected values ($E_{ij}$) are calculated using the following formula.

$$\begin{aligned} E_{ij} = \frac{RSum_{i.} \times CSum_{.j}}{n} \end{aligned}$$

The $\chi ^2$ test statistic determines whether any relationship between the positive and negative dataset splits exists given the frequent pattern. If the pattern occurs in both datasets, then the $\chi ^2$ value will be close to zero which signifies a relationship exists between the two dataset splits. We can order the frequent patterns based on their $\chi ^2$ statistic and select the ones for which the dataset splits do not exhibit any mutual relationship.

1.2 Calculating the information gain value

Entropy (H) is a measure for establishing whether a dataset has a uniform or varying distribution in terms of the different classes of instances. Given a dataset with positive and negative class instances, the entropy of the dataset can be calculated using the following formula.

$$\begin{aligned} H = -\Bigg (\frac{N_{\widehat{P}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times \mathrm{log}_2 \frac{N_{\widehat{P}}}{N_{\widehat{P}}+N_{\widehat{N}}}\Bigg ) -\Bigg (\frac{N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times \mathrm{log}_2 \frac{N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}}\Bigg ) \end{aligned}$$

If a pattern p occurs frequently in either class of instances in the dataset, we can create positive and negative class subsets based on the presence or absence of this pattern in each of the instances. The entropy of these subsets can then be calculated using the following equations.

$$\begin{aligned} H_{\widehat{P}}= & {} -\Bigg ( \frac{ f_{\widehat{P}} \times N_{\widehat{P}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ f_{\widehat{P}} \times N_{\widehat{P}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \Bigg )\\&-\Bigg ( \frac{ f_{\widehat{N}} \times N_{\widehat{N}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ f_{\widehat{N}} \times N_{\widehat{N}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \Bigg ) \\ H_{\widehat{N}}= & {} -\Bigg ( \frac{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} }{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} }{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} }\Bigg )\\&-\Bigg ( \frac{ (1-f_{\widehat{N}}) \times N_{\widehat{N}} }{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ (1-f_{\widehat{N}}) \times N_{\widehat{N}} }{ (1-f_{\widehat{p}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} } \Bigg ) \end{aligned}$$

Using the entropy values of the source dataset and the positive and negative subsets, we can calculate the information gain value using the following formula.

$$\begin{aligned} IG = H - \Bigg ( \frac{f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times H_{\widehat{P}} + \frac{(1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times H_{\widehat{N}} \Bigg ) \end{aligned}$$

If the frequent pattern effectively distinguishes between the two classes, the positive and negative class subsets will have very few or no instances of the other class resulting in a smaller value of entropy for the two subsets. This in turn will cause a higher information gain value indicating that the pattern is a good candidate for distinguishing between the two classes of instances. If, however, the converse is true, then the pattern is not a good candidate. This way the candidates can be selected on the basis of their discriminative power.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raza, A., Kramer, S. Accelerating pattern-based time series classification: a linear time and space string mining approach. Knowl Inf Syst 62, 1113–1141 (2020). https://doi.org/10.1007/s10115-019-01378-7

Download citation

Received: 24 August 2018
Revised: 15 June 2019
Accepted: 23 June 2019
Published: 12 July 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10115-019-01378-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Symbol	Representation
\(\widehat{P}\)	Positive class dataset
\(\widehat{N}\)	Negative class dataset
\(N_{\widehat{P}}\)	Number of instances in \(\widehat{P}\)
\(N_{\widehat{N}}\)	Number of instances in \(\widehat{N}\)
p	Frequent pattern
\(f_{\widehat{P}}\)	Occurrence frequency of p in \(\widehat{P}\)
\(f_{\widehat{N}}\)	Occurrence frequency of p in \(\widehat{N}\)

	Dataset splits
	Positive, \(\widehat{P}\)	Negative, \(\widehat{N}\)
\( With (p)\)	\(O_{11}=\lfloor f_{\widehat{P}} \times N_{\widehat{P}} \rceil \)	\(O_{12}=\lfloor f_{\widehat{N}} \times N_{\widehat{N}} \rceil \)	\(RSum_{1.} = O_{11}+O_{12}\)
\( WithOut (p)\)	\(O_{21}=N_{\widehat{P}}-O_{11}\)	\(O_{22}=N_{\widehat{N}}-O_{12}\)	\(RSum_{2.} = O_{21}+O_{22}\)
	\(CSum_{.1}=O_{11}+O_{21}\)	\(CSum_{.2}=O_{12}+O_{22}\)	\(n=\sum _{i,j} O_{ij}\)

Accelerating pattern-based time series classification: a linear time and space string mining approach

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances

A survey of methods for time series change point detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Calculating independence test statistics

1.1 Calculating the \(\chi ^2\) test statistic

1.2 Calculating the information gain value

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating pattern-based time series classification: a linear time and space string mining approach

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances

A survey of methods for time series change point detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Calculating independence test statistics

Appendix: Calculating independence test statistics

1.1 Calculating the \(\chi ^2\) test statistic

1.2 Calculating the information gain value

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation