Skip to main content

Pattern Matching Under \(\textrm{DTW}\) Distance

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13617))

Included in the following conference series:

  • 407 Accesses

Abstract

In this work, we consider the problem of pattern matching under the dynamic time warping (\(\textrm{DTW}\)) distance motivated by potential applications in the analysis of biological data produced by the third generation sequencing. To measure the \(\textrm{DTW}\) distance between two strings, one must “warp” them, that is, double some letters in the strings to obtain two equal-lengths strings, and then sum the distances between the letters in the corresponding positions. When the distances between letters are integers, we show that for a pattern P with m runs and a text T with n runs:

  1. 1.

    There is an \(\mathcal {O}(m+n)\)-time algorithm that computes all locations where the \(\textrm{DTW}\) distance from P to T is at most 1;

  2. 2.

    There is an \(\mathcal {O}(kmn)\)-time algorithm that computes all locations where the \(\textrm{DTW}\) distance from P to T is at most k.

As a corollary of the second result, we also derive an approximation algorithm for general metrics on the alphabet.

This work was partially funded by the grants ANR-20-CE48-0001, ANR-19-CE45-0008 SeqDigger and ANR-19-CE48-0016 from the French National Research Agency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The preprocessing time \(\mathcal {O}(|\varSigma |^2 \log L)\) that is required to embed \(\mu \) into a well-separated metric is not accounted for in the runtime of the algorithm.

References

  1. Abboud, A., Backurs, A., Williams, V.V.: Tight hardness results for LCS and other sequence similarity measures. In: FOCS 2015, pp. 59–78. IEEE Computer Society (2015). https://doi.org/10.1109/FOCS.2015.14

  2. Amarasinghe, S.L., Su, S., Dong, X., Zappia, L., Ritchie, M.E., Gouil, Q.: Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21(1), 1–16 (2020)

    Article  Google Scholar 

  3. Bansal, N., Buchbinder, N., Madry, A., Naor, J.: A polylogarithmic-competitive algorithm for the k-server problem. In: FOCS 2011, pp. 267–276 (2011). https://doi.org/10.1109/FOCS.2011.63

  4. Braverman, V., Charikar, M., Kuszmaul, W., Woodruff, D.P., Yang, L.F.: The one-way communication complexity of dynamic time warping distance. In: SoCG 2019. LIPIcs, vol. 129, pp. 16:1–16:15 (2019). https://doi.org/10.4230/LIPIcs.SoCG.2019.16

  5. Bringmann, K., Künnemann, M.: Quadratic conditional lower bounds for string problems and dynamic time warping. In: FOCS 2015, pp. 79–97 (2015). https://doi.org/10.1109/FOCS.2015.15

  6. Chen, J.Q., Wu, Y., Yang, H., Bergelson, J., Kreitman, M., Tian, D.: Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Mol. Biol. Evol. 26(7), 1523–1531 (2009). https://doi.org/10.1093/molbev/msp063

    Article  Google Scholar 

  7. Driemel, A., Silvestri, F.: Locality-sensitive hashing of curves. In: SoCG 2017. LIPIcs, vol. 77, pp. 37:1–37:16 (2017). https://doi.org/10.4230/LIPIcs.SoCG.2017.37

  8. Dupont, M., Marteau, P.-F.: Coarse-DTW for sparse time series alignment. In: Douzal-Chouakria, A., Vilar, J.A., Marteau, P.-F. (eds.) AALTD 2015. LNCS (LNAI), vol. 9785, pp. 157–172. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44412-3_11

    Chapter  Google Scholar 

  9. Emiris, I.Z., Psarros, I.: Products of euclidean metrics and applications to proximity questions among curves. In: SoCG 2018. LIPIcs, vol. 99, pp. 37:1–37:13 (2018). https://doi.org/10.4230/LIPIcs.SoCG.2018.37

  10. Fakcharoenphol, J., Rao, S., Talwar, K.: A tight bound on approximating arbitrary metrics by tree metrics. In: STOC 2003, pp. 448–455 (2003). https://doi.org/10.1145/780542.780608

  11. Froese, V., Jain, B.J., Rymar, M., Weller, M.: Fast exact dynamic time warping on run-length encoded time series. CoRR abs/1903.03003 (2019)

    Google Scholar 

  12. Gold, O., Sharir, M.: Dynamic time warping and geometric edit distance: breaking the quadratic barrier. ACM Trans. Algorithms 14(4), 50:1–50:17 (2018). https://doi.org/10.1145/3230734

  13. Gonzalez-Garay, M.L.: Introduction to isoform sequencing using pacific biosciences technology (Iso-Seq). In: Wu, J. (ed.) Transcriptomics and Gene Regulation. TRBIO, vol. 9, pp. 141–160. Springer, Dordrecht (2016). https://doi.org/10.1007/978-94-017-7450-5_6

    Chapter  Google Scholar 

  14. Huang, Y.T., Liu, P.Y., Shih, P.W.: Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biol. 22(1), 95 (2021). https://doi.org/10.1186/s13059-021-02282-6

    Article  Google Scholar 

  15. Hwang, Y., Gelfand, S.B.: Sparse dynamic time warping. In: Perner, P. (ed.) MLDM 2017. LNCS (LNAI), vol. 10358, pp. 163–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62416-7_12

    Chapter  Google Scholar 

  16. Hwang, Y., Gelfand, S.B.: Binary sparse dynamic time warping. In: MLDM 2019, pp. 748–759. ibai Publishing (2019)

    Google Scholar 

  17. Kuszmaul, W.: Dynamic time warping in strongly subquadratic time: algorithms for the low-distance regime and approximate evaluation. In: ICALP 2019. LIPIcs, vol. 132, pp. 80:1–80:15 (2019). https://doi.org/10.4230/LIPIcs.ICALP.2019.80

  18. Kuszmaul, W.: Dynamic time warping in strongly subquadratic time: algorithms for the low-distance regime and approximate evaluation. CoRR abs/1904.09690 (2019). https://doi.org/10.48550/ARXIV.1904.09690

  19. Kuszmaul, W.: Binary dynamic time warping in linear time. CoRR abs/2101.01108 (2021)

    Google Scholar 

  20. Landau, G.M., Myers, E.W., Schmidt, J.P.: Incremental string comparison. SIAM J. Comput. 27(2), 557–582 (1998). https://doi.org/10.1137/S0097539794264810

    Article  MathSciNet  MATH  Google Scholar 

  21. Landau, G.M., Vishkin, U.: Fast string matching with k differences. J. Comput. Syst. Sci. 37(1), 63–78 (1988). https://doi.org/10.1016/0022-0000(88)90045-1

    Article  MathSciNet  MATH  Google Scholar 

  22. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018). https://doi.org/10.1093/bioinformatics/bty191

    Article  Google Scholar 

  23. Mahmoud, M., Gobet, N., Cruz-Dávalos, D.I., Mounier, N., Dessimoz, C., Sedlazeck, F.J.: Structural variant calling: the long and the short of it. Genome Biol. 20(1), 1–14 (2019). https://doi.org/10.1186/s13059-019-1828-7

    Article  Google Scholar 

  24. Mueen, A., Chavoshi, N., Abu-El-Rub, N., Hamooni, H., Minnich, A.: AWarp: fast warping distance for sparse time series. In: ICDM 2016, pp. 350–359. IEEE (2016)

    Google Scholar 

  25. Nishi, A., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: Towards efficient interactive computation of dynamic time warping distance. In: Boucher, C., Thankachan, S.V. (eds.) SPIRE 2020. LNCS, vol. 12303, pp. 27–41. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59212-7_3

    Chapter  Google Scholar 

  26. Sakai, Y., Inenaga, S.: A reduction of the dynamic time warping distance to the longest increasing subsequence length. In: ISAAC 2020. LIPIcs, vol. 181, pp. 6:1–6:16 (2020). https://doi.org/10.4230/LIPIcs.ISAAC.2020.6

  27. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Sig. Process. 26(1), 43–49 (1978)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Garance Gourdel .

Editor information

Editors and Affiliations

Appendices

Appendix A

Lemma 2

Consider a block \(B = D[i_p\mathinner {.\,.}j_p, i_t \mathinner {.\,.}j_t]\) and cell (ab) in it. If \(i_p \le a < j_p\), then \(D[a,b] \le D[a+1,b]\) and if \(i_t \le b < j_t\), then \(D[a,b] \le D[a,b+1]\).

Proof

Let us first give an equivalent statement of the lemma: if (ab) and \((a+1,b)\) are in the same block, then \(D[a,b] \le D[a+1,b]\), and if (ab) and \((a,b+1)\) are in the same block, then \(D[a,b] \le D[a,b+1]\).

We show the lemma by induction on \(a+b\). The base of the induction are the cells such that \(a = 0\) or \(b = 0\), and for them the statement holds by the definition of D. Consider now a cell (ab), where \(a,b \ge 1\). Assume that the induction assumption holds for all cells (xy) such that \(x+y < a+b\). By Eq. 1, we have:

$$\begin{aligned}&D[a, b] = \min \{ D[a-1, b-1], D[a-1, b], D[a, b-1]\} +d\\&D[a+1, b] = \min \{ D[a, b-1], D[a, b], D[a+1, b-1]\} + d\\&D[a, b+1] = \min \{ D[a-1, b], D[a-1, b+1], D[a, b]\} + d\\ \end{aligned}$$

Assume that (ab) and \((a+1,b)\) are in the same block. We have \(D[a,b] \le D[a, b-1]+d\) and trivially \(D[a,b] \le D[a,b] + d\). By the induction assumption, \(D[a,b-1] \le D[a+1,b-1]\) (the cells \((a,b-1)\) and \((a+1,b-1)\) must belong to the same block). Therefore,

$$\begin{aligned} D[a+1,b]&= \min \{ D[a, b-1], D[a, b], D[a+1, b-1]\} + d \\&= \min \{ D[a, b-1] + d, D[a, b] + d, D[a+1, b-1] + d\} \\&\ge \min \{D[a,b], D[a,b], D[a,b-1]+d\} \\&\ge \min \{D[a,b], D[a,b], D[a,b]\} = D[a,b]. \end{aligned}$$

Assume now that (ab) and \((a,b+1)\) are in the same block. We have \(D[a,b] \le D[a-1, b]+d\). Furthermore, as \((a-1,b)\) and \((a-1,b+1)\) are in the same block, we have \(D[a-1,b] \le D[a-1,b+1]\) by the induction assumption. Therefore,

$$\begin{aligned} D[a,b+1]&= \min \{ D[a-1, b], D[a-1, b+1], D[a, b]\} + d\\&= \min \{ D[a-1, b] + d, D[a-1, b+1] + d, D[a, b] + d\}\\&\ge \min \{D[a-1,b]+d, D[a-1,b]+d, D[a,b]\}\\&\ge \min \{D[a,b], D[a,b], D[a,b]\} = D[a,b]. \end{aligned}$$

This concludes the proof of the lemma.    \(\square \)

Appendix B

Theorem 2

Given run-length encodings of a pattern P with m runs and of a text T with n runs over an alphabet \(\varSigma \). Assume that the \(\textrm{DTW}\) distance is specified by a metric \(\mu \) on \(\varSigma \), and suppose that the ratio between the largest and the smallest non-zero distances between the letters of \(\varSigma \) is at most exponential in \(L = \max \{|P|,|T|\}\). For any \(0< \epsilon < 1\), there is a \(\mathcal {O}(L^{1-\varepsilon } \cdot mn \log ^3 L)\)-time algorithm that computes \(\mathcal {O}(L^{\varepsilon })\)-approximation of the smallest \(\textrm{DTW}\) distance between P and a substring of T correctly with high probability (See Footnote 1).

Proof

Any metric \(\mu \) can be embedded in \(\mathcal {O}(\sigma ^2)\) time into a well-separated tree metric \(\mu _\tau \) of depth \(\mathcal {O}(\log \sigma )\) with expected distortion \(\mathcal {O}(\log \sigma )\) (see [10] and [3, Theorem 2.4]). Furthermore, the ratio between the smallest distance and the largest distance grows at most polynomially. Formally, for any two letters ab we have \(\mu (a,b) \le \mu _\tau (a,b)\) and \(\mathbb {E}(\mu _\tau (a,b)) \le \mathcal {O}(\log \sigma ) \cdot d(a,b)\). Therefore, we have:

$$\begin{aligned} \textrm{DTW}_{\mu }(X,Y)&\le \textrm{DTW}_{\mu _\tau }(X,Y) \end{aligned}$$
(4)
$$\begin{aligned} \mathbb {E}(\textrm{DTW}_{\mu _\tau }(X,Y))&\le \mathcal {O}(\log \sigma ) \cdot \textrm{DTW}_\mu (X,Y) \end{aligned}$$
(5)

Let \(\delta = \min _{S-\text { substr. of }T} \textrm{DTW}_\mu (P,S)\) and \(\delta _\tau = \min _{S-\text { substr. of }T} \textrm{DTW}_{\mu _\tau } (P,S)\). Assume that \(\delta \) is realised on a substring X, and \(\delta _\tau \) on a substring \(X_\tau \). By Eq. 4, we then obtain:

$$\delta = \textrm{DTW}_\mu (P,X) \le \textrm{DTW}_\mu (P,X_\tau ) \le \delta _\tau $$

And Eq. 5 gives the following:

$$\mathbb {E}(\delta _\tau ) \le \mathbb {E}(\textrm{DTW}_{\mu _\tau } (P,X)) \le \mathcal {O}(\log \sigma ) \cdot \textrm{DTW}_\mu (P,X) = \mathcal {O}(\log \sigma ) \cdot \delta $$

We apply the embedding \(\log L\) times independently to obtain well-separated tree metrics \(\mu _\tau ^i\), \(i = 1, 2, \ldots , \log L\). From above and by Chernoff bounds,

$$\min _i \min _{S-\text { substring of }T} \textrm{DTW}_{\mu _\tau }^i(P,S)$$

gives an \(\mathcal {O}(\log \sigma ) = \mathcal {O}(\log L)\) approximation of \(\delta \) with high probability and can be computed in time \(\mathcal {O}(L^{1-\varepsilon } \cdot mn \log ^3 L)\) by Lemma 6, concluding the proof of the theorem.    \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gourdel, G., Driemel, A., Peterlongo, P., Starikovskaya, T. (2022). Pattern Matching Under \(\textrm{DTW}\) Distance. In: Arroyuelo, D., Poblete, B. (eds) String Processing and Information Retrieval. SPIRE 2022. Lecture Notes in Computer Science, vol 13617. Springer, Cham. https://doi.org/10.1007/978-3-031-20643-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20643-6_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20642-9

  • Online ISBN: 978-3-031-20643-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics