Skip to main content
Log in

Subsequence versus substring constraints in sequence pattern languages

  • Original Article
  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract

A family of logics for expressing patterns in sequences is investigated. The logics are all fragments of first-order logic, but they are variable-free. Instead, they can use substring and subsequence constraints as basic propositions. Propositions expressing constraints on the beginning or the end of the sequence are also available. Also wildcards can be used, which is important when the alphabet is not fixed, as is typical in database applications. The maximal logic with all four features of substring, subsequence, begin–end constraints, and wildcards, turns out to be equivalent to the family of star-free regular languages of dot-depth at most one. We investigate the lattice formed by taking all possible combinations of the above four features, and show it to be strict. For instance, we formally confirm what might intuitively be expected, namely, that boolean combinations of substring constraints are not sufficient to express subsequence constraints, and vice versa. We show an expressiveness hierarchy results from allowing multiple wildcards. We also investigate what happens with regular expressions when concatenation is replaced by subsequencing. Finally, we study the expressiveness of our logic relative to first-order logic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. If we identify a condition with the set of sequences (words) that satisfy it, a condition is indeed a formal language.

  2. Note that mixing and is the same as allowing “globbing” wildcards \(*\) in substring patterns, familiar from the Unix shell where the previous pattern would be written as \(ab*bc*c*a\).

  3. Over the one-letter alphabet, all logics we consider collapse to the family of finite or cofinite languages.

  4. We note that, motivated in part by database applications, formal language theory has now been revisited to accommodate an unknown (or infinite) alphabet [18].

  5. By the connections given in the previous section, this theorem includes as a special case the long-known fact that the family of locally testable languages is strictly included in the family of star-free regular languages.

  6. The definition of \(\mathcal {R}_{k+2}^{+}\) considered in [21] contains all words of the form (1) where each \(c_i\) does not appear in \(w_i\). However, this is merely a more restricted version of the language \(\mathcal {U}(\psi _{k+1})\). The proof in [21, Lemma 4.3] still trivially holds even if we drop the requirement that each \(c_i\) does not appear in \(w_i\).

  7. Uniformly over all alphabets, or over an infinite alphabet, this is another matter [9].

References

  1. Büchi, J.R.: Weak second-order arithmetic and finite automata. Zeitschrift für Mathematische Logic und Grundlagen der Mathematik 6, 66–92 (1960)

    Article  MathSciNet  Google Scholar 

  2. Brzozowski, J.A., Knast, R.: The dot-depth hierarchy of star-free languages is infinite. J. Comput. Syst. Sci. 16, 37–55 (1978)

    Article  MathSciNet  Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Boston (1999)

    Google Scholar 

  4. Cohen, R.S., Brzozowski, J.A.: Dot-depth of star-free events. J. Comput. Syst. Sci. 5(1), 1–16 (1971)

    Article  MathSciNet  Google Scholar 

  5. Dong, G., Pei, J.: Sequence Data Mining. Springer, Berlin (2007)

    MATH  Google Scholar 

  6. Faloutsos, Ch., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, pp. 419–429 (1994)

  7. Genkin, D., Kaminski, M., Peterfreund, L.: Closure Under Reversal of Languages over Infinite Alphabets. In: Fomin, E., Podolskii, V. (eds.),Computer Science Symposium in Russia, Proceedings (CSR), volume 10846 of Lecture Notes in Computer Science, Springer, pp. 145–156 (2018)

  8. Jagadish, H.V., et al.: Making database systems usable. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2007)

  9. Kaminski, M., Tan, T.: Regular expressions for languages over infinite alphabets. Fundam. Inf. 69, 301–318 (2006)

    MathSciNet  MATH  Google Scholar 

  10. Loeffen, A.: Text databases: a survey of text models and systems. SIGMOD Record 23(1), 97–106 (1994)

    Article  Google Scholar 

  11. McNaughton, R., Papert, S.: Counter-Free Automata. MIT Press, Cambridge (1971)

    MATH  Google Scholar 

  12. Neven, F., Schwentick, T., Vianu, V.: Finite state machines for strings over infinite alphabets. ACM Trans. Comput. Logic 5(3), 403–435 (2004)

    Article  MathSciNet  Google Scholar 

  13. Patel, J.M.: Special issue on querying biological sequences. IEEE Data Eng. Bull. 27(3), (2004)

  14. Peterfreund, L.: Closure under reversal of languages over infinite alphabets: a case study. Master thesis, Department of Computer Science, Technion—Israel Institute of Technology (2015)

  15. Pin, J.E.: Syntactic semigroups. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 1, chapter 10. Springer (1997)

  16. Pin, J.E.: The dot-depth hierarchy, 45 years later. The role of theory in computer science, pp. 177–202 (2017)

  17. Place, T., van Rooijen, L., Zeitoun, M.: Separating regular languages by locally testable and locally threshold testable languages. Logical Methods Comput. Sci. 10(3) (2014)

  18. Segoufin, L.: Automata and logics for words and trees over an infinite alphabet. In: Ésik, Z. (ed.) Computer Science Logic, Proceedings (CSL), volume 4207 of Lecture Notes in Computer Science, Springer, pp. 41–57 (2006)

  19. Simon, I.: Piecewise testable events. In: Barkhage, H. (ed.) Automata Theory and Formal Languages, Proceedings, volume 33 of Lecture Notes in Computer Science, Springer, pp. 214–222 (1975)

  20. Tan, T.: On pebble automata for data languages with decidable emptiness problem. J. Comput. Syst. Sci. 76(8), 778–791 (2010)

    Article  MathSciNet  Google Scholar 

  21. Tan, T.: Graph reachability and pebble automata over infinite alphabets. ACM Trans. Comput. Logic 14(3), 19 (2013)

  22. Thomas, W.: A concatenation game and the dot-depth hierarchy. In: Computation Theory and Logic, volume 270 of Lecture Notes in Computer Science, Springer-Verlag, pp. 415–426 (1987)

  23. Thomas, W.: Languages, automata, and logic. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 3, chapter 7. Springer (1997)

  24. Wang, J.T.L., Shapiro, B.A., Shasha, D. (eds.): Pattern Discovery in Biomolecular Data. Oxford University Press, Oxford (1999)

    Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous referees for their careful and helpful comments in improving our paper. We also thank Frank Neven for suggesting the connection to locally testable languages, and Jean-Eric Pin for his encouragement and help in proving Theorem 6.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tony Tan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The second author acknowledges the generous financial support from Taiwan Ministry of Science and Technology under Grant No. 107-2221-E-002-026-MY2.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Engels, S., Tan, T. & Van den Bussche, J. Subsequence versus substring constraints in sequence pattern languages. Acta Informatica 58, 35–56 (2021). https://doi.org/10.1007/s00236-019-00347-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00236-019-00347-5

Navigation