Abstract
The pattern matching problem is to find all occurrences of a given pattern in an input text. In particular, we consider the case when the pattern is a stochastic regular language where each pattern string has its own probability. Our problem is to find all matching patterns—(start, end) indices in the text—whose probability is larger than a given threshold probability. A pattern matching procedure is frequently used on streaming data in several applications, and often it is very challenging to find the start index of a matching in streaming data. We design an efficient algorithm for the stochastic pattern matching problem over streaming data based on the transformation of the pattern PFA into a weighted automaton and a constant bound on the number of backtracks required to find a start index while reading the streaming input. We also employ heuristics that enable us to reduce the number of backtracks, which improves the practical runtime of our algorithm. We establish the tight theoretical runtime of the proposed algorithm and experimentally demonstrate its practical performance. Finally, we show a possible application of our algorithm to another stochastic pattern matching problem where we search for the maximum probability substring of a text that is a superstring of a specified string.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)
Birney, E.: Hidden Markov models in biological sequence analysis. IBM J. Res. Dev. 45, 449–454 (2001)
Blondel, V.D., Canterini, V.: Undecidable problems for probabilistic automata of fixed dimension. Theory Comput. Syst. 36, 231–245 (2003)
Casacuberta, F., de la Higuera, C.: Computational complexity of problems on probabilistic grammars and transducers. In: Proceedings of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications, pp. 15–24 (2000)
Droste, M., Kuich, W., Volger, H.: Handbook of Weighted Automata. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01492-5
Dupont, P., Denis, F., Esposito, Y.: Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recogn. 38, 1349–1371 (2005)
Fred, A.L.N.: Computation of substring probabilities in stochastic grammars. In: Proceedings of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications, pp. 103–114 (2000)
Guttman, O.: Probabilistic automata and distributions over sequences. Ph.D. thesis, The Australian National University (2006)
de la Higuera, C., Oncina, J.: Computing the most probable string with a probabilistic finite state machine. In: Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing, pp. 1–8 (2013)
de la Higuera, C., Oncina, J.: The most probable string: an algorithmic study. J. Logic Comput. 24, 311–330 (2014)
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6, 323–350 (1977)
Mohri, M., Pereira, F., Riley, M.: Speech recognition with weighted finite-state transducers. Comput. Speech Lang. 16, 69–88 (2002)
Nederhof, M., Satta, G.: Computation of infix probabilities for probabilistic context-free grammars. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1213–1221 (2011)
Thompson, K.: Regular expression search algorithm. Commun. ACM 11, 419–422 (1968)
Verwer, S., Eyraud, R., de la Higuera, C.: PAutomaC: a probabilistic automata and hidden Markov models learning competition. Mach. Learn. 96(1–2), 129–154 (2014)
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines-part I. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1013–1025 (2005)
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines-part II. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1026–1039 (2005)
Yoon, B.J.: Hidden Markov models and their applications in biological sequence analysis. Current Genomics 10(6), 402–415 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Cognetta, M., Han, YS. (2018). Online Stochastic Pattern Matching. In: Câmpeanu, C. (eds) Implementation and Application of Automata. CIAA 2018. Lecture Notes in Computer Science(), vol 10977. Springer, Cham. https://doi.org/10.1007/978-3-319-94812-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-94812-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94811-9
Online ISBN: 978-3-319-94812-6
eBook Packages: Computer ScienceComputer Science (R0)