Abstract
We present an efficient algorithm for finding approximate repetitions in a given sequence of characters. First, we define a class of simple regular expressions which are of star-height one and do not contain union operations, and a stochastic mutation process of a given length over a string of characters. Then, assuming that a given string of characters is obtained corrupted by the defined mutation process from some long enough word generated by a simple regular expression, we try to restore the expression. We prove that to within some reasonable accuracy it is always possible if the length of the mutation process is bounded comparing to the length of the example. We provide an algorithm by which the expression can be restored in linear time in the length of the example and no worse than quadratic in the length of the expression. We discuss some extensions of the method and possible applications to bioinformatics.
Preview
Unable to display preview. Download preview PDF.
References
D. Angluin. Inference of reversible languages. Journal of the ACM, 29(3):741–765, 1982.
D. Angluin, P. Laird. Learning from noisy examples. Machine Learning, V2, 1988, 343–370
A. Aho.“Pattern Matching in Strings.” In Formal Language Theory, R. Book (Ed.), New York: Academic Press.
A. Brazma. Learning a subclass of regular expressions by recognizing periodic repetitions. Proceedings of the Fourth Scandinavian Conference on AI, IOS Press, 137–146, 1993.
A. Brazma. Efficient identification of regular expressions from representative examples. In Proceedings of Sixth ACM Conference on Computational Learning Theory: COLT'93, ACM Press, 1993, 236–242.
A. Brazma, K. Cerans. Efficient Learning of Regular Expressions from Good Examples. Technical Report, LU-IMSC-TR-CS-94-1, University of Latvia, Riga, 1994 (also to appear in proceedings of AIP94).
C. DeLisi, Computers in molecular biology: current applications and emerging trends. Science, V. 240, April 1988, 47–51
M. Kearns, M. Li., Learning in the presence of malicious errors. In Proc. of the 20-th Annual Symposium on Theory of Computing, Chicago, Illinois, May 1988.
R.C. Lyndon, M.P. Schutzenberg. The equation a M =b NcP in a free group. Michigan Math. J. V9, 289–298, 1962.
E. Myers, W. Miller. Approximate matching of regular expression. Bulletin of Mathematical Biology, V. 51, N.1, 5–37, 1989.
A. Konagaya. A Stochastic Approach to Genetic Information. In Proc. of the 3-rd Workshop on Algorithmic Learning Theory ALT'92, JSAI, 25–36, 1992.
S. Miyano. Learning Theory Toward Genome Informatics. In Proc. of the 4-th Workshop on Algorithmic Learning Theory ALT'93, Lect. Notes in Artific. Int., Springer, 19–36, 1993.
M. Singer and P. Berg. Genes and Genomes. University Science Books, Mill Valey, California, 1991.
R. Sloan. Types of noise in data for concept learning. In Proc. of 1988 Workshop on Computational Learning Theory, Morgan Kaufman, 1988, 91–96.
N. Tanida, T. Yokomori. Polynomial-time identification of strictly regular languages in the limit. IEICE Trans. Inf. & Syst., V E75-D, 1992, 125–132.
K. Yamanishi. A learning criterion for stochastic rules. In Proc. of the 3-rd Workshop on Computational Learning Theory, Rochester, NY: Morgan Kaufman, 1990, 67–81.
R.A. Wagner, J.I. Seiferas. Correcting counter-automaton-recognizable languages. SIAM J. Computing. V 7, 1978, 357–375.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brāzma, A. (1994). Efficient algorithm for learning simple regular expressions from noisy examples. In: Arikawa, S., Jantke, K.P. (eds) Algorithmic Learning Theory. AII ALT 1994 1994. Lecture Notes in Computer Science, vol 872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58520-6_69
Download citation
DOI: https://doi.org/10.1007/3-540-58520-6_69
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58520-6
Online ISBN: 978-3-540-49030-2
eBook Packages: Springer Book Archive