Abstract
The problem of learning unions of certain pattern languages from positive examples is considered. We restrict to the regular patterns, i.e., patterns where each variable symbol can appear only once, and to the substring patterns, which is a subclass of regular patterns of the type xαy, where x and y are variables and α is a string of constant symbols. We present an algorithm that, given a set of strings, finds a good collection of patterns covering this set. The notion of a ‘good covering’ is defined as the most probable collection of patterns likely to be present in the examples, assuming a simple probabilistic model, or equivalently using the Minimum Description Length (MDL) principle. Our algorithm is shown to approximate the optimal cover within a logarithmic factor. This extends a similar recent result for the so-called simple patterns. For substring patterns the running time of the algorithm is O(nN), where n is the number and N the total lenght of the sequences.
Preview
Unable to display preview. Download preview PDF.
References
D. Angluin. Finding patterns common to a set of strings. J. of Comp. and Syst. Sci., 21:46–62, 1980.
H. Arimura, T. Shinohara, and S. Otsuki. Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In Proc. of the 11th STACS, Lecture Notes in Comp. Sci., 755, pages 649–660. Springer, 1994.
A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to automatic discovery of patterns in biosequences. Technical Report TR-113, Department of Informatics, University of Bergen, Bergen, Norway, December 1995.
A. Brazma, I. Jonassen, E. Ukkonen, and J. Vilo. Discovering patterns and subfamilies in biosequences. In Proceedings of Fourth International Conference on Intelligent Systems for Molecular Biology, pages 34–43. AAAI Press, 1996.
A. Brazma, E. Ukkonen, and J. Vilo. Finding a good collection of patterns covering a set of sequences. Technical Report C-1995-60, Department of Computer Science, University of Helsinki, December 1995.
V. Chvátal. A greedy heuristic for the set-covering problem. Math. Oper. Res., 4:233–235, 1979.
E. M. Gold. Language identification in the limit. Information and Control, 10:447–474, 1967.
L. C. K. Hui. Color set size problem with application to string matching. In Proc. of Third Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Comp. Science, 644, pages 230–243. Springer-Verlag, 1992.
I. Jonassen, J. F. Collins, and D. G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, 4(8):1587–1595, 1995.
P. Kilpeläinen, H. Mannila, and E. Ukkonen. MDL learning of unions of simple pattern languages from positive examples. In Proceedings of the 2nd European conference EuroCOLT'95, pages 252–260, 1995.
M. Li and P. Vitanyi. An introduction to Kolmogorov complexity and its applications. Texts and monographs in Computer Science. Springer-Verlag, 1993.
E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, 1976.
A. F. Neuwald, J. S. Liu, and C. E. Lawrence. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Science, 4:1618–1632, 1995.
J. R. Quinlan and R. L. Rivest. Inferring decision trees using the minimum decription length principle. Information and Computation, 80:227–248, 1989.
J. Rissanen. Modeling by the shortest data description. Automatica-J.IFAC, 14:465–471, 1978.
M.-F. Sagot, A. Viari, and H. Soldano. A distance-based block searching algorithm. In Proc. of Third International Conference on Intelligent Systems for Molecular Biology, pages 322–331. AAAI Press, 1995.
T. Shinohara. Polynomial time inference of extended regular pattern languages. In Proceedings of RIMS Symposia on Software Science and Engineering, Lecture Notes in Computer Science, 147, pages 115–127. Springer-Verlag, 1983.
R. Staden. Methods for discovering novel motifs in nucleic acid sequences. CABIOS, 5(4):293–298, 1989.
M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: Consensus and alignment. Bulletin of Mathematical Biology, 46(4):515–527, 1984.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brāzma, A., Ukkonen, E., Vilo, J. (1996). Discovering unbounded unions of regular pattern languages from positive examples. In: Asano, T., Igarashi, Y., Nagamochi, H., Miyano, S., Suri, S. (eds) Algorithms and Computation. ISAAC 1996. Lecture Notes in Computer Science, vol 1178. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0009485
Download citation
DOI: https://doi.org/10.1007/BFb0009485
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-62048-8
Online ISBN: 978-3-540-49633-5
eBook Packages: Springer Book Archive