Skip to main content

Discovering unbounded unions of regular pattern languages from positive examples

Extended abstract

  • Session 3b: Invited Presentation
  • Conference paper
  • First Online:
Algorithms and Computation (ISAAC 1996)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1178))

Included in the following conference series:

Abstract

The problem of learning unions of certain pattern languages from positive examples is considered. We restrict to the regular patterns, i.e., patterns where each variable symbol can appear only once, and to the substring patterns, which is a subclass of regular patterns of the type xαy, where x and y are variables and α is a string of constant symbols. We present an algorithm that, given a set of strings, finds a good collection of patterns covering this set. The notion of a ‘good covering’ is defined as the most probable collection of patterns likely to be present in the examples, assuming a simple probabilistic model, or equivalently using the Minimum Description Length (MDL) principle. Our algorithm is shown to approximate the optimal cover within a logarithmic factor. This extends a similar recent result for the so-called simple patterns. For substring patterns the running time of the algorithm is O(nN), where n is the number and N the total lenght of the sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Angluin. Finding patterns common to a set of strings. J. of Comp. and Syst. Sci., 21:46–62, 1980.

    Article  Google Scholar 

  2. H. Arimura, T. Shinohara, and S. Otsuki. Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In Proc. of the 11th STACS, Lecture Notes in Comp. Sci., 755, pages 649–660. Springer, 1994.

    Google Scholar 

  3. A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to automatic discovery of patterns in biosequences. Technical Report TR-113, Department of Informatics, University of Bergen, Bergen, Norway, December 1995.

    Google Scholar 

  4. A. Brazma, I. Jonassen, E. Ukkonen, and J. Vilo. Discovering patterns and subfamilies in biosequences. In Proceedings of Fourth International Conference on Intelligent Systems for Molecular Biology, pages 34–43. AAAI Press, 1996.

    Google Scholar 

  5. A. Brazma, E. Ukkonen, and J. Vilo. Finding a good collection of patterns covering a set of sequences. Technical Report C-1995-60, Department of Computer Science, University of Helsinki, December 1995.

    Google Scholar 

  6. V. Chvátal. A greedy heuristic for the set-covering problem. Math. Oper. Res., 4:233–235, 1979.

    Google Scholar 

  7. E. M. Gold. Language identification in the limit. Information and Control, 10:447–474, 1967.

    Article  Google Scholar 

  8. L. C. K. Hui. Color set size problem with application to string matching. In Proc. of Third Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Comp. Science, 644, pages 230–243. Springer-Verlag, 1992.

    Google Scholar 

  9. I. Jonassen, J. F. Collins, and D. G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, 4(8):1587–1595, 1995.

    PubMed  Google Scholar 

  10. P. Kilpeläinen, H. Mannila, and E. Ukkonen. MDL learning of unions of simple pattern languages from positive examples. In Proceedings of the 2nd European conference EuroCOLT'95, pages 252–260, 1995.

    Google Scholar 

  11. M. Li and P. Vitanyi. An introduction to Kolmogorov complexity and its applications. Texts and monographs in Computer Science. Springer-Verlag, 1993.

    Google Scholar 

  12. E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, 1976.

    Article  Google Scholar 

  13. A. F. Neuwald, J. S. Liu, and C. E. Lawrence. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Science, 4:1618–1632, 1995.

    PubMed  Google Scholar 

  14. J. R. Quinlan and R. L. Rivest. Inferring decision trees using the minimum decription length principle. Information and Computation, 80:227–248, 1989.

    Article  Google Scholar 

  15. J. Rissanen. Modeling by the shortest data description. Automatica-J.IFAC, 14:465–471, 1978.

    Article  Google Scholar 

  16. M.-F. Sagot, A. Viari, and H. Soldano. A distance-based block searching algorithm. In Proc. of Third International Conference on Intelligent Systems for Molecular Biology, pages 322–331. AAAI Press, 1995.

    Google Scholar 

  17. T. Shinohara. Polynomial time inference of extended regular pattern languages. In Proceedings of RIMS Symposia on Software Science and Engineering, Lecture Notes in Computer Science, 147, pages 115–127. Springer-Verlag, 1983.

    Google Scholar 

  18. R. Staden. Methods for discovering novel motifs in nucleic acid sequences. CABIOS, 5(4):293–298, 1989.

    PubMed  Google Scholar 

  19. M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: Consensus and alignment. Bulletin of Mathematical Biology, 46(4):515–527, 1984.

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Tetsuo Asano Yoshihide Igarashi Hiroshi Nagamochi Satoru Miyano Subhash Suri

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brāzma, A., Ukkonen, E., Vilo, J. (1996). Discovering unbounded unions of regular pattern languages from positive examples. In: Asano, T., Igarashi, Y., Nagamochi, H., Miyano, S., Suri, S. (eds) Algorithms and Computation. ISAAC 1996. Lecture Notes in Computer Science, vol 1178. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0009485

Download citation

  • DOI: https://doi.org/10.1007/BFb0009485

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-62048-8

  • Online ISBN: 978-3-540-49633-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics