Skip to main content

Best Fitting Fixed-Length Substring Patterns for a Set of Strings

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3595))

Abstract

Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of “over-generalization”, which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.

This work was partially supported by the Scientific Grant in Aid of the Ministry of Education, Science, Sports, Culture and Technology of Japan.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  2. Arimura, H., Fujino, R., Shinohara, T., Arikawa, S.: Protein motif discovery from positive examples by Minimal Multiple Generalization over regular patterns. In: Proceedings of the Genome Informatics Workshop, pp. 39–48 (1994)

    Google Scholar 

  3. Arimura, H., Shinohara, T., Otsuki, S.: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: Proc. Annual Symp. on Theoretical Aspects of Computer Sci. (1994)

    Google Scholar 

  4. Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 277–304 (1998)

    Article  Google Scholar 

  5. Brāzma, A., Ukkonen, E., Vilo, J.: Discovering unbounded unions of regular pattern languages from positive examples. In: Nagamochi, H., Suri, S., Igarashi, Y., Miyano, S., Asano, T. (eds.) ISAAC 1996. LNCS, vol. 1178, Springer, Heidelberg (1996)

    Google Scholar 

  6. Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software, pp. 491–522. Humana Press, Totowa (2003)

    Google Scholar 

  7. Chan, C.-Y., Garofalakis, M., Rastogi, R.: RE-tree: an efficient index structure for regular expressions. The VLDB Journal 12(2), 102–119 (2003)

    Article  Google Scholar 

  8. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., New York (1979)

    MATH  Google Scholar 

  9. Kannan, S., Sweedyk, Z., Mahaney, S.: Counting and random generation of strings in regular languages. In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 551–557. Society for Industrial and Applied Mathematics (1995)

    Google Scholar 

  10. Sato, M., Mukouchi, Y., Zheng, D.: Characteristic sets for unions of regular pattern languages and compactness. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 220–233. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  11. Shinohara, T.: Polynomial time inference of extended regular pattern languages. In: Goto, E., Nakajima, R., Yonezawa, A., Nakata, I., Furukawa, K. (eds.) RIMS 1982. LNCS, vol. 147, pp. 115–127. Springer, Heidelberg (1983)

    Chapter  Google Scholar 

  12. Uemura, J., Sato, M.: Compactness and learning of classes of unions of erasing regular pattern languages. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 293–307. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ono, H., Ng, Y.K. (2005). Best Fitting Fixed-Length Substring Patterns for a Set of Strings. In: Wang, L. (eds) Computing and Combinatorics. COCOON 2005. Lecture Notes in Computer Science, vol 3595. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11533719_26

Download citation

  • DOI: https://doi.org/10.1007/11533719_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28061-3

  • Online ISBN: 978-3-540-31806-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics