Best Fitting Fixed-Length Substring Patterns for a Set of Strings

Ono, Hirotaka; Ng, Yen Kaow

doi:10.1007/11533719_26

Best Fitting Fixed-Length Substring Patterns for a Set of Strings

Hirotaka Ono² &
Yen Kaow Ng³

Conference paper

1790 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3595))

Abstract

Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of “over-generalization”, which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.

This work was partially supported by the Scientific Grant in Aid of the Ministry of Education, Science, Sports, Culture and Technology of Japan.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980)
Article MathSciNet MATH Google Scholar
Arimura, H., Fujino, R., Shinohara, T., Arikawa, S.: Protein motif discovery from positive examples by Minimal Multiple Generalization over regular patterns. In: Proceedings of the Genome Informatics Workshop, pp. 39–48 (1994)
Google Scholar
Arimura, H., Shinohara, T., Otsuki, S.: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: Proc. Annual Symp. on Theoretical Aspects of Computer Sci. (1994)
Google Scholar
Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 277–304 (1998)
Article Google Scholar
Brāzma, A., Ukkonen, E., Vilo, J.: Discovering unbounded unions of regular pattern languages from positive examples. In: Nagamochi, H., Suri, S., Igarashi, Y., Miyano, S., Asano, T. (eds.) ISAAC 1996. LNCS, vol. 1178, Springer, Heidelberg (1996)
Google Scholar
Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software, pp. 491–522. Humana Press, Totowa (2003)
Google Scholar
Chan, C.-Y., Garofalakis, M., Rastogi, R.: RE-tree: an efficient index structure for regular expressions. The VLDB Journal 12(2), 102–119 (2003)
Article Google Scholar
Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., New York (1979)
MATH Google Scholar
Kannan, S., Sweedyk, Z., Mahaney, S.: Counting and random generation of strings in regular languages. In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 551–557. Society for Industrial and Applied Mathematics (1995)
Google Scholar
Sato, M., Mukouchi, Y., Zheng, D.: Characteristic sets for unions of regular pattern languages and compactness. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 220–233. Springer, Heidelberg (1998)
Chapter Google Scholar
Shinohara, T.: Polynomial time inference of extended regular pattern languages. In: Goto, E., Nakajima, R., Yonezawa, A., Nakata, I., Furukawa, K. (eds.) RIMS 1982. LNCS, vol. 147, pp. 115–127. Springer, Heidelberg (1983)
Chapter Google Scholar
Uemura, J., Sato, M.: Compactness and learning of classes of unions of erasing regular pattern languages. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 293–307. Springer, Heidelberg (2002)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Communication Engineering, Kyushu University, 6-10-1, Hakozaki, Fukuoka, 812-8581, Japan
Hirotaka Ono
Graduate School of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, 820, Japan
Yen Kaow Ng

Authors

Hirotaka Ono
View author publications
You can also search for this author in PubMed Google Scholar
Yen Kaow Ng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Lusheng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ono, H., Ng, Y.K. (2005). Best Fitting Fixed-Length Substring Patterns for a Set of Strings. In: Wang, L. (eds) Computing and Combinatorics. COCOON 2005. Lecture Notes in Computer Science, vol 3595. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11533719_26

Download citation

DOI: https://doi.org/10.1007/11533719_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28061-3
Online ISBN: 978-3-540-31806-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics