Abstract
Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of “over-generalization”, which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.
This work was partially supported by the Scientific Grant in Aid of the Ministry of Education, Science, Sports, Culture and Technology of Japan.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980)
Arimura, H., Fujino, R., Shinohara, T., Arikawa, S.: Protein motif discovery from positive examples by Minimal Multiple Generalization over regular patterns. In: Proceedings of the Genome Informatics Workshop, pp. 39–48 (1994)
Arimura, H., Shinohara, T., Otsuki, S.: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: Proc. Annual Symp. on Theoretical Aspects of Computer Sci. (1994)
Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 277–304 (1998)
Brāzma, A., Ukkonen, E., Vilo, J.: Discovering unbounded unions of regular pattern languages from positive examples. In: Nagamochi, H., Suri, S., Igarashi, Y., Miyano, S., Asano, T. (eds.) ISAAC 1996. LNCS, vol. 1178, Springer, Heidelberg (1996)
Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software, pp. 491–522. Humana Press, Totowa (2003)
Chan, C.-Y., Garofalakis, M., Rastogi, R.: RE-tree: an efficient index structure for regular expressions. The VLDB Journal 12(2), 102–119 (2003)
Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., New York (1979)
Kannan, S., Sweedyk, Z., Mahaney, S.: Counting and random generation of strings in regular languages. In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 551–557. Society for Industrial and Applied Mathematics (1995)
Sato, M., Mukouchi, Y., Zheng, D.: Characteristic sets for unions of regular pattern languages and compactness. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 220–233. Springer, Heidelberg (1998)
Shinohara, T.: Polynomial time inference of extended regular pattern languages. In: Goto, E., Nakajima, R., Yonezawa, A., Nakata, I., Furukawa, K. (eds.) RIMS 1982. LNCS, vol. 147, pp. 115–127. Springer, Heidelberg (1983)
Uemura, J., Sato, M.: Compactness and learning of classes of unions of erasing regular pattern languages. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 293–307. Springer, Heidelberg (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ono, H., Ng, Y.K. (2005). Best Fitting Fixed-Length Substring Patterns for a Set of Strings. In: Wang, L. (eds) Computing and Combinatorics. COCOON 2005. Lecture Notes in Computer Science, vol 3595. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11533719_26
Download citation
DOI: https://doi.org/10.1007/11533719_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28061-3
Online ISBN: 978-3-540-31806-4
eBook Packages: Computer ScienceComputer Science (R0)