Discovering Best Variable-Length-Don’t-Care Patterns

Inenaga, Shunsuke; Bannai, Hideo; Shinohara, Ayumi; Takeda, Masayuki; Arikawa, Setsuo

doi:10.1007/3-540-36182-0_10

Shunsuke Inenaga⁷,
Hideo Bannai⁹,
Ayumi Shinohara^7,8,
Masayuki Takeda^7,8 &
…
Setsuo Arikawa⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2534))

Included in the following conference series:

International Conference on Discovery Science

954 Accesses
11 Citations

Abstract

A variable-length-don’t-care pattern (VLDC pattern) is an element of set Π = (∑∪{⋆})^*, where ∑ is an alphabet and ⋆ is a wildcard matching any string in ∑^*. Given two sets of strings, we consider the problem of finding the VLDC pattern that is the most common to one, and the least common to the other. We present a practical algorithm to find such best VLDC patterns exactly, powerfully sped up by pruning heuristics. We introduce two versions of our algorithm: one employs a pattern matching machine (PMM) whereas the other does an index structure called the Wildcard Directed Acyclic Word Graph (WDAWG). In addition, we consider a more generalized problem of finding the best pair <q,k>, where k is the window size that specifies the length of an occurrence of the VLDC pattern q matching a string ω. We present three algorithms solving this problem with pruning heuristics, using the dynamic programming (DP), PMMs and WDAWGs, respectively. Although the two problems are NP-hard, we experimentally show that our algorithms run remarkably fast.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Angluin. Finding patterns common to a set of strings. J. Comput. Sys. Sci., 21:46–62, 1980.
Article MATH MathSciNet Google Scholar
R. A. Baeza-Yates. Searching subsequences (note). Theoretical Computer Science, 78(2):363–376, Jan. 1991.
Article MATH MathSciNet Google Scholar
A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. Seiferas. The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55, 1985.
Article MATH MathSciNet Google Scholar
M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63–86, 1986.
Article MATH MathSciNet Google Scholar
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.
MATH Google Scholar
G. Das, R. Fleischer, L. Gasieniec, D. Gunopulos, and J. Kärkkäinen. Episode matching. In Proc. 8th Annual Symposium on Combinatorial Pattern Matching (CPM’97), volume 1264 of Lecture Notes in Computer Science, pages 12–27. Springer-Verlag, 1997.
Google Scholar
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York, 1997.
MATH Google Scholar
M. Hirao, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm to find the best subsequence patterns. In Proc. The Third International Conference on Discovery Science, volume 1967 of Lecture Notes in Artificial Intelligence, pages 141–154. Springer-Verlag, 2000.
Google Scholar
M. Hirao, S. Inenaga, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm to find the best episode patterns. In Proc. The Fourth International Conference on Discovery Science, volume 2226 of Lecture Notes in Artificial Intelligence, pages 435–440. Springer-Verlag, 2001.
Google Scholar
S. Inenaga, A. Shinohara, M. Takeda, H. Bannai, and S. Arikawa. Space-economical construction of index structures for all suffixes of a string. In Proc. 27th International Symposium on Mathematical Foundations of Computer Science (MFCS’02), Lecture Notes in Computer Science. Springer-Verlag, 2002. To appear.
Google Scholar
S. Inenaga, M. Takeda, A. Shinohara, H. Hoshino, and S. Arikawa. The minimum dawg for all suffixes of a string and its applications. In Proc. 13th Annual Symposium on Combinatorial Pattern Matching (CPM’02), volume 2373 of Lecture Notes in Computer Science, pages 153–167. Springer-Verlag, 2002.
Google Scholar
S. R. Kosaraju. Fast pattern matching in trees. In Proc. 30th IEEE Symp. on Foundations of Computer Science, pages 178–183, 1989.
Google Scholar
J. Kyte and R. Doolittle. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157:105–132, 1982.
Article Google Scholar
H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episode in sequences. In Proc. 1st International Conference on Knowledge Discovery and Data Mining, pages 210–215. AAAI Press, 1995.
Google Scholar
S. Miyano, A. Shinohara, and T. Shinohara. Polynomial-time learning of elementary formal systems. New Generation Computing, 18:217–242, 2000.
Article Google Scholar
S. Morishita and J. Sese. Traversing itemset lattices with statistical metric pruning. In Proc. of the 19th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 226–236. ACM Press, 2000.
Google Scholar
S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa. Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Transactions of Information Processing Society of Japan, 35(10):2009–2018, 1994.
Google Scholar
A. Shinohara, M. Takeda, S. Arikawa, M. Hirao, H. Hoshino, and S. Inenaga. Finding best patterns practically. In Progress in Discovery Science, volume 2281 of Lecture Notes in Artificial Intelligence, pages 307–317. Springer-Verlag, 2002.
Google Scholar
T. Shinohara. Polynomial-time inference of pattern languages and its applications. In Proc. 7th IBM Symp. Math. Found. Comp. Sci., pages 191–209, 1982.
Google Scholar
Z. Troníček. Episode matching. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM’01), volume 2089 of Lecture Notes in Computer Science, pages 143–146. Springer-Verlag, 2001.
Google Scholar
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
Article MATH MathSciNet Google Scholar
G. von Heijne. The signal peptide. J. Membr. Biol., 115:195–201, 1990.
Article Google Scholar
G. von Heijne, J. Steppuhn, and R. G. Herrmann. Domain structure of mitochondrial and chloroplast targeting peptides. Eur. J. Biochem., 180:535–545, 1989.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, 33, Fukuoka, 812-8581, Japan
Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda & Setsuo Arikawa
PRESTO, Japan Science and Technology Corporation (JST), Japan
Ayumi Shinohara & Masayuki Takeda
Human Genome Center, University of Tokyo, 108-8639, Tokyo, Japan
Hideo Bannai

Authors

Shunsuke Inenaga
View author publications
You can also search for this author in PubMed Google Scholar
Hideo Bannai
View author publications
You can also search for this author in PubMed Google Scholar
Ayumi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz, Stuhlsatzenhausweg 3, 66123, Saarbrücken, Germany
Steffen Lange
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Ken Satoh
Department of Computer Science, University of Maryland, College Park, 20742, Maryland, MD, USA
Carl H. Smith

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Inenaga, S., Bannai, H., Shinohara, A., Takeda, M., Arikawa, S. (2002). Discovering Best Variable-Length-Don’t-Care Patterns. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_10

Download citation

DOI: https://doi.org/10.1007/3-540-36182-0_10
Published: 08 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00188-1
Online ISBN: 978-3-540-36182-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics