A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet

Kim, Woo-Cheol; Park, Sanghyun; Won, Jung-Im; Kim, Sang-Wook; Yoon, Jee-Hee

doi:10.1007/11430919_21

Woo-Cheol Kim²¹,
Sanghyun Park²¹,
Jung-Im Won²¹,
Sang-Wook Kim²² &
…
Jee-Hee Yoon²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3518))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2532 Accesses

Abstract

Exact match queries, wildcard match queries, and k-mismatch queries are widely used in lots of molecular biology applications including the searching of ESTs (Expressed Sequence Tag) and DNA transcription factors. In this paper, we suggest an efficient indexing and processing mechanism for such queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index, such as the R*-tree. Also, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in indexing space. Our query processing method converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapped with the rectangle.

This work was supported by the Korea Research Foundation Grant (KRF-2004-003-D00302), the Basic Research Program Grant (Grant R04-2003-000-10048-0), and the IT Research Center via Cheju National University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://www.ncbi.nlm.nih.gov
ftp://ftp.ensembl.org
Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Communications of the ACM 18, 333–340 (1975)
Article MATH MathSciNet Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25(17) (1997)
Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Google Scholar
Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-tree: An index structure for high-dimensional data. VLDB, 28–39 (1996)
Google Scholar
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Communications of the ACM 20, 762–772 (1977)
Article Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Guttman, A.: R^∗-Trees, A dynamic index structure for spatial searching. ACM SIGMOD, 47–57 (1984)
Google Scholar
Kaheci, T., Singh, A.K.: An efficient index structure for string databases. VLDB (2001)
Google Scholar
Knuth, D.E., Morris, J.H., Pratt, V.B.: Fast pattern matching in strings. SIAM J. Comput. 6, 323–350 (1977)
Article MATH MathSciNet Google Scholar
Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Stephen, G.A.: String Searching Algorithm. World Scientific Publishing, Singapore (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Yonsei University, Korea
Woo-Cheol Kim, Sanghyun Park & Jung-Im Won
College of Information and Communications, Hanyang University, Korea
Sang-Wook Kim
Division of Information Engineering and Telecommunications, Hallym University, Korea
Jee-Hee Yoon

Authors

Woo-Cheol Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sanghyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Jung-Im Won
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Wook Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jee-Hee Yoon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Japan Advanced Institute of Science and Technology, Asahidai 1-1, 923-12292, Nomi, Japan
Tu Bao Ho
University of Hong Kong, Pokfulam Road, Hong Kong, China
David Cheung
Department of Computer Science and Engineering, Arizona State University, Tempe, Arizona, USA
Huan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, WC., Park, S., Won, JI., Kim, SW., Yoon, JH. (2005). A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_21

Download citation

DOI: https://doi.org/10.1007/11430919_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics