Filtering Bio-sequence Based on Sequence Descriptor

Hsieh, Te-Wen; Kuo, Huang-Cheng; Huang, Jen-Peng

doi:10.1007/11691730_3

Filtering Bio-sequence Based on Sequence Descriptor

Te-Wen Hsieh²²,
Huang-Cheng Kuo²² &
Jen-Peng Huang²³

Conference paper

1014 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3916))

Abstract

Study on biological sequence database similarity searching has received substantial attention in the past decade, especially after the sequencing of the human genome. As a result, with larger and larger increases in database sizes, fast similarity search is becoming an important issue. Transforming sequences into numerical vectors, called sequence descriptors, for storing in a multidimensional data structure is becoming a promising method for indexing bio-sequences. In this paper, we present an effective sequence transformation method, called SD (Sequence Descriptor) which uses multiple features of a sequence including Count, RPD (Relative Position Dispersion), and APD (Absolute Position Dispersion) to represent the original sequence data. In contrast to the q-gram transformation method, this avoids the problem of exponentially growing vector size. Also, we present a transformation, called ST (Segment Transformation), which recursively divides sequence data into equal length subsequences, and concatenates them after transformation of the subsequences. Experiments on human genome data show that our transformation method is more effective than the q-gram transformation method.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol., 403–410 (1990)
Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 3389–3402 (1997)
Google Scholar
Aghili, S.A., Agrawal, D., EI Abbadi, A.: Filtration of string proximity search via transformation. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 149–157 (2003)
Google Scholar
Aghili, S.A., Sahin, O.D., Agrawal, D., Abbadi, A.E.: Efficient filtration of sequence similarity search through singular value decomposition. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 403–410 (2004)
Google Scholar
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R* -tree: an efficient and robust access method for points and rectangles. In: ACM SIGMOD, pp. 322–331 (1990)
Google Scholar
Bozkaya, T., Yazdani, N., Ozsoyoglu, Z.M.: Matching and indexing sequences of different lengths. In: The Sixth International Conference on Information and Knowledge Management, pp. 128–135 (1997)
Google Scholar
Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in BLAST. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive 1(3), 116–129 (2004)
Article Google Scholar
Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: ACM SIGMOD, pp. 47–57 (1984)
Google Scholar
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB Conference, pp. 351–360 (2001)
Google Scholar
Karakoc, E., Ozsoyoglu, Z.M., Sahinalp, S.C., Tasan, M., Zhang, X.: Novel approaches to biomolecular sequence indexing. Bulletin of the IEEE Technical Committee on Data Engineering, 37–44 (2004)
Google Scholar
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive and fast homology search. J. Bioinformatics and Computational Biology 2(3), 417–439 (2004)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: Patternhunter: Faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)
Article Google Scholar
Ozturk, O., Ferhatosmanoglu, H.: Effective indexing and filtering for similarity search in large biosequence databases. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 359–366 (2003)
Google Scholar
Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. In: Bioinformatics Methods and Protocols, pp. 185–219. Humana Press, Totowa (1999)
Chapter Google Scholar
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all e-matches over a given length. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 189–203. Springer, Heidelberg (2005)
Chapter Google Scholar
Sellis, T., Roussopoulos, N., Faloutsos, C.: The R+ -Tree: A dynamic index for multi-dimensional objects. In: VLDB Conference, pp. 507–518 (1987)
Google Scholar
Zhang, J., Madden, T.L.: PowerBLAST: a new network BLAST application for interactive or automated sequence analysis and annotation. Genome Research 7(6), 649–656 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Chiayi University, Taiwan
Te-Wen Hsieh & Huang-Cheng Kuo
Department of Information Management, Southern Taiwan University of Technology, Taiwan
Jen-Peng Huang

Authors

Te-Wen Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Huang-Cheng Kuo
View author publications
You can also search for this author in PubMed Google Scholar
Jen-Peng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Engineering, Nanyang Technological University, 639798, Singapore
Jinyan Li
The Hong Kong University of Science and Technology, Hong Kong, China
Qiang Yang
Intelligent Systems Centre and School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore
Ah-Hwee Tan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hsieh, TW., Kuo, HC., Huang, JP. (2006). Filtering Bio-sequence Based on Sequence Descriptor. In: Li, J., Yang, Q., Tan, AH. (eds) Data Mining for Biomedical Applications. BioDM 2006. Lecture Notes in Computer Science(), vol 3916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11691730_3

Download citation

DOI: https://doi.org/10.1007/11691730_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33104-9
Online ISBN: 978-3-540-33105-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics