Abstract
Study on biological sequence database similarity searching has received substantial attention in the past decade, especially after the sequencing of the human genome. As a result, with larger and larger increases in database sizes, fast similarity search is becoming an important issue. Transforming sequences into numerical vectors, called sequence descriptors, for storing in a multidimensional data structure is becoming a promising method for indexing bio-sequences. In this paper, we present an effective sequence transformation method, called SD (Sequence Descriptor) which uses multiple features of a sequence including Count, RPD (Relative Position Dispersion), and APD (Absolute Position Dispersion) to represent the original sequence data. In contrast to the q-gram transformation method, this avoids the problem of exponentially growing vector size. Also, we present a transformation, called ST (Segment Transformation), which recursively divides sequence data into equal length subsequences, and concatenates them after transformation of the subsequences. Experiments on human genome data show that our transformation method is more effective than the q-gram transformation method.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol., 403–410 (1990)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 3389–3402 (1997)
Aghili, S.A., Agrawal, D., EI Abbadi, A.: Filtration of string proximity search via transformation. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 149–157 (2003)
Aghili, S.A., Sahin, O.D., Agrawal, D., Abbadi, A.E.: Efficient filtration of sequence similarity search through singular value decomposition. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 403–410 (2004)
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R* -tree: an efficient and robust access method for points and rectangles. In: ACM SIGMOD, pp. 322–331 (1990)
Bozkaya, T., Yazdani, N., Ozsoyoglu, Z.M.: Matching and indexing sequences of different lengths. In: The Sixth International Conference on Information and Knowledge Management, pp. 128–135 (1997)
Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in BLAST. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive 1(3), 116–129 (2004)
Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: ACM SIGMOD, pp. 47–57 (1984)
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB Conference, pp. 351–360 (2001)
Karakoc, E., Ozsoyoglu, Z.M., Sahinalp, S.C., Tasan, M., Zhang, X.: Novel approaches to biomolecular sequence indexing. Bulletin of the IEEE Technical Committee on Data Engineering, 37–44 (2004)
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive and fast homology search. J. Bioinformatics and Computational Biology 2(3), 417–439 (2004)
Ma, B., Tromp, J., Li, M.: Patternhunter: Faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)
Ozturk, O., Ferhatosmanoglu, H.: Effective indexing and filtering for similarity search in large biosequence databases. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 359–366 (2003)
Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. In: Bioinformatics Methods and Protocols, pp. 185–219. Humana Press, Totowa (1999)
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all e-matches over a given length. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 189–203. Springer, Heidelberg (2005)
Sellis, T., Roussopoulos, N., Faloutsos, C.: The R+ -Tree: A dynamic index for multi-dimensional objects. In: VLDB Conference, pp. 507–518 (1987)
Zhang, J., Madden, T.L.: PowerBLAST: a new network BLAST application for interactive or automated sequence analysis and annotation. Genome Research 7(6), 649–656 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hsieh, TW., Kuo, HC., Huang, JP. (2006). Filtering Bio-sequence Based on Sequence Descriptor. In: Li, J., Yang, Q., Tan, AH. (eds) Data Mining for Biomedical Applications. BioDM 2006. Lecture Notes in Computer Science(), vol 3916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11691730_3
Download citation
DOI: https://doi.org/10.1007/11691730_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33104-9
Online ISBN: 978-3-540-33105-6
eBook Packages: Computer ScienceComputer Science (R0)