Skip to main content

Filtering Bio-sequence Based on Sequence Descriptor

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3916))

Abstract

Study on biological sequence database similarity searching has received substantial attention in the past decade, especially after the sequencing of the human genome. As a result, with larger and larger increases in database sizes, fast similarity search is becoming an important issue. Transforming sequences into numerical vectors, called sequence descriptors, for storing in a multidimensional data structure is becoming a promising method for indexing bio-sequences. In this paper, we present an effective sequence transformation method, called SD (Sequence Descriptor) which uses multiple features of a sequence including Count, RPD (Relative Position Dispersion), and APD (Absolute Position Dispersion) to represent the original sequence data. In contrast to the q-gram transformation method, this avoids the problem of exponentially growing vector size. Also, we present a transformation, called ST (Segment Transformation), which recursively divides sequence data into equal length subsequences, and concatenates them after transformation of the subsequences. Experiments on human genome data show that our transformation method is more effective than the q-gram transformation method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol., 403–410 (1990)

    Google Scholar 

  2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 3389–3402 (1997)

    Google Scholar 

  3. Aghili, S.A., Agrawal, D., EI Abbadi, A.: Filtration of string proximity search via transformation. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 149–157 (2003)

    Google Scholar 

  4. Aghili, S.A., Sahin, O.D., Agrawal, D., Abbadi, A.E.: Efficient filtration of sequence similarity search through singular value decomposition. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 403–410 (2004)

    Google Scholar 

  5. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R* -tree: an efficient and robust access method for points and rectangles. In: ACM SIGMOD, pp. 322–331 (1990)

    Google Scholar 

  6. Bozkaya, T., Yazdani, N., Ozsoyoglu, Z.M.: Matching and indexing sequences of different lengths. In: The Sixth International Conference on Information and Knowledge Management, pp. 128–135 (1997)

    Google Scholar 

  7. Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in BLAST. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive 1(3), 116–129 (2004)

    Article  Google Scholar 

  8. Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: ACM SIGMOD, pp. 47–57 (1984)

    Google Scholar 

  9. Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB Conference, pp. 351–360 (2001)

    Google Scholar 

  10. Karakoc, E., Ozsoyoglu, Z.M., Sahinalp, S.C., Tasan, M., Zhang, X.: Novel approaches to biomolecular sequence indexing. Bulletin of the IEEE Technical Committee on Data Engineering, 37–44 (2004)

    Google Scholar 

  11. Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive and fast homology search. J. Bioinformatics and Computational Biology 2(3), 417–439 (2004)

    Article  Google Scholar 

  12. Ma, B., Tromp, J., Li, M.: Patternhunter: Faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)

    Article  Google Scholar 

  13. Ozturk, O., Ferhatosmanoglu, H.: Effective indexing and filtering for similarity search in large biosequence databases. In: IEEE International Symposium on Bioinformatics and Bioengineering, pp. 359–366 (2003)

    Google Scholar 

  14. Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. In: Bioinformatics Methods and Protocols, pp. 185–219. Humana Press, Totowa (1999)

    Chapter  Google Scholar 

  15. Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all e-matches over a given length. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 189–203. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  16. Sellis, T., Roussopoulos, N., Faloutsos, C.: The R+ -Tree: A dynamic index for multi-dimensional objects. In: VLDB Conference, pp. 507–518 (1987)

    Google Scholar 

  17. Zhang, J., Madden, T.L.: PowerBLAST: a new network BLAST application for interactive or automated sequence analysis and annotation. Genome Research 7(6), 649–656 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hsieh, TW., Kuo, HC., Huang, JP. (2006). Filtering Bio-sequence Based on Sequence Descriptor. In: Li, J., Yang, Q., Tan, AH. (eds) Data Mining for Biomedical Applications. BioDM 2006. Lecture Notes in Computer Science(), vol 3916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11691730_3

Download citation

  • DOI: https://doi.org/10.1007/11691730_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33104-9

  • Online ISBN: 978-3-540-33105-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics