Skip to main content

BFT: Bit Filtration Technique for Approximate String Join in Biological Databases

  • Conference paper
String Processing and Information Retrieval (SPIRE 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

Abstract

Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.

This research was supported by the NSF grants under EIA02-05675, EIA99-86057, EIA00-80134, and IIS02-09112.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aghili, S.A., Agrawal, D., El Abbadi, A.: Filtration of String Proximity Search via Transformation. BIBE, 149–157 (2003)

    Google Scholar 

  2. Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Join in Biological Databases (Extended Version). UCSB Technical Report, TRCS03-12 (2003)

    Google Scholar 

  3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.J.: Basic Local Alignment Search tool. Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  4. Apostolico, A.: Apostolico, A.: The Myriad Virtues of Subword Trees. Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)

    Google Scholar 

  5. Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array (QUASAR). RECOMB, 77–83 (1999)

    Google Scholar 

  6. Chavez, E., Navarro, G.: A Metric Index for Approximate String Matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Gaede, V., Günther, O.: Multidimensional Access Methods. ACM Computing Surveys 30, 170–231 (1998)

    Article  Google Scholar 

  8. Giladi, E., Walker, M.G., Wang, J.Z., Volkmuth, W.: SST: An Algorithm for Finding Near-Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size. Bioinformatics 18, 873–877 (2002)

    Article  Google Scholar 

  9. Gravano, L., et al.: Approximate String Joins in a Database (Almost) for Free. VLDB, 491–500 (2001)

    Google Scholar 

  10. Internet Movie DataBase (IMDB), http://www.imdb.com

  11. Jin, L., Li, C., Mehrotra, S.: Efficient Similarity String Joins in Large Data Sets. UCI ICS Technical Report, TR-DB-02-04 (2002)

    Google Scholar 

  12. Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts. MFCS 16, 240–248 (1991)

    MathSciNet  Google Scholar 

  13. Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. VLDB, 351–360 (2001)

    Google Scholar 

  14. National Center for Biotechnology Information(NCBI), http://www.ncbi.nih.gov/

  15. Navarro, G., Baeza-Yates, R.A.: A Hybrid Indexing Method for Approximate String Matching. J. Discrete Algorithms 1, 205–239 (2000)

    MathSciNet  Google Scholar 

  16. Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24, 19–27 (2001)

    Google Scholar 

  17. Needleman, S.B., Wunsch, C.D.: General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Molecular Biology 48, 443–453 (1970)

    Article  Google Scholar 

  18. Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)

    Google Scholar 

  19. Smith, R., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  20. Thompson, J.D., et al.: CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice. Nuc. Acids Research 22, 4673–4680 (1994)

    Article  Google Scholar 

  21. Wu, Y., Agrawal, D., El Abbadi, A.: A Comparison of DFT and DWT based Similarity Search in Time-Series Databases. CIKM, 488–495 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Aghili, S.A., Agrawal, D., El Abbadi, A. (2003). BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39984-1_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20177-9

  • Online ISBN: 978-3-540-39984-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics