BFT: Bit Filtration Technique for Approximate String Join in Biological Databases

Aghili, S. Alireza; Agrawal, Divyakant; El Abbadi, Amr

doi:10.1007/978-3-540-39984-1_25

S. Alireza Aghili⁷,
Divyakant Agrawal⁷ &
Amr El Abbadi⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

520 Accesses
6 Citations

Abstract

Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.

This research was supported by the NSF grants under EIA02-05675, EIA99-86057, EIA00-80134, and IIS02-09112.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aghili, S.A., Agrawal, D., El Abbadi, A.: Filtration of String Proximity Search via Transformation. BIBE, 149–157 (2003)
Google Scholar
Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Join in Biological Databases (Extended Version). UCSB Technical Report, TRCS03-12 (2003)
Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.J.: Basic Local Alignment Search tool. Molecular Biology 215, 403–410 (1990)
Google Scholar
Apostolico, A.: Apostolico, A.: The Myriad Virtues of Subword Trees. Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Google Scholar
Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array (QUASAR). RECOMB, 77–83 (1999)
Google Scholar
Chavez, E., Navarro, G.: A Metric Index for Approximate String Matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)
Chapter Google Scholar
Gaede, V., Günther, O.: Multidimensional Access Methods. ACM Computing Surveys 30, 170–231 (1998)
Article Google Scholar
Giladi, E., Walker, M.G., Wang, J.Z., Volkmuth, W.: SST: An Algorithm for Finding Near-Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size. Bioinformatics 18, 873–877 (2002)
Article Google Scholar
Gravano, L., et al.: Approximate String Joins in a Database (Almost) for Free. VLDB, 491–500 (2001)
Google Scholar
Internet Movie DataBase (IMDB), http://www.imdb.com
Jin, L., Li, C., Mehrotra, S.: Efficient Similarity String Joins in Large Data Sets. UCI ICS Technical Report, TR-DB-02-04 (2002)
Google Scholar
Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts. MFCS 16, 240–248 (1991)
MathSciNet Google Scholar
Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. VLDB, 351–360 (2001)
Google Scholar
National Center for Biotechnology Information(NCBI), http://www.ncbi.nih.gov/
Navarro, G., Baeza-Yates, R.A.: A Hybrid Indexing Method for Approximate String Matching. J. Discrete Algorithms 1, 205–239 (2000)
MathSciNet Google Scholar
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24, 19–27 (2001)
Google Scholar
Needleman, S.B., Wunsch, C.D.: General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Molecular Biology 48, 443–453 (1970)
Article Google Scholar
Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)
Google Scholar
Smith, R., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Thompson, J.D., et al.: CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice. Nuc. Acids Research 22, 4673–4680 (1994)
Article Google Scholar
Wu, Y., Agrawal, D., El Abbadi, A.: A Comparison of DFT and DWT based Similarity Search in Time-Series Databases. CIKM, 488–495 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California-Santa Barbara, Santa Barbara, CA, 93106, USA
S. Alireza Aghili, Divyakant Agrawal & Amr El Abbadi

Authors

S. Alireza Aghili
View author publications
You can also search for this author in PubMed Google Scholar
Divyakant Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Amr El Abbadi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, Canada
Mario A. Nascimento
Universidade Federal do Amazonas, Manaus, AM, Brasil
Edleno S. de Moura
INESC-ID/IST, R. Alves Redol 9, 1000, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aghili, S.A., Agrawal, D., El Abbadi, A. (2003). BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-540-39984-1_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics