Abstract
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.
This research was supported by the NSF grants under EIA02-05675, EIA99-86057, EIA00-80134, and IIS02-09112.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aghili, S.A., Agrawal, D., El Abbadi, A.: Filtration of String Proximity Search via Transformation. BIBE, 149–157 (2003)
Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Join in Biological Databases (Extended Version). UCSB Technical Report, TRCS03-12 (2003)
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.J.: Basic Local Alignment Search tool. Molecular Biology 215, 403–410 (1990)
Apostolico, A.: Apostolico, A.: The Myriad Virtues of Subword Trees. Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array (QUASAR). RECOMB, 77–83 (1999)
Chavez, E., Navarro, G.: A Metric Index for Approximate String Matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)
Gaede, V., Günther, O.: Multidimensional Access Methods. ACM Computing Surveys 30, 170–231 (1998)
Giladi, E., Walker, M.G., Wang, J.Z., Volkmuth, W.: SST: An Algorithm for Finding Near-Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size. Bioinformatics 18, 873–877 (2002)
Gravano, L., et al.: Approximate String Joins in a Database (Almost) for Free. VLDB, 491–500 (2001)
Internet Movie DataBase (IMDB), http://www.imdb.com
Jin, L., Li, C., Mehrotra, S.: Efficient Similarity String Joins in Large Data Sets. UCI ICS Technical Report, TR-DB-02-04 (2002)
Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts. MFCS 16, 240–248 (1991)
Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. VLDB, 351–360 (2001)
National Center for Biotechnology Information(NCBI), http://www.ncbi.nih.gov/
Navarro, G., Baeza-Yates, R.A.: A Hybrid Indexing Method for Approximate String Matching. J. Discrete Algorithms 1, 205–239 (2000)
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24, 19–27 (2001)
Needleman, S.B., Wunsch, C.D.: General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Molecular Biology 48, 443–453 (1970)
Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)
Smith, R., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Molecular Biology 147, 195–197 (1981)
Thompson, J.D., et al.: CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice. Nuc. Acids Research 22, 4673–4680 (1994)
Wu, Y., Agrawal, D., El Abbadi, A.: A Comparison of DFT and DWT based Similarity Search in Time-Series Databases. CIKM, 488–495 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Aghili, S.A., Agrawal, D., El Abbadi, A. (2003). BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-39984-1_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive