DRESS: dimensionality reduction for efficient sequence search

Kotsifakos, Alexios; Stefan, Alexandra; Athitsos, Vassilis; Das, Gautam; Papapetrou, Panagiotis

doi:10.1007/s10618-015-0413-2

DRESS: dimensionality reduction for efficient sequence search

Published: 21 March 2015

Volume 29, pages 1280–1311, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Alexios Kotsifakos¹,
Alexandra Stefan¹,
Vassilis Athitsos¹,
Gautam Das¹ &
…
Panagiotis Papapetrou²

530 Accesses
Explore all metrics

Abstract

Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the $t$ most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ACRES: efficient query answering on large compressed sequences

Article 30 November 2017

Reducing the Distance Calculations when Searching an M‑Tree

Article 20 June 2017

A unified framework for string similarity search with edit-distance constraint

Article 17 December 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Altschul S, Madden T, Schffer R, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article Google Scholar
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of very large database endowment (PVLDB), pp 918–929
Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82
Article Google Scholar
Behm A, Vernica R, Alsubaiee S, Ji S, Lu J, Jin L, Lu Y, Li C (2010) UCI Flamingo Package 4.0. http://flamingo.ics.uci.edu/releases/4.0/
Bhadra R, Sandhya S, Abhinandan KR, Chakrabarti S, Sowdhamini R, Srinivasan N (2006) Cascade psi-blast web server: a remote homology search tool for relating protein domains. Nucleic Acids Res 34(Web–Server–Issue):143–146
Article Google Scholar
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Tech. Rep. 124, Systems Research Center, Palo Alto, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.6774
Hjaltason G, Samet H (2003) Properties of embedding methods for similarity searching in metric spaces. IEEE Trans Pattern Anal Mach Intell (PAMI) 25(5):530–549
Article Google Scholar
Jongeneel CV (2000) Searching the expressed sequence tag (est) databases: panning for genes. Bioinformatics 1:76–92
Google Scholar
Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by positional hashing. Genome Resour 14(4):672–678
Article Google Scholar
Kent WJ (2002) Resource BLAT-The BLAST-like alignment tool. Genome Res
Kim MS, Whang KY, Lee JG, Lee MJ (2005a) n-gram/2l: a space and time efficient two-level n-gram inverted index structure. In: Proceedings of the 31st international conference on very large data bases, VLDB Endowment, pp 325–336
Kim YJ, Boyd A, Athey BD, Patel JM (2005b) miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res 33:4335–4344
Article Google Scholar
Korf I, Gish W (2000) Mpblast : improved blast performance with multiplexed queries. Bioinformatics 16:1052–1053
Article Google Scholar
Langmead B, Trapnell C, Pop M, Salzberg SL et al (2009) Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol 10(3):R25
Article Google Scholar
Li C, Wang B, Yang X (2007) Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 303–314
Li C, Lu J, Lu Y (2008a) Efficient merging and filtering algorithms for approximate string searches. International conference on data engineering (ICDE)
Li H, Ruan J, Durbin R (2008b) Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res 18(11):1851–1858
Article Google Scholar
Li R, Li Y, Kristiansen K, Wang J (2008c) Soap: short oligonucleotide alignment program. Bioinformatics 24(5):713–714
Article Google Scholar
Li Y, Patel JM, Terrell A (2012) Wham: a high-throughput sequence alignment method. ACM Trans Database Syst (TODS) 37(4):28
Google Scholar
Litwin W, Mokadem R, Rigaux P, Schwarz T (2007) Fast ngram-based string search over data encoded using algebraic signatures. In: Proceedings of the very large database endowment (PVLDB), pp 207–218
Liu B, Wang X, Zou Q, Dong Q, Chen Q (2013) Protein remote homology detection by combining chous pseudo amino acid composition and profile-based protein representation. Mol Inf 32(9–10):775–782
Article Google Scholar
Meek C, Patel JM, Kasetty S (2003) Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: Proceedings of very large database endowment (PVLDB), vol 29, pp 910–921
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article Google Scholar
Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: A fast search method for large dna databases. Genome Resour 11(10):1725–1729
Article Google Scholar
Papapetrou P, Athitsos V, Kollios G, Gunopulos D (2009) Reference-based alignment in large sequence databases. Proc Very Large Database Endow (PVLDB) 2(1):205–216
Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Tian Y, Mceachin RC, Santos C, States DJ, Patel JM (2007) Saga: A subgraph matching tool for biological graphs. Bioinformatics 23(2):232–239
Article Google Scholar
Traina C, Traina AJM, Seeger B, Faloutsos C (2000) Slim-trees: high performance metric trees minimizing overlap between nodes. International conference on extending database technology (EDBT), pp 51–65
Venkateswaran J, Lachwani D, Kahveci T, Jermaine C (2006) Reference-based indexing of sequence databases. In: International conference on very large databases (VLDB), pp 906–917
Vergoulis T, Dalamagas T, Sacharidis D, Sellis TK (2012) Approximate regional sequence matching for genomic databases. VLDB J 21(6):779–795
Article Google Scholar
Vieira MR, Traina C, Chino FJT, Traina AJM (2004) Dbm-tree: a dynamic metric access method sensitive to local density data. Brazilian symposium on databases (SBBD), pp 163–177
Wandelt S, Starlinger J, Bux M, Leser U (2013) Rcsi: scalable similarity search in thousand(s) of genomes. Proceedings of the VLDB Endowment (PVLDB) p (to appear)
Wu S, Manber U (1992) Fast text searching: allowing errors. Commun ACM 35(10):83–91
Article Google Scholar
Yan X, Yu PS, Han J (2005) Graph indexing based on discriminative frequent structure analysis. ACM Trans Database Syst 30(4):960–993
Article Google Scholar
Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, pp 353–364
Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning dna sequences. J Comput Biol 7:203–214
Article Google Scholar
Zhu H, Kollios G, Athitsos V (2012) A generic framework for efficient and effective subsequence retrieval. Proc VLDB Endow (PVLDB) 5(11):1579–1590
Article Google Scholar

Download references

Acknowledgments

The work of Vassilis Athitsos was partially supported by National Science Foundation grants IIS-1055062, CNS-1059235, CNS-1035913, and CNS-1338118. The work of Gautam Das was partially supported by National Science Foundation under grants 0812601, 0915834, 1018865 and grants from Microsoft Research.

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX, USA
Alexios Kotsifakos, Alexandra Stefan, Vassilis Athitsos & Gautam Das
Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden
Panagiotis Papapetrou

Authors

Alexios Kotsifakos
View author publications
You can also search for this author inPubMed Google Scholar
Alexandra Stefan
View author publications
You can also search for this author inPubMed Google Scholar
Vassilis Athitsos
View author publications
You can also search for this author inPubMed Google Scholar
Gautam Das
View author publications
You can also search for this author inPubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Panagiotis Papapetrou.

Additional information

Responsible editor: Joao Gama, Indre Zliobaite, Alipio Jorge and Concha Bielza.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kotsifakos, A., Stefan, A., Athitsos, V. et al. DRESS: dimensionality reduction for efficient sequence search. Data Min Knowl Disc 29, 1280–1311 (2015). https://doi.org/10.1007/s10618-015-0413-2

Download citation

Received: 25 July 2014
Accepted: 10 March 2015
Published: 21 March 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10618-015-0413-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DRESS: dimensionality reduction for efficient sequence search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ACRES: efficient query answering on large compressed sequences

Reducing the Distance Calculations when Searching an M‑Tree

A unified framework for string similarity search with edit-distance constraint

Explore related subjects

Notes

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now