Simple and Practical Sequence Nearest Neighbors with Block Operations

Muthukrishnan, S. Muthu; Ṣahinalp, S. Cenk

doi:10.1007/3-540-45452-7_22

S. Muthu Muthukrishnan⁶ &
S. Cenk Ṣahinalp⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2373))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

406 Accesses
6 Citations

Abstract

Sequence nearest neighbors problem can be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks).

One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search. Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard.

The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this pape is the block edit distance.This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O(log l(log* l)²).The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above.

In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [11] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.

Supported in part by an NSF Career Award and by Charles B. Wang foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.
Google Scholar
Bailey J.A., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E., Segmental duplications: organization and impact within the curren t human genome project assembly, Genome Research 11(6), Jun 2001.
Google Scholar
G. Cormode, M. Paterson, S. C. Sahinalp and U. Vishkin. Communication Complexity of Document Exchange. Proc. ACM-SIAM Symp. on Discrete Algorithms, 2000.
Google Scholar
G. Cormode, S. Muthukrishnan, S. C. Sahinalp. Permutation editing and matching via Embeddings. Proc. ICALP, 2001.
Google Scholar
Feng D.F., Doolittle R.F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J Mol Evol. 1987;25(4):351–60.
Article Google Scholar
P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Remving the Curse of Dimensionality. Proc. ACM Symp. on Theory of Computing, 1998, 604–613.
Google Scholar
Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.
Google Scholar
Y. Ji, E. E. Eichler, S. Schwartz, R. D. Nicholls, Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders, Genome Research 10, 2000.
Google Scholar
E. Kushilevitz, R. Ostrovsky and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proc. ACM Symposium on Theory of Computing, 1998, 614–623.
Google Scholar
D. Lopresti and A. Tomkins. Block edit models for approximate string matching. Theoretical Computer Science, 1996.
Google Scholar
S. Muthukrishnan and S. C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations Proc. ACM Symposium on Theory of Computing, 2000.
Google Scholar
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707–710, 1966.
MathSciNet Google Scholar
V. Bafna, P. A. Pevzner, Sorting by transpositions. SIAM J. Discrete Math, 11, 224–240, 1998.
Article MATH MathSciNet Google Scholar
D. Shapira and J. Storer, Edit distance with move operations,t Proceedings of CPM, (2002).
Google Scholar
S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proceedings of IEEE Symposium on Foundations of Computer Science, (1996).
Google Scholar
George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528–535.
Google Scholar
J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.
Google Scholar
L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational Biology, 1:337–348, 1994.
Article Google Scholar
C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept of CS, Rutgers University and AT& T Labs — Research, Florham Park, NJ, USA
S. Muthu Muthukrishnan
Dept of EECS, Dept of Genetics, Cntr for Computational Genomics CWRU, Cleveland, OH, USA
S. Cenk Ṣahinalp

Authors

S. Muthu Muthukrishnan
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Ṣahinalp
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrial Engineering and Computer Science, University of Padova, Via Gradenigo 6/A, 35131, Padova, Italy
Alberto Apostolico
Department of Informatics, Kyushu University, Fukuoka 812-8581, Japan
Masayuki Takeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Muthukrishnan, S.M., Ṣahinalp, S.C. (2002). Simple and Practical Sequence Nearest Neighbors with Block Operations. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_22

Download citation

DOI: https://doi.org/10.1007/3-540-45452-7_22
Published: 21 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43862-5
Online ISBN: 978-3-540-45452-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics