Abstract
Sequence nearest neighbors problem can be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks).
One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search. Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard.
The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this pape is the block edit distance.This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O(log l(log* l)2).The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above.
In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [11] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.
Supported in part by an NSF Career Award and by Charles B. Wang foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.
Bailey J.A., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E., Segmental duplications: organization and impact within the curren t human genome project assembly, Genome Research 11(6), Jun 2001.
G. Cormode, M. Paterson, S. C. Sahinalp and U. Vishkin. Communication Complexity of Document Exchange. Proc. ACM-SIAM Symp. on Discrete Algorithms, 2000.
G. Cormode, S. Muthukrishnan, S. C. Sahinalp. Permutation editing and matching via Embeddings. Proc. ICALP, 2001.
Feng D.F., Doolittle R.F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J Mol Evol. 1987;25(4):351–60.
P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Remving the Curse of Dimensionality. Proc. ACM Symp. on Theory of Computing, 1998, 604–613.
Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.
Y. Ji, E. E. Eichler, S. Schwartz, R. D. Nicholls, Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders, Genome Research 10, 2000.
E. Kushilevitz, R. Ostrovsky and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proc. ACM Symposium on Theory of Computing, 1998, 614–623.
D. Lopresti and A. Tomkins. Block edit models for approximate string matching. Theoretical Computer Science, 1996.
S. Muthukrishnan and S. C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations Proc. ACM Symposium on Theory of Computing, 2000.
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707–710, 1966.
V. Bafna, P. A. Pevzner, Sorting by transpositions. SIAM J. Discrete Math, 11, 224–240, 1998.
D. Shapira and J. Storer, Edit distance with move operations,t Proceedings of CPM, (2002).
S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proceedings of IEEE Symposium on Foundations of Computer Science, (1996).
George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528–535.
J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.
L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational Biology, 1:337–348, 1994.
C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Muthukrishnan, S.M., Ṣahinalp, S.C. (2002). Simple and Practical Sequence Nearest Neighbors with Block Operations. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_22
Download citation
DOI: https://doi.org/10.1007/3-540-45452-7_22
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43862-5
Online ISBN: 978-3-540-45452-6
eBook Packages: Springer Book Archive