Skip to main content

Simple and Practical Sequence Nearest Neighbors with Block Operations

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2373))

Included in the following conference series:

Abstract

Sequence nearest neighbors problem can be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks).

One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search. Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard.

The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this pape is the block edit distance.This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O(log l(log* l)2).The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above.

In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [11] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.

Supported in part by an NSF Career Award and by Charles B. Wang foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.

    Google Scholar 

  2. Bailey J.A., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E., Segmental duplications: organization and impact within the curren t human genome project assembly, Genome Research 11(6), Jun 2001.

    Google Scholar 

  3. G. Cormode, M. Paterson, S. C. Sahinalp and U. Vishkin. Communication Complexity of Document Exchange. Proc. ACM-SIAM Symp. on Discrete Algorithms, 2000.

    Google Scholar 

  4. G. Cormode, S. Muthukrishnan, S. C. Sahinalp. Permutation editing and matching via Embeddings. Proc. ICALP, 2001.

    Google Scholar 

  5. Feng D.F., Doolittle R.F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J Mol Evol. 1987;25(4):351–60.

    Article  Google Scholar 

  6. P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Remving the Curse of Dimensionality. Proc. ACM Symp. on Theory of Computing, 1998, 604–613.

    Google Scholar 

  7. Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.

    Google Scholar 

  8. Y. Ji, E. E. Eichler, S. Schwartz, R. D. Nicholls, Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders, Genome Research 10, 2000.

    Google Scholar 

  9. E. Kushilevitz, R. Ostrovsky and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proc. ACM Symposium on Theory of Computing, 1998, 614–623.

    Google Scholar 

  10. D. Lopresti and A. Tomkins. Block edit models for approximate string matching. Theoretical Computer Science, 1996.

    Google Scholar 

  11. S. Muthukrishnan and S. C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations Proc. ACM Symposium on Theory of Computing, 2000.

    Google Scholar 

  12. V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707–710, 1966.

    MathSciNet  Google Scholar 

  13. V. Bafna, P. A. Pevzner, Sorting by transpositions. SIAM J. Discrete Math, 11, 224–240, 1998.

    Article  MATH  MathSciNet  Google Scholar 

  14. D. Shapira and J. Storer, Edit distance with move operations,t Proceedings of CPM, (2002).

    Google Scholar 

  15. S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proceedings of IEEE Symposium on Foundations of Computer Science, (1996).

    Google Scholar 

  16. George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528–535.

    Google Scholar 

  17. J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.

    Google Scholar 

  18. L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational Biology, 1:337–348, 1994.

    Article  Google Scholar 

  19. C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Muthukrishnan, S.M., Ṣahinalp, S.C. (2002). Simple and Practical Sequence Nearest Neighbors with Block Operations. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_22

Download citation

  • DOI: https://doi.org/10.1007/3-540-45452-7_22

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43862-5

  • Online ISBN: 978-3-540-45452-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics