Abstract
Although hash-based approaches to sequence alignment and genome assembly are long established, their utility is predicated on the rapid identification of exact k-mers from a hash-map or similar data structure. We describe how a fuzzy hash-map can be applied to quickly and accurately align a prokaryotic genome to the reference genome of a related species. Using this technique, a draft genome of Mycoplasma genitalium, sampled at 1X coverage, was accurately anchored against the genome of Mycoplasma pneumoniae. The fuzzy approach to alignment, ordered and orientated more than 65% of the reads from the draft genome in under 10 seconds, with an error rate of <1.5%. Without sacrificing execution speed, fuzzy hash-maps also provide a mechanism for error tolerance and variability in k-mer centric sequence alignment and assembly applications.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Goodrich, M., Tamassia, R.: Data Structures and Algorithms in Java. John Wiley & Sons, Chichester (2001)
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389 (1997)
Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences 85, 2444 (1988)
Pevzner, P., Tang, H., Waterman, M.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America 98, 9748 (2001)
Zerbino, D., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821 (2008)
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851 (2008)
Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: accurate mapping of short color-space reads. PLoS computational biology 5 (2009)
Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713 (2008)
Lin, H., Zhang, Z., Zhang, M., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics 24, 2431 (2008)
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440 (2002)
Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. Journal of Experimental Biology 210, 1518 (2007)
Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-generation sequencing. Genome Research 20, 1165 (2010)
Pop, M.: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354 (2009)
Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6 (2005)
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform., bbq015 (2010)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Digital SRC Research Report (1994)
Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009)
Li, R., Yu, C., Li, Y., Lam, T., Yiu, S., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966 (2009)
Topac, V.: Efficient fuzzy search enabled hash map, pp. 39–44 (2010)
Gosling, J., Joy, B., Steele, G., Bracha, G.: Java (TM) Language Specification, The Java (Addison-Wesley): Addison-Wesley Professional (2005)
Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29, 147–160 (1950)
Bookstein, A., Tomi Klein, S., Raita, T.: Fuzzy Hamming Distance: A New Dissimilarity Measure (Extended Abstract), pp. 86–97 (2001)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals (1966)
Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I., Belmonte, M., Lander, E., Nusbaum, C., Jaffe, D.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 18, 810 (2008)
Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., Birol: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Healy, J., Chambers, D. (2011). Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds) 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011). Advances in Intelligent and Soft Computing, vol 93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19914-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-19914-1_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19913-4
Online ISBN: 978-3-642-19914-1
eBook Packages: EngineeringEngineering (R0)