skip to main content
research-article

WHAM: A High-Throughput Sequence Alignment Method

Published: 01 December 2012 Publication History

Abstract

Over the last decade, the cost of producing genomic sequences has dropped dramatically due to the current so-called next-generation sequencing methods. However, these next-generation sequencing methods are critically dependent on fast and sophisticated data processing methods for aligning a set of query sequences to a reference genome using rich string matching models. The focus of this work is on the design, development and evaluation of a data processing system for this crucial “short read alignment” problem. Our system, called WHAM, employs hash-based indexing methods and bitwise operations for sequence alignments. It allows rich match models and it is significantly faster than the existing state-of-the-art methods. In addition, its relative speedup over the existing method is poised to increase in the future in which read sequence lengths will increase.

References

[1]
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403--410.
[2]
Arasu, A., Ganti, V., and Kaushik, R. 2006. Efficient exact set-similarity joins. In Proceedings of the International Conference on Very Large Databases. 918--929.
[3]
Baeza-Yates, R. A. and Gonnet, G. H. 1992. A new approach to text searching. Commun. ACM 35, 10, 74--82.
[4]
Baeza-Yates, R. A. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.
[5]
Baeza-Yates, R. A. and Perleberg, C. H. 1996. Fast and practical approximate string matching. Inf. Process. Lett. 59, 1, 21--27.
[6]
Burrows, M. and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Digital SRC Research Report.
[7]
Glaswin, H. T. 1971. A note on compiling fixed point binary multiplications. Commun. ACM 14, 6, 407--408.
[8]
Karakoc, E., Ozsoyoglu, Z., Sahinalp, S., Tasan, M., and Zhang, X. 2004. Novel approaches to biomolecular sequence indexing. Data Engin. 1001, 40.
[9]
Kim, M.-S., Whang, K.-Y., Lee, J.-G., and Lee, M.-J. 2005. n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In Proceedings of the International Conference on Very Large Databases. 325--336.
[10]
Knuth, D. E. 2011. The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Addison-Wesley Professional.
[11]
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. 2009. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10, 3, R25.
[12]
Li, C., Wang, B., and Yang, X. 2007. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In Proceedings of the International Conference on Very Large Databases. 303--314.
[13]
Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 14, 1754--1760.
[14]
Li, H., Ruan, J., and Durbin, R. 2008a. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Resear. 18, 11, 1851.
[15]
Li, R., Li, Y., Kristiansen, K., and Wang, J. 2008b. SOAP: Short oligonucleotide alignment program. Bioinformatics 24, 5, 713.
[16]
Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S.-M., Kristiansen, K., and Wang, J. 2009. Soap2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 15, 1966--1967.
[17]
Li, Y., Terrell, A., and Patel, J. M. 2011. Wham: A high-throughput sequence alignment method. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 445--456.
[18]
Litwin, W., Mokadem, R., Rigaux, P., and Schwarz, T. J. E. 2007. Fast ngram-based string search over data encoded using algebraic signatures. In Proceedings of the International Conference on Very Large Databases. 207--218.
[19]
McPherson, J. D. 2009. Next-generation gap. Nature Methods 6, 11s, S2--S5.
[20]
Myers, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415.
[21]
Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88.
[22]
Navarro, G. and Baeza-Yates, R. A. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1, 2.
[23]
Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443--453.
[24]
Papapetrou, P., Athitsos, V., Kollios, G., and Gunopulos, D. 2009. Reference-based alignment in large sequence databases. Proc. VLDB Endow. 2, 1, 205--216.
[25]
Rivest, R. L. 1976. Partial-match retrieval algorithms. SIAM J. Comput. 5, 1, 19--50.
[26]
Shi, F. 1996. Fast approximate string matching with q-blocks sequences. In Proceedings of the 3rd South American Workshop on String Processing. 257--271.
[27]
Smith, T. F. and Waterman, M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology 147, 195--197.
[28]
Ukkonen, E. 1985a. Algorithms for approximate string matching. Inf. Control 64, 1--3, 100--118.
[29]
Ukkonen, E. 1985b. Finding approximate patterns in strings. J. Algorithms 6, 1, 132--137.
[30]
Venter, J. C., Adams, M. D., et al. 2001. The sequence of the human genome. Science 291, 5507, 1304--1351.
[31]
Wegner, P. 1960. A technique for counting ones in a binary computer. Commun. ACM 3, 5, 322.
[32]
Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Commun. ACM 35, 10, 83--91.
[33]
Wu, T. D. and Nacu, S. 2010. Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 7, 873--881.
[34]
Yang, X., Wang, B., and Li, C. 2008. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 353--364.

Cited By

View all
  • (2023) Retracted: Maternal traditional Chinese medicine exposure and risk of congenital malformations: a multicenter prospective cohort study Acta Obstetricia et Gynecologica Scandinavica10.1111/aogs.14553102:6(735-743)Online publication date: 19-Apr-2023
  • (2022)Unlocking Personalized Healthcare on Modern CPUs/GPUs: Three-way Gene Interaction Study2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00023(146-156)Online publication date: May-2022
  • (2020)Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome ProfilingInternational Journal of Molecular Sciences10.3390/ijms2103094421:3(944)Online publication date: 31-Jan-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 37, Issue 4
December 2012
345 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2389241
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2012
Accepted: 01 August 2012
Revised: 01 August 2012
Received: 01 October 2011
Published in TODS Volume 37, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sequence alignment
  2. approximate string matching
  3. bit-parrallism

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)5
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023) Retracted: Maternal traditional Chinese medicine exposure and risk of congenital malformations: a multicenter prospective cohort study Acta Obstetricia et Gynecologica Scandinavica10.1111/aogs.14553102:6(735-743)Online publication date: 19-Apr-2023
  • (2022)Unlocking Personalized Healthcare on Modern CPUs/GPUs: Three-way Gene Interaction Study2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00023(146-156)Online publication date: May-2022
  • (2020)Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome ProfilingInternational Journal of Molecular Sciences10.3390/ijms2103094421:3(944)Online publication date: 31-Jan-2020
  • (2019)Effectiveness of pharmaceutical interventions for meibomian gland dysfunction: An evidence‐based review of clinical trialsClinical & Experimental Ophthalmology10.1111/ceo.1346047:5(658-668)Online publication date: 18-Feb-2019
  • (2019)RuleTailor: Optimizing Flow Table Updates in OpenFlow Switches With Rule TransformationsIEEE Transactions on Network and Service Management10.1109/TNSM.2019.294721716:4(1581-1594)Online publication date: Dec-2019
  • (2019)KmerindIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2017.276082916:4(1117-1131)Online publication date: 1-Jul-2019
  • (2018)Co-occurrence pattern mining based on a biological approximation scoring matrixPattern Analysis & Applications10.5555/3288219.328822721:4(977-996)Online publication date: 1-Nov-2018
  • (2017)Research on Real Time Processing and Intelligent Analysis Technology of Power Big DataProceedings of the International Conference on Big Data and Internet of Thing10.1145/3175684.3175717(43-47)Online publication date: 20-Dec-2017
  • (2017)Composed sketch framework for quantiles and cardinality queries over big data streamsProceedings of the ACM Turing 50th Celebration Conference - China10.1145/3063955.3063995(1-10)Online publication date: 12-May-2017
  • (2017)Massively Parallel Processing of Whole Genome Sequence DataProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064048(187-202)Online publication date: 9-May-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media