skip to main content
research-article

Efficient relaxed search in hierarchically clustered sequence datasets

Published: 19 July 2012 Publication History

Abstract

This article presents a new algorithm for finding oligonucleotide signatures that are specific and sensitive for organisms or groups of organisms in large-scale sequence datasets. We assume that the organisms have been organized in a hierarchy, for example, a phylogenetic tree. The resulting signatures, binding sites for primers and probes, match the maximum possible number of organisms in the target group while having at most k matches outside of the target group.
The key step in the algorithm is the use of the lowest common ancestor (LCA) to search the organism hierarchy; this allows the combinatorial problem in almost linear time (empirically observed) to be solved. The presented algorithm improves performance by several orders of magnitude in terms of both memory consumption and runtime when compared to the best-known previous algorithms while giving identical, exact solutions.
This article gives a formal description of the algorithm, discusses details of our concrete, publicly available implementation, and presents the results from our performance evaluation.

References

[1]
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--86.
[2]
Amann, R. and Fuchs, B. M. 2008. Single-cell identification in microbial communities by improved fluorescence in situ hybridization techniques. Nat. Rev. Microbiol. 6, 5, 339--348.
[3]
Amann, R. I., Ludwig, W., and Schleifer, K. H. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 59, 1, 143--169.
[4]
Bader, K. C., Eissler, T., Evans, N., GauthierDickey, C., Grothoff, C., Grothoff, K., Keene, J., Meier, H., Ritzdorf, C., and Rutherford, M. J. 2010. Distributed stream processing with DUP. In Network and Parallel Computing, Lecture Notes in Computer Science, vol. 6289, Springer, Berlin, 232--246.
[5]
Bader, K. C., Grothoff, C., and Meier, H. 2011. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 27, 1546--1554.
[6]
Bartlett, J. M. S. and Stirling, D. 2003. A short history of the polymerase chain reaction. In PCR Protocols, J. M. Bartlett and D. Stirling, Eds., Methods in Molecular Biology Series, vol. 226, Humana Press, 3--6. 10.1385/1-59259-384-4:3.
[7]
Bender, M. and Farach-Colton, M. 2000. The lca problem revisited. In LATIN 2000: Theoretical Informatics, G. Gonnet and A. Viola, Eds., Lecture Notes in Computer Science Series, vol. 1776, Springer Berlin, 88--94.
[8]
Chung, W. H., Rhee, S. K., Wan, X. F., Bae, J. W., Quan, Z. X., and Park, Y. H. 2005. Design of long oligonucleotide probes for functional gene detection in a microbial community. Bioinformatics 21, 4092--4100.
[9]
Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Marsh, T., Garrity, G. M., and Tiedje, J. M. 2009. The Ribosomal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37, D141--145.
[10]
DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P., and Andersen, G. L. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Micorbiol. 72, 7, 5069--5072.
[11]
Doring, A., Weese, D., Rausch, T., and Reinert, K. 2008. Seqan an efficient, generic c++ library for sequence analysis. BMC Bioinformatics 9, 1, 11.
[12]
Duitama, J., Kumar, D. M., Hemphill, E., Khan, M., Mandoiu, I. I., and Nelson, C. E. 2009. PrimerHunter: a primer design tool for PCR-based virus subtype identification. Nucleic Acids Res. 37, 2483--2492.
[13]
Eissler, T., Hodges, C. P., and Meier, H. 2011. Ptpan—overcoming memory limitations in oligonucleotide string matching for primer/probe design. Bioinformatics 27, 20, 2797--2805.
[14]
Feng, S. and Tillier, E. R. M. R. 2007. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics 23, 1195--1202.
[15]
Ficetola, G. F., Coissac, E., Zundel, S., Riaz, T., Shehzad, W., Bessiere, J., Taberlet, P., and Pompanon, F. 2010. An in silico approach for the evaluation of DNA barcodes. BMC Genomics 11, 434.
[16]
Fischer, J. and Heun, V. 2006. Theoretical and practical improvements on the rmq-problem, with applications to lca and lce. In Combinatorial Pattern Matching, M. Lewenstein and G. Valiente, Eds., Lecture Notes in Computer Science, vol. 4009, Springer, Berlin, 36--48.
[17]
Fischer, J. and Heun, V. 2007. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, B. Chen, M. Paterson, and G. Zhang, Eds., Lecture Notes in Computer Science, vol. 4614, Springer, Berlin, 459--470.
[18]
Gabow, H. N., Bentley, J. L., and Tarjan, R. E. 1984. Scaling and related techniques for geometry problems. In Proceedings of the 16th Annual ACM Symposium on Theory of Computing (STOC'84). ACM, New York, 135--143.
[19]
Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. 2004. Versatile and open software for comparing large genomes. Genome Biol. 5, R12.
[20]
Lee, H. P., Sheu, T.-F., and Tang, C. Y. 2010. A parallel and incremental algorithm for efficient unique signature discovery on dna databases. BMC Bioinformatics 11, 132.
[21]
Loy, A., Maixner, F., Wagner, M., and Horn, M. 2007. probeBase--an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res. 35 (Database issue) D800--804.
[22]
Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, et al. 2004. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 4, 1363--1371.
[23]
Olsen, G. 1990. The “Newick's 8:45” tree format. http://evolution.genetics.washington.edu/phylip/newick_doc.html.
[24]
Peterlongo, P., Nicolas, J., Lavenier, D., Vorch, R., and Querellou, J. 2009. c-gamma:comparative genome analysis of molecular markers. In Pattern Recognition in Bioinformatics, V. Kadirkamanathan, G. Sanguinetti, M. Girolami, M. Niranjan, and J. Noirel, Eds., Lecture Notes in Computer Science, vol. 5780, Springer, Berlin, 255--269.
[25]
Phillippy, A. M., Ayanbule, K., Edwards, N. J., and Salzberg, S. L. 2009. Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Res. 37, W229--234.
[26]
Phillippy, A. M., Mason, J. A., Ayanbule, K., Sommer, D. D., Taviani, E., Huq, A., Colwell, R. R., Knight, I. T., and Salzberg, S. L. 2007. Comprehensive DNA signature discovery and validation. PLoS Comput. Biol. 3, e98.
[27]
Price, M. N., Dehal, P. S., and Arkin, A. P. 2010. Fasttree 2 approximately maximum-likelihood trees for large alignments. PLoS ONE 5, 3, e9490.
[28]
Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glöckner, F. O. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 21, 7188--7196.
[29]
Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P., and Coissac, E. 2011. ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Res. 39, e145.
[30]
Stahl, D. A., Flesher, B., Mansfield, H. R., and Montgomery, L. 1988. Use of phylogenetically based hybridization probes for studies of ruminal microbial ecology. Appl. Environ. Microbiol. 54, 1079--1084.
[31]
Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688--2690.

Index Terms

  1. Efficient relaxed search in hierarchically clustered sequence datasets

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Journal of Experimental Algorithmics
      ACM Journal of Experimental Algorithmics  Volume 17, Issue
      2012
      427 pages
      ISSN:1084-6654
      EISSN:1084-6654
      DOI:10.1145/2133803
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2012
      Accepted: 01 April 2012
      Revised: 01 April 2012
      Received: 01 January 2012
      Published in JEA Volume 17

      Author Tags

      1. Data mining
      2. oligonucleotide
      3. phylogenetic
      4. primer design
      5. probe design
      6. ribosomal RNA
      7. signature
      8. tree

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 258
        Total Downloads
      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media