skip to main content
10.1145/3019612.3019614acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

A profile-based probabilistic approach for the detection of anomalies in the cytochrome C oxidase I amplicon sequences

Published:03 April 2017Publication History

ABSTRACT

The cytochrome c oxidase 1 (COI) gene is among the most popular markers for molecular biodiversity estimation. In essence, COI-based approaches for taxonomic identification rely on comprehensive reference databases to assign unknown sequences to known species and/or to enhance the identification of new species. As such, for COI-based methods to be effective, the accuracy and integrity of reference databases are critical. However, as COI repositories grow, it becomes difficult to manually curate and validate user-contributed data. This, in turn, propagates prediction errors, therefore reinforcing the cycle. Here, we propose a new computationally efficient approach for identifying anomalies which are either due to systematic biases (indels and chimeras) or to user error (mistranslation and misclassification). Our approach uses COI reference alignments to model substitutions across the marker. The resulting model is subsequently used to screen and identify sequences with incongruous fit to the model. Analysis of the complete set of curated Insecta COI reference sequences identify the presence of numerous anomalous sequences, which makes a strong case for the importance of new strategies to screen publicly available COI references.

References

  1. J. Bentley. Programming Pearls. ACM, New York, NY, USA, 1986. Google ScholarGoogle Scholar
  2. S. Csősz and B. L. Fisher. Toward objective, morphology-based taxonomy: A case study on the malagasy nesomyrmex sikorai species group (hymenoptera: Formicidae). PloS one, 11(4):e0152454, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  3. R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids, 1998.Google ScholarGoogle Scholar
  4. R. C. Edgar, B. J. Haas, J. C. Clemente, and C. Quince. UCHIME improves sensitivity and speed of chimera detection. ..., 2011.Google ScholarGoogle Scholar
  5. L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150--3152, Dec. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. J. Haas, D. Gevers, A. M. Earl, M. Feldgarden, D. V. Ward, G. Giannoukos, D. Ciulla, D. Tabbaa, S. K. Highlander, E. Sodergren, B. Methé, T. Z. DeSantis, Human Microbiome Consortium, J. F. Petrosino, R. Knight, and B. W. Birren. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome research, 21(3):494--504, Mar. 2011. Google ScholarGoogle ScholarCross RefCross Ref
  7. C. R. Harding, G. N. Schroeder, J. W. Collins, and G. Frankel. Use of Galleria mellonella as a model organism to study Legionella pneumophila infection. Journal of visualized experiments : JoVE, (81):e50964, 2013.Google ScholarGoogle Scholar
  8. C. R. Harding, G. N. Schroeder, S. Reynolds, A. Kosta, J. W. Collins, A. Mousnier, and G. Frankel. Legionella pneumophila pathogenesis in the galleria mellonella infection model. Infection and immunity, 80(8):2780--2790, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Hebert and A. Cywinska. Login. . . . B: Biological . . ., 2003.Google ScholarGoogle Scholar
  10. P. Hebert and S. Ratnasingham. Login. ... of the Royal . . ., 2003.Google ScholarGoogle Scholar
  11. K. Katoh, K. Misawa, K.-i. Kuma, and T. Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research, 30(14):3059--3066, July 2002. Google ScholarGoogle ScholarCross RefCross Ref
  12. K. Katoh and D. M. Standley. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4):772--780, Apr. 2013. Google ScholarGoogle ScholarCross RefCross Ref
  13. N. Knowlton and L. A. Weigt. New dates and new rates for divergence across the isthmus of panama. Proceedings of the Royal Society of London B: Biological Sciences, 265(1412):2257--2263, 1998. Google ScholarGoogle ScholarCross RefCross Ref
  14. D. H. Lunt, D. X. Zhang, J. M. Szymura, and G. M. Hewitt. The insect cytochrome oxidase I gene: evolutionary patterns and conserved primers for phylogenetic studies. Insect molecular biology, 5(3):153--165, Aug. 1996. Google ScholarGoogle ScholarCross RefCross Ref
  15. M. Mysara, Y. Saeys, N. Leys, and J. Raes. CATCh, an ensemble classifier for chimera detection in 16S rRNA sequencing studies. Applied and . . ., 2015.Google ScholarGoogle Scholar
  16. E. Parzen. On Estimation of a Probability Density Function and Mode on JSTOR. The annals of mathematical statistics, 1962. Google ScholarGoogle ScholarCross RefCross Ref
  17. D. L. Porazinska, R. M. Giblin-Davis, and W. Sung. The nature and frequency of chimeras in eukaryotic metagenetic samples. Journal of . . ., 2012.Google ScholarGoogle Scholar
  18. S. Ratnasingham and P. D. N. Hebert. bold: The Barcode of Life Data System (http://www.barcodinglife.org). Molecular Ecology Resources, 7(3):355--364, May 2007. Google ScholarGoogle ScholarCross RefCross Ref
  19. M. Schirmer, U. Z. Ijaz, R. D'Amore, N. Hall, W. T. Sloan, and C. Quince. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 43(6):gku1341-e37, Jan. 2015. Google ScholarGoogle ScholarCross RefCross Ref
  20. J. Tu, J. Guo, J. Li, S. Gao, B. Yao, and Z. Lu. Systematic Characteristic Exploration of the Chimeras Generated in Multiple Displacement Amplification through Next Generation Sequencing Data Reanalysis. PloS one, 10(10):e0139857, Oct. 2015. Google ScholarGoogle ScholarCross RefCross Ref
  21. G. Wang and Y. Wang. Login. Microbiology, 1996.Google ScholarGoogle Scholar
  22. B. S. Yandell. Smoothing Methods in Statistics. Technometrics, 1997. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A profile-based probabilistic approach for the detection of anomalies in the cytochrome C oxidase I amplicon sequences

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '17: Proceedings of the Symposium on Applied Computing
          April 2017
          2004 pages
          ISBN:9781450344869
          DOI:10.1145/3019612

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 April 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,650of6,669submissions,25%
        • Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader