ABSTRACT
The cytochrome c oxidase 1 (COI) gene is among the most popular markers for molecular biodiversity estimation. In essence, COI-based approaches for taxonomic identification rely on comprehensive reference databases to assign unknown sequences to known species and/or to enhance the identification of new species. As such, for COI-based methods to be effective, the accuracy and integrity of reference databases are critical. However, as COI repositories grow, it becomes difficult to manually curate and validate user-contributed data. This, in turn, propagates prediction errors, therefore reinforcing the cycle. Here, we propose a new computationally efficient approach for identifying anomalies which are either due to systematic biases (indels and chimeras) or to user error (mistranslation and misclassification). Our approach uses COI reference alignments to model substitutions across the marker. The resulting model is subsequently used to screen and identify sequences with incongruous fit to the model. Analysis of the complete set of curated Insecta COI reference sequences identify the presence of numerous anomalous sequences, which makes a strong case for the importance of new strategies to screen publicly available COI references.
- J. Bentley. Programming Pearls. ACM, New York, NY, USA, 1986. Google Scholar
- S. Csősz and B. L. Fisher. Toward objective, morphology-based taxonomy: A case study on the malagasy nesomyrmex sikorai species group (hymenoptera: Formicidae). PloS one, 11(4):e0152454, 2016. Google ScholarCross Ref
- R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids, 1998.Google Scholar
- R. C. Edgar, B. J. Haas, J. C. Clemente, and C. Quince. UCHIME improves sensitivity and speed of chimera detection. ..., 2011.Google Scholar
- L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150--3152, Dec. 2012. Google ScholarDigital Library
- B. J. Haas, D. Gevers, A. M. Earl, M. Feldgarden, D. V. Ward, G. Giannoukos, D. Ciulla, D. Tabbaa, S. K. Highlander, E. Sodergren, B. Methé, T. Z. DeSantis, Human Microbiome Consortium, J. F. Petrosino, R. Knight, and B. W. Birren. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome research, 21(3):494--504, Mar. 2011. Google ScholarCross Ref
- C. R. Harding, G. N. Schroeder, J. W. Collins, and G. Frankel. Use of Galleria mellonella as a model organism to study Legionella pneumophila infection. Journal of visualized experiments : JoVE, (81):e50964, 2013.Google Scholar
- C. R. Harding, G. N. Schroeder, S. Reynolds, A. Kosta, J. W. Collins, A. Mousnier, and G. Frankel. Legionella pneumophila pathogenesis in the galleria mellonella infection model. Infection and immunity, 80(8):2780--2790, 2012. Google ScholarCross Ref
- P. Hebert and A. Cywinska. Login. . . . B: Biological . . ., 2003.Google Scholar
- P. Hebert and S. Ratnasingham. Login. ... of the Royal . . ., 2003.Google Scholar
- K. Katoh, K. Misawa, K.-i. Kuma, and T. Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research, 30(14):3059--3066, July 2002. Google ScholarCross Ref
- K. Katoh and D. M. Standley. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4):772--780, Apr. 2013. Google ScholarCross Ref
- N. Knowlton and L. A. Weigt. New dates and new rates for divergence across the isthmus of panama. Proceedings of the Royal Society of London B: Biological Sciences, 265(1412):2257--2263, 1998. Google ScholarCross Ref
- D. H. Lunt, D. X. Zhang, J. M. Szymura, and G. M. Hewitt. The insect cytochrome oxidase I gene: evolutionary patterns and conserved primers for phylogenetic studies. Insect molecular biology, 5(3):153--165, Aug. 1996. Google ScholarCross Ref
- M. Mysara, Y. Saeys, N. Leys, and J. Raes. CATCh, an ensemble classifier for chimera detection in 16S rRNA sequencing studies. Applied and . . ., 2015.Google Scholar
- E. Parzen. On Estimation of a Probability Density Function and Mode on JSTOR. The annals of mathematical statistics, 1962. Google ScholarCross Ref
- D. L. Porazinska, R. M. Giblin-Davis, and W. Sung. The nature and frequency of chimeras in eukaryotic metagenetic samples. Journal of . . ., 2012.Google Scholar
- S. Ratnasingham and P. D. N. Hebert. bold: The Barcode of Life Data System (http://www.barcodinglife.org). Molecular Ecology Resources, 7(3):355--364, May 2007. Google ScholarCross Ref
- M. Schirmer, U. Z. Ijaz, R. D'Amore, N. Hall, W. T. Sloan, and C. Quince. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 43(6):gku1341-e37, Jan. 2015. Google ScholarCross Ref
- J. Tu, J. Guo, J. Li, S. Gao, B. Yao, and Z. Lu. Systematic Characteristic Exploration of the Chimeras Generated in Multiple Displacement Amplification through Next Generation Sequencing Data Reanalysis. PloS one, 10(10):e0139857, Oct. 2015. Google ScholarCross Ref
- G. Wang and Y. Wang. Login. Microbiology, 1996.Google Scholar
- B. S. Yandell. Smoothing Methods in Statistics. Technometrics, 1997. Google ScholarCross Ref
Index Terms
- A profile-based probabilistic approach for the detection of anomalies in the cytochrome C oxidase I amplicon sequences
Recommendations
Detecting anomalies in the Cytochrome C Oxidase I amplicon sequences using minimum scoring segments
The Cytochrome C Oxidase 1 (COI) gene is among the most popular markers for molecular biodiversity estimation. In essence, COI-based approaches for taxonomic identification rely on comprehensive reference databases to assign unknown sequences to known ...
Cytochrome Oxidase I COI sequence conservation and variation patterns in the yellowfin and longtail tunas
Tunas are commercially important fishery worldwide. There are at least 13 species of tuna belonging to three genera, out of which genus Thunnus has maximum eight species. On the basis of their availability, they can be characterised as oceanic such as ...
Amplicon: software for designing PCR primers on aligned DNA sequences
Summary: Amplicon is a program for designing PCR primers on aligned groups of DNA sequences. The most important application for Amplicon is the design of 'group-specific' PCR primer sets that amplify a DNA region from a given taxonomic group but do ...
Comments