ABSTRACT
A necessary step in many metagenomic studies is to determine organisms present in a sample. Knowledge of the similarity among genomes of present organisms allows for more accurate mapping of high throughput sequencing reads to the correct genome for expression quantification. This study investigates current metrics of genome similarity as they relate to cross mapping percentage, defined as the percentage of sequence reads from one organism mapping to another organism's genome. This study aims to establish a new metric for genome similarity, incorporating cross mapping percentage. Paired-end reads were generated using Artificial FASTQ Generator (AFG), for 10 organisms fitting into two categories -- host and pathogen. The reads were mapped to reference genomes and the cross mapping percentage was calculated using Bowtie2. Bowtie2 produced higher values for organisms with a lower calculated genomic distance, which led to the conclusion that hosts and pathogens could easily be distinguished, while pathogens and other microbial genomes themselves were harder to separate. The genomes were aligned using MUMmer and an overall percent similarity between the sequences was determined. A metric for genome similarity was established by modifying formulas calculated within DSMZ's Genome-to-Genome Distance Calculator (GGDC) to incorporate cross mapping percentages. Formula manipulation did not change the trend present in genomic distance values which supports that cross mapping percentage, distance calculated with the original formulas and distance calculated with the new formulas are interchangeable. This work helps establish at what resolution organisms in a sample can be distinguished using whole genome sequence information. That is, how similar organisms can be and still be distinguished in a metagenomic study for the purposes of computing expression values. These findings allow for organisms in metagenomic studies to be better identified and an accurate quantification of expression computed in metatranscriptomic studies.
- Frampton, Matthew, and Richard Houlston. "Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines." PLoS ONE 7.11 (2012): Web. 29 July 2015.Google ScholarCross Ref
- Kurtz, Stefan et al., "Versatile and Open Software for Comparing Large Genomes." Genome Biology 5.2 (2004): Web. 29 July 2015.Google ScholarCross Ref
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012, 9:357--359.Google Scholar
- Meier-Kolthoff, J. P., Auch, A. F., Klenk, H.-P., Göker, M. Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics 14:60, 2013.Google ScholarCross Ref
- Monaco MK et al.,(2014). Gramene 2013: comparative plant genomics resources. Nucleic Acids Res. 42 (D1): D1193--D1199. PMID:24217918. doi: 10.1093/nar/gkt1110.Google ScholarCross Ref
- National Center for Biotechnology Information (NCBI). Web. http://www.ncbi.nlm.nih.gov/nuccore.Google Scholar
Index Terms
- Investigating genome similarity through cross mapping percentage
Recommendations
Rearrangement Phylogeny of Genomes in Contig Form
There has been a trend in increasing the phylogenetic scope of genome sequencing while decreasing the quality of the published sequence for each genome. With reduced finishing effort, there is an increasing number of genomes being published in contig ...
Scaffold Filling under the Breakpoint and Related Distances
Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, a problem on filling an incomplete multichromosomal genome (or scaffold) I with respect to a complete target genome G was studied. The objective is to ...
Internal Validation of Ancestral Gene Order Reconstruction in Angiosperm Phylogeny
Comparative GenomicsAbstractWhole genome doubling (WGD), a frequent occurrence during the evolution of the angiopsperms, complicates ancestral gene order reconstruction due to the multiplicity of solutions to the genome halving process. Using the genome of a related species (...
Comments