Elsevier

Digital Signal Processing

Volume 88, May 2019, Pages 90-100
Digital Signal Processing

BagGMM: Calling copy number variation by bagging multiple Gaussian mixture models from tumor and matched normal next-generation sequencing data

https://doi.org/10.1016/j.dsp.2019.01.025Get rights and content

Abstract

Copy number variations (CNVs) contribute significantly to human genomic variability, some of which lead to diseases. However, effective detection of CNVs from whole genome next generation sequencing data (NGS) remains challenging. Here, we present BagGMM, a new method to call CNVs using tumor-normal matched samples from NGS data. BagGMM extracts read depth ratios of tumor samples to normal samples, divides the genomic sequences into segments by sliding windows to count the average coverage ratio of each segment, filters candidate deletions and duplications based on a coarse criterion of coverage ratio, and then builds Gaussian Mixture Model (GMM) for remaining ratios to identify the remaining ambiguous copy number states after filtration. Bagging multiple GMMs makes false positive calls descent instead of using one GMM, thus enhancing the detection power of BagGMM. Considering the computation speed of GMMs and false positive calls, we employ a segmentation procedure “large window and then small windows”, which is also helpful to determine boundary of CNV regions. We apply BagGMM to three simulation datasets and two groups of human whole genome sequencing (WGS) data for breast cancer patients and ovarian cancer patients to identify CNVs, respectively. All performed experiments demonstrate that BagGMM has the capability of robustly identification of CNVs with different sizes and states. The performance of this tool is compared to four peer existing CNV detection methods. BagGMM shows a significant improvement in both sensitivity and specificity for detecting both copy number gains and losses.

Introduction

Copy number variation (CNV) is defined as a gain or a loss of genome sequences where the DNA copy number deviates from the normal copy number. CNV, as a type of structural variation that plays an important role in human disease, is of widespread concern [1]. Biomedical research has estimated that a substantial proportion of the human genome consists of CNVs and more than 1000 CNV regions with a frequency above 1% have been identified in the genome [2]. Most of the CNVs have been reported to substantially contribute to phenotypic variations and common diseases by disrupting genes, altering gene dosage, and perturbing their expression levels [3], [4]. Some CNV encompassing genes also have been demonstrated to foster activation of oncogenes or inactivation of tumor suppressors in cancer cells [5], [6]. Therefore, specific focus is drawn on accurate inference of the interest and importance of CNVs in identifying disease-causing regulatory variants and genes.

With the breakthrough technology of high-throughput platforms, next generation sequencing (NGS) technology has been widely applied to measure the relative copy number between patients and normal counterparts, over a set of genomic regions, which achieves both high resolution and accuracy [7]. As a result, whole-genome sequencing (WGS) [8], whole-exome sequencing (WES) [9], and targeted capture sequencing from NGS technologies have become primary strategies for CNV detection and for studying the relationship between these variations and human diseases [10]. For each strategy, many CNV detection tools have been developed correspondingly. CNV detection tools based on targeted capture sequencing are designed by the capture, which is relatively efficient for CNV detection in novel regions. However the performance of this kind of tools is currently limited by the capture design. In contrast, CNV detection tools based on WGS are more effective for detecting CNVs in novel regions than those tools based on target capture sequencing in spite of their higher cost. In this work, all analyses of CNVs are based on WGS data.

Methods existed for the detection of CNVs from NGS data have been reported for whole genome sequences [11], [12]. In general, these can be grouped into four categories: (1) read depth (RD), (2) paired-end, (3) split-read, and (4) assembly. The underlying assumption of RD-based methods is that the number of sequencing reads aligns to a location of the genome is proportional to the number of copies at that location. Furthermore, the copy number of this region is estimated by counting the number of reads aligned to this region. Under this assumption, RD-based methods identify somatic CNVs by comparing the differences of RD in particular genomic regions between case and matched control samples. Methods in this category include mHMM [13], SeqCNV [2], Control-FREEC [14], and XCAVATOR [3]. For paired-end mapping methods, insertions or deletions are detected by comparing the distance between mapped end pairs on a reference genome and the known insert size [15]. This type of approach is mostly used for identifying other types of structural variations (beyond CNVs) such as inversion and translocation. Split-read based methods detect CNVs by finding and evaluating gaps in sequence alignment. In the assembly approach, short reads are used to assemble genomics regions by connecting overlapped short reads or contigs. CNVs are detected by comparing the assembled contigs to the reference genome. Compared to other above-mentioned CNV detection methods, RD-based methods achieve a higher precision solution at a lower cost [16], [17] for CNV detection. For this reason, RD-based methods have recently become a major approach for the identification CNVs.

RD-based methods for the detection of CNVs are based on the assumption that the sequencing process is uniform. Consequently, the number of reads mapped to a region (the total coverage of a region) is expected to be proportional to the number of times that the region appears in the DNA genome [18]. Following this idea, the absolute copy number can be inferred from the read counts ratio of diseased samples to matched normal samples.

Although the aforementioned RD-based methods are capable to detect CNVs with high accuracy, many studies have shown that each RD-based approach detects variation events with specific structural characteristics [19], [20]. Each method has its advantages; however, its detection power and sensitivity are still limited by low coverage data. Control-FREEC segments and normalizes read counts, and then provides an estimate of polymerization to detect CNV break points. Nevertheless, Control-FREEC is more sensitive to copy gains than copy loss inferring from the result of following experiments. m-HMM uses a K-means method to form windows by joining adjacent genomic sites and then establishes a Hidden Markov Model to discriminate copy number states, which has capability to detect CNVs with high sensitivity but high false positive rate. SeqCNV builds a statistical method, using maximum penalized likelihood estimation to evaluate CNV boundaries and copy number ratios. However, it just provides the types of copy number (gains or loss) while not providing the copies of CNV events. The detection power of SeqCNV and its sensitivity are also limited by low coverage depth data. XCAVATOR builds a normalization procedure for read counts using the statistical properties and biases of read counts, and combines SLM and FastCall algorithms to detect CNV regions. Being capable to discriminate five discrete copy number states (two- and single-copy deletions, neutral, three- and multiple-copies duplication), XCAVATOR detects duplications with lower sensitivity than deletions calls. When the variance of RD signals is large, XCAVATOR shows a lack of sensitivity for detecting signal shifts.

Currently available tools, except for XCAVATOR, are able to classify three states (deletion, normal, and amplification) without the ability of discriminating homozygous deletions (single-copy deletions) from heterozygous deletions (two-copies deletions). This keeps the use of sequencing data limited for the prediction of exact DNA copies.

To overcome this limitation of discriminating homozygous deletions from heterozygous deletions with high specificity and recalls, we have developed BagGMM, which is an analytic toolset that uses bagging of multiple Gaussian mixture models to detect CNVs from a pair of case–control genome sequencing data. In this method, we start to divide genomic sequences into segments by non-overlapping sliding windows according to genomic coordinates. The underlying assumption is that genomic loci within a segment share the same copy number state. For each segment, we divide its read count in the case sample by that of the control sample as coverage ratio of that segment. Here, we use a Gaussian assumption that all coverage ratio values in a certain copy number state (e.g., homozygous deletion, heterozygous deletion, normal, and duplication) in a genome follow a Gaussian distribution. For example, if there are three kinds of copy number status, the coverage ratios will be clustered into three different distributions. Therefore, there are multiple Gaussian distribution models of coverage ratios when a genome has different copy number status. Specially, we employ a segmentation procedure: “large windows with small windows” to divide genomic sequences into segments with a focus on reducing the computing time and determining CNV boundaries. Then, we designate segments as distinct copy number status via a coarse criterion of coverage ratio and leave segments with ambiguous copy number status. As a further step, based on the above-mentioned assumption, we build a Gaussian Mixture Model (GMM) for coverage ratios of remaining segments with ambiguous copy number states to identify CNVs, which models different Gaussian kernels to represent different copy number distributions. However, there are many potential false positive events in the GMM. To solve such a problem, we employ a bagging idea, implying that multiple GMMs are built by repeatedly re-sampling coverage ratios. In summary, BagGMM is suited for the detection of CNVs with high specificity. The code implementing our method is publicly available and can be accessed at https://github.com/tudui123/BagGMM.

To demonstrate the power of our method, we use BagGMM to analyze simulation datasets and real WGS datasets and compare its performance to that of four other state-of-the-art tools. All performed analyses demonstrate that our computational pipeline is capable to detect CNVs in WGS data outperforming all other compared tools.

Section snippets

Overview

Our method uses a read depth approach with a pair of tumor-matched normal samples in every analytical run to correct for coverage depth variability due to differences in GC content, repeat elements, poor mapping arising from high local SNP densities, and total sequencing resulting in a low number of false-positive calls. Fig. 1 shows an overview of our method, which is in steps: (i) data preprocessing; (ii) large windows with small windows; (iii) filtration and detection; (iv) GMM detection on

Results

In this section, we evaluate the proposed BagGMM on both simulation datasets and real datasets separately, and compare its performance to those of other state-of-the-art peer methods: Control-FREEC, mHMM, seqCNV, and XCAVATOR. Real datasets consist of six matched pair breast cancer whole genomes and two matched ovarian cancer whole genome samples, downloaded from the European Genome-Phenome Archive (https://www.ebi.ac.uk/).

We preprocess sequencing reads obtained from a tumor genome and its

Discussion and conclusion

In this paper, we present the tool named BagGMM that can be used to comprehensively normalize sequencing coverage in large-scale genome sequencing and discover CNVs using this rich information. We demonstrate that the BagGMM has high specificity along with high sensitivity to reliable known calls. BagGMM also permits the high-resolution discovery of partial gene disruptions, a form of structural variation potentially involved in disease pathology [31], [32], a possible burden of gene-disrupting

Conflict of interest statement

All authors declare that they have no conflict of interest.

Acknowledgements

We thank the joint Editor and referees for the provided thoughtful comments that greatly improved the presentation of the paper. This work is supported by the Natural Science Foundation of China (No. 61571341), the Natural Science Foundation of Shaanxi Province in China (No. 2017JM6036), the Research Projects of Weinan Normal University (No. 18YKF04), and the Projects of Integration Research of Weinan Normal University (No. 18JMR07).

Yaoyao Li received a bachelor's degree in Management from Harbin Medical University in 2015. She is a third-year MD-PhD based in the School of Computer Science and Technology, Xidian University, China. Her researches focused on detecting copy number variations and other bio-models from next-generation sequencing data by statistics and machine learning algorithms.

References (41)

  • R. Tan et al.

    An evaluation of copy number variation detection tools from whole-exome sequencing data

    Human Mutat.

    (2014)
  • Z. Yu et al.

    CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data

    Sci. Found. China

    (2015)
  • Y. Zhang et al.

    DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data

    Nucleic Acids Res.

    (2015)
  • L.F. Johansson et al.

    CoNVaDING: single exon variation detection in targeted NGS data

    Human Mutat.

    (2016)
  • D.Y. Chiang et al.

    High-resolution mapping of copy-number alterations with massively parallel sequencing

    Nat. Methods

    (2009)
  • C. Xie et al.

    CNV-seq, a new method to detect copy number variation using high-throughput sequencing

    BMC Bioinform.

    (2009)
  • V. Boeva et al.

    Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data

    Bioinformatics

    (2011)
  • J.O. Korbel et al.

    Paired-end mapping reveals extensive structural variation in the human genome

    Science

    (2007)
  • A. Abyzov et al.

    CNVnator: an approach to discover, genotype and characterize typical and atypical CNVs from family and population genome sequencing

    Genome Res.

    (2011)
  • E. Bellos et al.

    cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data

    Genome Biol.

    (2012)
  • Cited by (9)

    View all citing articles on Scopus

    Yaoyao Li received a bachelor's degree in Management from Harbin Medical University in 2015. She is a third-year MD-PhD based in the School of Computer Science and Technology, Xidian University, China. Her researches focused on detecting copy number variations and other bio-models from next-generation sequencing data by statistics and machine learning algorithms.

    Junying Zhang, Professor, Ph.D., academic leader of Xi'an University of Electronic Science and Technology. IEEE member, senior member of China Electronics Society, senior member of China Computer Society, emergency management expert of Shaanxi Province, evaluation expert of overseas study in Shaanxi Province, national study fund project evaluation expert, National Natural Science Foundation project evaluation expert, Shaanxi Provincial Education Department Nature Scientific Fund project review expert, Ningbo Science and Technology Plan project review expert, Beijing Natural Science Foundation project review expert, etc.; Chinese Science, Automation Journal, Electronic Journal, Neurocomputing, Digital Signal Processing, BMC Bioinformatics, ACM Computing Surveys and other publications expert.

    Her papers were published in “J.VLSI SP”, “Computational Intelligence and Neuroscience”, “Information Technology Journal”, “IEEE Transactions on Information Technology in Biomedicine”, “IEEE Transactions on Nuclear Science”, “IEEE Transactions on NanoBioscience”, “Progress in Natural Science”, “Science in China”, “Journal of Electronics”, “Journal of Automation”, “Journal of Communications”, “Journal of Physics”, “Control and Decision”, “Photoelectron and Laser”, “Applied Intelligence” and a number of important academic conferences.

    Xiguo Yuan, Associate Professor, School of Computer Science and Technology, Xidian University, received a Ph.D. in Computer Application Technology from Xidian University in June 2011. His doctoral thesis titled “Simulation of Genomic Variation and Identification of Genome Models”. From 2009 to 2010, he was jointly trained by Xidian University and the laboratory of CBIL of Virginia Polytechnic University.

    He is interested in analyzing biomolecular data by computer technology to reveal the connotation of biomolecular data. Specifically: He is good at using comprehensive (/design) machine learning algorithms, probability theory methods, and statistical test methods to detect or identify variant sites or fragments in the DNA genome to discover patterns with biological functions. His research group focuses on the analysis of single nucleotide polymorphism, copy number variation, methylation and other data. Related papers have been published in the important international publications “BMC Genomics”, “Journal of Computational Biology”, “PLoS ONE” and so on.

    View full text