Abstract
Aiming at the problems of parameter optimization and insufficient utilization of split reads in the detection for copy number variation (CNV), a new definition of relative read depth (RRD) and a randomized sampling strategy (RGN) are proposed in this paper. Compared to the raw read depth, the RRD parameter has weak correlation with GC content, mappability and the width of analysis windows tiled along the genome. The RGN strategy is based on the weighted sampling strategy which can speed up the read count data analysis. Subsequently, we propose an improved detection algorithm for CNV based on hidden Markov model (CNV-HMM). The HMM detects the abnormal signal of read count data and outputs the detection results of candidate CNVs. At the end of the algorithm, we filter out the results of candidate CNVs using the split reads to improve the performance of CNV-HMM algorithm. Finally, the experiment results show that our CNV-HMM algorithm has higher sensitivity and accuracy for CNVs detection than most of current detection algorithms and applicative both for diploid animal and plant.
Similar content being viewed by others
References
Abyzov A, Urban AE, Snyder M et al (2011) CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 21(6):974–984
Chen K, Wallis JW, McLellan MD et al (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6(9):677–681
Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J (2007) QuantiSNP: An objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 35:2013–2025
Ellingford JM, Barton S, Bhaskar S et al (2016) Whole genome sequencing increases molecular diagnostic yield compared with current diagnostic testing for inherited retinal disease. Ophthalmology 123:1143–1150
Ellingford JM, Horn B, Campbell C et al (2018) Assessment of the incorporation of CNV surveillance into gene panel next-generation sequencing testing for inherited retinal diseases. J Med Genet 55:114–121
Gonzalez E, Kulkarni H, Bolivar H et al (2005) The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307(5714):1434–1440
Jiang Yuchao ODA, Diskin SJ et al (2015) CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 43(6):e39
Korbel JO, Abyzov A, Mu XJ et al (2009) PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome BioI 10(2):R23
Lee K, Garg S (2015) Navigating the current landscape of clinical genetic testing for inherited retinal dystrophies. Genet Med 17:245–252
Li J, Lupat R, Amarasinghe KC et al (2012) CONTRA: copy number analysis for targeted resequencing. Bioinformatics 28(10):1307–1313
Ma P, Sun X (2015) Leveraging for big data regression. Wiley Interdisciplinary Reviews Computational Statistics 7:70–76
Magi A, Tattini L, Pippucci T, Torricelli F, Benelh M (2012) Read count approach for DNA copy number variants detection. Biomformatics 28(4):470–478
McKernan KJ, Peckham HE, Costa GL et al (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 19(9):1527–1541
Miller CA, Hampton O, Coarfa C et al (2011) ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One 6(1):e16327
Rabiner LR (1989) A Tutorial on Hidden Markov-Models and Selected Applications in Speech Recognition. Pleee 77(2):257–286
Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE Acoustics, Speech and Signal Processing Society Magazine 3(1):4–16
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: astronomical or genomical? PLoS Biol 13(7):1002195
Tan R, Wang Y, Kleinstein SE et al (2014) An evaluation of copy number variation detection tools from whole-exome sequencing data. Hum Mutat 35(7):899–907
Wang Jianmin MCG, Easton J et al (2011) CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods 8(8):652–654
Wang WB, Sun W, Wang W, Szatkiewicz J (2018) A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection. BMC Bioinformatics 19:74–84
Xie C, Tammi MT (2009) CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10:80
Yoon BJ, Vaid Yana Than PP (2007) Computational identification and analysis of noncoding RNAs-unearthing the buried treasures in the genome. IEEE Signal Process Mag 24(1):64–74
Yoon B J, Vaid Yana Than PP (2007) Fast search of sequences with complex symbol correlations using profile context-sensitive HMMS and pre-screening filters. ICASSP 2007, Hawaii, USA: IEEE Press, 1:345–348
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J (2009) Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res 19(9):1586–1592
Zeju L, Li Y et al (2007) Recognition of DNA sequences based on hidden Markov models. Journal of South China University of Technology: Natural Science Edition 35(8):123–126
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, H., Zhu, D. Improved detection algorithm for copy number variations based on hidden Markov model. Multimed Tools Appl 79, 9237–9253 (2020). https://doi.org/10.1007/s11042-019-7368-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-7368-z