Abstract
Genotyping of structural variations considering copy number variations (CNVs) is an infancy and challenging problem. CNVs, a prevalent form of critical genetic variations that cause abnormal copy numbers of large genomic regions in cells, often affect transcription and contribute to a variety of diseases. The characteristics of CNVs often lead to the ambiguity and confusion of existing genotyping features and algorithms, which may cause heterozygous variations to be erroneously genotyped as homozygous variations and seriously affect the accuracy of downstream analysis. As the allelic copy number increases, the error rate of genotyping increases sharply. Some instances with different copy numbers play an auxiliary role in the genotyping classification problem, but some will seriously interfere with the accuracy of the model. Motivated by these, we propose a transfer learning-based method to genotype structural variations accurately considering CNVs. The method first divides the instances with different allelic copy numbers and trains the basic machine learning framework with different genotype datasets. It maximizes the weights of the instances that contribute to classification and minimizes the weights of the instances that hinder correct genotyping. By adjusting the weights of the instances with different allelic copy numbers, the contribution of all the instances to genotyping can be maximized, and the genotyping errors of heterozygote variations caused by CNVs can be minimized. We applied the proposed method to both the simulated and real datasets, and compared it to some popular algorithms including GATK, Facets and Gindel. The experimental results demonstrate that the proposed method outperforms the others in terms of accuracy, stability and efficiency. The source codes have been uploaded at github/TrinaZ/CNVtransfer for academic use only.
Similar content being viewed by others
References
Lu X, Chen X, Forney C, Donmez O, Miller D, Parameswaran S, Hong T, Huang Y, Pujato M, Cazares T, Miraldi E R, Ray J P, De Boer C G, Harley J B, Weirauch M T, Kottyan L C. Global discovery of lupus genetic risk variant allelic enhancer activity. Nature Communications, 2021, 12(1): 1611
Alkan C, Coe B P, Eichler E E. Genome structural variation discovery and genotyping. Nature Reviews Genetics, 2011, 12(5): 363–376
Zhang Z, Cheng H, Hong X, Di Narzo A F, Franzen O, Peng S, Ruusalepp A, Kovacic J C, Bjorkegren J L M, Wang X, Hao K. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data. Nucleic Acids Research, 2019, 47(7): e39
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 2013, 14(S11): S1
Zhang C, Cai H, Huang J, Song Y. nbCNV: a multi-constrained optimization model for discovering copy number variants in single-cell sequencing data. BMC Bioinformatics, 2016, 17: 384
Iranmanesh S M, Guo N L. Integrated DNA copy number and gene expression regulatory network analysis of non-small cell lung cancer metastasis. Cancer Informatics, 2014, 13(S5): 13–23
Conrad D F, Pinto D, Redon R, Feuk L, Gokcumen O, et al. Origins and functional impact of copy number variation in the human genome. Nature, 2010, 464(7289): 704–712
Chiang C, Scott A J, Davis J R, Tsang E K, Li X, Kim Y, Hadzic T, Damani F N, Ganel L, Consortium G, Montgomery S B, Battle A, Conrad D F, Hall I M. The impact of structural variation on human gene expression. Nature Genetics, 2017, 49(5): 692–699
Chen P, Huang W, Shao W, Cai H. Discrimination of recurrent CNVs from individual ones from multisample aCGH by jointly constrained minimization. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. 2015, 186–193
Xu B, Cai H, Zhang C, Yang X, Han G. Copy number variants calling for single cell sequencing data by multi-constrained optimization. Computational Biology and Chemistry, 2016, 63: 15–20
Lu C, Xie M, Wendl M C, Wang J, McLellan M D, et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nature Communications, 2015, 6: 10086
Freed D, Aldana R, Weber J A, Edwards J S. The Sentieon genomics tools-a fast and accurate solution to variant calling from next-generation sequence data. bioRxiv, 2017, DOI: 10.1101/115717
Chu C, Zhang J, Wu Y. GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS One, 2014, 9(11): e113324
Sudmant P, Rausch T, Gardner E J, Handsaker R E, Abyzov A, et al. An integrated map of structural variation in 2,504 human genomes. Nature, 2015, 526(7571): 75–81
Liaw A, Wiener M. Classification and regression by randomForest. R News, 2002, 2–3: 18–22
Nørgaard M, Ravn O, Poulsen N K, Hansen L K. Neural Networks for Modeling and Control of Dynamic Systems: A Practitioner’s Handbook. London: Springer, 2000, 246
Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 1–27
Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and regression trees (CART). Biometrics, 1984, 40(3): 358
Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1–2): 273–324
Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200
Shen R, Seshan V E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Research, 2016, 44(16): e131
Auton A, Abecasis G R, Altshuler D M, Durbin R M, Abecasis G R, et al. A global reference for human genetic variation. Nature, 2015, 526(7571): 68–74
Cao D S, Liang Y Z, Xu Q S, Zhang L X, Hu Q N, Li H D. Feature importance sampling-based adaptive random forest as a useful tool to screen underlying lead compounds. Journal of Chemometrics, 2011, 25(4): 201–207
Acknowledgements
The work was supported by the National Natural Science Foundation of China (Grant No. 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).
Author information
Authors and Affiliations
Corresponding author
Additional information
Supporting information
The supporting information is available online at journal.hep.com.cn and link.springer.com.
Tian Zheng is currently a PhD candidate in the School of Computer Science and Technology, Xi’an Jiaotong University, China. Her current research interests include bioinformatics, machine learning and data mining.
Xinyang Qian is studying for his master’s degree in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His current research interests include bioinformatics and machine learning.
Jiayin Wang is a Professor in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His current research interests include cancer genomics and bioinformatics.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Zheng, T., Qian, X. & Wang, J. A structural variation genotyping algorithm enhanced by CNV quantitative transfer. Front. Comput. Sci. 16, 166905 (2022). https://doi.org/10.1007/s11704-021-1177-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-021-1177-z