A structural variation genotyping algorithm enhanced by CNV quantitative transfer

Zheng, Tian; Qian, Xinyang; Wang, Jiayin

doi:10.1007/s11704-021-1177-z

A structural variation genotyping algorithm enhanced by CNV quantitative transfer

Research Artlcle
Published: 02 April 2022

Volume 16, article number 166905, (2022)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Tian Zheng^1,2,
Xinyang Qian^1,2 &
Jiayin Wang^1,2

106 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Genotyping of structural variations considering copy number variations (CNVs) is an infancy and challenging problem. CNVs, a prevalent form of critical genetic variations that cause abnormal copy numbers of large genomic regions in cells, often affect transcription and contribute to a variety of diseases. The characteristics of CNVs often lead to the ambiguity and confusion of existing genotyping features and algorithms, which may cause heterozygous variations to be erroneously genotyped as homozygous variations and seriously affect the accuracy of downstream analysis. As the allelic copy number increases, the error rate of genotyping increases sharply. Some instances with different copy numbers play an auxiliary role in the genotyping classification problem, but some will seriously interfere with the accuracy of the model. Motivated by these, we propose a transfer learning-based method to genotype structural variations accurately considering CNVs. The method first divides the instances with different allelic copy numbers and trains the basic machine learning framework with different genotype datasets. It maximizes the weights of the instances that contribute to classification and minimizes the weights of the instances that hinder correct genotyping. By adjusting the weights of the instances with different allelic copy numbers, the contribution of all the instances to genotyping can be maximized, and the genotyping errors of heterozygote variations caused by CNVs can be minimized. We applied the proposed method to both the simulated and real datasets, and compared it to some popular algorithms including GATK, Facets and Gindel. The experimental results demonstrate that the proposed method outperforms the others in terms of accuracy, stability and efficiency. The source codes have been uploaded at github/TrinaZ/CNVtransfer for academic use only.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A machine learning framework for genotyping the structural variations with copy number variant

Article Open access 27 August 2020

CNVoyant a machine learning framework for accurate and explainable copy number variant classification

Article Open access 28 September 2024

Automated prediction of the clinical impact of structural copy number variations

Article Open access 11 January 2022

References

Lu X, Chen X, Forney C, Donmez O, Miller D, Parameswaran S, Hong T, Huang Y, Pujato M, Cazares T, Miraldi E R, Ray J P, De Boer C G, Harley J B, Weirauch M T, Kottyan L C. Global discovery of lupus genetic risk variant allelic enhancer activity. Nature Communications, 2021, 12(1): 1611
Article Google Scholar
Alkan C, Coe B P, Eichler E E. Genome structural variation discovery and genotyping. Nature Reviews Genetics, 2011, 12(5): 363–376
Article Google Scholar
Zhang Z, Cheng H, Hong X, Di Narzo A F, Franzen O, Peng S, Ruusalepp A, Kovacic J C, Bjorkegren J L M, Wang X, Hao K. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data. Nucleic Acids Research, 2019, 47(7): e39
Article Google Scholar
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 2013, 14(S11): S1
Article Google Scholar
Zhang C, Cai H, Huang J, Song Y. nbCNV: a multi-constrained optimization model for discovering copy number variants in single-cell sequencing data. BMC Bioinformatics, 2016, 17: 384
Article Google Scholar
Iranmanesh S M, Guo N L. Integrated DNA copy number and gene expression regulatory network analysis of non-small cell lung cancer metastasis. Cancer Informatics, 2014, 13(S5): 13–23
Google Scholar
Conrad D F, Pinto D, Redon R, Feuk L, Gokcumen O, et al. Origins and functional impact of copy number variation in the human genome. Nature, 2010, 464(7289): 704–712
Article Google Scholar
Chiang C, Scott A J, Davis J R, Tsang E K, Li X, Kim Y, Hadzic T, Damani F N, Ganel L, Consortium G, Montgomery S B, Battle A, Conrad D F, Hall I M. The impact of structural variation on human gene expression. Nature Genetics, 2017, 49(5): 692–699
Article Google Scholar
Chen P, Huang W, Shao W, Cai H. Discrimination of recurrent CNVs from individual ones from multisample aCGH by jointly constrained minimization. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. 2015, 186–193
Xu B, Cai H, Zhang C, Yang X, Han G. Copy number variants calling for single cell sequencing data by multi-constrained optimization. Computational Biology and Chemistry, 2016, 63: 15–20
Article Google Scholar
Lu C, Xie M, Wendl M C, Wang J, McLellan M D, et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nature Communications, 2015, 6: 10086
Article Google Scholar
Freed D, Aldana R, Weber J A, Edwards J S. The Sentieon genomics tools-a fast and accurate solution to variant calling from next-generation sequence data. bioRxiv, 2017, DOI: 10.1101/115717
Chu C, Zhang J, Wu Y. GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS One, 2014, 9(11): e113324
Article Google Scholar
Sudmant P, Rausch T, Gardner E J, Handsaker R E, Abyzov A, et al. An integrated map of structural variation in 2,504 human genomes. Nature, 2015, 526(7571): 75–81
Article Google Scholar
Liaw A, Wiener M. Classification and regression by randomForest. R News, 2002, 2–3: 18–22
Google Scholar
Nørgaard M, Ravn O, Poulsen N K, Hansen L K. Neural Networks for Modeling and Control of Dynamic Systems: A Practitioner’s Handbook. London: Springer, 2000, 246
Book Google Scholar
Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 1–27
Article Google Scholar
Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and regression trees (CART). Biometrics, 1984, 40(3): 358
Google Scholar
Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1–2): 273–324
Article Google Scholar
Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200
Shen R, Seshan V E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Research, 2016, 44(16): e131
Article Google Scholar
Auton A, Abecasis G R, Altshuler D M, Durbin R M, Abecasis G R, et al. A global reference for human genetic variation. Nature, 2015, 526(7571): 68–74
Article Google Scholar
Cao D S, Liang Y Z, Xu Q S, Zhang L X, Hu Q N, Li H D. Feature importance sampling-based adaptive random forest as a useful tool to screen underlying lead compounds. Journal of Chemometrics, 2011, 25(4): 201–207
Article Google Scholar

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Grant No. 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).

Author information

Authors and Affiliations

Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, 710049, China
Tian Zheng, Xinyang Qian & Jiayin Wang
Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, 710049, China
Tian Zheng, Xinyang Qian & Jiayin Wang

Authors

Tian Zheng
View author publications
Search author on:PubMed Google Scholar
Xinyang Qian
View author publications
Search author on:PubMed Google Scholar
Jiayin Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jiayin Wang.

Additional information

Supporting information

The supporting information is available online at journal.hep.com.cn and link.springer.com.

Tian Zheng is currently a PhD candidate in the School of Computer Science and Technology, Xi’an Jiaotong University, China. Her current research interests include bioinformatics, machine learning and data mining.

Xinyang Qian is studying for his master’s degree in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His current research interests include bioinformatics and machine learning.

Jiayin Wang is a Professor in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His current research interests include cancer genomics and bioinformatics.

Electronic Supplementary Material

Appendix