Skip to main content
Log in

A structural variation genotyping algorithm enhanced by CNV quantitative transfer

  • Research Artlcle
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Genotyping of structural variations considering copy number variations (CNVs) is an infancy and challenging problem. CNVs, a prevalent form of critical genetic variations that cause abnormal copy numbers of large genomic regions in cells, often affect transcription and contribute to a variety of diseases. The characteristics of CNVs often lead to the ambiguity and confusion of existing genotyping features and algorithms, which may cause heterozygous variations to be erroneously genotyped as homozygous variations and seriously affect the accuracy of downstream analysis. As the allelic copy number increases, the error rate of genotyping increases sharply. Some instances with different copy numbers play an auxiliary role in the genotyping classification problem, but some will seriously interfere with the accuracy of the model. Motivated by these, we propose a transfer learning-based method to genotype structural variations accurately considering CNVs. The method first divides the instances with different allelic copy numbers and trains the basic machine learning framework with different genotype datasets. It maximizes the weights of the instances that contribute to classification and minimizes the weights of the instances that hinder correct genotyping. By adjusting the weights of the instances with different allelic copy numbers, the contribution of all the instances to genotyping can be maximized, and the genotyping errors of heterozygote variations caused by CNVs can be minimized. We applied the proposed method to both the simulated and real datasets, and compared it to some popular algorithms including GATK, Facets and Gindel. The experimental results demonstrate that the proposed method outperforms the others in terms of accuracy, stability and efficiency. The source codes have been uploaded at github/TrinaZ/CNVtransfer for academic use only.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Lu X, Chen X, Forney C, Donmez O, Miller D, Parameswaran S, Hong T, Huang Y, Pujato M, Cazares T, Miraldi E R, Ray J P, De Boer C G, Harley J B, Weirauch M T, Kottyan L C. Global discovery of lupus genetic risk variant allelic enhancer activity. Nature Communications, 2021, 12(1): 1611

    Article  Google Scholar 

  2. Alkan C, Coe B P, Eichler E E. Genome structural variation discovery and genotyping. Nature Reviews Genetics, 2011, 12(5): 363–376

    Article  Google Scholar 

  3. Zhang Z, Cheng H, Hong X, Di Narzo A F, Franzen O, Peng S, Ruusalepp A, Kovacic J C, Bjorkegren J L M, Wang X, Hao K. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data. Nucleic Acids Research, 2019, 47(7): e39

    Article  Google Scholar 

  4. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 2013, 14(S11): S1

    Article  Google Scholar 

  5. Zhang C, Cai H, Huang J, Song Y. nbCNV: a multi-constrained optimization model for discovering copy number variants in single-cell sequencing data. BMC Bioinformatics, 2016, 17: 384

    Article  Google Scholar 

  6. Iranmanesh S M, Guo N L. Integrated DNA copy number and gene expression regulatory network analysis of non-small cell lung cancer metastasis. Cancer Informatics, 2014, 13(S5): 13–23

    Google Scholar 

  7. Conrad D F, Pinto D, Redon R, Feuk L, Gokcumen O, et al. Origins and functional impact of copy number variation in the human genome. Nature, 2010, 464(7289): 704–712

    Article  Google Scholar 

  8. Chiang C, Scott A J, Davis J R, Tsang E K, Li X, Kim Y, Hadzic T, Damani F N, Ganel L, Consortium G, Montgomery S B, Battle A, Conrad D F, Hall I M. The impact of structural variation on human gene expression. Nature Genetics, 2017, 49(5): 692–699

    Article  Google Scholar 

  9. Chen P, Huang W, Shao W, Cai H. Discrimination of recurrent CNVs from individual ones from multisample aCGH by jointly constrained minimization. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. 2015, 186–193

  10. Xu B, Cai H, Zhang C, Yang X, Han G. Copy number variants calling for single cell sequencing data by multi-constrained optimization. Computational Biology and Chemistry, 2016, 63: 15–20

    Article  Google Scholar 

  11. Lu C, Xie M, Wendl M C, Wang J, McLellan M D, et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nature Communications, 2015, 6: 10086

    Article  Google Scholar 

  12. Freed D, Aldana R, Weber J A, Edwards J S. The Sentieon genomics tools-a fast and accurate solution to variant calling from next-generation sequence data. bioRxiv, 2017, DOI: 10.1101/115717

  13. Chu C, Zhang J, Wu Y. GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS One, 2014, 9(11): e113324

    Article  Google Scholar 

  14. Sudmant P, Rausch T, Gardner E J, Handsaker R E, Abyzov A, et al. An integrated map of structural variation in 2,504 human genomes. Nature, 2015, 526(7571): 75–81

    Article  Google Scholar 

  15. Liaw A, Wiener M. Classification and regression by randomForest. R News, 2002, 2–3: 18–22

    Google Scholar 

  16. Nørgaard M, Ravn O, Poulsen N K, Hansen L K. Neural Networks for Modeling and Control of Dynamic Systems: A Practitioner’s Handbook. London: Springer, 2000, 246

    Book  Google Scholar 

  17. Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 1–27

    Article  Google Scholar 

  18. Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and regression trees (CART). Biometrics, 1984, 40(3): 358

    Google Scholar 

  19. Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1–2): 273–324

    Article  Google Scholar 

  20. Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200

  21. Shen R, Seshan V E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Research, 2016, 44(16): e131

    Article  Google Scholar 

  22. Auton A, Abecasis G R, Altshuler D M, Durbin R M, Abecasis G R, et al. A global reference for human genetic variation. Nature, 2015, 526(7571): 68–74

    Article  Google Scholar 

  23. Cao D S, Liang Y Z, Xu Q S, Zhang L X, Hu Q N, Li H D. Feature importance sampling-based adaptive random forest as a useful tool to screen underlying lead compounds. Journal of Chemometrics, 2011, 25(4): 201–207

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Grant No. 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiayin Wang.

Additional information

Supporting information

The supporting information is available online at journal.hep.com.cn and link.springer.com.

Tian Zheng is currently a PhD candidate in the School of Computer Science and Technology, Xi’an Jiaotong University, China. Her current research interests include bioinformatics, machine learning and data mining.

Xinyang Qian is studying for his master’s degree in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His current research interests include bioinformatics and machine learning.

Jiayin Wang is a Professor in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His current research interests include cancer genomics and bioinformatics.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, T., Qian, X. & Wang, J. A structural variation genotyping algorithm enhanced by CNV quantitative transfer. Front. Comput. Sci. 16, 166905 (2022). https://doi.org/10.1007/s11704-021-1177-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-1177-z

Keywords

Navigation