ABSTRACT
Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.
- Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308--318.Google ScholarDigital Library
- Basel Alyafi, Oliver Diaz, and Robert Martí. 2020. DCGANs for realistic breast mass augmentation in x-ray mammography. In Medical Imaging 2020: Computer-Aided Diagnosis, Vol. 11314. International Society for Optics and Photonics, 1131420.Google Scholar
- Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2017. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017).Google Scholar
- Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).Google Scholar
- Oleksandr Bailo, DongShik Ham, and Young Min Shin. 2019. Red blood cell image generation for data augmentation using conditional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 0--0.Google ScholarCross Ref
- Brett K Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P Bhavnani, James Brian Byrd, and Casey S Greene. 2019. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes 12, 7 (2019), e005122.Google ScholarCross Ref
- David Berthelot, Thomas Schumm, and Luke Metz. 2017. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017).Google Scholar
- Poonam Chaudhari, Himanshu Agrawal, and Ketan Kotecha. 2019. Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Computing (2019), 1--11.Google Scholar
- Junjie Chen and Xinghua Shi. 2019. Sparse Convolutional Denoising Autoencoders for Genotype Imputation. Genes 10, 9 (2019), 652.Google ScholarCross Ref
- Junjie Chen and Xinghua Shi. 2019. A Sparse Convolutional Predictor with Denoising Autoencoders for Phenotype Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 217--222.Google ScholarDigital Library
- Yifei Chen, Yi Li, Rajiv Narayan, Aravind Subramanian, and Xiaohui Xie. 2016. Gene expression inference with deep learning. Bioinformatics 32, 12 (2016), 1832--1839.Google ScholarCross Ref
- 1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68--74.Google Scholar
- Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. Gan-based data augmentation for improved liver lesion classification. (2018).Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google Scholar
- Benjamin C Haller and Philipp W Messer. 2019. SLiM 3: forward genetic simulations beyond the Wright-Fisher model. Molecular biology and evolution 36, 3 (2019), 632--637.Google Scholar
- Changhee Han, Hideaki Hayashi, Leonardo Rundo, Ryosuke Araki, Wataru Shimoda, Shinichi Muramatsu, Yujiro Furukawa, Giancarlo Mauri, and Hideki Nakayama. 2018. GAN-based synthetic brain MR image generation. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 734--738.Google ScholarCross Ref
- Ryan D Hernandez, Lawrence H Uricchio, Kevin Hartman, Chun Ye, Andrew Dahl, and Noah Zaitlen. 2019. Ultra-rare variants drive substantial cis-heritability of human gene expression. bioRxiv (2019), 219238.Google Scholar
- Jan Hillert. 1994. Human leukocyte antigen studies in multiple sclerosis. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 36, S1 (1994), S15--S17.Google Scholar
- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).Google Scholar
- Jerome Kelleher, Alison M Etheridge, and Gilean McVean. 2016. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology 12, 5 (2016).Google Scholar
- Klaus-Peter Koepfli, Benedict Paten, Genome 10K Community of Scientists, and Stephen J O'Brien. 2015. The Genome 10K Project: a way forward. Annu. Rev. Anim. Biosci. 3, 1 (2015), 57--111.Google ScholarCross Ref
- Xiaoqiang Li, Liangbo Chen, Lu Wang, Pin Wu, and Weiqin Tong. 2018. SCGAN: Disentangled Representation Learning by Adding Similarity Constraint on Generative Adversarial Nets. IEEE Access 7 (2018), 147928--147938.Google ScholarCross Ref
- Jeantine E Lunshof, Ruth Chadwick, Daniel B Vorhaus, and George M Church. 2008. From genetic privacy to open consent. Nature Reviews Genetics 9, 5 (2008), 406--411.Google ScholarCross Ref
- Mohamed Marouf, Pierre Machart, Vikas Bansal, Christoph Kilian, Daniel S Magruder, Christian F Krebs, and Stefan Bonn.2020. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nature Communications 11, 1 (2020), 1--12.Google ScholarCross Ref
- Seonwoo Min, Byunghan Lee, and Sungroh Yoon. 2017. Deep learning in bioinformatics. Briefings in bioinformatics 18, 5 (2017), 851--869.Google Scholar
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google Scholar
- Magnus Nordborg and Simon Tavaré. 2002. Linkage disequilibrium: what history has to tell us. TRENDS in Genetics 18, 2 (2002), 83--90.Google ScholarCross Ref
- John Novembre and Matthew Stephens. 2008. Interpreting principal component analyses of spatial population genetic variation. Nature genetics 40, 5 (2008), 646--649.Google Scholar
- Nick Patterson, Alkes L Price, and David Reich. 2006. Population structure and eigenanalysis. PLoS genetics 2, 12 (2006).Google Scholar
- Alkes L Price, Noah A Zaitlen, David Reich, and Nick Patterson. 2010. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 7 (2010), 459--463.Google ScholarCross Ref
- Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google Scholar
- David Reich, Alkes L Price, and Nick Patterson. 2008. Principal component analysis of genetic data. Nature genetics 40, 5 (2008), 491--492.Google Scholar
- David E Reich, Michele Cargill, Stacey Bolk, James Ireland, Pardis C Sabeti, Daniel J Richter, Thomas Lavery, Rose Kouyoumjian, Shelli F Farhadian, Ryk Ward, et al. 2001. Linkage disequilibrium in the human genome. Nature 411, 6834 (2001), 199--204.Google Scholar
- Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. 2015. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12, 3 (2015).Google Scholar
- David A Van Dyk and Xiao-Li Meng. 2001. The art of data augmentation. Journal of Computational and Graphical Statistics 10, 1 (2001), 1--50.Google ScholarCross Ref
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.Google ScholarCross Ref
- Xinyue Zhu, Yifan Liu, Jiahong Li, Tao Wan, and Zengchang Qin. 2018. Emotion classification with data augmentation using generative adversarial networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 349--360.Google ScholarDigital Library
Index Terms
- Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks
Recommendations
Offspring GAN augments biased human genomic data
BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsGenomic data have been used for trait association and disease risk prediction for a long time. In recent years, many such prediction models are built using machine learning (ML) algorithms. As of today, human genomic data and other biomedical data ...
Genomic data modeling
Special issue: Data management in bioinformaticsResearchers face many challenges in representing biological data, including: (1) inherent complexity of biological data, (2) domain knowledge barrier, (3) constantly evolving knowledge, and (4) lack of expert data-modeling skills. We have studied how to ...
Large scale features in DNA genomic signals
Special issue: Genomic signal processingComplex representations of the nucleotides, codons and amino acids derived from the projection of the Genetic Code Tetrahedron on adequately oriented planes are presented. By converting the sequences of nucleotides and polypeptides into digital genomic ...
Comments