research-article

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks

Authors:
Junjie Chen

Temple University, Philadelphia, PA, USA

Temple University, Philadelphia, PA, USA
View Profile

,
Mohammad Erfan Mowlaei

Temple University, Philadelphia, PA, USA

Temple University, Philadelphia, PA, USA
View Profile

,
Xinghua Shi

Temple University, Philadelphia, PA, USA

Temple University, Philadelphia, PA, USA
View Profile

BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsSeptember 2020Article No.: 26Pages 1–6https://doi.org/10.1145/3388440.3412475

Published:10 November 2020Publication History

BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 1–6

ABSTRACT

Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.

References

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308--318.Google ScholarDigital Library
Basel Alyafi, Oliver Diaz, and Robert Martí. 2020. DCGANs for realistic breast mass augmentation in x-ray mammography. In Medical Imaging 2020: Computer-Aided Diagnosis, Vol. 11314. International Society for Optics and Photonics, 1131420.Google Scholar
Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2017. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017).Google Scholar
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).Google Scholar
Oleksandr Bailo, DongShik Ham, and Young Min Shin. 2019. Red blood cell image generation for data augmentation using conditional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 0--0.Google ScholarCross Ref
Brett K Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P Bhavnani, James Brian Byrd, and Casey S Greene. 2019. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes 12, 7 (2019), e005122.Google ScholarCross Ref
David Berthelot, Thomas Schumm, and Luke Metz. 2017. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017).Google Scholar
Poonam Chaudhari, Himanshu Agrawal, and Ketan Kotecha. 2019. Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Computing (2019), 1--11.Google Scholar
Junjie Chen and Xinghua Shi. 2019. Sparse Convolutional Denoising Autoencoders for Genotype Imputation. Genes 10, 9 (2019), 652.Google ScholarCross Ref
Junjie Chen and Xinghua Shi. 2019. A Sparse Convolutional Predictor with Denoising Autoencoders for Phenotype Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 217--222.Google ScholarDigital Library
Yifei Chen, Yi Li, Rajiv Narayan, Aravind Subramanian, and Xiaohui Xie. 2016. Gene expression inference with deep learning. Bioinformatics 32, 12 (2016), 1832--1839.Google ScholarCross Ref
1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68--74.Google Scholar
Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. Gan-based data augmentation for improved liver lesion classification. (2018).Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google Scholar
Benjamin C Haller and Philipp W Messer. 2019. SLiM 3: forward genetic simulations beyond the Wright-Fisher model. Molecular biology and evolution 36, 3 (2019), 632--637.Google Scholar
Changhee Han, Hideaki Hayashi, Leonardo Rundo, Ryosuke Araki, Wataru Shimoda, Shinichi Muramatsu, Yujiro Furukawa, Giancarlo Mauri, and Hideki Nakayama. 2018. GAN-based synthetic brain MR image generation. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 734--738.Google ScholarCross Ref
Ryan D Hernandez, Lawrence H Uricchio, Kevin Hartman, Chun Ye, Andrew Dahl, and Noah Zaitlen. 2019. Ultra-rare variants drive substantial cis-heritability of human gene expression. bioRxiv (2019), 219238.Google Scholar
Jan Hillert. 1994. Human leukocyte antigen studies in multiple sclerosis. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 36, S1 (1994), S15--S17.Google Scholar
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).Google Scholar
Jerome Kelleher, Alison M Etheridge, and Gilean McVean. 2016. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology 12, 5 (2016).Google Scholar
Klaus-Peter Koepfli, Benedict Paten, Genome 10K Community of Scientists, and Stephen J O'Brien. 2015. The Genome 10K Project: a way forward. Annu. Rev. Anim. Biosci. 3, 1 (2015), 57--111.Google ScholarCross Ref
Xiaoqiang Li, Liangbo Chen, Lu Wang, Pin Wu, and Weiqin Tong. 2018. SCGAN: Disentangled Representation Learning by Adding Similarity Constraint on Generative Adversarial Nets. IEEE Access 7 (2018), 147928--147938.Google ScholarCross Ref
Jeantine E Lunshof, Ruth Chadwick, Daniel B Vorhaus, and George M Church. 2008. From genetic privacy to open consent. Nature Reviews Genetics 9, 5 (2008), 406--411.Google ScholarCross Ref
Mohamed Marouf, Pierre Machart, Vikas Bansal, Christoph Kilian, Daniel S Magruder, Christian F Krebs, and Stefan Bonn.2020. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nature Communications 11, 1 (2020), 1--12.Google ScholarCross Ref
Seonwoo Min, Byunghan Lee, and Sungroh Yoon. 2017. Deep learning in bioinformatics. Briefings in bioinformatics 18, 5 (2017), 851--869.Google Scholar
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google Scholar
Magnus Nordborg and Simon Tavaré. 2002. Linkage disequilibrium: what history has to tell us. TRENDS in Genetics 18, 2 (2002), 83--90.Google ScholarCross Ref
John Novembre and Matthew Stephens. 2008. Interpreting principal component analyses of spatial population genetic variation. Nature genetics 40, 5 (2008), 646--649.Google Scholar
Nick Patterson, Alkes L Price, and David Reich. 2006. Population structure and eigenanalysis. PLoS genetics 2, 12 (2006).Google Scholar
Alkes L Price, Noah A Zaitlen, David Reich, and Nick Patterson. 2010. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 7 (2010), 459--463.Google ScholarCross Ref
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google Scholar
David Reich, Alkes L Price, and Nick Patterson. 2008. Principal component analysis of genetic data. Nature genetics 40, 5 (2008), 491--492.Google Scholar
David E Reich, Michele Cargill, Stacey Bolk, James Ireland, Pardis C Sabeti, Daniel J Richter, Thomas Lavery, Rose Kouyoumjian, Shelli F Farhadian, Ryk Ward, et al. 2001. Linkage disequilibrium in the human genome. Nature 411, 6834 (2001), 199--204.Google Scholar
Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. 2015. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12, 3 (2015).Google Scholar
David A Van Dyk and Xiao-Li Meng. 2001. The art of data augmentation. Journal of Computational and Graphical Statistics 10, 1 (2001), 1--50.Google ScholarCross Ref
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.Google ScholarCross Ref
Xinyue Zhu, Yifan Liu, Jiahong Li, Tao Wan, and Zengchang Qin. 2018. Emotion classification with data augmentation using generative adversarial networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 349--360.Google ScholarDigital Library

Index Terms

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks
1. Applied computing
  1. Life and medical sciences
    1. Computational biology
      1. Computational genomics

Recommendations

Offspring GAN augments biased human genomic data
BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Genomic data have been used for trait association and disease risk prediction for a long time. In recent years, many such prediction models are built using machine learning (ML) algorithms. As of today, human genomic data and other biomedical data ...
Read More
Genomic data modeling
Special issue: Data management in bioinformatics

Researchers face many challenges in representing biological data, including: (1) inherent complexity of biological data, (2) domain knowledge barrier, (3) constantly evolving knowledge, and (4) lack of expert data-modeling skills. We have studied how to ...
Read More
Large scale features in DNA genomic signals
Special issue: Genomic signal processing

Complex representations of the nucleotides, codons and amino acids derived from the projection of the Genetic Code Tetrahedron on adequately oriented planes are presented. By converting the sequences of nucleotides and polypeptides into digital genomic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
September 2020
193 pages
ISBN:9781450379649
DOI:10.1145/3388440

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 November 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data augmentation
deep learning
generative adversarial networks
genomics
machine learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate254of885submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 377
  Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks

BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Offspring GAN augments biased human genomic data

Genomic data modeling

Large scale features in DNA genomic signals

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks

BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Offspring GAN augments biased human genomic data

Genomic data modeling

Large scale features in DNA genomic signals

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media