skip to main content
10.1145/3388440.3412475acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks

Published:10 November 2020Publication History

ABSTRACT

Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.

References

  1. Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Basel Alyafi, Oliver Diaz, and Robert Martí. 2020. DCGANs for realistic breast mass augmentation in x-ray mammography. In Medical Imaging 2020: Computer-Aided Diagnosis, Vol. 11314. International Society for Optics and Photonics, 1131420.Google ScholarGoogle Scholar
  3. Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2017. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017).Google ScholarGoogle Scholar
  4. Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).Google ScholarGoogle Scholar
  5. Oleksandr Bailo, DongShik Ham, and Young Min Shin. 2019. Red blood cell image generation for data augmentation using conditional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 0--0.Google ScholarGoogle ScholarCross RefCross Ref
  6. Brett K Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P Bhavnani, James Brian Byrd, and Casey S Greene. 2019. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes 12, 7 (2019), e005122.Google ScholarGoogle ScholarCross RefCross Ref
  7. David Berthelot, Thomas Schumm, and Luke Metz. 2017. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017).Google ScholarGoogle Scholar
  8. Poonam Chaudhari, Himanshu Agrawal, and Ketan Kotecha. 2019. Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Computing (2019), 1--11.Google ScholarGoogle Scholar
  9. Junjie Chen and Xinghua Shi. 2019. Sparse Convolutional Denoising Autoencoders for Genotype Imputation. Genes 10, 9 (2019), 652.Google ScholarGoogle ScholarCross RefCross Ref
  10. Junjie Chen and Xinghua Shi. 2019. A Sparse Convolutional Predictor with Denoising Autoencoders for Phenotype Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 217--222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yifei Chen, Yi Li, Rajiv Narayan, Aravind Subramanian, and Xiaohui Xie. 2016. Gene expression inference with deep learning. Bioinformatics 32, 12 (2016), 1832--1839.Google ScholarGoogle ScholarCross RefCross Ref
  12. 1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68--74.Google ScholarGoogle Scholar
  13. Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. Gan-based data augmentation for improved liver lesion classification. (2018).Google ScholarGoogle Scholar
  14. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google ScholarGoogle Scholar
  15. Benjamin C Haller and Philipp W Messer. 2019. SLiM 3: forward genetic simulations beyond the Wright-Fisher model. Molecular biology and evolution 36, 3 (2019), 632--637.Google ScholarGoogle Scholar
  16. Changhee Han, Hideaki Hayashi, Leonardo Rundo, Ryosuke Araki, Wataru Shimoda, Shinichi Muramatsu, Yujiro Furukawa, Giancarlo Mauri, and Hideki Nakayama. 2018. GAN-based synthetic brain MR image generation. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 734--738.Google ScholarGoogle ScholarCross RefCross Ref
  17. Ryan D Hernandez, Lawrence H Uricchio, Kevin Hartman, Chun Ye, Andrew Dahl, and Noah Zaitlen. 2019. Ultra-rare variants drive substantial cis-heritability of human gene expression. bioRxiv (2019), 219238.Google ScholarGoogle Scholar
  18. Jan Hillert. 1994. Human leukocyte antigen studies in multiple sclerosis. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 36, S1 (1994), S15--S17.Google ScholarGoogle Scholar
  19. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).Google ScholarGoogle Scholar
  20. Jerome Kelleher, Alison M Etheridge, and Gilean McVean. 2016. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology 12, 5 (2016).Google ScholarGoogle Scholar
  21. Klaus-Peter Koepfli, Benedict Paten, Genome 10K Community of Scientists, and Stephen J O'Brien. 2015. The Genome 10K Project: a way forward. Annu. Rev. Anim. Biosci. 3, 1 (2015), 57--111.Google ScholarGoogle ScholarCross RefCross Ref
  22. Xiaoqiang Li, Liangbo Chen, Lu Wang, Pin Wu, and Weiqin Tong. 2018. SCGAN: Disentangled Representation Learning by Adding Similarity Constraint on Generative Adversarial Nets. IEEE Access 7 (2018), 147928--147938.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jeantine E Lunshof, Ruth Chadwick, Daniel B Vorhaus, and George M Church. 2008. From genetic privacy to open consent. Nature Reviews Genetics 9, 5 (2008), 406--411.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mohamed Marouf, Pierre Machart, Vikas Bansal, Christoph Kilian, Daniel S Magruder, Christian F Krebs, and Stefan Bonn.2020. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nature Communications 11, 1 (2020), 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  25. Seonwoo Min, Byunghan Lee, and Sungroh Yoon. 2017. Deep learning in bioinformatics. Briefings in bioinformatics 18, 5 (2017), 851--869.Google ScholarGoogle Scholar
  26. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google ScholarGoogle Scholar
  27. Magnus Nordborg and Simon Tavaré. 2002. Linkage disequilibrium: what history has to tell us. TRENDS in Genetics 18, 2 (2002), 83--90.Google ScholarGoogle ScholarCross RefCross Ref
  28. John Novembre and Matthew Stephens. 2008. Interpreting principal component analyses of spatial population genetic variation. Nature genetics 40, 5 (2008), 646--649.Google ScholarGoogle Scholar
  29. Nick Patterson, Alkes L Price, and David Reich. 2006. Population structure and eigenanalysis. PLoS genetics 2, 12 (2006).Google ScholarGoogle Scholar
  30. Alkes L Price, Noah A Zaitlen, David Reich, and Nick Patterson. 2010. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 7 (2010), 459--463.Google ScholarGoogle ScholarCross RefCross Ref
  31. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google ScholarGoogle Scholar
  32. David Reich, Alkes L Price, and Nick Patterson. 2008. Principal component analysis of genetic data. Nature genetics 40, 5 (2008), 491--492.Google ScholarGoogle Scholar
  33. David E Reich, Michele Cargill, Stacey Bolk, James Ireland, Pardis C Sabeti, Daniel J Richter, Thomas Lavery, Rose Kouyoumjian, Shelli F Farhadian, Ryk Ward, et al. 2001. Linkage disequilibrium in the human genome. Nature 411, 6834 (2001), 199--204.Google ScholarGoogle Scholar
  34. Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. 2015. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12, 3 (2015).Google ScholarGoogle Scholar
  35. David A Van Dyk and Xiao-Li Meng. 2001. The art of data augmentation. Journal of Computational and Graphical Statistics 10, 1 (2001), 1--50.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.Google ScholarGoogle ScholarCross RefCross Ref
  37. Xinyue Zhu, Yifan Liu, Jiahong Li, Tao Wan, and Zengchang Qin. 2018. Emotion classification with data augmentation using generative adversarial networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 349--360.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
      September 2020
      193 pages
      ISBN:9781450379649
      DOI:10.1145/3388440

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 November 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate254of885submissions,29%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader