skip to main content
10.1145/3107411.3107427acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Public Access

Exploring Frequented Regions in Pan-Genomic Graphs

Published:20 August 2017Publication History

ABSTRACT

We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work we formalize the FR problem and describe an efficient algorithm for finding FRs. Subsequently, we propose some applications of FRs based on machine-learning and pan-genome graph simplification. We demonstrate the effectiveness of these applications using data sets for the organisms Staphylococcus aureus (bacteria) and Saccharomyces cerevisiae (yeast). We corroborate the biological relevance of FRs such as identifying introgressions in yeast that aid in alcohol tolerance, and show that FRs are useful for classification of yeast strains by industrial use and visualizing pan-genomic space.

References

  1. Rakesh Agrawal, Ramakrishnan Srikant, and others. 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. Vol. 1215. 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Timo Beller and Enno Ohlebusch 2015. Efficient construction of a compressed de Bruijn graph for pan-genome analysis Combinatorial Pattern Matching. Springer, 40--51.Google ScholarGoogle Scholar
  3. Viv Bewick, Liz Cheek, and Jonathan Ball 2004. Statistics review 13: receiver operating characteristic curves. Critical care, Vol. 8, 6 (2004), 508.Google ScholarGoogle ScholarCross RefCross Ref
  4. Anthony R Borneman, Brian A Desany, David Riches, Jason P Affourtit, Angus H Forgan, Isak S Pretorius, Michael Egholm, and Paul J Chambers 2011. Whole-genome comparison reveals novel genetic elements that characterize the genome of industrial strains of Saccharomyces cerevisiae. PLoS Genet, Vol. 7, 2 (2011), e1001287.Google ScholarGoogle ScholarCross RefCross Ref
  5. Michael Bridges, Elizabeth A Heron, Colm O'Dushlaine, Ricardo Segurado, Derek Morris, Aiden Corvin, Michael Gill, Carlos Pinto, International Schizophrenia Consortium, and others. 2011. Genetic classification of populations using supervised learning. PloS one, Vol. 6, 5 (2011), e14802.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hong Cheng, Philip S Yu, and Jiawei Han 2006. Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery Data Mining, 2006. ICDM'06. Sixth International Conference on. 839--844. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alan Cleary, Brendan Mumey, Thiruvarangan Ramaraj, and Joann Mudge 2017. Approximate Frequent Subpath Mining Applied to Pangenomics BICoB. 59--65.Google ScholarGoogle Scholar
  8. Corinna Cortes and Vladimir Vapnik 1995. Support-vector networks. Machine learning, Vol. 20, 3 (1995), 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Trevor F Cox and Michael AA Cox 2000. Multidimensional scaling. CRC press.Google ScholarGoogle Scholar
  10. Barbara Dunn, Chandra Richter, Daniel J Kvitek, Tom Pugh, and Gavin Sherlock 2012. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments. Genome research, Vol. 22, 5 (2012), 908--924.Google ScholarGoogle Scholar
  11. Sumanta Guha. 2009. Efficiently mining frequent subpaths. In Proceedings of the Eighth Australasian Data Mining Conference-Volume 101. 11--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Haussler, Stephen J O'Brien, Oliver A Ryder, F Keith Barker, Michele Clamp, Andrew J Crawford, Robert Hanner, Olivier Hanotte, Warren E Johnson, Jimmy A McGuire, and others 2009. Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity, Vol. 100, 6 (2009), 659--674.Google ScholarGoogle ScholarCross RefCross Ref
  13. Steven Hill, Bismita Srichandan, and Rajshekhar Sunderraman. 2012. An iterative mapreduce approach to frequent subgraph mining in biological datasets Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 661--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. 2004. Mining protein family specific residue packing patterns from protein structure graphs Proceedings of the eighth annual international conference on Resaerch in computational molecular biology. ACM, 308--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hoching L Huang and Marjorie C Brandriss 2000. The regulator of the yeast proline utilization pathway is differentially phosphorylated in response to the quality of the nitrogen source. Molecular and cellular biology Vol. 20, 3 (2000), 892--899.Google ScholarGoogle Scholar
  16. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, Vol. 44, 2 (2012), 226--232.Google ScholarGoogle Scholar
  18. Chuntao Jiang, Frans Coenen, and Michele Zito. 2013. A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review Vol. 28, 01 (2013), 75--105.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jaebum Kim, Denis M Larkin, Qingle Cai, Yongfen Zhang, Ri-Li Ge, Loretta Auvil, Boris Capitanu, Guojie Zhang, Harris A Lewin, Jian Ma, and others 2013. Reference-assisted chromosome assembly. Proceedings of the National Academy of Sciences, Vol. 110, 5 (2013), 1785--1790.Google ScholarGoogle ScholarCross RefCross Ref
  20. Mehmet Koyutürk, Ananth Grama, and Wojciech Szpankowski 2004. An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics, Vol. 20, suppl 1 (2004), i200--i207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Heng Li and Nils Homer 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics Vol. 11, 5 (2010), 473--483.Google ScholarGoogle Scholar
  22. Ruirui Li and Wei Wang 2015. REAFUM: Representative Approximate Frequent Subgraph Mining SIAM International Conference on Data Mining. SIAM, 2167--0099.Google ScholarGoogle Scholar
  23. Gianni Liti, David M Carter, Alan M Moses, Jonas Warringer, Leopold Parts, Stephen A James, Robert P Davey, Ian N Roberts, Austin Burt, Vassiliki Koufopanou, and others 2009. Population genomics of domestic and wild yeasts. Nature, Vol. 458, 7236 (2009), 337--341.Google ScholarGoogle Scholar
  24. Jinze Liu, Susan Paulsen, Xing Sun, Wei Wang, Andrew B Nobel, and Jan Prins. 2006. Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis. SDM, Vol. Vol. 6. 405--416.Google ScholarGoogle ScholarCross RefCross Ref
  25. Shoshana Marcus, Hayan Lee, and Michael C Schatz. 2014. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, Vol. 30, 24 (2014), 3476--3483.Google ScholarGoogle ScholarCross RefCross Ref
  26. Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E Dutilh, Ali Ghaffaari, Paul Kersey, Wigard Kloosterman, Veli Makinen, Adam Novak, and others 2016. Computational Pan-Genomics: Status, Promises and Challenges. bioRxiv (2016), 043430.Google ScholarGoogle Scholar
  27. Ilya Minkin, Anand Patel, Mikhail Kolmogorov, Nikolay Vyahhi, and Son Pham 2013. Sibelia: a scalable and comprehensive synteny block generation tool for closely related microbial genomes. Algorithms in Bioinformatics. Springer, 215--229.Google ScholarGoogle Scholar
  28. Ilia Minkin, Son Pham, and Paul Medvedev 2016. TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes. arXiv:1602.05856 (2016).Google ScholarGoogle Scholar
  29. William S Noble. 2006. What is a support vector machine? Nature biotechnology, Vol. 24, 12 (2006), 1565--1567.Google ScholarGoogle Scholar
  30. Maite Novo, Frédéric Bigey, Emmanuelle Beyne, Virginie Galeote, Frédérick Gavory, Sandrine Mallet, Brigitte Cambon, Jean-Luc Legras, Patrick Wincker, Serge Casaregola, and others 2009. Eukaryote-to-eukaryote gene transfer events revealed by the genome sequence of the wine yeast Saccharomyces cerevisiae EC1118. Proceedings of the National Academy of Sciences, Vol. 106, 38 (2009), 16333--16338.Google ScholarGoogle ScholarCross RefCross Ref
  31. Son K Pham and Pavel A Pevzner 2010. DRIMM-Synteny: decomposing genomes into evolutionary conserved segments. Bioinformatics, Vol. 26, 20 (2010), 2509--2516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Markus Ringnér. 2008. What is principal component analysis? Nature biotechnology, Vol. 26, 3 (2008), 303.Google ScholarGoogle Scholar
  33. Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson 2015. Big data: astronomical or genomical? PLoS Biol, Vol. 13, 7 (2015), e1002195.Google ScholarGoogle ScholarCross RefCross Ref
  34. H Takagi, F Iwamoto, and S Nakamori 1997. Isolation of freeze-tolerant laboratory strains of Saccharomyces cerevisiae from proline-analogue-resistant mutants. Applied microbiology and biotechnology Vol. 47, 4 (1997), 405--411.Google ScholarGoogle Scholar
  35. Hiroshi Takagi, Kuumi Sakai, Kana Morida, and Shigeru Nakamori 2000. Proline accumulation by mutation or disruption of the proline oxidase gene improves resistance to freezing and desiccation stresses in Saccharomyces cerevisiae. FEMS microbiology letters Vol. 184, 1 (2000), 103--108.Google ScholarGoogle Scholar
  36. Hiroshi Takagi, Miki Takaoka, Akari Kawaguchi, and Yoshito Kubo 2005. Effect of L-proline on sake brewing and ethanol stress in Saccharomyces cerevisiae. Applied and environmental microbiology Vol. 71, 12 (2005), 8656--8662.Google ScholarGoogle Scholar
  37. Hervé Tettelin, Vega Masignani, Michael J Cieslewicz, Claudio Donati, Duccio Medini, Naomi L Ward, Samuel V Angiuoli, Jonathan Crabtree, Amanda L Jones, A Scott Durkin, and others 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences, Vol. 102, 39 (2005), 13950--13955.Google ScholarGoogle ScholarCross RefCross Ref
  38. George Vernikos, Duccio Medini, David R Riley, and Herve Tettelin 2015. Ten years of pan-genome analyses. Current opinion in microbiology Vol. 23 (2015), 148--154.Google ScholarGoogle Scholar
  39. Cheng Yang, Usama Fayyad, and Paul S Bradley 2001. Efficient discovery of error-tolerant frequent itemsets in high dimensions Proceedings of ACM SIGKDD. 194--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns Proceedings of ACM SIGKDD. 344--353. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploring Frequented Regions in Pan-Genomic Graphs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
        August 2017
        800 pages
        ISBN:9781450347228
        DOI:10.1145/3107411

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 August 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ACM-BCB '17 Paper Acceptance Rate42of132submissions,32%Overall Acceptance Rate254of885submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader