ABSTRACT
We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work we formalize the FR problem and describe an efficient algorithm for finding FRs. Subsequently, we propose some applications of FRs based on machine-learning and pan-genome graph simplification. We demonstrate the effectiveness of these applications using data sets for the organisms Staphylococcus aureus (bacteria) and Saccharomyces cerevisiae (yeast). We corroborate the biological relevance of FRs such as identifying introgressions in yeast that aid in alcohol tolerance, and show that FRs are useful for classification of yeast strains by industrial use and visualizing pan-genomic space.
- Rakesh Agrawal, Ramakrishnan Srikant, and others. 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. Vol. 1215. 487--499. Google ScholarDigital Library
- Timo Beller and Enno Ohlebusch 2015. Efficient construction of a compressed de Bruijn graph for pan-genome analysis Combinatorial Pattern Matching. Springer, 40--51.Google Scholar
- Viv Bewick, Liz Cheek, and Jonathan Ball 2004. Statistics review 13: receiver operating characteristic curves. Critical care, Vol. 8, 6 (2004), 508.Google ScholarCross Ref
- Anthony R Borneman, Brian A Desany, David Riches, Jason P Affourtit, Angus H Forgan, Isak S Pretorius, Michael Egholm, and Paul J Chambers 2011. Whole-genome comparison reveals novel genetic elements that characterize the genome of industrial strains of Saccharomyces cerevisiae. PLoS Genet, Vol. 7, 2 (2011), e1001287.Google ScholarCross Ref
- Michael Bridges, Elizabeth A Heron, Colm O'Dushlaine, Ricardo Segurado, Derek Morris, Aiden Corvin, Michael Gill, Carlos Pinto, International Schizophrenia Consortium, and others. 2011. Genetic classification of populations using supervised learning. PloS one, Vol. 6, 5 (2011), e14802.Google ScholarCross Ref
- Hong Cheng, Philip S Yu, and Jiawei Han 2006. Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery Data Mining, 2006. ICDM'06. Sixth International Conference on. 839--844. Google ScholarDigital Library
- Alan Cleary, Brendan Mumey, Thiruvarangan Ramaraj, and Joann Mudge 2017. Approximate Frequent Subpath Mining Applied to Pangenomics BICoB. 59--65.Google Scholar
- Corinna Cortes and Vladimir Vapnik 1995. Support-vector networks. Machine learning, Vol. 20, 3 (1995), 273--297. Google ScholarDigital Library
- Trevor F Cox and Michael AA Cox 2000. Multidimensional scaling. CRC press.Google Scholar
- Barbara Dunn, Chandra Richter, Daniel J Kvitek, Tom Pugh, and Gavin Sherlock 2012. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments. Genome research, Vol. 22, 5 (2012), 908--924.Google Scholar
- Sumanta Guha. 2009. Efficiently mining frequent subpaths. In Proceedings of the Eighth Australasian Data Mining Conference-Volume 101. 11--15. Google ScholarDigital Library
- David Haussler, Stephen J O'Brien, Oliver A Ryder, F Keith Barker, Michele Clamp, Andrew J Crawford, Robert Hanner, Olivier Hanotte, Warren E Johnson, Jimmy A McGuire, and others 2009. Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity, Vol. 100, 6 (2009), 659--674.Google ScholarCross Ref
- Steven Hill, Bismita Srichandan, and Rajshekhar Sunderraman. 2012. An iterative mapreduce approach to frequent subgraph mining in biological datasets Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 661--666. Google ScholarDigital Library
- Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. 2004. Mining protein family specific residue packing patterns from protein structure graphs Proceedings of the eighth annual international conference on Resaerch in computational molecular biology. ACM, 308--315. Google ScholarDigital Library
- Hoching L Huang and Marjorie C Brandriss 2000. The regulator of the yeast proline utilization pathway is differentially phosphorylated in response to the quality of the nitrogen source. Molecular and cellular biology Vol. 20, 3 (2000), 892--899.Google Scholar
- Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 13--23. Google ScholarDigital Library
- Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, Vol. 44, 2 (2012), 226--232.Google Scholar
- Chuntao Jiang, Frans Coenen, and Michele Zito. 2013. A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review Vol. 28, 01 (2013), 75--105.Google ScholarCross Ref
- Jaebum Kim, Denis M Larkin, Qingle Cai, Yongfen Zhang, Ri-Li Ge, Loretta Auvil, Boris Capitanu, Guojie Zhang, Harris A Lewin, Jian Ma, and others 2013. Reference-assisted chromosome assembly. Proceedings of the National Academy of Sciences, Vol. 110, 5 (2013), 1785--1790.Google ScholarCross Ref
- Mehmet Koyutürk, Ananth Grama, and Wojciech Szpankowski 2004. An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics, Vol. 20, suppl 1 (2004), i200--i207. Google ScholarDigital Library
- Heng Li and Nils Homer 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics Vol. 11, 5 (2010), 473--483.Google Scholar
- Ruirui Li and Wei Wang 2015. REAFUM: Representative Approximate Frequent Subgraph Mining SIAM International Conference on Data Mining. SIAM, 2167--0099.Google Scholar
- Gianni Liti, David M Carter, Alan M Moses, Jonas Warringer, Leopold Parts, Stephen A James, Robert P Davey, Ian N Roberts, Austin Burt, Vassiliki Koufopanou, and others 2009. Population genomics of domestic and wild yeasts. Nature, Vol. 458, 7236 (2009), 337--341.Google Scholar
- Jinze Liu, Susan Paulsen, Xing Sun, Wei Wang, Andrew B Nobel, and Jan Prins. 2006. Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis. SDM, Vol. Vol. 6. 405--416.Google ScholarCross Ref
- Shoshana Marcus, Hayan Lee, and Michael C Schatz. 2014. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, Vol. 30, 24 (2014), 3476--3483.Google ScholarCross Ref
- Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E Dutilh, Ali Ghaffaari, Paul Kersey, Wigard Kloosterman, Veli Makinen, Adam Novak, and others 2016. Computational Pan-Genomics: Status, Promises and Challenges. bioRxiv (2016), 043430.Google Scholar
- Ilya Minkin, Anand Patel, Mikhail Kolmogorov, Nikolay Vyahhi, and Son Pham 2013. Sibelia: a scalable and comprehensive synteny block generation tool for closely related microbial genomes. Algorithms in Bioinformatics. Springer, 215--229.Google Scholar
- Ilia Minkin, Son Pham, and Paul Medvedev 2016. TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes. arXiv:1602.05856 (2016).Google Scholar
- William S Noble. 2006. What is a support vector machine? Nature biotechnology, Vol. 24, 12 (2006), 1565--1567.Google Scholar
- Maite Novo, Frédéric Bigey, Emmanuelle Beyne, Virginie Galeote, Frédérick Gavory, Sandrine Mallet, Brigitte Cambon, Jean-Luc Legras, Patrick Wincker, Serge Casaregola, and others 2009. Eukaryote-to-eukaryote gene transfer events revealed by the genome sequence of the wine yeast Saccharomyces cerevisiae EC1118. Proceedings of the National Academy of Sciences, Vol. 106, 38 (2009), 16333--16338.Google ScholarCross Ref
- Son K Pham and Pavel A Pevzner 2010. DRIMM-Synteny: decomposing genomes into evolutionary conserved segments. Bioinformatics, Vol. 26, 20 (2010), 2509--2516. Google ScholarDigital Library
- Markus Ringnér. 2008. What is principal component analysis? Nature biotechnology, Vol. 26, 3 (2008), 303.Google Scholar
- Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson 2015. Big data: astronomical or genomical? PLoS Biol, Vol. 13, 7 (2015), e1002195.Google ScholarCross Ref
- H Takagi, F Iwamoto, and S Nakamori 1997. Isolation of freeze-tolerant laboratory strains of Saccharomyces cerevisiae from proline-analogue-resistant mutants. Applied microbiology and biotechnology Vol. 47, 4 (1997), 405--411.Google Scholar
- Hiroshi Takagi, Kuumi Sakai, Kana Morida, and Shigeru Nakamori 2000. Proline accumulation by mutation or disruption of the proline oxidase gene improves resistance to freezing and desiccation stresses in Saccharomyces cerevisiae. FEMS microbiology letters Vol. 184, 1 (2000), 103--108.Google Scholar
- Hiroshi Takagi, Miki Takaoka, Akari Kawaguchi, and Yoshito Kubo 2005. Effect of L-proline on sake brewing and ethanol stress in Saccharomyces cerevisiae. Applied and environmental microbiology Vol. 71, 12 (2005), 8656--8662.Google Scholar
- Hervé Tettelin, Vega Masignani, Michael J Cieslewicz, Claudio Donati, Duccio Medini, Naomi L Ward, Samuel V Angiuoli, Jonathan Crabtree, Amanda L Jones, A Scott Durkin, and others 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences, Vol. 102, 39 (2005), 13950--13955.Google ScholarCross Ref
- George Vernikos, Duccio Medini, David R Riley, and Herve Tettelin 2015. Ten years of pan-genome analyses. Current opinion in microbiology Vol. 23 (2015), 148--154.Google Scholar
- Cheng Yang, Usama Fayyad, and Paul S Bradley 2001. Efficient discovery of error-tolerant frequent itemsets in high dimensions Proceedings of ACM SIGKDD. 194--203. Google ScholarDigital Library
- Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns Proceedings of ACM SIGKDD. 344--353. Google ScholarDigital Library
Index Terms
- Exploring Frequented Regions in Pan-Genomic Graphs
Recommendations
Pangenome-Wide Association Studies with Frequented Regions
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsConnecting genetic variation (genotype) to trait variation (phenotype) is a critical but often difficult step in genetic research. A genome-wide association study (GWAS) is a common approach to connect underlying genetic variation to complex phenotypic ...
Exploring Frequented Regions in Pan-Genomic Graphs
We consider the problem of identifying regions within a pan-genome De Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions FRs. In this work, we formalize the FR problem ...
Distinguishing between genomic regions bound by paralogous transcription factors
RECOMB'13: Proceedings of the 17th international conference on Research in Computational Molecular BiologyTranscription factors (TFs) regulate gene expression by binding to specific DNA sites in cis regulatory regions of genes. Most eukaryotic TFs are members of protein families that share a common DNA binding domain and often recognize highly similar DNA ...
Comments