research-article

Public Access

Exploring Frequented Regions in Pan-Genomic Graphs

Authors:
Alan Cleary

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

,
Indika Kahanda

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

,
Brendan Mumey

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

,
Joann Mudge

National Center for Genome Resources, Santa Fe, NM, USA

National Center for Genome Resources, Santa Fe, NM, USA
View Profile

,
Thiruvarangan Ramaraj

National Center for Genome Resources, Santa Fe, NM, USA

National Center for Genome Resources, Santa Fe, NM, USA
View Profile

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health InformaticsAugust 2017Pages 89–97https://doi.org/10.1145/3107411.3107427

Published:20 August 2017Publication History

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pages 89–97

ABSTRACT

We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work we formalize the FR problem and describe an efficient algorithm for finding FRs. Subsequently, we propose some applications of FRs based on machine-learning and pan-genome graph simplification. We demonstrate the effectiveness of these applications using data sets for the organisms Staphylococcus aureus (bacteria) and Saccharomyces cerevisiae (yeast). We corroborate the biological relevance of FRs such as identifying introgressions in yeast that aid in alcohol tolerance, and show that FRs are useful for classification of yeast strains by industrial use and visualizing pan-genomic space.

References

Rakesh Agrawal, Ramakrishnan Srikant, and others. 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. Vol. 1215. 487--499. Google ScholarDigital Library
Timo Beller and Enno Ohlebusch 2015. Efficient construction of a compressed de Bruijn graph for pan-genome analysis Combinatorial Pattern Matching. Springer, 40--51.Google Scholar
Viv Bewick, Liz Cheek, and Jonathan Ball 2004. Statistics review 13: receiver operating characteristic curves. Critical care, Vol. 8, 6 (2004), 508.Google ScholarCross Ref
Anthony R Borneman, Brian A Desany, David Riches, Jason P Affourtit, Angus H Forgan, Isak S Pretorius, Michael Egholm, and Paul J Chambers 2011. Whole-genome comparison reveals novel genetic elements that characterize the genome of industrial strains of Saccharomyces cerevisiae. PLoS Genet, Vol. 7, 2 (2011), e1001287.Google ScholarCross Ref
Michael Bridges, Elizabeth A Heron, Colm O'Dushlaine, Ricardo Segurado, Derek Morris, Aiden Corvin, Michael Gill, Carlos Pinto, International Schizophrenia Consortium, and others. 2011. Genetic classification of populations using supervised learning. PloS one, Vol. 6, 5 (2011), e14802.Google ScholarCross Ref
Hong Cheng, Philip S Yu, and Jiawei Han 2006. Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery Data Mining, 2006. ICDM'06. Sixth International Conference on. 839--844. Google ScholarDigital Library
Alan Cleary, Brendan Mumey, Thiruvarangan Ramaraj, and Joann Mudge 2017. Approximate Frequent Subpath Mining Applied to Pangenomics BICoB. 59--65.Google Scholar
Corinna Cortes and Vladimir Vapnik 1995. Support-vector networks. Machine learning, Vol. 20, 3 (1995), 273--297. Google ScholarDigital Library
Trevor F Cox and Michael AA Cox 2000. Multidimensional scaling. CRC press.Google Scholar
Barbara Dunn, Chandra Richter, Daniel J Kvitek, Tom Pugh, and Gavin Sherlock 2012. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments. Genome research, Vol. 22, 5 (2012), 908--924.Google Scholar
Sumanta Guha. 2009. Efficiently mining frequent subpaths. In Proceedings of the Eighth Australasian Data Mining Conference-Volume 101. 11--15. Google ScholarDigital Library
David Haussler, Stephen J O'Brien, Oliver A Ryder, F Keith Barker, Michele Clamp, Andrew J Crawford, Robert Hanner, Olivier Hanotte, Warren E Johnson, Jimmy A McGuire, and others 2009. Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity, Vol. 100, 6 (2009), 659--674.Google ScholarCross Ref
Steven Hill, Bismita Srichandan, and Rajshekhar Sunderraman. 2012. An iterative mapreduce approach to frequent subgraph mining in biological datasets Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 661--666. Google ScholarDigital Library
Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. 2004. Mining protein family specific residue packing patterns from protein structure graphs Proceedings of the eighth annual international conference on Resaerch in computational molecular biology. ACM, 308--315. Google ScholarDigital Library
Hoching L Huang and Marjorie C Brandriss 2000. The regulator of the yeast proline utilization pathway is differentially phosphorylated in response to the quality of the nitrogen source. Molecular and cellular biology Vol. 20, 3 (2000), 892--899.Google Scholar
Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 13--23. Google ScholarDigital Library
Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, Vol. 44, 2 (2012), 226--232.Google Scholar
Chuntao Jiang, Frans Coenen, and Michele Zito. 2013. A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review Vol. 28, 01 (2013), 75--105.Google ScholarCross Ref
Jaebum Kim, Denis M Larkin, Qingle Cai, Yongfen Zhang, Ri-Li Ge, Loretta Auvil, Boris Capitanu, Guojie Zhang, Harris A Lewin, Jian Ma, and others 2013. Reference-assisted chromosome assembly. Proceedings of the National Academy of Sciences, Vol. 110, 5 (2013), 1785--1790.Google ScholarCross Ref
Mehmet Koyutürk, Ananth Grama, and Wojciech Szpankowski 2004. An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics, Vol. 20, suppl 1 (2004), i200--i207. Google ScholarDigital Library
Heng Li and Nils Homer 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics Vol. 11, 5 (2010), 473--483.Google Scholar
Ruirui Li and Wei Wang 2015. REAFUM: Representative Approximate Frequent Subgraph Mining SIAM International Conference on Data Mining. SIAM, 2167--0099.Google Scholar
Gianni Liti, David M Carter, Alan M Moses, Jonas Warringer, Leopold Parts, Stephen A James, Robert P Davey, Ian N Roberts, Austin Burt, Vassiliki Koufopanou, and others 2009. Population genomics of domestic and wild yeasts. Nature, Vol. 458, 7236 (2009), 337--341.Google Scholar
Jinze Liu, Susan Paulsen, Xing Sun, Wei Wang, Andrew B Nobel, and Jan Prins. 2006. Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis. SDM, Vol. Vol. 6. 405--416.Google ScholarCross Ref
Shoshana Marcus, Hayan Lee, and Michael C Schatz. 2014. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, Vol. 30, 24 (2014), 3476--3483.Google ScholarCross Ref
Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E Dutilh, Ali Ghaffaari, Paul Kersey, Wigard Kloosterman, Veli Makinen, Adam Novak, and others 2016. Computational Pan-Genomics: Status, Promises and Challenges. bioRxiv (2016), 043430.Google Scholar
Ilya Minkin, Anand Patel, Mikhail Kolmogorov, Nikolay Vyahhi, and Son Pham 2013. Sibelia: a scalable and comprehensive synteny block generation tool for closely related microbial genomes. Algorithms in Bioinformatics. Springer, 215--229.Google Scholar
Ilia Minkin, Son Pham, and Paul Medvedev 2016. TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes. arXiv:1602.05856 (2016).Google Scholar
William S Noble. 2006. What is a support vector machine? Nature biotechnology, Vol. 24, 12 (2006), 1565--1567.Google Scholar
Maite Novo, Frédéric Bigey, Emmanuelle Beyne, Virginie Galeote, Frédérick Gavory, Sandrine Mallet, Brigitte Cambon, Jean-Luc Legras, Patrick Wincker, Serge Casaregola, and others 2009. Eukaryote-to-eukaryote gene transfer events revealed by the genome sequence of the wine yeast Saccharomyces cerevisiae EC1118. Proceedings of the National Academy of Sciences, Vol. 106, 38 (2009), 16333--16338.Google ScholarCross Ref
Son K Pham and Pavel A Pevzner 2010. DRIMM-Synteny: decomposing genomes into evolutionary conserved segments. Bioinformatics, Vol. 26, 20 (2010), 2509--2516. Google ScholarDigital Library
Markus Ringnér. 2008. What is principal component analysis? Nature biotechnology, Vol. 26, 3 (2008), 303.Google Scholar
Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson 2015. Big data: astronomical or genomical? PLoS Biol, Vol. 13, 7 (2015), e1002195.Google ScholarCross Ref
H Takagi, F Iwamoto, and S Nakamori 1997. Isolation of freeze-tolerant laboratory strains of Saccharomyces cerevisiae from proline-analogue-resistant mutants. Applied microbiology and biotechnology Vol. 47, 4 (1997), 405--411.Google Scholar
Hiroshi Takagi, Kuumi Sakai, Kana Morida, and Shigeru Nakamori 2000. Proline accumulation by mutation or disruption of the proline oxidase gene improves resistance to freezing and desiccation stresses in Saccharomyces cerevisiae. FEMS microbiology letters Vol. 184, 1 (2000), 103--108.Google Scholar
Hiroshi Takagi, Miki Takaoka, Akari Kawaguchi, and Yoshito Kubo 2005. Effect of L-proline on sake brewing and ethanol stress in Saccharomyces cerevisiae. Applied and environmental microbiology Vol. 71, 12 (2005), 8656--8662.Google Scholar
Hervé Tettelin, Vega Masignani, Michael J Cieslewicz, Claudio Donati, Duccio Medini, Naomi L Ward, Samuel V Angiuoli, Jonathan Crabtree, Amanda L Jones, A Scott Durkin, and others 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences, Vol. 102, 39 (2005), 13950--13955.Google ScholarCross Ref
George Vernikos, Duccio Medini, David R Riley, and Herve Tettelin 2015. Ten years of pan-genome analyses. Current opinion in microbiology Vol. 23 (2015), 148--154.Google Scholar
Cheng Yang, Usama Fayyad, and Paul S Bradley 2001. Efficient discovery of error-tolerant frequent itemsets in high dimensions Proceedings of ACM SIGKDD. 194--203. Google ScholarDigital Library
Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns Proceedings of ACM SIGKDD. 344--353. Google ScholarDigital Library

Index Terms

Exploring Frequented Regions in Pan-Genomic Graphs
1. Applied computing
  1. Life and medical sciences
    1. Computational biology
      1. Computational genomics
2. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Pangenome-Wide Association Studies with Frequented Regions
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Connecting genetic variation (genotype) to trait variation (phenotype) is a critical but often difficult step in genetic research. A genome-wide association study (GWAS) is a common approach to connect underlying genetic variation to complex phenotypic ...
Read More
Exploring Frequented Regions in Pan-Genomic Graphs

We consider the problem of identifying regions within a pan-genome De Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions FRs. In this work, we formalize the FR problem ...
Read More
Distinguishing between genomic regions bound by paralogous transcription factors
RECOMB'13: Proceedings of the 17th international conference on Research in Computational Molecular Biology

Transcription factors (TFs) regulate gene expression by binding to specific DNA sites in cis regulatory regions of genes. Most eukaryotic TFs are members of protein families that share a common DNA binding domain and often recognize highly similar DNA ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
August 2017
800 pages
ISBN:9781450347228
DOI:10.1145/3107411
General Chairs:
Nurit Haspel
University of Massachusetts Boston, USA
,
Lenore J. Cowen
Tufts University, USA
,
Program Chairs:
Amarda Shehu
George Mason University, USA
,
Tamer Kahveci
University of Florida, USA
,
Giuseppe Pozzi
Politecnico di Milano, Italy
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
pan-genomics
visualization
Qualifiers
- research-article
Conference

Acceptance Rates
ACM-BCB '17 Paper Acceptance Rate42of132submissions,32%Overall Acceptance Rate254of885submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 253
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploring Frequented Regions in Pan-Genomic Graphs

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pangenome-Wide Association Studies with Frequented Regions

Exploring Frequented Regions in Pan-Genomic Graphs

Distinguishing between genomic regions bound by paralogous transcription factors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exploring Frequented Regions in Pan-Genomic Graphs

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pangenome-Wide Association Studies with Frequented Regions

Exploring Frequented Regions in Pan-Genomic Graphs

Distinguishing between genomic regions bound by paralogous transcription factors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media