A Flow Procedure for the Linearization of Genome Sequence Graphs

Haussler, David; Smuga-Otto, Maciej; Paten, Benedict; Novak, Adam M.; Nikitin, Sergei; Zueva, Maria; Miagkov, Dmitrii

doi:10.1007/978-3-319-56970-3_3

David Haussler¹⁴,
Maciej Smuga-Otto¹⁴,
Benedict Paten¹⁴,
Adam M. Novak¹⁴,
Sergei Nikitin¹⁵,
Maria Zueva¹⁵ &
…
Dmitrii Miagkov¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2021 Accesses
2 Citations

Abstract

Efforts to incorporate human genetic variation into the reference human genome have converged on the idea of a graph representation of genetic variation within a species, a genome sequence graph. A sequence graph represents a set of individual haploid reference genomes as paths in a single graph. When that set of reference genomes is sufficiently diverse, the sequence graph implicitly contains all frequent human genetic variations, including translocations, inversions, deletions, and insertions.

In representing a set of genomes as a sequence graph one encounters certain challenges. One of the most important is the problem of graph linearization, essential both for efficiency of storage and access, as well as for natural graph visualization and compatibility with other tools. The goal of graph linearization is to order nodes of the graph in such a way that operations such as access, traversal and visualization are as efficient and effective as possible.

A new algorithm for the linearization of sequence graphs, called the flow procedure, is proposed in this paper. Comparative experimental evaluation of the flow procedure against other algorithms shows that it outperforms its rivals in the metrics most relevant to sequence graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Graph Extension of the Positional Burrows-Wheeler Transform and Its Applications

A graph extension of the positional Burrows–Wheeler transform and its applications

Article Open access 11 July 2017

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

References

Paten, B., Novak, A., Haussler, D.: Mapping to a Reference Genome Structure eprint arXiv:1404.5010
Paten, B., Novak, A.M., Garrison, E., Hickey, G.: Superbubbles, ultrabubbles and cacti. In: Proceedings of RECOMB 2017 (2017)
Google Scholar
Baharev, A., Schichl, H., Neumaer, A., Achterberg, T.: An exact method for the minimum feedback arc set problem (2016)
Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103. Springer US, New York (1972)
Chapter Google Scholar
Brandenburg, F., Hanauer, K.: Sorting heuristics for the feedback arc set problem. Technical report. Number MIP-1104 (2011)
Google Scholar
Gavril, F.: Some NP-complete problems on graphs. In: Proceedings of the 11th conference on Information Sciences and Systems, pp. 91–95 (1977)
Google Scholar
Martí, R., Pantrigo, J., Duarte, A., Pardo, E.: Branch and bound for the cutwidth minimization problem. Comput. Oper. Res. 40, 137–149 (2013). doi:10.1016/j.cor.2012.05.016
Article MathSciNet MATH Google Scholar
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to algorithms. Mit Press, Cambridge (Inglaterra) (2009)
MATH Google Scholar
Medvedev, P., Brudno, M.: Maximum likelihood genome assembly. J. Comput. Biol. 16, 1101–1116 (2009). doi:10.1089/cmb.2009.0047
Article MathSciNet Google Scholar
Ford, L.R., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton (1962)
MATH Google Scholar
https://www.bioconductor.org/packages/release/bioc/html/RSVSim.html
https://github.com/vgteam/vg
http://biorxiv.org/content/early/2017/01/18/101501
Kahn, A.: Topological sorting of large networks. Commun. ACM 5, 558–562 (1962). doi:10.1145/368996.369025
Article MATH Google Scholar
Eades, P., Lin, X., Smyth, W.: A fast and effective heuristic for the feedback arc set problem. Inf. Process. Lett. 47, 319–323 (1993). doi:10.1016/0020-0190(93)90079-O
Article MathSciNet MATH Google Scholar
Nguyen, N., Hickey, G., Zerbino, D., Raney, B., Earl, D., Armstrong, J., Kent, W., Haussler, D., Paten, B.: Building a pan-genome reference for a population. J. Comput. Biol. 22, 387–401 (2015). doi:10.1089/cmb.2014.0146
Article MathSciNet Google Scholar

Download references

Acknowledgements

We’d like to thank Erik Garrison and Glenn Hickey for helpful conversations. This work was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number 5U54HG007990 and grants from the W.M. Keck foundation and the Simons Foundation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
David Haussler, Maciej Smuga-Otto, Benedict Paten & Adam M. Novak
Life Sciences Business Unit, EPAM Systems, Inc., Newtown, Pennsylvania, USA
Sergei Nikitin, Maria Zueva & Dmitrii Miagkov

Authors

David Haussler
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Smuga-Otto
View author publications
You can also search for this author in PubMed Google Scholar
Benedict Paten
View author publications
You can also search for this author in PubMed Google Scholar
Adam M. Novak
View author publications
You can also search for this author in PubMed Google Scholar
Sergei Nikitin
View author publications
You can also search for this author in PubMed Google Scholar
Maria Zueva
View author publications
You can also search for this author in PubMed Google Scholar
Dmitrii Miagkov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benedict Paten .

Editor information

Editors and Affiliations

Indiana University Bloomington, Bloomington, Indiana, USA
S. Cenk Sahinalp

Appendix

1.1 Some Details of the FP Algorithm

Noteworthy is the order in which we find the in- and outgrowths. First, we traverse the backbone from end to start, finding the outgrowth for each node, then we traverse it from start to end, finding the ingrowth. We include in the in- and outgrowths only those nodes that did not end up in any of the previous out- or ingrowths (see Fig. 14).

1.2 Step-by-Step Algorithm Run

Let’s start from the moment when we have already found and removed the minimum cut. We go from the beginning to the end over the backbone (CGATC) and find the in-growth CCGA (upper 3 nodes and A from the backbone). For this in-growth we run the entire flow procedure recursively. Looking for the backbone, we start from A and search for incoming max weight arcs. We get CCA, then run the min cut search and remove the CG arc. Then we recursively go to the CCA backbone from the beginning to the end; we are looking for the in-growth. We find GC. For it we run the procedure, which arranges these two nodes in the obvious way. We insert the result into the backbone CCA with the G before the second C (the one that had the in-arc). Thus, we get . All nodes of this part are sorted, so the recursion is finished and we insert the resulting in-growth into the backbone of the source graph. Inserting to the backbone we get . There are no other in-growths, so we turn to search for out-growths. We go from the end to the beginning. We find the GGC out-growth. It includes 3 consecutive nodes, so the recursive procedure for it throws out a natural GGC order. We insert to the backbone and get . Then we look for the next out-growth. We find the CTCA starting from the first node of the backbone. For it, we run the procedure recursively. It finds the backbone CTA, then removes the min cut, finds the in-growth CA and inserts its C before the A: . There are no other in- or out-growths, so this part of the algorithm is finished and we insert nodes to the original backbone, finally getting .

1.3 Test Data Set Modeling

In order to simulate the test data, we used the RSVSim package (version 1.14.0) from the Bioconductor software (Release 3.4). As a reference genome, we took BSgenome.Hsapiens.UCSC.hg38 (version 1.4.1), alternative branch chr13_KI270842v1_alt, which is 37287 nucleotides long. Using the simulateSV command of the RSVSim package, we modeled genome fragments of 10 individuals with a given set of variations. Resulting FASTA files were submitted to the entry of the msga command of the vg utility [12]. As a result, we got a sequence graph (*.gfa format). This graph is an input to the commands vg sort-f (Eades) and vg sort (Flow procedure) of the vg utility [12]. Finally, we got text files with graph nodes ordered by linearization using the Kahn, Eades, and flow procedure algorithms respectively. To analyze the algorithm, we created the original software to get the number of feedback arcs and the cut width in abovementioned sorts. To reduce the impact of accidents, we repeated the procedure 20 times for each set of variations and average the results.

We created variation sets as follows. In the modelled genome fragments, we added 5 variation types: insertions, deletions, duplications, inversions, and translocations. The positions of all variations were uniformly distributed over the simulation section of the genome. Twenty percent of the insertions were duplicating sections of the DNA. Translocations were modelled using the shoulder exchange mechanism. The lengths of insertions and deletions were 20 nucleotides; the length of inversion was 200 nucleotides; the length of duplications was 500. The number of variations of each type was equal to 5 in the first set, 6 in the second, 7 in the third, and so on up to 11 in the latest set of variations. The [13] provides a dependence of the number of feedback arcs and cut widths of number of variations of the same type. For this study, the number of variations of all types, except the examined, were fixed at level 7, and the number of investigated variations were changing according to the following list: 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, and 31.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haussler, D. et al. (2017). A Flow Procedure for the Linearization of Genome Sequence Graphs. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-56970-3_3
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Flow Procedure for the Linearization of Genome Sequence Graphs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Graph Extension of the Positional Burrows-Wheeler Transform and Its Applications

A graph extension of the positional Burrows–Wheeler transform and its applications

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Some Details of the FP Algorithm

1.2 Step-by-Step Algorithm Run

1.3 Test Data Set Modeling

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us