RACCROCHE: Ancestral Flowering Plant Chromosomes and Gene Orders Based on Generalized Adjacencies and Chromosomal Gene Co-occurrences

Xu, Qiaoji; Jin, Lingling; Zheng, Chunfang; Leebens Mack, James H.; Sankoff, David

doi:10.1007/978-3-030-79290-9_9

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12686))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

543 Accesses
3 Citations

Abstract

Given the phylogenetic relationships of several extant species, the reconstruction of their ancestral genomes at the gene and chromosome level is made difficult by the cycles of whole genome doubling followed by fractionation in plant lineages. Fractionation scrambles the gene adjacencies that enable existing reconstruction methods. We propose an alternative approach that postpones the selection of gene adjacencies for reconstructing small ancestral segments and instead accumulates a very large number of syntenically validated candidate adjacencies to produce long ancestral contigs through maximum weight matching. Likewise, we do not construct chromosomes by successively piecing together contigs into larger segments, but instead count all contig co-occurrences on the input genomes and cluster these, so that chromosomal assemblies of contigs all emerge naturally ordered at each ancestral node of the phylogeny. These strategies result in substantially more complete reconstructions than existing methods. We deploy a number of quality measures: contig lengths, continuity of contig structure on successive ancestors, coverage of the reconstruction on the input genomes, and rearrangement implications of the chromosomal structures obtained. The reconstructed ancestors can be functionally annotated and are visualized by painting the ancestral projections on the descendant genomes, and by highlighting syntenic ancestor-descendant relationships. We apply our methods to genomes drawn from a broad range of monocot orders, confirming the tetraploidization event “tau” in the stem lineage between the alismatids and the lilioids.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Availability

The annotated genomic data is accessible on the CoGe platform https://genomevolution.org/coge/ and Phytozome. The pipeline is available at https://github.com/jin-repo/RACCROCHE.

References

Amborella Genome Project: The Amborella genome and the evolution of flowering plants. Science 342(6165), 1241089 (2013)
Article Google Scholar
Anselmetti, Y., Luhmann, N., Bérard, S., Tannier, E., Chauve, C.: Comparative methods for reconstructing ancient genome organization. In: Setubal, J.C., Stoye, J., Stadler, P.F. (eds.) Comparative Genomics. MMB, vol. 1704, pp. 343–362. Springer, New York (2018). https://doi.org/10.1007/978-1-4939-7463-4_13
Chapter Google Scholar
Avdeyev, P., Alexeev, N., Rong, Y., Alekseyev, M.A.: A unified ILP framework for core ancestral genome reconstruction problems. Bioinformatics 36(10), 2993–3003 (2020)
Article Google Scholar
Badouin, H., et al.: The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546(7656), 148–152 (2017)
Article Google Scholar
Chauve, C., Tannier, E.: A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLoS Comput. Biol. 4(11), e1000234 (2008)
Article MathSciNet Google Scholar
Givnish, T.J., et al.: Monocot plastid phylogenomics, timeline, net rates of species diversification, the power of multi-gene analyses, and a functional model for the origin of monocots. Am. J. Bot. 105(11), 1888–1910 (2018)
Article Google Scholar
Ma, J., et al.: Reconstructing contiguous regions of an ancestral genome. Genome Res. 16(12), 1557–1565 (2006)
Article Google Scholar
Martí, R., Reinelt, G., Duarte, A.: A benchmark library and a comparison of heuristic methods for the linear ordering problem. Comput. Optim. Appl. 51(3), 1297–1317 (2012). https://doi.org/10.1007/s10589-010-9384-9
Article MathSciNet MATH Google Scholar
Mazowita, M., Haque, L., Sankoff, D.: Stability of rearrangement measures in the comparison of genome sequences. J. Comput. Biol. 13(2), 554–566 (2006)
Article MathSciNet Google Scholar
Murat, F., Armero, A., Pont, C., Klopp, C., Salse, J.: Reconstructing the genome of the most recent common ancestor of flowering plants. Nat. Genet. 49, 490–496 (2017)
Article Google Scholar
Perrin, A., Varré, J.S., Blanquart, S., Ouangraoua, A.: ProCARs: progressive reconstruction of ancestral gene orders. BMC Genomics 16(S5) (2015). Article number: S6. https://doi.org/10.1186/1471-2164-16-S5-S6
Rubert, D.P., Martinez, F.V., Stoye, J., Doerr, D.: Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants. BMC Genomics 21, 1–11 (2020). https://doi.org/10.1186/s12864-020-6609-x
Article Google Scholar
Schiavinotto, T., Stützle, T.: The linear ordering problem: instances, search space analysis and algorithms. J. Math. Model. Algorithms 3(4), 367–402 (2004). https://doi.org/10.1007/s10852-005-2583-1
Article MathSciNet MATH Google Scholar
Tannier, E., Bazin, A., Davín, A., Guéguen, L., Bérard, S., Chauve, C.: Ancestral genome organization as a diagnosis tool for phylogenomics (2020)
Google Scholar
Wang, Y., et al.: MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40(7), e49 (2012)
Article Google Scholar
Xu, Q., Jin, L., Zheng, C., Leebens-Mack, J.H., Sankoff, D.: Validation of automated chromosome recovery in the reconstruction of ancestral gene order. Algorithms 14, 160 (2021)
Article Google Scholar
Xu, X., Sankoff, D.: Tests for gene clusters satisfying the generalized adjacency criterion. In: Bazzan, A.L.C., Craven, M., Martins, N.F. (eds.) BSB 2008. LNCS, vol. 5167, pp. 152–160. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85557-6_14
Chapter Google Scholar
Yang, Z., Sankoff, D.: Natural parameter values for generalized gene adjacency. J. Comput. Biol. 17(9), 1113–1128 (2010)
Article MathSciNet Google Scholar
Zheng, C., Chen, E., Albert, V.A., Lyons, E., Sankoff, D.: Ancient eudicot hexaploidy meets ancestral eurosid gene order. BMC Genomics 14(S7), S3 (2013)
Article Google Scholar

Download references

Acknowledgements

We thank the Department of Energy Joint Genome Institute staff and collaborators including David Kudrna, Jerry Jenkins, Jane Grimwood, Shengqiang Shu, and Jeremy Schmutz for pre-publication access to the Acorus genome sequence and annotation. Thanks to Aîda Ouangraoua for much help in implementing ProCARs [11] and Haibao Tang for prompt replies to queries about MCScanX [15].

Funding

Research supported by Discovery grants to LJ and DS from the Natural Sciences and Engineering Research Council of Canada. DS holds the Canada Research Chair in Mathematical Genomics.

Author information

Authors and Affiliations

University of Ottawa, Ottawa, ON, K1N 6N5, Canada
Qiaoji Xu, Chunfang Zheng & David Sankoff
University of Saskatchewan, Saskatoon, SK, S7N 5C9, Canada
Lingling Jin
University of Georgia, Athens, GA, 30602, USA
James H. Leebens Mack

Authors

Qiaoji Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lingling Jin
View author publications
You can also search for this author in PubMed Google Scholar
Chunfang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
James H. Leebens Mack
View author publications
You can also search for this author in PubMed Google Scholar
David Sankoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sankoff .

Editor information

Editors and Affiliations

The University of Texas at San Antonio, San Antonio, TX, USA
Sumit Kumar Jha
University of Connecticut, Storrs, CT, USA
Ion Măndoiu
University of Connecticut, Storrs Mansfield, CT, USA
Sanguthevar Rajasekaran
Department of Computer Science, Georgia State University, Roswell, GA, USA
Pavel Skums
Department of Computer Science, Georgia State University, Atlanta, GA, USA
Alex Zelikovsky

Appendices

A Redistributing Genes from Families Exceeding Upper Size Limits

As an optional second “redistribution” step, all families with more than NF members or more than NG members in any particular genome, are flagged. Then the construction of the families is repeated, with the restriction that no gene can be recruited to a family by virtue only of a similarity of less than some threshold homology level $\theta $ to a gene already in the family. The intent is to break up large families held together by a few weak links, and thus to retrieve some better supported smaller families.

B Modes of Contig Construction

RACCROCHE executes for a single set of W, NF, NG parameters, or for a range of values of W and NG. In the latter case, there is an option, designed to increase coherence among sets of contigs for successive ancestors, that the MWM for any combination of W and NG must be restricted to include all adjacencies already recovered for lesser values of W or NG, insofar as possible. Thus, starting with some small W and NG, we can construct MWM solutions for larger window size and/or larger gene family size, and hence sets of contigs, by incrementing one or the other of the parameters.

It is possible, however, to have conflicts between $W,NG-1$, and $W-1,NG$ analyses. For example if adjacencies (a, b) and (b, c) are in the MWM for $(W, NG-1)$ and (a, b) and (b, d) are in the MWM for $(W-1, NG)$, then a matching for W, G cannot be forced to include all matchings from the two previous MWM. To accommodate this possibility, when we restrict the MWM for (W, NG) to include all adjacencies from $(W,NG-1)$ and $(W-1,NG)$, we make an exception for any adjacencies from either that are in potential conflict with adjacencies from the other. Thus (a, b) in the example above might be obligatorily included, but (b, c) and (b, d) would not. Thus the MWM for (W, NG) might include (b, c) or (b, d), but not both.

C Matching Contigs to Chromosomes of Extant Genomes

For the ancestor genome, A, computed from a set of extant genomes neighbouring A, $G_{1\cdots n}$, perform the following steps.

1.
Extract gene features of ancestor A in descendant genomes.

For every gene, g, in ancestor A computed from Step 2, retrieve six features of this gene in every extant genome $G_{1\cdots n}$ involved in constructing ancestor A. The features of a gene include chromosome ID, start and end chromosomal positions, distance between g to its next adjacent gene in $G_i$, gene family ID labelled in Step 1, and contig ID in A, denoted as $g^{A\rightarrow G_i } (chr,start,end,distance,gf,ctg)$.
2.
Map ancestor A to each of the descendant genomes.

The ancestor will be mapped as ancestral syntenic blocks on the descendant genome in two steps. The first step initializes a syntenic block by merging two adjacent genes given a distance threshold DIS: merge two genes, $g_1$ and $g_2$, forming one ancestral syntenic block on $G_i$ if $g_1$ and $g_2$ satisfy the following conditions:
1. (a)
  $g_1$ and $g_2$ locate the same chromosome of $G_i$;
2. (b)
  $g_1$ and $g_2$ are adjacent to each other; in other words, there could be a non-coding region but no other gene(s) between $g_1$ and $g_2$;
3. (c)
  The distance between the two adjacent genes must be less than or equal to the distance threshold DIS (i.e. $DIS=1$ Mbp).
The second step extends the above identified ancestral syntenic block by merging flanking gene(s) into the block if the gene(s) satisfies the above three conditions. It stops extending the block if no flanking gene could be merged into the block. After the two steps, an ancestral synteny block mapping A to $G_i$ is denoted as syntenyBlk(chr, start, end, ctg, len). The set of synteny blocks between A and $G_i$ is

$syntenyBlkSet^{A\rightarrow G_i}=\{syntenyBlk_k(chr,start,end,ctg,len) | 1\le k\le m,$ where m is the total number of synteny blocks mapping from A to $G_i$}

D Construction of Ancestral Chromosomes

1.
Filter the set of blocks longer than a block length threshold.

Given a block length threshold, blockLEN, $\overline{syntenyBlkSet}^{A\rightarrow G_i}$ is a subset of $syntenyBlkSet^{A\rightarrow G_i}$, where each block in the set is longer than blockLEN (i.e. $blockLEN = 150$ Kbp).
2.
Count co-occurrence of ancestral contigs on same chromosomes.

Based on syntenyBlk.chr and syntenyBlk.ctg of each pair of synteny block in $\overline{syntenyBlkSet}^{A\rightarrow G_i}$, gather the co-occurrence of ancestral contigs on the same extant chromosome. Write the co-occurrence result into the lower triangle of a $NC\times NC$ matrix, m, where the rows and columns are contigs with ID from 0 to $(NC-1)$, $m_{i,j}$ is the number of co-occurrence between contigs i and j, where $0<j<i<NC-1$. The maximum co-occurrence frequency in m is denoted as $\max _{freq}$.
3.
Cluster ancestral contigs into ancestral chromosomes according to pairwise distance matrix based on co-occurrence.

A NC by NC distance matrix, dmat, is calculated as
$$dmat_{i,j}= -\log (\frac{\max _{freq}-m_{i,j} }{\max _{freq}}).$$
This distance matrix is fed into the complete-link clustering algorithm. This can then be composed into K clusters, according to users’ preferences. The resultant clusters of contigs correspond to ancestral chromosomes and their compositions.

Last, attach ancestral chromosome number as an attribute to each of the synteny block:

$$ syntenyBlkSet^{A\rightarrow G_{1\cdots N}}=\{syntenyBlk_k(chr,start,end,ctg,len,ancestral_chr)\}, $$

where $ancestral\_chr$ corresponds to the cluster ID which blk.ctg belong to.

To order the contigs along each chromosome, we proceed as follows.

After the $syntenyBlkSet^{A\rightarrow G_{1\cdots N}}$ is generated in Step 3, relative ordering between every pair of contigs is counted. The number of times each contig appears upstream/downstream of other contig is structured into an $NC\times NC$ ordering matrix, C, where the rows and columns are contig IDs from 0 to $NC-1$. $c_{i,j}$ represents the number of times contig i occurred in upstream of contig j in the extant chromosomes.

Given the ordering matrix C, the linear ordering problem (LOP) is the problem of finding a permutation $\pi $ of the column and row indices $\{1, \cdots , NC\}$, such that the value

$$\begin{aligned} f(\pi ) = \sum _{i=1}^{NC} \sum _{j=i+1}^{NC} C^{(\pi (i),\pi (j))} \end{aligned}$$

(3)

is maximized [13]. In other words, the goal is to find a permutation of the columns and rows of C such that the sum of the elements in the upper triangle is maximized.

By applying a meta-heuristic solver of LOP, Tabu Search [8], the solution order corresponds to the ordering/permutation of contigs sorted by their positions along ancestral chromosomes.

E Functional Annotation of Ancestral Genes

We create a set of all genes in all families represented by ancestral genes in the reconstructed ancestor. This is the background set. For each gene family, all the genes in the family constitute a query set for GO-term enrichment analysis against the background set. Significant terms that emerge constitute the functional annotation for the ancestral gene.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Q., Jin, L., Zheng, C., Leebens Mack, J.H., Sankoff, D. (2021). RACCROCHE: Ancestral Flowering Plant Chromosomes and Gene Orders Based on Generalized Adjacencies and Chromosomal Gene Co-occurrences. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-79290-9_9
Published: 03 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79289-3
Online ISBN: 978-3-030-79290-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RACCROCHE: Ancestral Flowering Plant Chromosomes and Gene Orders Based on Generalized Adjacencies and Chromosomal Gene Co-occurrences

Abstract

Access this chapter

Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendices

A Redistributing Genes from Families Exceeding Upper Size Limits

B Modes of Contig Construction

C Matching Contigs to Chromosomes of Extant Genomes

D Construction of Ancestral Chromosomes

E Functional Annotation of Ancestral Genes

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation