Skip to main content

RACCROCHE: Ancestral Flowering Plant Chromosomes and Gene Orders Based on Generalized Adjacencies and Chromosomal Gene Co-occurrences

  • Conference paper
  • First Online:
Computational Advances in Bio and Medical Sciences (ICCABS 2020)

Abstract

Given the phylogenetic relationships of several extant species, the reconstruction of their ancestral genomes at the gene and chromosome level is made difficult by the cycles of whole genome doubling followed by fractionation in plant lineages. Fractionation scrambles the gene adjacencies that enable existing reconstruction methods. We propose an alternative approach that postpones the selection of gene adjacencies for reconstructing small ancestral segments and instead accumulates a very large number of syntenically validated candidate adjacencies to produce long ancestral contigs through maximum weight matching. Likewise, we do not construct chromosomes by successively piecing together contigs into larger segments, but instead count all contig co-occurrences on the input genomes and cluster these, so that chromosomal assemblies of contigs all emerge naturally ordered at each ancestral node of the phylogeny. These strategies result in substantially more complete reconstructions than existing methods. We deploy a number of quality measures: contig lengths, continuity of contig structure on successive ancestors, coverage of the reconstruction on the input genomes, and rearrangement implications of the chromosomal structures obtained. The reconstructed ancestors can be functionally annotated and are visualized by painting the ancestral projections on the descendant genomes, and by highlighting syntenic ancestor-descendant relationships. We apply our methods to genomes drawn from a broad range of monocot orders, confirming the tetraploidization event “tau” in the stem lineage between the alismatids and the lilioids.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Availability

The annotated genomic data is accessible on the CoGe platform https://genomevolution.org/coge/ and Phytozome. The pipeline is available at https://github.com/jin-repo/RACCROCHE.

References

  1. Amborella Genome Project: The Amborella genome and the evolution of flowering plants. Science 342(6165), 1241089 (2013)

    Article  Google Scholar 

  2. Anselmetti, Y., Luhmann, N., Bérard, S., Tannier, E., Chauve, C.: Comparative methods for reconstructing ancient genome organization. In: Setubal, J.C., Stoye, J., Stadler, P.F. (eds.) Comparative Genomics. MMB, vol. 1704, pp. 343–362. Springer, New York (2018). https://doi.org/10.1007/978-1-4939-7463-4_13

    Chapter  Google Scholar 

  3. Avdeyev, P., Alexeev, N., Rong, Y., Alekseyev, M.A.: A unified ILP framework for core ancestral genome reconstruction problems. Bioinformatics 36(10), 2993–3003 (2020)

    Article  Google Scholar 

  4. Badouin, H., et al.: The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546(7656), 148–152 (2017)

    Article  Google Scholar 

  5. Chauve, C., Tannier, E.: A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLoS Comput. Biol. 4(11), e1000234 (2008)

    Article  MathSciNet  Google Scholar 

  6. Givnish, T.J., et al.: Monocot plastid phylogenomics, timeline, net rates of species diversification, the power of multi-gene analyses, and a functional model for the origin of monocots. Am. J. Bot. 105(11), 1888–1910 (2018)

    Article  Google Scholar 

  7. Ma, J., et al.: Reconstructing contiguous regions of an ancestral genome. Genome Res. 16(12), 1557–1565 (2006)

    Article  Google Scholar 

  8. Martí, R., Reinelt, G., Duarte, A.: A benchmark library and a comparison of heuristic methods for the linear ordering problem. Comput. Optim. Appl. 51(3), 1297–1317 (2012). https://doi.org/10.1007/s10589-010-9384-9

    Article  MathSciNet  MATH  Google Scholar 

  9. Mazowita, M., Haque, L., Sankoff, D.: Stability of rearrangement measures in the comparison of genome sequences. J. Comput. Biol. 13(2), 554–566 (2006)

    Article  MathSciNet  Google Scholar 

  10. Murat, F., Armero, A., Pont, C., Klopp, C., Salse, J.: Reconstructing the genome of the most recent common ancestor of flowering plants. Nat. Genet. 49, 490–496 (2017)

    Article  Google Scholar 

  11. Perrin, A., Varré, J.S., Blanquart, S., Ouangraoua, A.: ProCARs: progressive reconstruction of ancestral gene orders. BMC Genomics 16(S5) (2015). Article number: S6. https://doi.org/10.1186/1471-2164-16-S5-S6

  12. Rubert, D.P., Martinez, F.V., Stoye, J., Doerr, D.: Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants. BMC Genomics 21, 1–11 (2020). https://doi.org/10.1186/s12864-020-6609-x

    Article  Google Scholar 

  13. Schiavinotto, T., Stützle, T.: The linear ordering problem: instances, search space analysis and algorithms. J. Math. Model. Algorithms 3(4), 367–402 (2004). https://doi.org/10.1007/s10852-005-2583-1

    Article  MathSciNet  MATH  Google Scholar 

  14. Tannier, E., Bazin, A., Davín, A., Guéguen, L., Bérard, S., Chauve, C.: Ancestral genome organization as a diagnosis tool for phylogenomics (2020)

    Google Scholar 

  15. Wang, Y., et al.: MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40(7), e49 (2012)

    Article  Google Scholar 

  16. Xu, Q., Jin, L., Zheng, C., Leebens-Mack, J.H., Sankoff, D.: Validation of automated chromosome recovery in the reconstruction of ancestral gene order. Algorithms 14, 160 (2021)

    Article  Google Scholar 

  17. Xu, X., Sankoff, D.: Tests for gene clusters satisfying the generalized adjacency criterion. In: Bazzan, A.L.C., Craven, M., Martins, N.F. (eds.) BSB 2008. LNCS, vol. 5167, pp. 152–160. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85557-6_14

    Chapter  Google Scholar 

  18. Yang, Z., Sankoff, D.: Natural parameter values for generalized gene adjacency. J. Comput. Biol. 17(9), 1113–1128 (2010)

    Article  MathSciNet  Google Scholar 

  19. Zheng, C., Chen, E., Albert, V.A., Lyons, E., Sankoff, D.: Ancient eudicot hexaploidy meets ancestral eurosid gene order. BMC Genomics 14(S7), S3 (2013)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the Department of Energy Joint Genome Institute staff and collaborators including David Kudrna, Jerry Jenkins, Jane Grimwood, Shengqiang Shu, and Jeremy Schmutz for pre-publication access to the Acorus genome sequence and annotation. Thanks to Aîda Ouangraoua for much help in implementing ProCARs [11] and Haibao Tang for prompt replies to queries about MCScanX [15].

Funding

Research supported by Discovery grants to LJ and DS from the Natural Sciences and Engineering Research Council of Canada. DS holds the Canada Research Chair in Mathematical Genomics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Sankoff .

Editor information

Editors and Affiliations

Appendices

Appendices

A Redistributing Genes from Families Exceeding Upper Size Limits

As an optional second “redistribution” step, all families with more than NF members or more than NG members in any particular genome, are flagged. Then the construction of the families is repeated, with the restriction that no gene can be recruited to a family by virtue only of a similarity of less than some threshold homology level \(\theta \) to a gene already in the family. The intent is to break up large families held together by a few weak links, and thus to retrieve some better supported smaller families.

B Modes of Contig Construction

RACCROCHE executes for a single set of WNFNG parameters, or for a range of values of W and NG. In the latter case, there is an option, designed to increase coherence among sets of contigs for successive ancestors, that the MWM for any combination of W and NG must be restricted to include all adjacencies already recovered for lesser values of W or NG, insofar as possible. Thus, starting with some small W and NG, we can construct MWM solutions for larger window size and/or larger gene family size, and hence sets of contigs, by incrementing one or the other of the parameters.

It is possible, however, to have conflicts between \(W,NG-1\), and \(W-1,NG\) analyses. For example if adjacencies (ab) and (bc) are in the MWM for \((W, NG-1)\) and (ab) and (bd) are in the MWM for \((W-1, NG)\), then a matching for WG cannot be forced to include all matchings from the two previous MWM. To accommodate this possibility, when we restrict the MWM for (WNG) to include all adjacencies from \((W,NG-1)\) and \((W-1,NG)\), we make an exception for any adjacencies from either that are in potential conflict with adjacencies from the other. Thus (ab) in the example above might be obligatorily included, but (bc) and (bd) would not. Thus the MWM for (WNG) might include (bc) or (bd), but not both.

C Matching Contigs to Chromosomes of Extant Genomes

For the ancestor genome, A, computed from a set of extant genomes neighbouring A, \(G_{1\cdots n}\), perform the following steps.

  1. 1.

    Extract gene features of ancestor A in descendant genomes.

    For every gene, g, in ancestor A computed from Step 2, retrieve six features of this gene in every extant genome \(G_{1\cdots n}\) involved in constructing ancestor A. The features of a gene include chromosome ID, start and end chromosomal positions, distance between g to its next adjacent gene in \(G_i\), gene family ID labelled in Step 1, and contig ID in A, denoted as \(g^{A\rightarrow G_i } (chr,start,end,distance,gf,ctg)\).

  2. 2.

    Map ancestor A to each of the descendant genomes.

    The ancestor will be mapped as ancestral syntenic blocks on the descendant genome in two steps. The first step initializes a syntenic block by merging two adjacent genes given a distance threshold DIS: merge two genes, \(g_1\) and \(g_2\), forming one ancestral syntenic block on \(G_i\) if \(g_1\) and \(g_2\) satisfy the following conditions:

    1. (a)

      \(g_1\) and \(g_2\) locate the same chromosome of \(G_i\);

    2. (b)

      \(g_1\) and \(g_2\) are adjacent to each other; in other words, there could be a non-coding region but no other gene(s) between \(g_1\) and \(g_2\);

    3. (c)

      The distance between the two adjacent genes must be less than or equal to the distance threshold DIS (i.e. \(DIS=1\) Mbp).

    The second step extends the above identified ancestral syntenic block by merging flanking gene(s) into the block if the gene(s) satisfies the above three conditions. It stops extending the block if no flanking gene could be merged into the block. After the two steps, an ancestral synteny block mapping A to \(G_i\) is denoted as syntenyBlk(chrstartendctglen). The set of synteny blocks between A and \(G_i\) is

    \(syntenyBlkSet^{A\rightarrow G_i}=\{syntenyBlk_k(chr,start,end,ctg,len) | 1\le k\le m,\) where m is the total number of synteny blocks mapping from A to \(G_i\)}

D Construction of Ancestral Chromosomes

  1. 1.

    Filter the set of blocks longer than a block length threshold.

    Given a block length threshold, blockLEN, \(\overline{syntenyBlkSet}^{A\rightarrow G_i}\) is a subset of \(syntenyBlkSet^{A\rightarrow G_i}\), where each block in the set is longer than blockLEN (i.e. \(blockLEN = 150\) Kbp).

  2. 2.

    Count co-occurrence of ancestral contigs on same chromosomes.

    Based on syntenyBlk.chr and syntenyBlk.ctg of each pair of synteny block in \(\overline{syntenyBlkSet}^{A\rightarrow G_i}\), gather the co-occurrence of ancestral contigs on the same extant chromosome. Write the co-occurrence result into the lower triangle of a \(NC\times NC\) matrix, m, where the rows and columns are contigs with ID from 0 to \((NC-1)\), \(m_{i,j}\) is the number of co-occurrence between contigs i and j, where \(0<j<i<NC-1\). The maximum co-occurrence frequency in m is denoted as \(\max _{freq}\).

  3. 3.

    Cluster ancestral contigs into ancestral chromosomes according to pairwise distance matrix based on co-occurrence.

    A NC by NC distance matrix, dmat, is calculated as

    $$dmat_{i,j}= -\log (\frac{\max _{freq}-m_{i,j} }{\max _{freq}}).$$

    This distance matrix is fed into the complete-link clustering algorithm. This can then be composed into K clusters, according to users’ preferences. The resultant clusters of contigs correspond to ancestral chromosomes and their compositions.

Last, attach ancestral chromosome number as an attribute to each of the synteny block:

$$ syntenyBlkSet^{A\rightarrow G_{1\cdots N}}=\{syntenyBlk_k(chr,start,end,ctg,len,ancestral_chr)\}, $$

where \(ancestral\_chr\) corresponds to the cluster ID which blk.ctg belong to.

To order the contigs along each chromosome, we proceed as follows.

After the \(syntenyBlkSet^{A\rightarrow G_{1\cdots N}}\) is generated in Step 3, relative ordering between every pair of contigs is counted. The number of times each contig appears upstream/downstream of other contig is structured into an \(NC\times NC\) ordering matrix, C, where the rows and columns are contig IDs from 0 to \(NC-1\). \(c_{i,j}\) represents the number of times contig i occurred in upstream of contig j in the extant chromosomes.

Given the ordering matrix C, the linear ordering problem (LOP) is the problem of finding a permutation \(\pi \) of the column and row indices \(\{1, \cdots , NC\}\), such that the value

$$\begin{aligned} f(\pi ) = \sum _{i=1}^{NC} \sum _{j=i+1}^{NC} C^{(\pi (i),\pi (j))} \end{aligned}$$
(3)

is maximized [13]. In other words, the goal is to find a permutation of the columns and rows of C such that the sum of the elements in the upper triangle is maximized.

By applying a meta-heuristic solver of LOP, Tabu Search [8], the solution order corresponds to the ordering/permutation of contigs sorted by their positions along ancestral chromosomes.

E Functional Annotation of Ancestral Genes

We create a set of all genes in all families represented by ancestral genes in the reconstructed ancestor. This is the background set. For each gene family, all the genes in the family constitute a query set for GO-term enrichment analysis against the background set. Significant terms that emerge constitute the functional annotation for the ancestral gene.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, Q., Jin, L., Zheng, C., Leebens Mack, J.H., Sankoff, D. (2021). RACCROCHE: Ancestral Flowering Plant Chromosomes and Gene Orders Based on Generalized Adjacencies and Chromosomal Gene Co-occurrences. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-79290-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-79289-3

  • Online ISBN: 978-3-030-79290-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics