skip to main content
10.1145/974614.974658acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

A structural perspective on genome evolution

Published:27 March 2004Publication History

ABSTRACT

At UCL we have developed several automated protocols for generating protein family resources (CATH; Gene3D). These resources can be used to perform comparative genome analyses in order to understand the evolution of protein families. Also to identify biologically and/or medically interesting families for which no structural data currently exists and which may therefore be important targets for structure genomics initiatives.The CATH domain structure database, established by Orengo and Thornton in 1993, now contains a significant proportion of protein structures from the PDB clustered into 1400 evolutionary families. Relationships have been identified using robust structure comparison methods (SSAP, CATHEDRAL). We have also benchmarked and optimised various 1D-profiles and HMM based protocols for assigning genome sequences to families within the resource (e.g. SAM-T99, SAMOSA, CATH-ISL).In this way we can assign structural data to a large proportion (up to 60%) of whole or partial sequences in completed genomes and >80% of genes coding for enzymes and other proteins in biochemical pathways. However, in order to include all families regardless of whether their structure is known or not, a new protein family resource has been developed (Gene3D). In Gene3D, complete genes have been clustered according to sequence similarity alone, using a robust clustering method (Pfscape). 120 completed genomes from all kingdoms have been clustered into 220,000 gene families, 70,000 of which contain 2 or more sequences. Subsequently, we have labelled those gene families for which CATH structural or Pfam functional domain annotations can be provided for all or part of the gene.Preliminary analysis of the genome annotations reveals that a significant proportion (up to 70%) of CATH annotated genes or gene regions in genomes are assigned to domain families that are common to all three kingdoms of life. However, only 20% of the genome sequences are assigned to gene families common to all kingdoms. Since a large proportion of these genes are multidomain proteins this supports the view that a great deal of functional diversity within the genomes has been achieved by combining domain modules in different ways.In collaboration with Professor Janet Thornton, we have analysed a subset of 56 bacterial genomes to determine the recurrence of specific domain structure families within the genomes. This revealed a small but essential group of universal, and in some cases, highly recurring domain families. For some size-dependent families, domain recurrence is highly correlated with increase in genome size, whilst in other size-independent families no correlation is observed. Statistical analysis allowed us to distinguish three groups. Within the size-dependent families we differentiated two groups: linearly-distributed and non-linearly-distributed. Functional annotation using the COGs revealed that these domains were predominantly involved in metabolism and regulation, respectively. Whilst a third group of Evenly-distributed size independent domains are primarily involved in protein translation and biosynthesis.By mapping CATH and Pfam domains families onto all the genome sequences in Gene3D we observe that a few hundred highly recurrent families are dominating at least 50% of whole or partial genome sequences. Many of these families are common to both prokaryotes and eukaryotes and are performing essential generic functions. In many of the largest families, significant divergence in sequence has been accompanied by modifications in structure and function. Targetting representatives in these families for structure determination will allow the structure genomics initiatives to map both fold and function space and reveal the mechanisms by which divergence in protein families promotes evolution of new functions.

  1. A structural perspective on genome evolution

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      RECOMB '04: Proceedings of the eighth annual international conference on Research in computational molecular biology
      March 2004
      370 pages
      ISBN:1581137559
      DOI:10.1145/974614

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 March 2004

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate148of538submissions,28%
    • Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader