OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements

Dessimoz, Christophe; Cannarozzi, Gina; Gil, Manuel; Margadant, Daniel; Roth, Alexander; Schneider, Adrian; Gonnet, Gaston H.

doi:10.1007/11554714_6

Christophe Dessimoz²¹,
Gina Cannarozzi²¹,
Manuel Gil²¹,
Daniel Margadant²¹,
Alexander Roth²¹,
Adrian Schneider²¹ &
…
Gaston H. Gonnet²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3678))

Included in the following conference series:

RECOMB Workshop on Comparative Genomics

737 Accesses

Abstract

The OMA project is a large-scale effort to identify groups of orthologs from complete genome data, currently 150 species. The algorithm relies solely on protein sequence information and does not require any human supervision. It has several original features, in particular a verification step that detects paralogs and prevents them from being clustered together. Consistency checks and verification are performed throughout the process. The resulting groups, whenever a comparison could be made, are highly consistent both with EC assignments, and with assignments from the manually curated database HAMAP. A highly accurate set of orthologous sequences constitutes the basis for several other investigations, including phylogenetic analysis and protein classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Gene Phylogenies and Orthologous Groups

Inferring Orthology and Paralogy

Protein-Coding Gene Families in Prokaryote Genome Comparisons

References

Fitch, W.M.: Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970)
Article Google Scholar
Koonin, E.V.: An apology for orthologs - or brave new memes. Genome. Biol. 2, COMMENT1005 (2001)
Article Google Scholar
Tatusov, R.L., Koonin, E.V., Lipman, D.J.: A genomic perspective on protein families. Science 278, 631–637 (1997)
Article Google Scholar
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: The cog database: an updated version includes eukaryotes. BMC Bioinformatics 4 (2003), http://www.biomedcentral.com/1471–2105/4/41
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Article Google Scholar
Fujibuchi, W., Ogata, H., Matsuda, H., Kanehisa, M.: Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res. 28, 4029–4036 (2000)
Article Google Scholar
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, 277–280 (2004)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Remm, M., Storm, C., Sonnhammer, E.: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001)
Article Google Scholar
Li, L., Stoeckert, C.J.J., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003)
Article Google Scholar
Lee, Y., Sultana, R., Pertea, G., Cho, J., Karamycheva, S., Tsai, J., Parvizi, B., Cheung, F., Antonescu, V., White, J., Holt, I., Liang, F., Quackenbush, J.: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome. Res. 12, 493–502 (2002)
Article Google Scholar
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U. S. A. 96, 4285–4288 (1999)
Article Google Scholar
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Huminiecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C., Mongin, E., Pettett, R., Pocock, M., Potter, S., Rust, A., Schmidt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik, I., Clamp, M.: The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002)
Article Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 33 Database Issue, 34–38 (2005)
Google Scholar
Gonnet, G.H., Hallett, M.T., Korostensky, C., Bernardin, L.: Darwin v. 2.0 an interpreted computer language for the biosciences. Bioinformatics 16, 101–103 (2000)
Article Google Scholar
Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A.H., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C.J.A., Lachaize, C., Veuthey, A.L., Gasteiger, E., Bairoch, A.: Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27, 49–58 (2003)
Article Google Scholar
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)
Article Google Scholar
Gonnet, G.H., Cohen, M.A., Benner, S.A.: Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992)
Article Google Scholar
Gonnet, G.H.: A tutorial introduction to computational biochemistry using Darwin. Technical report, Informatik, ETH Zurich, Switzerland (1994)
Google Scholar
Brenner, S.E., Chothia, C., Hubbard, J.T.: Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. U. S. A. 95, 6073–6078 (1998)
Article Google Scholar
von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A., Bork, P.: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33 Database Issue, 433–437 (2005)
Google Scholar
Balasubramanian, R., Fellows, M.R., Raman, V.: An improved fixed-parameter algorithm for vertex cover. Inf. Process. Lett. 65, 163–168 (1998)
Article MathSciNet Google Scholar
Bairoch, A.: The ENZYME database in 2000. Nucleic Acids Res. 28, 304–305 (2000)
Article Google Scholar
Jensen, R.A.: Orthologs and paralogs - we need to get it right. Genome. Biol. 2, INTERACTIONS1002 (2001)
Google Scholar
Vogel, C., Bashton, M., Kerrison, N.D., Chothia, C., Teichmann, S.A.: Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol. 14, 208–216 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computational Science, ETH Zurich, CH-8092, Zürich
Christophe Dessimoz, Gina Cannarozzi, Manuel Gil, Daniel Margadant, Alexander Roth, Adrian Schneider & Gaston H. Gonnet

Authors

Christophe Dessimoz
View author publications
You can also search for this author in PubMed Google Scholar
Gina Cannarozzi
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Gil
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Margadant
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Roth
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Gaston H. Gonnet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Dublin, Smurfit Institute of Genetics, Trinity College, Ireland
Aoife McLysaght
Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076, Tübingen, Germany
Daniel H. Huson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dessimoz, C. et al. (2005). OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements. In: McLysaght, A., Huson, D.H. (eds) Comparative Genomics. RCG 2005. Lecture Notes in Computer Science(), vol 3678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11554714_6

Download citation

DOI: https://doi.org/10.1007/11554714_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28932-6
Online ISBN: 978-3-540-31814-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics