skip to main content
10.1145/1621995.1622040acmconferencesArticle/Chapter ViewAbstractPublication PagesdocConference Proceedingsconference-collections
research-article

Experience report: issues in comparing gene function annotation in text

Published: 05 October 2009 Publication History

Abstract

Annotating function of genes accurately is one of the most important tasks in molecular biology and medical sciences. The new sequencing technology, called the next generation sequencing technology, made sequencing the whole genomes possible with a fraction of cost of sequencing by using the traditional sequencing technology. As a result, the amount of sequence data has been growing very rapidly, but the computational method for gene function annotation is yet to be fully developed. Thus annotation of gene function is a serious bottleneck to achieving the high-throughput genome projects. The most commonly used gene annotation technique is to transfer annotation of genes based on the sequence similarity; annotation of top-ranked genes in terms of sequence similarity is simply transferred to the function of a target gene. However, this sequence-similarity based gene function annotation is often incorrect. As a result, genome projects still rely on expensive, error-prone, labor-intensive, manual process. Combining annotation and sequence similarity can improve the accuracy of gene function annotation significantly. We have been developing a computational method for comparing gene annotation in text. In this paper, we will discuss issues in comparing genome annotation in a text format. To compute textual similarity, we used cosine similarity. Since cosine similarity is effective only after preprocessing with textual variations, we used commonly used text preprocessing techniques such as removing stop words and stemming as well as gene annotation specific preprocessing such as handling synonyms and gene symbols using databases of biology terminologies such as BioThesaurus and MeSH. In experiments with annotations of a number of bacterial genomes, our method was able to handle many difficult cases (syntactically different but semantically equivalent gene function annotations) correctly.

References

[1]
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403--410, 1990.
[2]
M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarski s, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rub in, and G. Sherlock. Gene ontology: tool for the unification of biology. Nat Genet, 25(1):25--29, 2000.
[3]
T. U. Consortium. The Universal Protein Resource (UniProt).
[4]
ECOLI. http://www.uniprot.org/taxonomy/83333.
[5]
C. F. Wordnet: An electronic lexical database. 1998.
[6]
IUBMB. Enzyme nomenclature. 1992.
[7]
E. Jain, A. Bairoch, S. Duvaud, I. Phan, N. Redaschi, B. Suzek, M. Martin, P. McGarvey, and E. Gasteiger. Infrastructure for the life sciences: design and implementation of the uniprot website. BMC Bioinformatics, 10(1):136, 2009.
[8]
D. Lipman and W. Pearson. Rapid and sensitive protein similarity searches. Science, 227(4693):1435--1441, 1985.
[9]
H. Liu, Z.-Z. Hu, M. Torii, C. Wu, and C. Friedman. Quantitative Assessment of Dictionary-based Protein Named Entity Tagging. J Am Med Inform Assoc, 13(5):497--507, 2006.
[10]
H. Liu, Z.-Z. Hu, J. Zhang, and C. Wu. BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics, 22(1):103--105, 2006.
[11]
R. Overbeek, M. Fonstein, M. D'Souza, G. D. Pusch, and N. Maltsev. The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences of the United States of America, 96(6):2896--2901, 1999.
[12]
R. L. Tatusov, E. V. Koonin, and D. J. Lipman. A Genomic Perspective on Protein Families. Science, 278(5338):631--637, 1997.
[13]
Y. Yang, D. Gibert, and S. Kim. Annotation confidence score for genome annotation: A genome comparison approach. in review.

Index Terms

  1. Experience report: issues in comparing gene function annotation in text

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGDOC '09: Proceedings of the 27th ACM international conference on Design of communication
    October 2009
    328 pages
    ISBN:9781605585598
    DOI:10.1145/1621995
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 October 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cosine similarity
    2. genome annotation
    3. text comparison

    Qualifiers

    • Research-article

    Conference

    SIGDOC '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 355 of 582 submissions, 61%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 118
      Total Downloads
    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media