skip to main content
10.1145/1076034.1076060acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

An application of text categorization methods to gene ontology annotation

Published:15 August 2005Publication History

ABSTRACT

This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation is a major activity in most model organism database projects and annotates gene functions using a controlled vocabulary. As a first step toward automatic GO annotation, we aim to assign GO domain codes given a specific gene and an article in which the gene appears, which is one of the task challenges at the TREC 2004 Genomics Track. We approached the task with careful consideration of the specialized terminology and paid special attention to dealing with various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extracted the words around the gene occurrences and used them to represent the gene for GO domain code annotation. As a classifier, we adopted a variant of k-Nearest Neighbor (kNN) with supervised term weighting schemes to improve the performance, making our method among the top-performing systems in the TREC official evaluation. Moreover, it is demonstrated that our proposed framework is successfully applied to another task of the Genomics Track, showing comparable results to the best performing system.

References

  1. A. Dayanik, D. Fradkin, A. Genkin, P. Kantor, D. D. Lewis, D. Madigan, and V. Menkov. DIMACS at the TREC 2004 genomics track. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.Google ScholarGoogle Scholar
  2. Franca Debole and Fabrizio Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing, pages 784--788, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sergei Egorov, Anton Yuryev, and Nikolai Daraselia. A simple and practical dictionary-based approach for identification of proteins in MEDLINE abstracts. Journal of the American Medical Informatics Association, 11(3):174--178, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sumio Fujita. Revisiting again document length hypotheses TREC-2004 genomics track experiments at Patolis. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.Google ScholarGoogle Scholar
  5. Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer. Playing biology's name game: Identifying protein names in scientific text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), volume 8, pages 403--414, 2003.Google ScholarGoogle Scholar
  6. William Hersh. Text retrieval conference (TREC) genomics pre-track workshop. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, page 428, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. William Hersh. Report on TREC 2003 genomics track first-year results and future plans. SIGIR Forum, 38(1):69--72, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W.R. Hersh, R.T. Bhuptiraju, L. Ross, A.M. Cohen, and D.F. Kraemer. TREC 2004 genomics track overview. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.Google ScholarGoogle Scholar
  9. Lynette Hirschman, Jong C. Park, Jun-ichi Tsujii, Limsoon Wong, and Cathy H. Wu. Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12):1553--1561, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  10. Julie Beth Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.Google ScholarGoogle Scholar
  11. Claire O'Donovan, Maria Jesus Martin, Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch, and Rolf Apweiler. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform, 3(3):275--284, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  12. Kim D. Pruitt and Donna R. Maglott. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Research, 29(1):137--140, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  13. Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), volume 8, pages 451--462, 2003.Google ScholarGoogle Scholar
  15. Burr Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Burr Settles and Mark Craven. Exploiting zone information, syntactic rules, and informative terms in gene ontology annotation of biomedical documents. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.Google ScholarGoogle Scholar
  17. Hagit Shatkay and Ronen Feldman. Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6):821--856, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42--49, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 412--420, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An application of text categorization methods to gene ontology annotation

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
                    August 2005
                    708 pages
                    ISBN:1595930345
                    DOI:10.1145/1076034

                    Copyright © 2005 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 15 August 2005

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • Article

                    Acceptance Rates

                    Overall Acceptance Rate792of3,983submissions,20%

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader