skip to main content
10.1145/2350176.2350180acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Learning to extract chemical names based on random text generation and incomplete dictionary

Published:12 August 2012Publication History

ABSTRACT

Automatically extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable good quality training set to train a reliable entity extraction model. Leveraging the well-studied random text generation techniques based on formal grammars, we explore the idea of automatically creating training sets for the task of chemical named entity extraction. Assuming the availability of an incomplete list of chemical names, we are able to generate well-controlled, random, yet realistic chemical-like training documents. Compared to state-of-the-art models learned from manually labeled data and rule-based systems using real-world data, our solutions show comparable or better results, with least human effort.

References

  1. BioCreAtIvE-Critical Assessment of Information Extraction systems in Biology http://biocreative.sourceforge.net/.Google ScholarGoogle Scholar
  2. A. C. Bulhak. On the simulation of postmodernism and mental debility using recursive transition networks. Technical report, 1996.Google ScholarGoogle Scholar
  3. P. Corbett and A. Copestake. Cascaded classifiers for confidence-based chemical named entity recognition. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 54--62, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273--297, Sept. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck. Biomedical and chemical named entity recognition with conditional random fields: The advantage of dictionary features. In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006)., pages 85--89, 2006.Google ScholarGoogle Scholar
  6. L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of zipf's law to words and phrases. In Proceedings of the 19th international conference on Computational linguistics - Volume 1, COLING '02, pages 1--6, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C. M. Friedrich. Detection of iupac and iupac-like chemical names. In ISMB, pages 268--276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, pages 591--598, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. R. Rabiner. Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech recognition, pages 267--296. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.Google ScholarGoogle Scholar
  12. H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3-4):425--440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Stribling, M. Krohn, and D. Aguayo. SCIgen - An Automatic CS Paper Generator, http://www.pdos.lcs.mit.edu/scigen/, 2006.Google ScholarGoogle Scholar
  14. B. Sun, P. Mitra, and C. L. Giles. Mining, indexing, and searching for textual chemical molecule information on the web. In WWW, pages 735--744, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. J. Wilbur, G. F. Hazard, G. Divita, J. G. Mork, A. R. Aronson, and A. C. Browne. Analysis of biomedical text for chemical names: a comparison of three methods. Proc AMIA Symp, pages 176--180, 1999.Google ScholarGoogle Scholar
  16. S. Yan, W. S. Spangler, and Y. Chen. Cross media entity extraction and linkage for chemical documents. In AAAI, 2011.Google ScholarGoogle Scholar
  1. Learning to extract chemical names based on random text generation and incomplete dictionary

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        BIOKDD '12: Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
        August 2012
        38 pages
        ISBN:9781450315524
        DOI:10.1145/2350176
        • General Chairs:
        • Jake Chen,
        • Mohammed J. Zaki,
        • Program Chairs:
        • Tamer Kahveci,
        • Saeed Salem,
        • Mehmet Koyutürk

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate7of16submissions,44%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader